# Phase 4 Project

### Business Problem

Company Zen has been marketing and selling Apple and Google products for an year now. One of their main platform is Twitter and they have hired me to build a model to help them classify the sentiments on the product offering so as to enable them make informed decision to strengthen or reassess their relationships with Apple and Google based on how customers are responding to the products.

### Business Understanding

Understanding customer feel and think on products is very crucial as it impacts the marketing and product offering. Twitter offers real-time insights that would help the company react quicky to feedback and possible recover service. Identifying trends and sentiment patterns early, will enable Zen outpace competitors in refining their campaigns and address customer pain points.
See below objectives that the model seeks to answer:
* What is the % rate of the positive and negative sentiments?
* What products have the best and worst sentiments?
* Should Company Zen continue this partneship with Apple and Google?

### Data Understanding

The dataset was downloaded from CrowdFlower and is stored in the data folder. Human raters rated the sentiments in over 9,000 Tweets as positive, negative, or neither. See the breakdown of the columns the data is in.

**Tweet_text** -- actual text of the tweets

**Emotion_in_tweet_is_directed_at** -- who or what the emotion expressed in the tweet is directed at

**Is_there_an_emotion_directed_at_a_brand_or_product** -- whether the tweet contains an emotion directed at a specific brand or product


#### Import Libraries

In [1]:
#Import relevant libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from scipy.sparse.csr import csr_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

#### Load & Inspect Data

In [2]:
# Load the dataset
df = pd.read_csv('data/judge-1377884607_tweet_product_company.csv', encoding='ISO-8859-1')
# Preview the first five rows
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


* The column names are so long. Let us proceed to rename for ease of access

In [3]:
# Rename columns for ease of access
df.rename(columns={
    'tweet_text': 'Text',
    'emotion_in_tweet_is_directed_at': 'Product',
    'is_there_an_emotion_directed_at_a_brand_or_product': 'Sentiment'
}, inplace=True)

# Preview the renamed columns
df.head()

Unnamed: 0,Text,Product,Sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
#To view the shape of the dataset
df.shape

(9093, 3)

* We have 9093 tweet texts and 3 columns renamed as above.

In [5]:
#To view the dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Text       9092 non-null   object
 1   Product    3291 non-null   object
 2   Sentiment  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


* We can view the datatype as object for the three columns and some missing values detected under the Non-null count which we shall further investigate.

#### Checking For Duplicates


In [6]:
# Check the number of duplicate rows
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")
# Remove duplicate rows
df_cleaned = df.drop_duplicates()

Number of duplicate rows: 22


#### Checking For Missing Values

In [7]:
#To view the missing values from the cleaned dataset
df_cleaned.isnull().sum()

Text            1
Product      5789
Sentiment       0
dtype: int64

* Since we have only one null row under the Text column we shall proceed to drop this null value then fill the product null values with a placeholder as the count of null values is actually too high and the product is important to us.

In [8]:
# Drop missing values in 'Text' column
df_cleaned = df_cleaned.dropna(subset=['Text'])

# Fill missing values in 'Product' with a placeholder since we have 5789 null values.
df_cleaned['Product'].fillna('Unknown Product', inplace=True)

# Check if missing values are handled
print(df_cleaned.isnull().sum())

Text         0
Product      0
Sentiment    0
dtype: int64


In [9]:
#To view the unique features of our target variable
df_cleaned['Sentiment'].value_counts()

No emotion toward brand or product    5375
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: Sentiment, dtype: int64

In [10]:
#To view the unique features of our product
df_cleaned['Product'].value_counts()

Unknown Product                    5788
iPad                                945
Apple                               659
iPad or iPhone App                  469
Google                              428
iPhone                              296
Other Google product or service     293
Android App                          80
Android                              77
Other Apple product or service       35
Name: Product, dtype: int64