# Apple and Google NLP Twitter Sentiment Analysis
## Business Overview

## Introduction
In today's digital age, social media platforms like Twitter serve as an invaluable source of public sentiment and opinion. Understanding the sentiments of users on these platforms can provide valuable insights for businesses. In this project, the aim is to harness the power of Natural Language Processing (NLP) to analyze Twitter sentiment about products from two tech giants, Apple and Google.

## Business Understanding
Apple and Google are two of the most prominent companies in the tech industry, producing a wide range of products that have a significant impact on people's lives. Monitoring the sentiment expressed by Twitter users towards these companies and their products can help businesses make informed decisions. This sentiment analysis can inform product development, marketing strategies, and customer relations.

## Business Problem
The primary business problem that will be addressed is the need for a systematic and automated way to gauge the sentiment of tweets related to Apple and Google products. Twitter is a platform where millions of users express their opinions and experiences daily. Manual analysis of these tweets is not feasible due to the sheer volume of data. Therefore, there's need for a reliable NLP model that can classify tweets into positive, negative, or neutral sentiment categories.

## Main Objective
The main objective of this project is to build a proof-of-concept NLP model that can accurately rate the sentiment of tweets about Apple and Google products. This model will enable businesses to gain real-time insights into how their products are perceived by the Twitter community.

## Specific Objectives
* **Data Collection:** A dataset of tweets related to Apple and Google products will be gathered. This dataset should include tweets that express both positive and negative sentiments.

* **Data Preprocessing:** Cleaning and preprocessing the collected data to prepare it for NLP analysis. This includes tasks like text normalization, tokenization, and handling of missing or irrelevant data.

* **Model Development:** Developing a baseline NLP model for binary sentiment classification, categorizing tweets as either positive or negative. This model will serve as a starting point for further improvements.

* **Model Evaluation:** Evaluating the binary sentiment classifier using appropriate metrics like accuracy, precision, recall, and F1-score. This will help assess the model's performance and identify areas for improvement.

* **Multiclass Classification:** Extending the binary classifier to a multiclass classifier by incorporating a neutral sentiment category. This will provide a more nuanced understanding of sentiment.

* **Business Insights:** Interpret the results and provide actionable insights to businesses.

In conclusion, this project aims to create a valuable tool for businesses in the tech industry by leveraging NLP to understand and react to public sentiment on Twitter. By achieving the specific objectives outlined, the project will provide a scalable solution for sentiment analysis that can be adapted to other products and industries, ultimately enhancing decision-making processes and customer satisfaction.

## Metrics of Success
After modeling, the success metrics for the sentiment analysis on this project includes:

* **Accuracy:** Measure the accuracy of the sentiment classification model in correctly categorizing tweets into positive, negative, or neutral sentiments. This metric indicates the model's ability to make accurate predictions.

* **Precision, Recall, and F1 Score:** Calculate precision, recall, and F1 score to assess the model's performance in correctly identifying positive, negative, and neutral sentiments. These metrics provide insights into the model's ability to balance precision (correctly identifying positive/negative sentiments) and recall (identifying all positive/negative sentiments).

## Data Understanding
The data was sourced from [here](https://data.world/crowdflower/brands-and-product-emotions). Contributors evaluated tweets about multiple brands and products. The crowd was asked if the tweet expressed positive, negative, or no emotion towards a brand and/or product. If some emotion was expressed they were also asked to say which brand or product was the target of that emotion.





In [1]:
# importing necessary packages
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from matplotlib import cm
from sklearn.ensemble import RandomForestClassifier

nltk.download("stopwords")
nltk.download("wordnet")



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# Reading the Dataset 
# Use 'latin-1' encoding to handle special characters
review_df = pd.read_csv("judge-1377884607_tweet_product_company.csv", encoding="latin-1")
review_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


The column names are overly lengthy and challenging to read. To improve readability, we can rename the columns.

In [3]:
review_df.columns = ["tweet", "products", "emotion"]
review_df.head()

Unnamed: 0,tweet,products,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


Checking at unique values of products and emotion to have a better understanding of what we are working with

In [4]:
review_df.products.unique()

array(['iPhone', 'iPad or iPhone App', 'iPad', 'Google', nan, 'Android',
       'Apple', 'Android App', 'Other Google product or service',
       'Other Apple product or service'], dtype=object)

In [5]:
review_df.emotion.unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

Based on the unique values observed in the **"products"** and **"emotion"** columns, we can make several observations:

The **"products"** column contains a variety of product-related values, including specific Apple and Google products like 'iPhone,' 'iPad,' 'Google,' 'Android,' as well as more general categories like 'iPad or iPhone App,' 'Android App,' 'Other Google product or service,' and 'Other Apple product or service.' Additionally, there are some missing values (NaN).

The **"emotion"** column represents the sentiment or emotion associated with the tweets. It includes categories such as 'Negative emotion,' 'Positive emotion,' 'No emotion toward brand or product,' and "I can't tell."

Changing the names of the values in the emotion column for easy interpretability

In [6]:
# Replace 'No emotion toward brand or product' with 'Neutral emotion'
# and 'I can't tell' with 'Unknown'
review_df['emotion'].replace({'No emotion toward brand or product': 'Neutral emotion', 
                               "I can't tell": 'Unknown'}, inplace=True)

In [7]:
#confirming the changes while 
review_df.emotion.value_counts()

Neutral emotion     5389
Positive emotion    2978
Negative emotion     570
Unknown              156
Name: emotion, dtype: int64

In [8]:
# checking structure of the dataset
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet     9092 non-null   object
 1   products  3291 non-null   object
 2   emotion   9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


From the above we can observe that we have a missing text for tweet and also were are missing some products of which the corresponding tweet was about.

### Dealing with missing values

In [9]:
#inspecting the row with the missing tweet
review_df[pd.isna(review_df["tweet"])]

Unnamed: 0,tweet,products,emotion
6,,,Neutral emotion


We can see that both tweet and product information are missing in this row, so it would be appropriate to drop it.

In [10]:
review_df.dropna(subset=["tweet"], inplace=True)

In [11]:
#inspecting the rows where product column has missing value
review_df[pd.isna(review_df['products'])].head(10)


Unnamed: 0,tweet,products,emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,Neutral emotion
16,Holler Gram for iPad on the iTunes App Store -...,,Neutral emotion
32,"Attn: All #SXSW frineds, @mention Register fo...",,Neutral emotion
33,Anyone at #sxsw want to sell their old iPad?,,Neutral emotion
34,Anyone at #SXSW who bought the new iPad want ...,,Neutral emotion
35,At #sxsw. Oooh. RT @mention Google to Launch ...,,Neutral emotion
37,SPIN Play - a new concept in music discovery f...,,Neutral emotion
39,VatorNews - Google And Apple Force Print Media...,,Neutral emotion
41,HootSuite - HootSuite Mobile for #SXSW ~ Updat...,,Neutral emotion
42,Hey #SXSW - How long do you think it takes us ...,,Neutral emotion


In [12]:
# checking the percentage of the null values
missing_products_percentage = (review_df['products'].isna().sum() / len(review_df)) * 100
print(round(missing_products_percentage, 2))

63.8


These tweets are not really directed towards a specific product or brand so we can go ahead and fill the null values with "Unknown" as a placeholder value. Also, given that approximately 63.80% of the "products" column contains missing values, filling them with "Unknown" is a reasonable approach to ensure that we retain as much useful information as possible while preparing the data for analysis.

In [13]:
review_df["products"].fillna("Unknown", inplace=True)

In [14]:
# verifying that the missing values have been dealt with
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet     9092 non-null   object
 1   products  9092 non-null   object
 2   emotion   9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


### Dropping some rows

Calculating the percentage of each emotion category

In [15]:
# Calculate the percentage of each emotion category
emotion_percentage = (review_df['emotion'].value_counts() / len(review_df)) * 100

# Display the percentage of each emotion
print(emotion_percentage)

Neutral emotion     59.260889
Positive emotion    32.754070
Negative emotion     6.269248
Unknown              1.715794
Name: emotion, dtype: float64


For this project, our focus is solely on emotions categorized as positive, neutral, or negative. Therefore, we will proceed to remove rows with the "Unknown" emotion category. It's worth noting that these rows account for only up to 1.7% of our dataset, making their removal a justifiable step.

In [19]:
# Dropping rows with "Unknown" emotion category
review_df = review_df[review_df.emotion != "Unknown" ]

# Checking the changes
review_df.emotion.value_counts()

Neutral emotion     5388
Positive emotion    2978
Negative emotion     570
Name: emotion, dtype: int64

Checking and Dealing with Duplicates

In [26]:
# Calculating the number of duplicate rows 
len(review_df[review_df.duplicated()])

22

In [28]:
# Checking for duplicate rows in the DataFrame
review_df[review_df.duplicated()].head(10)


Unnamed: 0,tweet,products,emotion
468,"Before It Even Begins, Apple Wins #SXSW {link}",Apple,Positive emotion
776,"Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw",Unknown,Neutral emotion
2232,Marissa Mayer: Google Will Connect the Digital &amp; Physical Worlds Through Mobile - {link} #sxsw,Unknown,Neutral emotion
2559,Counting down the days to #sxsw plus strong Canadian dollar means stock up on Apple gear,Apple,Positive emotion
3950,Really enjoying the changes in Gowalla 3.0 for Android! Looking forward to seeing what else they &amp; Foursquare have up their sleeves at #SXSW,Android App,Positive emotion
3962,"#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan",Android,Positive emotion
4897,"Oh. My. God. The #SXSW app for iPad is pure, unadulterated awesome. It's easier to browse events on iPad than on the website!!!",iPad or iPhone App,Positive emotion
5338,RT @mention ÷¼ GO BEYOND BORDERS! ÷_ {link} ã_ #edchat #musedchat #sxsw #sxswi #classical #newTwitter,Unknown,Neutral emotion
5341,"RT @mention ÷¼ Happy Woman's Day! Make love, not fuss! ÷_ {link} ã_ #edchat #musedchat #sxsw #sxswi #classical #newTwitter",Unknown,Neutral emotion
5881,"RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw",Unknown,Neutral emotion


It appears that there are 22 duplicate rows. These duplicates will be removed, retaining only the first occurrence of each row.

In [29]:
# Remove duplicate rows and keep the first occurrence
review_df.drop_duplicates(inplace=True)

In [30]:
# Display information about the DataFrame
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8914 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet     8914 non-null   object
 1   products  8914 non-null   object
 2   emotion   8914 non-null   object
dtypes: object(3)
memory usage: 278.6+ KB
