<center><h1> Natural Language Processing Project </h1></center>


Data Science & Machine Learning


**Contributors**
- Lynette Wangari - lynettewangari26@gmail.com
- Jackson Munene - jacmwaniki@gmail.com
- Julius Kinyua - juliusczar36@gmail.com
- Philip Oluoch - 

## Overview

In today's technologically driven society, social media (especially Twitter) often acts as a central repository for thoughts, feelings, and opinions. Fortunately, technology has also provided a means by which we can attempt to evaluate or analyze some of those opinions. Through the use of machine learning and natural language processing, we will identify users who have expressed dissatisfaction with Apple and Google products. The objective is to categorize tweets related to Apple and Google into positive, neutral, and negative sentiments, providing actionable insights for their advertising strategies.

In this project, we will build a predictive model that can monitor recent tweets about various technology brands. Consumer sentiment is a key indicator of purchasing decisions and, by extension, the financial performance of the companies behind these products. Their goal is to use sentiment analysis to filter and identify brands with positive consumer emotions, thereby guiding their investment decisions. We will utilize supervised modeling processes and natural language processing (NLP) techniques to solve the tweet sentiment classification problem using an advanced dataset.

## Business Understanding

### Business Context:

Before we begin exploring the data we first need to understand why an NLP model that can predict sentiment would be useful. In this dataset we have numerous tweets that are centered around either Apple or Google products and the SXSW Festival. The SXSW festival draws thousands of people each year and provides large tech companies such as Apple or Google to showcase their products and services to understand how much engagement and interest they garner from the attendees.

The tweets included were sent out during the South by South West conference. A model can then analyze each of those tweets and we can have near instant feedback on how people feel. We can gain an understanding of features that were well received and possibly gain insight as to what things need improvement.

### Goal:

Develop an initial Proof of Concept (PoC) for a sentiment analysis model that can classify Tweets about Apple and Google products into positive or negative sentiments. Once a reliable binary classifier is achieved, extend it to include neutral sentiments for a more comprehensive analysis.

### Business Problem:

The primary business problem is to ensure the best possibe experience for customers by accurately prediciting whether a  given tweet expresses positive or negative sentiment about Apple or Google products. Misclassifying negative sentiments as positive (Fals Positives) can lead to misleading insights, which might result in poor strategic decisions.

### Objectives:

1. Sentiment Analysis Model: Develop a model that can accurately classify Tweets into positive, negative, or neither based  on their content.

2. Minimize False Positives: Focus on rreducing the instances where the model incorrectly identifies a negatives sentiment as positive.

3. Maximize Accuarcy and Precision: Ensure the model is relaible by tartgeting high accuracy precision metrics.

## Data Understanding

- Contributors evaluated tweets about multiple brands and products. The crowd was asked if the tweet expressed positive, negative, or no emotion towards a brand and/or product. If some emotion was expressed they were also asked to say which brand or product was the target of that emotion.

- The tweets included were sent out during the South by South West conference, mostly about Google and Apple products that was put together in 2013.

- The dataset includes 9,093 rows. There are three columns, the first column includes the tweet text, the second column is the subject of the tweet and third column is the emotion of the tweet.

- Data comes from CrowdFlower via data.world.

- Human raters rated the sentiment in over 9,000 Tweets as positive, negative, neither or "can't tell". "Can't tell" really is not of much use to us for this analysis and will be dropped.



##  Data Preparation

### Preview the data

Let's first import the various libraries we will use in our analysis and preview the data.

In [2]:
pip install pandas





[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [6]:

df = pd.read_csv('tweets.csv', encoding='latin1')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


The tweets contain mentions, hashtags, and external links to various websites. In order to make manipulation of the dataframe easier, we will rename the columns to something simpler.

In [12]:
df.rename(columns={'tweet_text':'tweet', 'emotion_in_tweet_is_directed_at':'brand_product', 
                  'is_there_an_emotion_directed_at_a_brand_or_product':'sentiment'}, inplace=True)
df

Unnamed: 0,tweet,brand_product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


We have a total of 9093 tweets. Next we will examine the tweets based on sentiment.

In [13]:
#check for null values in columns
df.isna().sum()

tweet               1
brand_product    5802
sentiment           0
dtype: int64

In [14]:
df.brand_product.value_counts()

brand_product
iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: count, dtype: int64

In [15]:
#looking at data, duplicates and null values
print(df.info())
print(("-"*20))
print('Total duplicated rows')
print(df.duplicated().sum())
print(("-"*20))
print('Total null values')
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tweet          9092 non-null   object
 1   brand_product  3291 non-null   object
 2   sentiment      9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB
None
--------------------
Total duplicated rows
22
--------------------
Total null values
tweet               1
brand_product    5802
sentiment           0
dtype: int64


In [18]:
#Simplify sentiment labels for visualizations
dict_sent = {'No emotion toward brand or product':"No emotion", 
             'Positive emotion':'Positive emotion',
             'Negative emotion':'Negative emotion',
             "I can't tell": "I can't tell"}
df['sentiment'] = df['sentiment'].map(dict_sent)

In [20]:
#sentiment breakdown
df['sentiment'].value_counts()

sentiment
No emotion          5389
Positive emotion    2978
Negative emotion     570
I can't tell         156
Name: count, dtype: int64

In [22]:
#drop NaN in the Tweet column
df['tweet'].dropna(inplace=True)
df

Unnamed: 0,tweet,brand_product,sentiment,brand
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,Apple
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,Apple
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,Google
...,...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion,Apple
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion,
9090,"Google's Zeiger, a physician never reported po...",,No emotion,
9091,Some Verizon iPhone customers complained their...,,No emotion,


In [23]:
#drop duplicates
df.drop_duplicates(inplace=True)
df

Unnamed: 0,tweet,brand_product,sentiment,brand
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,Apple
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,Apple
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,Google
...,...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion,Apple
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion,
9090,"Google's Zeiger, a physician never reported po...",,No emotion,
9091,Some Verizon iPhone customers complained their...,,No emotion,
