# <ins> Project 4 Twitter Sentiment<ins>

## Business Understanding

## Data Understanding

Data is taken from [data.world]("https://data.world/crowdflower/brands-and-product-emotions/workspace/file?filename=judge-1377884607_tweet_product_company.csv"). According to their website "Contributors evaluated tweets about multiple brands and products. The crowd was asked if the tweet expressed positive, negative, or no emotion towards a brand and/or product. If some emotion was expressed they were also asked to say which brand or product was the target of that emotion. Added: August 30, 2013 by Kent Cavender-Bares | Data Rows: 9093"

In [30]:
import pandas as pd
import numpy as np
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, regexp_tokenize
import matplotlib.pyplot as plt
import string
import re


In [31]:
## Data Preparation
df = pd.read_csv("data/judge-1377884607_tweet_product_company.csv", encoding='latin1')

In [32]:
df 

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


There are 5802 values that are missing in is there an emotion directed at a brand or product column, one missing in the first column which we will drop and 0 missing in the last column. In addition, there are many values that we will want to fill in as we go through data cleaning because they are mislabeled as neutral but contain an emotion.

In [40]:
df = df.dropna(subset=['tweet_text'])
df['tweet_text'].isna().sum()

0

### Data Cleaning
Our first step will be tokenizing the first column, in order to do so we will need to convert our columns from object data types to strings.

In [41]:
df['tweet_text'] = df['tweet_text'].astype(str)

In [43]:
# Checking to make sure new Dtype
print(type(df['tweet_text'].iloc[13]))

<class 'str'>


In [45]:
# Applying word token to tweet tokens
df['tweet_tokens'] = df['tweet_text'].apply(tweet_tokenizer.tokenize)
print(df[['tweet_text', 'tweet_tokens']])

                                             tweet_text  \
0     .@wesley83 I have a 3G iPhone. After 3 hrs twe...   
1     @jessedee Know about @fludapp ? Awesome iPad/i...   
2     @swonderlin Can not wait for #iPad 2 also. The...   
3     @sxsw I hope this year's festival isn't as cra...   
4     @sxtxstate great stuff on Fri #SXSW: Marissa M...   
...                                                 ...   
9088                      Ipad everywhere. #SXSW {link}   
9089  Wave, buzz... RT @mention We interrupt your re...   
9090  Google's Zeiger, a physician never reported po...   
9091  Some Verizon iPhone customers complained their...   
9092  Ï¡Ïàü_ÊÎÒ£Áââ_£â_ÛâRT @...   

                                           tweet_tokens  
0     [., @wesley83, I, have, a, 3G, iPhone, ., Afte...  
1     [@jessedee, Know, about, @fludapp, ?, Awesome,...  
2     [@swonderlin, Can, not, wait, for, #iPad, 2, a...  
3     [@sxsw, I, hope, this, year's, festival, isn't...  
4

In [54]:
print(df['tweet_tokens'].apply(type).unique())  # Should only show <class 'list'>

[<class 'list'>]


In [55]:
# Gettings rid of stop words
stop_words = set(stopwords.words('english'))

In [57]:
#lower casing all tokens in tweet_tokens
df['filtered'] = df['tweet_tokens'].apply(lambda tokens: [token.lower() for token in tokens])

In [60]:
df['filtered'].head()

0    [., @wesley83, i, have, a, 3g, iphone, ., afte...
1    [@jessedee, know, about, @fludapp, ?, awesome,...
2    [@swonderlin, can, not, wait, for, #ipad, 2, a...
3    [@sxsw, i, hope, this, year's, festival, isn't...
4    [@sxtxstate, great, stuff, on, fri, #sxsw, :, ...
Name: filtered, dtype: object

In [61]:
# Remove stopwords from the 'tweet_tokens' column
df['filtered_tokens'] = df['filtered'].apply(
    lambda tokens: [token for token in tokens if token not in stop_words]
)

# Check the result
print(df[['tweet_tokens', 'filtered_tokens']].head())

                                        tweet_tokens  \
0  [., @wesley83, I, have, a, 3G, iPhone, ., Afte...   
1  [@jessedee, Know, about, @fludapp, ?, Awesome,...   
2  [@swonderlin, Can, not, wait, for, #iPad, 2, a...   
3  [@sxsw, I, hope, this, year's, festival, isn't...   
4  [@sxtxstate, great, stuff, on, Fri, #SXSW, :, ...   

                                     filtered_tokens  
0  [., @wesley83, 3g, iphone, ., 3, hrs, tweeting...  
1  [@jessedee, know, @fludapp, ?, awesome, ipad, ...  
2  [@swonderlin, wait, #ipad, 2, also, ., sale, #...  
3  [@sxsw, hope, year's, festival, crashy, year's...  
4  [@sxtxstate, great, stuff, fri, #sxsw, :, mari...  


In [62]:
# Remove special characters from tokens
df['clean_tokens'] = df['filtered_tokens'].apply(
    lambda tokens: [re.sub(r'\W+', '', token) for token in tokens if re.sub(r'\W+', '', token)]
)

print(df[['filtered_tokens', 'clean_tokens']].head())

                                     filtered_tokens  \
0  [., @wesley83, 3g, iphone, ., 3, hrs, tweeting...   
1  [@jessedee, know, @fludapp, ?, awesome, ipad, ...   
2  [@swonderlin, wait, #ipad, 2, also, ., sale, #...   
3  [@sxsw, hope, year's, festival, crashy, year's...   
4  [@sxtxstate, great, stuff, fri, #sxsw, :, mari...   

                                        clean_tokens  
0  [wesley83, 3g, iphone, 3, hrs, tweeting, rise_...  
1  [jessedee, know, fludapp, awesome, ipad, iphon...  
2      [swonderlin, wait, ipad, 2, also, sale, sxsw]  
3  [sxsw, hope, years, festival, crashy, years, i...  
4  [sxtxstate, great, stuff, fri, sxsw, marissa, ...  


In [63]:
df['filtered_tokens'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[rt, @mention, marissa, mayer, :, google, connect, digital, &, physical, worlds, mobile, -, {, link, }, #sxsw]                                             9
[rt, @mention, google, launch, major, new, social, network, called, circles, ,, possibly, today, {, link, }, #sxsw]                                        9
[win, free, ipad, 2, webdoc.com, #sxsw, rt]                                                                                                                7
[google, launch, major, new, social, network, called, circles, ,, possibly, today, {, link, }, #sxsw]                                                      6
[rt, @mention, rumor, :, apple, opening, temporary, store, downtown, austin, #sxsw, ipad, 2, launch, {, link, }]                                           4
                                                                                                                                                          ..
[watching, twit, live, #sxsw, ,, want, go, next, year, ., 

In [None]:

## Modeling

In [None]:
## Evaluation

In [None]:
## 