# EDA Notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



In [4]:
raw_data = pd.read_csv('data/judge-1377884607_tweet_product_company.csv', encoding='unicode_escape')

In [5]:
raw_data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Exploring Nan Values

As we can see there are many missing values in the 'emotion_in_tweet_is_directed_at' column.

In [6]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


The tweets talk about Apple and Google as well as their more specific products, the nan values are likely about things other than these products, but let's investigate more.

In [7]:
raw_data['emotion_in_tweet_is_directed_at'].value_counts()

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: emotion_in_tweet_is_directed_at, dtype: int64

In [9]:
raw_data[raw_data['emotion_in_tweet_is_directed_at'].isna()].head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product


In [10]:
nan_rows = raw_data[raw_data['emotion_in_tweet_is_directed_at'].isna()]
nan_rows['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

No emotion toward brand or product    5298
Positive emotion                       306
I can't tell                           147
Negative emotion                        51
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

#### Postive Emotion NaN

Let's look at some example tweets with positive emotion, but without a listed product or brand.

In [17]:
positive_nan = nan_rows[nan_rows['is_there_an_emotion_directed_at_a_brand_or_product']=='Positive emotion']
positive_nan['tweet_text']

46      Hand-Held Û÷HoboÛª: Drafthouse launches Û÷H...
112     Spark for #android is up for a #teamandroid aw...
131     Does your #SmallBiz need reviews to play on Go...
157     @mention  #SXSW LonelyPlanet Austin guide for ...
337     First day at sxsw.  Fun final presentation on ...
                              ...                        
8898    @mention What's the wait time lookin like? The...
9011    apparently the line to get an iPad at the #sxs...
9049    @mention you can buy my used iPad and I'll pic...
9052    @mention You could buy a new iPad 2 tmrw at th...
9054    Guys, if you ever plan on attending #SXSW, you...
Name: tweet_text, Length: 306, dtype: object

Some seem to be random tweets about other things vaguely related to the brands, like other things on their platform.

In [23]:
positive_nan.iloc[0, 0]

'Hand-Held \x89Û÷Hobo\x89Ûª: Drafthouse launches \x89Û÷Hobo With a Shotgun\x89Ûª iPhone app #SXSW {link}'

In [29]:
positive_nan.iloc[20, 0]

'@mention  Its bigger than an iphone and smaller than a PC, so good for big events like #SXSW and meeti? {link}'

Or, cases of people using google as a verb.

In [25]:
positive_nan.iloc[5, 0]

'&quot;You can Google Canadian Tuxedo and lose yourself for hours&quot; #sxsw'

Some do seem to be about google or apple products but aren't classified as such.

In [27]:
positive_nan.iloc[10, 0]

'\x89ÛÏ@mention Google to Launch New Social Network Called Circles, Today? {link} #sxsw\x89Û\x9d Applause for a new privacy model attempt'

In [28]:
positive_nan.iloc[15, 0]

'Catch 22\x89Û_ I mean iPad 2 at #SXSW - {link} #apple #ipad2'

#### No Emotion NaN

Let's look at some of the no emotion examples to try to determine if there are interesting features here.

In [30]:
no_emotion_nan = nan_rows[nan_rows['is_there_an_emotion_directed_at_a_brand_or_product']=='No emotion toward brand or product']

In [31]:
no_emotion_nan.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
6,,,No emotion toward brand or product
16,Holler Gram for iPad on the iTunes App Store -...,,No emotion toward brand or product
32,"Attn: All #SXSW frineds, @mention Register fo...",,No emotion toward brand or product
33,Anyone at #sxsw want to sell their old iPad?,,No emotion toward brand or product


All that data seems to contain some kind of vague allusion to the products, or contain certain keywords. It doesn't seem like there is data that is completely disconnected to the products or brands. It seems like everything in this class are things that are neutral or unclear in their emotional resonance or not clearly about the brand even though they have keywords in them.

In [32]:
no_emotion_nan.iloc[0,0]

'@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd'

In [33]:
no_emotion_nan.iloc[5,0]

'Anyone at  #SXSW who bought the new iPad want to sell their older iPad to me?'

In [34]:
no_emotion_nan.iloc[10,0]

'Hey #SXSW - How long do you think it takes us to make an iPhone case? answer @mention using #zazzlesxsw and we\x89Ûªll make you one!'

In [35]:
no_emotion_nan.iloc[20,0]

'{link} RT @mention Those at #SXSW check out the Holler Gram ipad app from @mention  {link}'

It is also clear that all these tweets are from the hashtag #SXSW, likely because they were tweets more likely to be about brand products. One potential limitation of using this data is that it might reward biases in favor of thinking a tweet is about a brand in general, since it is likely tweets in general are less likely to be about the brand, so if the classifier is used in practice it makes sense to only use it on prefiltered data.

### Missing tweet text?

There is one accidental empty tweet

In [36]:
raw_data[raw_data['tweet_text'].isna()]

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
6,,,No emotion toward brand or product


## Labels

There are 4 label types as it stands.

In [8]:
raw_data['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64