# Classifying Tweet Emotions

In [5]:
# Import necessary modules
import pandas as pd

%matplotlib inline
import matplotlib.pylab as plt

pd.set_option('display.max_colwidth', 160)

## Understanding Data

In [6]:
# Read data
df = pd.read_csv('data/text_emotion.csv')

In [7]:
# Print the head of the dataset
df.head()

Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin on your call...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will."


In [14]:
# Print the tail of the dataset
df.tail()

Unnamed: 0,tweet_id,sentiment,author,content
39995,1753918954,neutral,showMe_Heaven,@JohnLloydTaylor
39996,1753919001,love,drapeaux,Happy Mothers Day All my love
39997,1753919005,love,JenniRox,"Happy Mother's Day to all the mommies out there, be you woman or man as long as you're 'momma' to someone this is your day!"
39998,1753919043,happiness,ipdaman1,@niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEEP OUT MY NEW HIT SINGLES WWW.MYSPACE.COM/IPSOHOT I DEF. WAT U IN THE VIDEO!!
39999,1753919049,love,Alpharalpha,@mopedronin bullet train from tokyo the gf and i have been visiting japan since thursday vacation/sightseeing gaijin godzilla


In [16]:
# Take a sample to investiate
df.sample(10)

Unnamed: 0,tweet_id,sentiment,author,content
33304,1752564774,love,zoemoon,"Happy Mothers Day all! Hugs and love, Zoe"
28641,1750929126,enthusiasm,ivanaruggiero,i decided that myspacee is wayy better
25737,1695190657,happiness,joshjones_75,"One final down, two to go!"
24267,1694792741,happiness,SimplePlan2k8,"Oh, and now Mondays also mean new American Dad! So glad I watched that show, so funny, and it makes Mondays even better"
25048,1695001108,neutral,jasonaltenburg,"I'm going to be doing the FAFSA form today. I hope to help out in the Ann Arbor / Detroit Metro Area with computers, art, and design."
2884,1957647362,worry,banksy34,I have a broken wrist
13142,1963906866,hate,kencamp,@SherylBreuker I hate that Costco always costs us so much money
13419,1964030288,surprise,thagolden1,everyone has left me
26521,1695441898,hate,sunniesosweet,@JuliusLionheart y r we giving up on people? I actually thought that way for a while too before I found someone who is very intriguing
29262,1751143124,neutral,ozigal72,@BrianMcnugget nothing beats nurofen plus!


There seem to be many mistakes in the sentences. This is problem for the analysis, and needs to be taken care of.

In [8]:
# Print the structure of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 4 columns):
tweet_id     40000 non-null int64
sentiment    40000 non-null object
author       40000 non-null object
content      40000 non-null object
dtypes: int64(1), object(3)
memory usage: 1.2+ MB


In [19]:
# Print summary statistics of the data
df.describe()

Unnamed: 0,tweet_id
count,40000.0
mean,1845184000.0
std,118857900.0
min,1693956000.0
25%,1751431000.0
50%,1855443000.0
75%,1962781000.0
max,1966441000.0


In [21]:
# Describe "object" columns
df.describe(exclude="number")

Unnamed: 0,sentiment,author,content
count,40000,40000,40000
unique,13,33871,39827
top,neutral,MissxMarisa,I just received a mothers day card from my lovely daughter wishing my a happy mothers day http://tr.im/kWK9
freq,8638,23,14


In [13]:
# Print the number of unique values of tweet_id
df.tweet_id.nunique()

40000

In [18]:
# Print the number of unique values of author
df.author.nunique()

33871

In [24]:
# Plot histogram of the target
df.sentiment.value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

## Initial Findings

### Features and the Target

`sentiment`, with 13 emotion categories, is the target, and the `content` is the feature. 

### Futher Information

We may also use the `author` as a feature, since some people may tend to express same type of emotion in Twitter. However, most of the auhors tweeted only once in the dataset at hand. Therefore I'll not use it as a feature. 

`tweet_id` is unique as expected. However, there seem to be multiple tweets from the same `author` for some tweets. We may drop the `tweet_id` column for the analysis, since it is basically an index. Nevertheless, I'll keep `author` column.

Some values in the `content` column appear more than once. There are **same** tweets for some of the `tweet_id`s.

There seems to be no missing values at all. I'll investigate further for missing values and wrong entries. 

Data seems to be noisy.