# Sentiment Analysis

#### The Objective of this assignment is to perform Sentiment Analysis on the data collected from twitter.
#### The collected data is in the form of tweets from multiple users with respect to multiple mobile devices.
#### Based upon emotion associated within tweet(textual data), model has to be trained to predict the emotion to perform Sentiment Analysis.

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.read_csv("tweets.csv",encoding='latin1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
df.shape

(9093, 3)

### Drop null values
- Check for null values
- drop rows with null values

In [4]:
df.isnull().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [5]:
df['emotion_in_tweet_is_directed_at'] = df['emotion_in_tweet_is_directed_at'].fillna('NA')

In [6]:
df.isnull().sum()

tweet_text                                            1
emotion_in_tweet_is_directed_at                       0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

In [7]:
df.dropna(inplace=True)

In [8]:
df.isnull().sum()

tweet_text                                            0
emotion_in_tweet_is_directed_at                       0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

### Print the dataframe
- print initial 5 rows of the data

In [9]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [10]:
df['emotion_in_tweet_is_directed_at'].unique()

array(['iPhone', 'iPad or iPhone App', 'iPad', 'Google', 'NA', 'Android',
       'Apple', 'Android App', 'Other Google product or service',
       'Other Apple product or service'], dtype=object)

In [11]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [12]:
import unicodedata
def preprocesstweet(tweet):
    tweet = tweet.lower()
    tweet = re.sub(r'https?://\S+', '', tweet)   # Removes Hyperlink
    tweet = re.sub('[^0-9a-z #+_]', '', tweet)   # Select only numbers,alphabets and #+_
    tweet = unicodedata.normalize('NFKD', tweet).encode('ascii', 'ignore').decode('utf-8', 'ignore') # Remove Accented Characters
    tweet = tweet.strip()
    return tweet

In [13]:
#df=df.applymap(lambda s:s.lower())
#df=df.applymap(lambda s:re.sub('[^0-9a-z #+_]', '', s))
#df=df.applymap(lambda s:s.strip())

In [14]:
preprocesstweet("Find &amp; Start Impromptu Parties at #SXSW With @HurricaneParty http://bit.ly/gVLrIn I can't wait til the Android app comes out.")

'find amp start impromptu parties at #sxsw with hurricaneparty  i cant wait til the android app comes out'

In [15]:
preprocesstweet('‰÷¼ WHAT? ‰÷_ {link} ‰ã_ #edchat #musedchat #sxsw #sxswi #classical #newTwitter')

'what _ link _ #edchat #musedchat #sxsw #sxswi #classical #newtwitter'

In [16]:
df['tweet_text'] = df['tweet_text'].apply(lambda x:preprocesstweet(x))

print dataframe

In [17]:
df.head(10)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweeti...,iPhone,Negative emotion
1,jessedee know about fludapp awesome ipadiphon...,iPad or iPhone App,Positive emotion
2,swonderlin can not wait for #ipad 2 also they ...,iPad,Positive emotion
3,sxsw i hope this years festival isnt as crashy...,iPad or iPhone App,Negative emotion
4,sxtxstate great stuff on fri #sxsw marissa may...,Google,Positive emotion
5,teachntech00 new ipad apps for #speechtherapy ...,,No emotion toward brand or product
7,#sxsw is just starting #ctia is around the cor...,Android,Positive emotion
8,beautifully smart and simple idea rt madebyman...,iPad or iPhone App,Positive emotion
9,counting down the days to #sxsw plus strong ca...,Apple,Positive emotion
10,excited to meet the samsungmobileus at #sxsw s...,Android,Positive emotion


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [18]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [19]:
df =  df[(df['is_there_an_emotion_directed_at_a_brand_or_product']=='Positive emotion') | (df['is_there_an_emotion_directed_at_a_brand_or_product']=='Negative emotion')]

In [20]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [21]:
df.shape

(3548, 3)

## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [22]:
df['is_there_an_emotion_directed_at_a_brand_or_product'] = df['is_there_an_emotion_directed_at_a_brand_or_product'].replace({'Positive emotion':1,'Negative emotion':0})

In [23]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

1    2978
0     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [24]:
X = df.tweet_text
Y = df.is_there_an_emotion_directed_at_a_brand_or_product

### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=5)

In [26]:
X_train.shape,X_test.shape

((2661,), (887,))

In [27]:
X_train

7714                #sxsw android users were proud of you
7963    mention saw you awesome educational iphone mus...
8619    82 of future tablet buyers say theyll choose a...
6611    rt mention rt mention the day bank of america ...
8358    dont bite any ears rt mention ill be at the au...
                              ...                        
8153    i think #sxsw has taken it upon itself to make...
7769    awesome rt mention heading to austin for #sxsw...
4178    never realized that there was a storymeaning b...
5548    rt mention before it even begins apple wins #s...
7447    add the apple popup shop to your #sxsw mention...
Name: tweet_text, Length: 2661, dtype: object

## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [28]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(ngram_range=(1,2),stop_words='english',min_df=2)

In [29]:
cvect.fit(X_train)
len(cvect.vocabulary_)

5927

In [30]:
cvect.vocabulary_

{'sxsw': 4866,
 'android': 229,
 'users': 5538,
 'sxsw android': 4877,
 'mention': 3219,
 'saw': 4380,
 'awesome': 505,
 'iphone': 2582,
 'music': 3524,
 'app': 270,
 'mention saw': 3360,
 'future': 1766,
 'tablet': 5128,
 'buyers': 772,
 'say': 4382,
 'theyll': 5261,
 'apples': 382,
 'ipad': 2437,
 'popup': 3944,
 'store': 4738,
 'link': 2924,
 'apples ipad': 384,
 'ipad popup': 2511,
 'popup store': 3949,
 'store sxsw': 4773,
 'sxsw link': 4974,
 'rt': 4327,
 'day': 1172,
 'bank': 540,
 'america': 195,
 'launched': 2833,
 'got': 1976,
 '250k': 43,
 'new': 3578,
 'customers': 1146,
 'bankinnovation': 544,
 'rt mention': 4329,
 'mention rt': 3357,
 'mention day': 3257,
 'day bank': 1175,
 'bank america': 541,
 'america launched': 196,
 'launched iphone': 2834,
 'iphone app': 2586,
 'app got': 285,
 'got 250k': 1977,
 '250k new': 44,
 'new customers': 3585,
 'customers bankinnovation': 1147,
 'bankinnovation sxsw': 545,
 'dont': 1334,
 'ears': 1406,
 'ill': 2302,
 'austin': 456,
 'conve

In [31]:
X_train_ct = cvect.transform(X_train)

In [32]:
X_train

7714                #sxsw android users were proud of you
7963    mention saw you awesome educational iphone mus...
8619    82 of future tablet buyers say theyll choose a...
6611    rt mention rt mention the day bank of america ...
8358    dont bite any ears rt mention ill be at the au...
                              ...                        
8153    i think #sxsw has taken it upon itself to make...
7769    awesome rt mention heading to austin for #sxsw...
4178    never realized that there was a storymeaning b...
5548    rt mention before it even begins apple wins #s...
7447    add the apple popup shop to your #sxsw mention...
Name: tweet_text, Length: 2661, dtype: object

In [33]:
print(X_train_ct.shape)

(2661, 5927)


In [34]:
X_test_ct = cvect.transform(X_test)

In [35]:
print(X_test_ct.shape)

(887, 5927)


## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [36]:
from sklearn.linear_model import LogisticRegression

In [37]:
model=LogisticRegression()

### Fit the classifer
- fit logistic regression classifier

In [38]:
model.fit(X_train_ct, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [39]:
from sklearn.naive_bayes import MultinomialNB

In [40]:
mnb = MultinomialNB()

### Fit the classifer
- fit naive bayes classifier

In [41]:
mnb.fit(X_train_ct, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [42]:
y_pred_model = model.predict(X_test_ct)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [43]:
y_pred_mnb = mnb.predict(X_test_ct)

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [44]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_model)

0.8838782412626832

### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [45]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_mnb)

0.8613303269447576