# Sentiment Analysis

#### The Objective of this assignment is to perform Sentiment Analysis on the data collected from twitter.
#### The collected data is in the form of tweets from multiple users with respect to multiple mobile devices.
#### Based upon emotion associated within tweet(textual data), model has to be trained to predict the emotion to perform Sentiment Analysis.

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [1]:
import pandas as pd

In [3]:
df=pd.read_csv('tweets.csv',encoding='latin')

In [10]:

print(df.shape)

(9093, 3)


### Drop null values
- Check for null values
- drop rows with null values

In [11]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [5]:
df.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [13]:
df['emotion_in_tweet_is_directed_at']=df['emotion_in_tweet_is_directed_at'].fillna('NA')

In [15]:
df.dropna(subset=['tweet_text','is_there_an_emotion_directed_at_a_brand_or_product'],inplace=True)

In [19]:
df.isna().sum()

tweet_text                                            0
emotion_in_tweet_is_directed_at                       0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

### Print the dataframe
- print initial 5 rows of the data

In [17]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [22]:
df=df.applymap(lambda s:s.lower())

#select only numbers and alphabets
import re

df=df.applymap(lambda s: re.sub('[^0-9a-z #+_]'," ",s))

#strip
df=df.applymap(lambda s:s.strip())

print dataframe

In [23]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,negative emotion
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they...,ipad,positive emotion
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,positive emotion


## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [24]:
df.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

no emotion toward brand or product    5388
positive emotion                      2978
negative emotion                       570
i can t tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [26]:
df = df[(df['is_there_an_emotion_directed_at_a_brand_or_product'] == 'positive emotion') | (df['is_there_an_emotion_directed_at_a_brand_or_product']=='negative emotion')]

In [28]:
df.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

positive emotion    2978
negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [30]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,negative emotion
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,positive emotion
2,swonderlin can not wait for #ipad 2 also they...,ipad,positive emotion
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,negative emotion
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,positive emotion


In [33]:
df.shape

(3548, 3)

## Question 4
Encode labels
in column "is_there_an_emotion_directed_at_a_brand_or_product"
change "positive emotion" to 1
change "negative emotion" to 0
use map function to replace values

In [31]:
df['is_there_an_emotion_directed_at_a_brand_or_product'] = df.is_there_an_emotion_directed_at_a_brand_or_product.map({'negative emotion':0,'positive emotion':1})

In [32]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,0
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,1
2,swonderlin can not wait for #ipad 2 also they...,ipad,1
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,0
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,1


In [34]:
df.shape

(3548, 3)

## Question 5

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [35]:
feature=df['tweet_text']
label=df['is_there_an_emotion_directed_at_a_brand_or_product']

### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [36]:
from sklearn.model_selection import train_test_split
tweet_train,tweet_test,y_train,y_test=train_test_split(feature,label,test_size=0.25,random_state=2)

## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

X_train=tweet_train
X_test=tweet_test
cvect=CountVectorizer(ngram_range=(1,2),stop_words='english',min_df=2)
X_train=cvect.fit_transform(X_train)
X_test=cvect.transform(X_test)

## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

In [41]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()

### Fit the classifer
- fit logistic regression classifier

In [42]:
logreg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [44]:
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()


### Fit the classifer
- fit naive bayes classifier

In [45]:
mnb.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [47]:
lr_predict=logreg.predict(X_test)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [48]:
nb_pred=mnb.predict(X_test)

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [50]:
from sklearn.metrics import accuracy_score
lr_acc=accuracy_score(y_test,lr_predict)

In [51]:
lr_acc

0.8804960541149943

### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [52]:
mnb_acc=accuracy_score(y_test,nb_pred)

In [53]:
mnb_acc

0.8804960541149943