# Sentiment analysis 

The objective of the second problem is to perform Sentiment analysis from the tweets collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

## Question 1

### Read the data
- read tweets.csv
- use latin encoding if it gives encoding error while loading

In [104]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression


In [105]:
df= pd.read_csv('tweets.csv',encoding = "latin")

In [106]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### Drop null values
- drop all the rows with null values

In [107]:
df= df.dropna()

In [108]:
df.shape

(3291, 3)

### Print the dataframe
- print initial 5 rows of the data
- use df.head()

In [109]:
df.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Question 2

### Preprocess data
- convert all text to lowercase - use .lower()
- select only numbers, alphabets, and #+_ from text - use re.sub()
- strip all the text - use .strip()
    - this is for removing extra spaces

In [110]:
import re
df = df.applymap(lambda s: s.lower())
df = df.applymap(lambda s: re.sub('[^0-9a-z #+_]'," ",s))
df = df.applymap(lambda s: s.strip())



print dataframe

In [111]:
print(df)

                                             tweet_text  ... is_there_an_emotion_directed_at_a_brand_or_product
0     wesley83 i have a 3g iphone  after 3 hrs tweet...  ...                                   negative emotion
1     jessedee know about  fludapp   awesome ipad ip...  ...                                   positive emotion
2     swonderlin can not wait for #ipad 2 also  they...  ...                                   positive emotion
3     sxsw i hope this year s festival isn t as cras...  ...                                   negative emotion
4     sxtxstate great stuff on fri #sxsw  marissa ma...  ...                                   positive emotion
...                                                 ...  ...                                                ...
9077  mention your pr guy just convinced me to switc...  ...                                   positive emotion
9079  quot papyrus   sort of like the ipad quot    n...  ...                                   positive 

## Question 3

### Preprocess data
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - select only those rows where value equal to "positive emotion" or "negative emotion"
- find the value counts of "positive emotion" and "negative emotion"

In [112]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

positive emotion                      2672
negative emotion                       519
no emotion toward brand or product      91
i can t tell                             9
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

## Question 4

### Encode labels
- in column "is_there_an_emotion_directed_at_a_brand_or_product"
    - change "positive emotion" to 1
    - change "negative emotion" to 0
- use map function to replace values

In [113]:
df.is_there_an_emotion_directed_at_a_brand_or_product = df.is_there_an_emotion_directed_at_a_brand_or_product.map({'positive emotion':1, 'negative emotion':0})


In [114]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,wesley83 i have a 3g iphone after 3 hrs tweet...,iphone,0.0
1,jessedee know about fludapp awesome ipad ip...,ipad or iphone app,1.0
2,swonderlin can not wait for #ipad 2 also they...,ipad,1.0
3,sxsw i hope this year s festival isn t as cras...,ipad or iphone app,0.0
4,sxtxstate great stuff on fri #sxsw marissa ma...,google,1.0


## Question 5

In [136]:
df.shape

(3291, 3)

In [138]:
df= df.dropna()

In [139]:
df.shape

(3191, 3)

### Get feature and label
- get column "tweet_text" as feature
- get column "is_there_an_emotion_directed_at_a_brand_or_product" as label

In [141]:
df_feature = df['tweet_text']
df_labels = df['is_there_an_emotion_directed_at_a_brand_or_product']


In [142]:
df_labels.value_counts()

1.0    2672
0.0     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### Create train and test data
- use train_test_split to get train and test set
- set a random_state
- test_size: 0.25

In [146]:
X = df.tweet_text
y = df.is_there_an_emotion_directed_at_a_brand_or_product
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)


## Question 6

### Vectorize data
- create document-term matrix
- use CountVectorizer()
    - ngram_range: (1, 2)
    - stop_words: 'english'
    - min_df: 2   
- do fit_transform on X_train
- do transform on X_test

In [147]:
vectorizer = CountVectorizer(ngram_range = (1, 2), stop_words = 'english', min_df = 2)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


## Question 7

### Select classifier logistic regression
- use logistic regression for predicting sentiment of the given tweet
- initialize classifier

### Fit the classifer
- fit logistic regression classifier

In [149]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Question 8

### Select classifier naive bayes
- use naive bayes for predicting sentiment of the given tweet
- initialize classifier
- use MultinomialNB

In [150]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()


### Fit the classifer
- fit naive bayes classifier

In [151]:
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Question 9

### Make predictions on logistic regression
- use your trained logistic regression model to make predictions on X_test

In [158]:
lr_result = logreg.predict(X_test)

### Make predictions on naive bayes
- use your trained naive bayes model to make predictions on X_test
- use a different variable name to store predictions so that they are kept separately

In [163]:
nb_result = clf.predict(X_test)

## Question 10

### Calculate accuracy of logistic regression
- check accuracy of logistic regression classifer
- use sklearn.metrics.accuracy_score

In [164]:
from sklearn.metrics import accuracy_score
lr_accuracy = accuracy_score(y_test.values, lr_result)
print(lr_accuracy)

0.868421052631579


In [165]:
print("Testing Accuracy")
print(logreg.score(X_test,y_test))
print("Training Accuracy")
print(logreg.score(X_train,y_train))

Testing Accuracy
0.868421052631579
Training Accuracy
0.9736732135394902


### Calculate accuracy of naive bayes
- check accuracy of naive bayes classifer
- use sklearn.metrics.accuracy_score

In [172]:
from sklearn.metrics import accuracy_score
clf_accuracy = accuracy_score(y_test.values,nb_result)
print(clf_accuracy)


0.8646616541353384


In [173]:
print("Testing Accuracy")
print(clf.score(X_test,y_test))
print("Training Accuracy")
print(clf.score(X_train,y_train))

Testing Accuracy
0.8646616541353384
Training Accuracy
0.9335562055996657
