# Sentiment Analysis

The first step of this project consists in performing **sentiment analysis with NLTK's VADER** module, https://www.nltk.org/_modules/nltk/sentiment/vader.html, on a dataset with 10 000 Amazon reviews. VADER stands for <b>V</b>alence <b>A</b>ware <b>D</b>ictionary for s<b>E</b>ntiment <b>R</b>easoning, and is a **rule-based algorithm** composed by Hutto and Gilbert: http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf.

### Step 1

#### 1. Perform initial imports

In [1]:
import numpy as np
import pandas as pd

#### 2. Load data

In [2]:
df = pd.read_csv("data/amazonreviews.tsv", sep='\t')

#### 3. Check the dataframe

In [3]:
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


#### 4. Check missing values

In [4]:
df.isnull().sum()

label     0
review    0
dtype: int64

There are no missing reviews.

#### 5. Check empty strings

In [5]:
# using the isspace() method
# returns True if there are only whitespace characters in the string. If not, it returns False. 

empty_strings = []

for i, lb, rv in df.itertuples():
    if rv.isspace():
        empty_strings.append(i)

In [6]:
print(empty_strings)
print(len(empty_strings))

[]
0


There are no reviews that correspond to empty strings.

In [7]:
# check length

len(df)

10000

In [8]:
# check number of both labels

df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

We have 10 000 movie reviews (5097 are negative and 4903 are positive). Our dataset is cleaned and we can now analyse it with VADER.

#### 6. Import `SentimentIntensityAnalyzer` and create an sid object

In [9]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()



#### 7. Check the lexicon (vocabulary or all valid tokens) of our sid object

In [10]:
sid.lexicon

{'$:': -1.5,
 '%)': -0.4,
 '%-)': -1.5,
 '&-:': -0.4,
 '&:': -0.7,
 "( '}{' )": 1.6,
 '(%': -0.9,
 "('-:": 2.2,
 "(':": 2.3,
 '((-:': 2.1,
 '(*': 1.1,
 '(-%': -0.7,
 '(-*': 1.3,
 '(-:': 1.6,
 '(-:0': 2.8,
 '(-:<': -0.4,
 '(-:o': 1.5,
 '(-:O': 1.5,
 '(-:{': -0.1,
 '(-:|>*': 1.9,
 '(-;': 1.3,
 '(-;|': 2.1,
 '(8': 2.6,
 '(:': 2.2,
 '(:0': 2.4,
 '(:<': -0.2,
 '(:o': 2.5,
 '(:O': 2.5,
 '(;': 1.1,
 '(;<': 0.3,
 '(=': 2.2,
 '(?:': 2.1,
 '(^:': 1.5,
 '(^;': 1.5,
 '(^;0': 2.0,
 '(^;o': 1.9,
 '(o:': 1.6,
 ")':": -2.0,
 ")-':": -2.1,
 ')-:': -2.1,
 ')-:<': -2.2,
 ')-:{': -2.1,
 '):': -1.8,
 '):<': -1.9,
 '):{': -2.3,
 ');<': -2.6,
 '*)': 0.6,
 '*-)': 0.3,
 '*-:': 2.1,
 '*-;': 2.4,
 '*:': 1.9,
 '*<|:-)': 1.6,
 '*\\0/*': 2.3,
 '*^:': 1.6,
 ',-:': 1.2,
 "---'-;-{@": 2.3,
 '--<--<@': 2.2,
 '.-:': -1.2,
 '..###-:': -1.7,
 '..###:': -1.9,
 '/-:': -1.3,
 '/:': -1.3,
 '/:<': -1.4,
 '/=': -0.9,
 '/^:': -1.0,
 '/o:': -1.4,
 '0-8': 0.1,
 '0-|': -1.2,
 '0:)': 1.9,
 '0:-)': 1.4,
 '0:-3': 1.5,
 '0:03': 1.9,
 '

In [11]:
len(sid.lexicon)

7502

In [12]:
# check n-grams

[(token, score) for token, score in sid.lexicon.items() if " " in token]

[("( '}{' )", 1.6),
 ("can't stand", -2.0),
 ('fed up', -1.8),
 ('screwed up', -1.5)]

We see that emoticons are also part of this lexicon of 7502 tokens. From these tokens, only 3 of them are n-grams - bigrams in this case.

#### 8. Add scores and new labels to the dataframe

We'll append 3 columns to our dataset:
* `scores` with the polarity scores (negative, neutral, positive and compound)
* `compound` with the extracted compound score
* `comp_label` with the label derived from the compound score

In [13]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_label'] = df['compound'].apply(lambda comp: 'pos' if comp >=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_label
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


#### 9. Compare the original label with the new label and evaluate the results

In [14]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [15]:
# confusion matrix

print(confusion_matrix(df['label'], df['comp_label'], labels =['pos', 'neg']))

[[4468  435]
 [2474 2623]]


In [16]:
# classification report

print(classification_report(df['label'], df['comp_label']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

   micro avg       0.71      0.71      0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [17]:
# accuracy score

print(accuracy_score(df['label'], df['comp_label']))

0.7091


VADER **was not very accurate**, but still it was able to correctly identify about **71%** of the reviews as positive or negative.

Instead of a rule-base algorithm like VADER, we can build a **machine learning model** from labeled data - this is the second step of this project. We'll use a **Naive Bayes** model to do just that.

### Step 2

#### 1. Tokenize reviews with casual_tokenize and create a bag of words

Casual_tokenize was built to deal with short and informal texts from social networks, and is able to deal with emoticons, hashtags and other specificities of this kind of texts.

In [18]:
from nltk.tokenize import casual_tokenize

In [19]:
bow = []
from collections import Counter
for review in df['review']:
    bow.append(Counter(casual_tokenize(review)))

In [20]:
# check first two elements of bow

bow[:2]

[Counter({'Stuning': 1,
          'even': 2,
          'for': 1,
          'the': 5,
          'non-gamer': 1,
          ':': 1,
          'This': 1,
          'sound': 1,
          'track': 1,
          'was': 1,
          'beautiful': 1,
          '!': 4,
          'It': 3,
          'paints': 1,
          'senery': 1,
          'in': 1,
          'your': 1,
          'mind': 1,
          'so': 1,
          'well': 1,
          'I': 3,
          'would': 2,
          'recomend': 1,
          'it': 2,
          'to': 2,
          'people': 1,
          'who': 2,
          'hate': 1,
          'vid': 1,
          '.': 2,
          'game': 2,
          'music': 2,
          'have': 2,
          'played': 2,
          'Chrono': 1,
          'Cross': 1,
          'but': 1,
          'out': 1,
          'of': 2,
          'all': 1,
          'games': 1,
          'ever': 1,
          'has': 1,
          'best': 1,
          'backs': 1,
          'away': 1,
          'from': 1,
          'c

Our variable `bow` is a list of "Counter dictionaries".

#### 2. Create a dataframe with the from_records() method

In [21]:
df_bow = pd.DataFrame.from_records(bow)

In [22]:
df_bow.head()

Unnamed: 0,!,"""",#,#10,#10162,#11,#114622,#12,#13,#15,...,ÁNGEL,ÚNICA,à,è,émouvantes,étai,étre,éviter,﻿,�
0,4.0,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,1.0,12.0,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,2.0,,,,,,,,,,...,,,,,,,,,,


#### 3. Fill all the NaN's  with zeros and set the numbers to integers

In [23]:
df_bow = df_bow.fillna(0).astype(int)

In [24]:
df_bow.head()

Unnamed: 0,!,"""",#,#10,#10162,#11,#114622,#12,#13,#15,...,ÁNGEL,ÚNICA,à,è,émouvantes,étai,étre,éviter,﻿,�
0,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,12,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
df_bow.shape

(10000, 46435)

For our 10 000 reviews, we have a total of 46435 tokens.

#### 4. Fit a Naive Bayes model

In [26]:
from sklearn.naive_bayes import MultinomialNB

In [27]:
# check reviews dataframe

df.head()

Unnamed: 0,label,review,scores,compound,comp_label
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [28]:
# split training data into train and test sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_bow, df['label'], test_size=0.3, random_state=42)

In [29]:
# fit

my_model = MultinomialNB()
my_model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### 5. Make predictions

In [30]:
# predict

predicted_sentiment = my_model.predict(X_test)

In [31]:
# first 10 predictions

predicted_sentiment[:10]

array(['neg', 'neg', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg', 'neg',
       'neg'], dtype='<U3')

#### 6. Compare the original label with the new label and evaluate the results

In [32]:
# confusion matrix

print(confusion_matrix(y_test, predicted_sentiment, labels =['pos', 'neg']))

[[1125  357]
 [ 164 1354]]


In [33]:
# classification report

print(classification_report(y_test, predicted_sentiment))

              precision    recall  f1-score   support

         neg       0.79      0.89      0.84      1518
         pos       0.87      0.76      0.81      1482

   micro avg       0.83      0.83      0.83      3000
   macro avg       0.83      0.83      0.83      3000
weighted avg       0.83      0.83      0.83      3000



In [34]:
# accuracy score

print(accuracy_score(y_test, predicted_sentiment))

0.8263333333333334


With this simple model, we have managed to improve our accuracy. Our **Naive Bayes model** was able to correctly identify about **83%** of the reviews of the test set as positive or negative.

We can now try to improve this model by:
* trying different text vectorizaton methods 
* tune the hyperparameter alpha of our model (default value is 1.0)

This will be the third step of this project.

### Step 3

The simplest text vectorization method is a **bag-of-words (BoW**) model based on simple counts of how many times each word appears on a given text (frequency), that we used in step 2. **One-hot-encoding** and **TF-IDF (Term Frequence-Inverse Document Frequency)** are the other methods we'll implement in this step. In order to do this, we will use **sklearn's CountVectorizer** and **TfidfVectorizer**.

#### 1. Import CountVectorizer and TfidfVectorizer

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#### 2. Vectorize text : BoW, one-hot encoding and TF-IDF

In [36]:
# use casual_tokenize as our tokenizer

bow_vectorizer = CountVectorizer(tokenizer=casual_tokenize)
ohe_vectorizer = CountVectorizer(binary=True, tokenizer=casual_tokenize)
tfidf_vectorizer = TfidfVectorizer(tokenizer=casual_tokenize)

cv_bow = bow_vectorizer.fit_transform(df['review'])
cv_ohe = ohe_vectorizer.fit_transform(df['review'])
tfidf = tfidf_vectorizer.fit_transform(df['review'])

#### 3. Create dataframes with the vectorized text

In [37]:
# check vocabulary

bow_vectorizer.vocabulary_

{'stuning': 32284,
 'even': 12003,
 'for': 13522,
 'the': 33429,
 'non-gamer': 22931,
 ':': 1044,
 'this': 33634,
 'sound': 31277,
 'track': 34260,
 'was': 36396,
 'beautiful': 3848,
 '!': 0,
 'it': 17976,
 'paints': 24265,
 'senery': 29789,
 'in': 17049,
 'your': 37601,
 'mind': 21437,
 'so': 30988,
 'well': 36616,
 'i': 16691,
 'would': 37291,
 'recomend': 27490,
 'to': 33978,
 'people': 24722,
 'who': 36835,
 'hate': 15622,
 'vid': 35970,
 '.': 38,
 'game': 14153,
 'music': 22265,
 'have': 15653,
 'played': 25322,
 'chrono': 6653,
 'cross': 8481,
 'but': 5448,
 'out': 23908,
 'of': 23398,
 'all': 1973,
 'games': 14179,
 'ever': 12017,
 'has': 15597,
 'best': 4125,
 'backs': 3473,
 'away': 3336,
 'from': 13888,
 'crude': 8512,
 'keyboarding': 18684,
 'and': 2291,
 'takes': 32968,
 'a': 1099,
 'fresher': 13827,
 'step': 31915,
 'with': 37039,
 'grate': 14968,
 'guitars': 15218,
 'soulful': 31268,
 'orchestras': 23750,
 'impress': 17010,
 'anyone': 2537,
 'cares': 5845,
 'listen': 1980

In [38]:
# confirm that vocabulary is the same for all vectorizers

bow_vectorizer.vocabulary_ == ohe_vectorizer.vocabulary_ == tfidf_vectorizer.vocabulary_

True

In [39]:
# sort tokens in ascending order of the vocabulary values

_, tokens = zip(*sorted(zip(bow_vectorizer.vocabulary_.values(), bow_vectorizer.vocabulary_.keys())))
tokens[:10]

('!', '"', '#', '#10', '#10162', '#11', '#114622', '#12', '#13', '#15')

In [40]:
# create dataframes

df_cv_bow = pd.DataFrame(cv_bow.toarray(), columns=tokens)
df_cv_ohe = pd.DataFrame(cv_ohe.toarray(), columns=tokens)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=tokens)

#### 4. Check dataframes

In [41]:
df_cv_bow.head()

Unnamed: 0,!,"""",#,#10,#10162,#11,#114622,#12,#13,#15,...,à,ángel,è,émouvantes,étai,étre,éviter,única,﻿,�
0,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,12,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
df_cv_bow.shape

(10000, 37767)

In [43]:
df_cv_ohe.head()

Unnamed: 0,!,"""",#,#10,#10162,#11,#114622,#12,#13,#15,...,à,ángel,è,émouvantes,étai,étre,éviter,única,﻿,�
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
df_cv_ohe.shape

(10000, 37767)

In [45]:
df_tfidf.head()

Unnamed: 0,!,"""",#,#10,#10162,#11,#114622,#12,#13,#15,...,à,ángel,è,émouvantes,étai,étre,éviter,única,﻿,�
0,0.16666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.029174,0.430374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.081502,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
df_tfidf.shape

(10000, 37767)

For our 10 000 reviews, we have a total of 37767 tokens.

#### 5. Fit a Naive Bayes model, predict and evaluate the results for the different vectorization methods

In [47]:
vec_methods = {'BoW': cv_bow, 'One-hot encoding': cv_ohe, 'TF-IDF': tfidf}

for method, vec in vec_methods.items():
    
    # split training data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(vec, df['label'], test_size=0.3, random_state=42)
    
    # fit
    my_model = MultinomialNB()
    my_model.fit(X_train, y_train)
    
    # predict
    predicted_sentiment = my_model.predict(X_test)
    
    # accuracy score
    print(f"{method}: {accuracy_score(y_test, predicted_sentiment)}")

BoW: 0.832
One-hot encoding: 0.832
TF-IDF: 0.815


CountVectorizer's BoW and one-hot encoding methods result in a slight improvement of our model's accuracy. Let's try to improve it even further with the help of **n-grams**.

#### 6. Vectorize text: Bow with n-grams

In [48]:
ngrams_bow_vectorizer = CountVectorizer(tokenizer=casual_tokenize, ngram_range=(1,2))
cv_ngrams_bow = ngrams_bow_vectorizer.fit_transform(df['review'])

#### 7. Fit a Naive Bayes model, predict and evaluate the result using n-grams

In [49]:
# split training data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(cv_ngrams_bow, df['label'], test_size=0.3, random_state=42)
    
# fit
my_model = MultinomialNB()
my_model.fit(X_train, y_train)

# predict
predicted_sentiment = my_model.predict(X_test)

# accuracy score
print(f"BoW with n-grams: {accuracy_score(y_test, predicted_sentiment)}")

BoW with n-grams: 0.8596666666666667


Considering both **unigrams and bigrams**, our Naive Bayes model was able to correctly identify about **86%** of the reviews of the test set as positive or negative.

We can now try to **tune the hyperparameter alpha** of our Naive Bayes model. We'll do this with sklearn's GridSearchCV.

#### 8. Import GridSearchCV

In [50]:
from sklearn.model_selection import GridSearchCV

#### 9. Tune hyperparameter alpha 

In [51]:
param_grid = [{'alpha': [0.1, 0.3, 0.4, 0.5, 1]}]

my_model = MultinomialNB()

grid_search = GridSearchCV(my_model, param_grid, cv = 5, scoring = 'accuracy')

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'alpha': [0.1, 0.3, 0.4, 0.5, 1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [52]:
grid_search.best_params_

{'alpha': 0.4}

#### 10. Retrain Naive Bayes model with tuned alpha hyperparameter

In [53]:
# fit
my_model = MultinomialNB(alpha=0.4)
my_model.fit(X_train, y_train)

# predict
predicted_sentiment = my_model.predict(X_test)

# accuracy score
print(f"BoW with n-grams: {accuracy_score(y_test, predicted_sentiment)}")

BoW with n-grams: 0.8666666666666667


We were able to slightly improve our model. Our final accuracy score is now close to **0.87**.