# Sentiment Analysis

The goal of this project is to perform **sentiment analysis with NLTK's VADER** module, https://www.nltk.org/_modules/nltk/sentiment/vader.html, on a dataset with 10 000 Amazon reviews.

#### 1. Perform initial imports

In [1]:
import numpy as np
import pandas as pd

#### 2. Load data

In [2]:
df = pd.read_csv("data/amazonreviews.tsv", sep='\t')

#### 3. Check the dataframe

In [3]:
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [4]:
len(df)

10000

In [5]:
# check number of both labels

df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

#### 4. Check missing values

In [6]:
df.isnull().sum()

label     0
review    0
dtype: int64

There are no missing reviews.

#### 5. Check empty strings

In [7]:
# using the isspace() method

empty_strings = []

for i, lb, rv in df.itertuples():
    if rv.isspace():
        empty_strings.append(i)

In [8]:
print(empty_strings)
print(len(empty_strings))

[]
0


There are no reviews that correspond to empty strings.

In [9]:
# check length

len(df)

10000

In [10]:
# check number of both labels

df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

We have 10 000 movie reviews (5097 are negative and 4903 are positive). Our dataset is cleaned and we can now analyse it with VADER.

#### 6. Import `SentimentIntensityAnalyzer` and create an sid object

In [11]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()



#### 7. Add scores and new labels to the dataframe

We'll append 3 columns to our dataset:
* `scores` with the polarity scores (negative, neutral, positive and compound)
* `compound` with the extracted compound score
* `comp_label` with the label derived from the compound score

In [12]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_label'] = df['compound'].apply(lambda comp: 'pos' if comp >=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_label
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


#### 8. Compare the original label with the new label and evaluate the results

In [13]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [14]:
# confusion matrix

print(confusion_matrix(df['label'], df['comp_label'], labels =['pos', 'neg']))

[[4468  435]
 [2474 2623]]


In [15]:
# classification report

print(classification_report(df['label'], df['comp_label']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

   micro avg       0.71      0.71      0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [16]:
# accuracy score

print(accuracy_score(df['label'], df['comp_label']))

0.7091


VADER **was not very accurate**, but still it was able to correctly identify about **71%** of the reviews as positive or negative.