# Sentiment Analysis with VADER

This notebook introduces the problem of **sentiment analysis** and shows how to apply it using the NLTK module [VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html) = Valence Aware Dictionary for sEntiment Reasoning.

We can apply VADER directly to unlabelled text data to obtain the strength of the sentiment in 3 different directions: positive, neutral and negative; all 3 are also compunded toa 4th value:

```python
{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}
```

In the background, VADER maps lexical features to sentiment scores; the sentiment score of a text is the sum of the sentiment score of each word. Note that:
- Negation is understood: `did not love` is the opposite `did love`.
- Exclamation marks, capitals, are taken into account.
- If a text has both positive and negative parts, the result will be the sum, and maybe that is not intended.
- Sarcasm is not captured, e.g., positive words intended as. negative.

Note that the VADER model is already prepared for us! We don't need to have a labelled dataset, nor train anything: we just use it! However, the performance is not as good: 60%-70% accuracy, being 50% a pure random guess. That is probably because we have not trained for the dataset and because human communication is very nuanced -- we can say only in the last sentence what we want, we can be sarcastic, etc.

I think that one could build a regression model for sentiment analysis if we had labelled data (e.g., 5-star ratings); also using TFIDF makes sense.

Overview of contents:
1. NLTK's VADER Module for Sentiment Analysis
2. Example 1: Amazon Reviews Sentiment Analysis with NLTK-VADER
    - 2.1 Load and Clean the Data
    - 2.2 Add Sentiment Scores
    - 2.3 Check Accuracy
3. Example 2: Moview Reviews Sentiment Analysis with NLTK-VADER
    - 3.1 Load and Clean the Data
    - 3.2 Add Sentiment Scores
    - 3.3 Check Accuracy

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. NLTK's VADER Module for Sentiment Analysis

In [2]:
import nltk
# Load the VADER lexicon
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mxagar/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [3]:
# Import the library for Sentiment Analysis that makes use of the previos lexicon
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [5]:
# VADER's SentimentIntensityAnalyzer() takes in a string
# and returns a dictionary of scores in each of four categories
# - negative
# - neutral
# - positive
# - compound (computed by normalizing the scores above)
a = 'This was a good movie.'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

In [6]:
a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

In [7]:
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)

{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

In [11]:
sid.polarity_scores('best')

{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.6369}

In [12]:
sid.polarity_scores('worst')

{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.6249}

## 2. Example 1: Amazon Reviews Sentiment Analysis with NLTK-VADER

In [70]:
import numpy as np
import pandas as pd

### 2.1 Load and Clean the Data

In [71]:
df = pd.read_csv('../data/amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [72]:
# Balanced? Quite.
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [73]:
# Check if empty reviews
df['review'].isna().sum() + df['review'].isnull().sum()

0

In [74]:
# Remove if empty reviews
df.dropna(inplace=True)

In [75]:
# Check for empty strings and remove them
blanks = []  # start with an empty list
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
df.drop(blanks, inplace=True)

In [76]:
# Still Balanced? Same as before: nothing removed
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

### 2.2 Add Sentiment Scores

In [77]:
# Check a single review
sid.polarity_scores(df.loc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [78]:
# Label seems to match
df.loc[0]['label']

'pos'

In [79]:
# Check the review text
df.loc[0]['review']

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [80]:
# Create new column/feature: predicted sentiment score
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [81]:
# Take compound value: usually, that's what we take
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [82]:
# Binarize
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


### 2.3 Check Accuracy

In [83]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [87]:
# It is not a very good metric: random guesses would yield 50%
accuracy_score(df['label'],df['comp_score'])

0.7097

In [85]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.52      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [86]:
print(confusion_matrix(df['label'],df['comp_score']))

[[2629 2468]
 [ 435 4468]]


## 3. Example 2: Moview Reviews Sentiment Analysis with NLTK-VADER 

This example uses 2,000 IMDb moview review dataset that can be loaded with NLTK:

```python
from nltk.corpus import movie_reviews
```

### 3.1 Load and Clean the Data

In [89]:
import numpy as np
import pandas as pd

In [90]:
df = pd.read_csv('../data/moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [91]:
# Check empty items
df.isnull().sum() + df.isna().sum()

label      0
review    70
dtype: int64

In [92]:
df.dropna(inplace=True)

In [93]:
# Remove items with empty reviews
blanks = []  # start with an empty list
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
df.drop(blanks, inplace=True)

In [94]:
# Check empty items
df.isnull().sum() + df.isna().sum()

label     0
review    0
dtype: int64

In [95]:
# Check balance: perfect!
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

### 3.2 Add Sentiment Scores

In [96]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [97]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

In [98]:
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg


### 3.3 Check Accuracy

In [99]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [100]:
# It is not a very good metric: random guesses would yield 50%
accuracy_score(df['label'],df['comp_score'])

0.6357069143446853

In [101]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [102]:
print(confusion_matrix(df['label'],df['comp_score']))

[[427 542]
 [164 805]]
