# Sentiment Analysis
In this exercise, we will explore a movie review dataset.


**Task 1:** Load the data from `/dsa/data/all_datasets/movie_reviews` into mvr variable. While loading use `encoding='utf-8'`. (Solved for you)


In [1]:
# Importing some libraries used in the past

#All the packages we are using in this project
import nltk, re, pprint

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import word_tokenize
from nltk import FreqDist

## Lets import some libraries form mathplotlib ... it's helpful for plotting. 
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

In [2]:
from sklearn.datasets import load_files

data_dir = '/dsa/data/all_datasets/movie_reviews'

mvr = load_files(data_dir, encoding = 'utf-8')

In [3]:
mvr.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
print('Number of Reviews: {0}'.format(len(mvr.filenames)))

Number of Reviews: 2000


**Task 2:** Apply `SentimentIntensityAnalyzer` on the entire dataset to estimate polarity scores. Print the top 3 `positive`, `negative`, and `neural` reviews based on the following rule: 


* positive sentiment: compound score >= 0.05
* neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
* negative sentiment: compound score <= -0.05

In [5]:
analyzer = SentimentIntensityAnalyzer()
vs = [analyzer.polarity_scores(t) for t in mvr['data']]

df = pd.DataFrame(vs)
df['review'] = mvr['data']

df = df[['review', 'neg', 'neu', 'pos', 'compound']]

df.head()

Unnamed: 0,review,neg,neu,pos,compound
0,arnold schwarzenegger has been an icon for act...,0.153,0.678,0.169,0.6567
1,good films are hard to find these days . \ngre...,0.075,0.802,0.123,0.9783
2,quaid stars as a man who has taken up the prof...,0.083,0.766,0.151,0.9827
3,we could paraphrase michelle pfieffer's charac...,0.095,0.801,0.104,-0.8142
4,kolya is one of the richest films i've seen in...,0.015,0.835,0.15,0.9538


In [6]:
POS = df[df['compound'] >= 0.05]
NEU = df[(df['compound'] > -0.05) & (df['compound'] < 0.05)]
NEG = df[df['compound'] <= -0.05]

In [7]:
POS.head()

Unnamed: 0,review,neg,neu,pos,compound
0,arnold schwarzenegger has been an icon for act...,0.153,0.678,0.169,0.6567
1,good films are hard to find these days . \ngre...,0.075,0.802,0.123,0.9783
2,quaid stars as a man who has taken up the prof...,0.083,0.766,0.151,0.9827
4,kolya is one of the richest films i've seen in...,0.015,0.835,0.15,0.9538
5,i don't know how many other people have had th...,0.081,0.78,0.14,0.9934


In [8]:
POS.sort_values(by = ['compound'], axis = 0, ascending=False).head(n=3)

Unnamed: 0,review,neg,neu,pos,compound
1338,as i write the review for the new hanks/ryan r...,0.044,0.668,0.288,0.9999
1351,note : some may consider portions of the follo...,0.038,0.675,0.287,0.9999
1157,i actually am a fan of the original 1961 or so...,0.055,0.732,0.213,0.9998


In [9]:
NEU.sort_values(by = ['compound'], axis = 0, ascending=False).head(n=3)

Unnamed: 0,review,neg,neu,pos,compound
1014,"pulp fiction , quentin tarantino's anxiously a...",0.104,0.799,0.097,-0.0488


In [10]:
NEG.sort_values(by = ['compound'], axis = 0, ascending=False).head(n=3)

Unnamed: 0,review,neg,neu,pos,compound
1852,it's a sad state of affairs when the back box ...,0.071,0.853,0.075,-0.0933
1397,"the word to describe sharon stone is "" wonder ...",0.141,0.708,0.152,-0.1023
1781,""" very bad things , "" is the most delightfull...",0.171,0.652,0.176,-0.1411


**Task 3:** Apply `SentimentIntensityAnalyzer` on the entire dataset to estimate polarity scores. Print a classification report based on the following rule: 


positive sentiment: compound score >= 0
negative sentiment: compound score < 0

In [11]:
# Setting up additional column for df
df['sentiment'] = 'POS'
df.loc[df['compound'] < 0, 'sentiment'] = 'NEG'

df.head()

Unnamed: 0,review,neg,neu,pos,compound,sentiment
0,arnold schwarzenegger has been an icon for act...,0.153,0.678,0.169,0.6567,POS
1,good films are hard to find these days . \ngre...,0.075,0.802,0.123,0.9783,POS
2,quaid stars as a man who has taken up the prof...,0.083,0.766,0.151,0.9827,POS
3,we could paraphrase michelle pfieffer's charac...,0.095,0.801,0.104,-0.8142,NEG
4,kolya is one of the richest films i've seen in...,0.015,0.835,0.15,0.9538,POS


In [12]:
mvr['target']

array([0, 1, 1, ..., 1, 0, 0])

In [21]:
df_true = pd.DataFrame()
df_true['review'] = mvr['data']
df_true['score'] = mvr['target']

In [23]:
df_true['score'] = df_true['score'].map({0: 'NEG', 1: 'POS'})

In [24]:
df_true.head()

Unnamed: 0,review,score
0,arnold schwarzenegger has been an icon for act...,NEG
1,good films are hard to find these days . \ngre...,POS
2,quaid stars as a man who has taken up the prof...,POS
3,we could paraphrase michelle pfieffer's charac...,NEG
4,kolya is one of the richest films i've seen in...,POS


In [26]:
from sklearn.metrics import classification_report

print(classification_report(df_true['score'], df['sentiment']))

              precision    recall  f1-score   support

         NEG       0.72      0.44      0.55      1000
         POS       0.60      0.83      0.69      1000

    accuracy                           0.64      2000
   macro avg       0.66      0.64      0.62      2000
weighted avg       0.66      0.64      0.62      2000

