# Project Overview

For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/


We will try to predict whether a review is considered negative or positive

# Importing Basic Libraries

In [1]:
# These are the libraries I typically use in my analysis so I find it easier to import them all at once
# If I need more libraries I will import them as needed

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline

# Importing the Dataset

In [2]:
# Our dataset is moviereviews.tsv, where the tsv stands for "tab separated variables"
# Hence in order to import the file correctly we need to add delimiter = "\t"
# We will name the dataframe "movies"

movies =  pd.read_csv('moviereviews.tsv', delimiter = '\t')

In [3]:
movies.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
# There are 2000 movie reviews in our dataset

movies.shape

(2000, 2)

In [5]:
# We do not have any missing values in the label column
# We do however have 35 missing values in the review column

movies.isnull().sum()

label      0
review    35
dtype: int64

In [6]:
# Here we will drop the missing values

movies.dropna(inplace = True)

In [7]:
# Just confirming that we no longer have any null values

movies.isnull().sum()

label     0
review    0
dtype: int64

In [8]:
# As an extra check, we will create our own way of iterating through the dataset to look for empty strings

# So here we will initialize an empty list
blanks = []

# We will get a tuple with the index location, label value, and the review text itself
# i = index, lb = label value, rv = review
for i,lb,rv in movies.itertuples():
    # isspace() checks if there is just a whitespace and no other values
    # "If the review is whitespace"
    if rv.isspace():
        # Add the index position to our empty list me initiated above
        blanks.append(i)

In [9]:
# Here are all the indices where the review is just empty space

blanks

[57,
 71,
 147,
 151,
 283,
 307,
 313,
 323,
 343,
 351,
 427,
 501,
 633,
 675,
 815,
 851,
 977,
 1079,
 1299,
 1455,
 1493,
 1525,
 1531,
 1763,
 1851,
 1905,
 1993]

In [10]:
# We will remove the rows with empty reviews by passing in our blanks list to our dataframe

movies.drop(blanks, inplace = True)

In [11]:
# Looks like we have a perfectly symmetrical breakdown of positive and negative reviews after the cleaning

movies['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

# Use VADER For Sentiment Analysis

In [12]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

In [13]:
movies['scores'] = movies['review'].apply(lambda review: sid.polarity_scores(review))

movies['compound']  = movies['scores'].apply(lambda scores_dict: scores_dict['compound'])

movies['comp_score'] = movies['compound'].apply(lambda score: 'pos' if score >=0 else 'neg')

movies.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg


# Compare Polarity Scores to Original Label

In [14]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [15]:
# So, it looks like VADER couldn't judge the movie reviews very accurately. 
# This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. 
# Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.

accuracy_score(movies['label'], movies['comp_score'])

0.6357069143446853

In [16]:
print(classification_report(movies['label'], movies['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [17]:
print(confusion_matrix(movies['label'], movies['comp_score']))

[[427 542]
 [164 805]]
