### Michelle Kouba
### Movie Reviews - Sentiment Analysis and Vader Analysis

In [None]:
# Import libraries
import pandas as pd
import string
from sklearn import metrics
from textblob import TextBlob

In [None]:
# Import and inspect dataframe
movie_review_df = pd.read_csv('labeledTrainData.tsv',sep='\t')

### Sentiment Analysis

In [None]:
## Finding the polarity and subjectivity of every review then categorizing it to match the sentiment score already provided
## for accuracy analysis.
# Function that gets the subjectivity of each review
def get_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity
# Function that gets the polarity of each review
def get_polarity(text):
    return TextBlob(text).sentiment.polarity
# Function that sets a category/sentiment for each score
def get_analysis(score):
    if score < 0:
        return 'Negative'
    elif score >= 0:
        return 'Positive'
# Creates the variables named Subjectivity and Polarity that will be populated
# by the functions above
movie_review_df['Subjectivity'] = movie_review_df['review'].apply(get_subjectivity)
movie_review_df['Polarity'] = movie_review_df['review'].apply(get_polarity)
# Creates a variable that will store the sentiment for each review
movie_review_df['Sentiment_Analysis'] = movie_review_df['Polarity'].apply(get_analysis)

In [None]:
## Checking the accuracy of the model with a confusion matrix
# Creating a Binary value to match to Sentiment
movie_review_df['Sentiment_Numeric'] = movie_review_df['Sentiment_Analysis'].replace({'Negative':0, 'Positive':1})
# Create confusion matrix with True Positives and True Negatives on the diagonal
confusion_matrix = pd.crosstab(movie_review_df.sentiment, movie_review_df.Sentiment_Numeric, rownames=['Actual'], colnames=['Predicted'])
print(confusion_matrix)
# Print accuracy of the model
print(metrics.accuracy_score(movie_review_df.sentiment, movie_review_df.Sentiment_Numeric))


Predicted     0      1
Actual                
0          5307   7193
1           676  11824
0.68524


The accuracy of the model is roughly 69%.   We would expect the accuracy of random guessing to be around 50% so it's more accurate than random guessing.   The model seems to do a much better job accurately guessing positive reviews (roughly 94% correct) than negative reviews (roughly 42%). While the positive reviews are coded incredibly accurately, it would be better to go back and use random guessing for the negatively coded reviews instead.

In [None]:
## Vader analysis for extra credit
# Vader analysis generally uses unstemmed text
# Importing needed libraries/features
# nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Function that defines the sentiments of each movie review.
def sentiment_scores(text):
    # Create a SentimentIntensityAnalyzer object.
    sid_obj = SentimentIntensityAnalyzer()
    # polarity_scores method of SentimentIntensityAnalyzer
    # object gives a sentiment dictionary.
    # which contains pos, neg, neu, and compound scores.
    sentiment_dict = sid_obj.polarity_scores(text)
    return sentiment_dict['compound']
# Categorizes each review using the compound score as positive or negative
def get_analysis_Vader (score):
    if score >= 0.0 :
        return 'Positive'
    elif score < 0.0 :
        return 'Negative'
# Calculates and saves the compound score
movie_review_df['Compound'] =  movie_review_df['review'].apply(sentiment_scores)
movie_review_df['Compound'].head()
# Creates a variable that will store the sentiment for each review
movie_review_df['Sentiment_Analysis_Vader'] = movie_review_df['Compound'].apply(get_analysis_Vader)
## Checking the accuracy of the Vader model with a confusion matrix
# Creating a Binary value to match to Sentiment
movie_review_df['Sentiment_Numeric_Vader'] = movie_review_df['Sentiment_Analysis_Vader'].replace({'Negative':0, 'Positive':1})
# Create confusion matrix with True Positives and True Negatives on the diagonal
confusion_matrix_Vader = pd.crosstab(movie_review_df.sentiment, movie_review_df.Sentiment_Numeric_Vader, rownames=['Actual'], colnames=['Predicted'])
print(confusion_matrix_Vader)
# Print accuracy of the model
print(metrics.accuracy_score(movie_review_df.sentiment, movie_review_df.Sentiment_Numeric_Vader))

Predicted     0      1
Actual                
0          6682   5818
1          1843  10657
0.69356


The accuracy of the model is also roughly 69% (close to one percentage point higher).   We would expect the accuracy of random guessing to be around 50% so it's more accurate than random guessing.   The model also seems to do a much better job accurately guessing positive reviews (roughly 85% correct) when compared to the negative reviews (roughly 53%).  It's important to note that this model doesn't as well at guessing the positive reviews as the earlier model and is comparable to random guessing for the negative reviews.   Neither model handles negative reviews well so that might be something to review by hand.

### Vader Analysis

In [None]:
## Prepping the data for a custom model analyses by converting all the words in the review to lower-
## case letters, removing all punctuation and other characters, and stop words.
# Forces all words in the reviews to lower case
movie_review_df['cleaned_review'] = movie_review_df['review'].str.lower()
# Function that removes all punctuation
def remove_punctuation(text):
    for char in string.punctuation:
        text = text.replace(char, '')
    return text
# Removing punctuation from each review
movie_review_df['cleaned_review'] = movie_review_df['cleaned_review'].apply(remove_punctuation)
# Function for removing stopwords
def remove_stopwords(text):
    ## Importing a list of stopwords:
    from nltk.corpus import stopwords
    stop = set(stopwords.words('english'))
    for word in text:
        if word in stop: text.replace(word, '')
    return text
# Removing stopwords
movie_review_df['cleaned_review'] = movie_review_df['cleaned_review'].apply(remove_stopwords)

In [None]:
## Using NLTK's PorterStemmer to reduce the words to their stems.
# Importing need libraries/features
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
# Shortcut
ps = PorterStemmer()
# Functions that creates tokenized/stemmed words for easier matching and analysis.
def stem_sen(sentence):
    tokens = sentence.split()
    stem_tokens = [ps.stem(token) for token in tokens]
    return ' '.join(stem_tokens)
# Stemming words
movie_review_df['cleaned_review'] = movie_review_df['cleaned_review'].apply(stem_sen)

0    with all thi stuff go down at the moment with ...
1    the classic war of the world by timothi hine i...
2    the film start with a manag nichola bell give ...
3    it must be assum that those who prai thi film ...
4    superbl trashi and wondrou unpretenti 80 explo...
Name: cleaned_review, dtype: object

In [None]:
## Creating a bag of words matrix from the stemmed text.
# Importing need libraries/features
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
# Creating the bag of words matrix
bows = count.fit_transform(movie_review_df['cleaned_review'])
# Checking the shape of the 'bag of words' matrix
bows.shape
# The number of rows match the original list of reviews (25,000).

(25000, 91611)

In [None]:
## Creating a term frequency-inverse document frequency (tf-idf) matrix from the stemmed text.
# Importing need libraries/features
from sklearn.feature_extraction.text import TfidfVectorizer
# Shortcut
tfidf = TfidfVectorizer()
# Creating the tfidf matrix
feature_matrix = tfidf.fit_transform(movie_review_df['cleaned_review'])
feature_matrix.shape
# These dimensions match the 'bag of words' matrix.

(25000, 91611)