# MSDS 7337 HW-8 Sentiment Analysis

Author: Nathan Wall

Date: 8/13/2019

This notebook contains the code and visualizations for visualization of sample of reviews from IMDB from the crime genre

Notebook Sections:
- [Data Preperation](#prep)
- [Clustering](#clustering)
- [Sentiment](#sentiment)

In [32]:
import random 
random.seed(13)
import operator
import re
import json
import pandas as pd
import numpy as np

#text pre-processing
import spacy
!python -m spacy download en_core_web_sm
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## Preparing the reviews
<a id='prep'></a>

In [4]:
with open('msds7337_nwall_reviews.json', 'r') as json_file:
    data = json.load(json_file)

reviews = json.loads(data)
corpus = [r['reviewText'] for r in reviews]
len(corpus)

1219

In [5]:
#load default pipeline
nlp = spacy.load("en_core_web_sm")
def preprocess_review(text, remove_ne = True):
    """ This function takes a document and applies some of preprocessing step

    :param text: A string, the single document you want to process.
    :param remove_ne: A boolean , whether you remove named entities from the text
    
    returns: A string with all tokens lemmatized & lower case with stop words, pronouns, and punctuation removed
    """
    doc = nlp(text)
    if remove_ne == True:
        named_entity = [i.text for i in doc.ents] #create list of named entities
        tokens = [token.lemma_.lower() for token in doc #take the lower case lemmatized word
                  if token.lemma_ != "-PRON-" #remove pronouns
                  and token.pos_ != "PROPN" #remove propernouns
                  and token.is_punct == False #remove punctiuation
                  and token.is_stop == False #remove stopwords
                  and token.is_digit == False #remove numbers
                  and token.is_space == False #remove space
                  and token.text not in named_entity #remove named entities
                 ]
    else:
        tokens = [token.lemma_.lower() for token in doc #take the lower case lemmatized word
                  if token.lemma_ != "-PRON-" #remove pronouns
                  and token.is_punct == False #remove punctiuation
                  and token.is_stop == False #remove stopwords
                  and token.is_digit == False #remove numbers
                  and token.is_space == False #remove space
                 ]
    tokens = " ".join(tokens)
    return tokens

tokens = preprocess_review(corpus[1])
print(tokens)

nice easy breezy murder mystery fun count deep sit popcorn soda enjoy movie offencive adult murder mystery romp like anymore ignore people like criticize think actual critic chemistry awesome hope movie


In [29]:
docs = [preprocess_review(doc) for doc in corpus]
print(len(docs))
print(docs[1])

1219
nice easy breezy murder mystery fun count deep sit popcorn soda enjoy movie offencive adult murder mystery romp like anymore ignore people like criticize think actual critic chemistry awesome hope movie


## Clustering the reviews
<a id='clustering'></a>

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.6, min_df=1, stop_words='english', use_idf=True)
tf_idf = vectorizer.fit_transform(docs)
tf_idf.shape

(1219, 7196)

In [36]:
n_clusters = 4
kmeans_model = KMeans(n_clusters=n_clusters, init='k-means++')

km4 = kmeans_model.fit(tf_idf)

unique, counts = np.unique(km4.labels_, return_counts=True)
print(np.asarray((unique, counts)).T)

[[  0 372]
 [  1 552]
 [  2 144]
 [  3 151]]


## Sentiment Analysis
<a id='sentiment'></a>

### 1. In Python, load one of the sentiment vocabularies referenced in the textbook, and run the sentiment analyzer as explained in the corresponding reference.

In [25]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/newall/nltk_data...


True

In [26]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [34]:
analyzer = SentimentIntensityAnalyzer()
scores = analyzer.polarity_scores(docs[1])
#remove the compound label
del scores['compound']

In [35]:
print(scores)
print("The sentiment of this review is {}".format(max(scores.items(), key=operator.itemgetter(1))[0]))

{'neg': 0.282, 'neu': 0.306, 'pos': 0.412}
The sentiment of this review is pos


The Original Sample Doc appears to be mostly positive which makes sense based on the content of that review shown above.

### 2. For each of the clusters you created in homework 7, compute the average, median, high, and low sentiment scores for each cluster.

In [51]:
#compute sentiment for each review and build a dataframe.
analyzer = SentimentIntensityAnalyzer()
scores = analyzer.polarity_scores(docs[0])
df_sentiment = pd.DataFrame([scores], columns=scores.keys())

for d in docs[1:]:
    scores = analyzer.polarity_scores(d)
    df_sentiment = df_sentiment.append(scores, ignore_index=True)

In [53]:
df_sentiment['cluster'] = km4.labels_
df_sentiment.head()

Unnamed: 0,neg,neu,pos,compound,cluster
0,0.141,0.703,0.157,0.0772,2
1,0.282,0.306,0.412,0.765,3
2,0.0,0.527,0.473,0.9244,0
3,0.282,0.423,0.296,0.0258,2
4,0.299,0.388,0.313,0.7783,1


Now we have the sentiment for all the clustered data. Lets visulaize the data.

In [55]:
cluster_summ = df_sentiment.groupby('cluster').agg({'neg':['mean', 'median', 'min', 'max'],
                                     'pos':['mean', 'median', 'min', 'max'],
                                     'compound':['mean', 'median', 'min', 'max']
                                     })

In [56]:
cluster_summ

Unnamed: 0_level_0,neg,neg,neg,neg,pos,pos,pos,pos,compound,compound,compound,compound
Unnamed: 0_level_1,mean,median,min,max,mean,median,min,max,mean,median,min,max
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,0.181387,0.0605,0.0,1.0,0.311046,0.3015,0.0,1.0,0.211208,0.2263,-0.9934,0.9947
1,0.131467,0.117,0.0,0.778,0.350803,0.3415,0.0,1.0,0.660251,0.94195,-0.9978,0.999
2,0.150208,0.146,0.0,0.605,0.311368,0.2855,0.0,0.802,0.401949,0.65415,-0.9648,0.9976
3,0.151887,0.133,0.0,0.646,0.343132,0.334,0.0,0.767,0.537387,0.872,-0.986,0.9991


Overall, these reviews are largely positive in nature. Based off the findings in HW7 we know our reviews are more biased towards more popular & well liked shows/movies as these reviews come from the top50 Crime Movies/TV Shows on IMBD.

It seems like the most neutral cluster is cluster 0, which is also our second biggest. This could our most critical reviews and probably a group of interest for the types of movie and TV shows in this group.

Cluster 1 seems to be the most positive and also our largest cluster. Based on HW7 our largest cluster was usually a pretty noisy group but perhaps brought together by the 'positivity' of the language used.

Cluster 2 & 3 are similar in both sentiment & size. It is interesting that neither have any entirely positive reviews like the other two. These two clusters are both very specific clusters so a common mostly positive sentiment is a little bit more informative.