# MSDS 7337 Homework 7

Author: Nathan Wall

Date: 7/30/2019

The notebook below works through the questions as part of the homework 7

Notebook Sections:
- [Data Preperation](#prep)
- [Q1: Cluster the Reviews](#q1)
- [Q2: Cluster Interpretations](#q2)
- [Q3: Clustering Evaluations](#q3)

In [49]:
import re
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

## Preparing the reviews
<a id='prep'></a>


In [2]:
with open('msds7337_nwall_reviews.json', 'r') as json_file:
    data = json.load(json_file)

reviews = json.loads(data)
len(reviews)

1219

In [3]:
text = []
for r in reviews:
    text.append(r['reviewText'])

len(text)

1219

In [27]:
text[1]

"A nice easy breezy murder mystery. Full of fun. Don't count on anything serious or deep here just sit back with your popcorn and a soda and enjoy the movie. Nothing offencive here. Just an adult murder mystery romp. We don't get many like these anymore. Ignore the people who like to criticize everything because they think they are actual critics. Chemistry between Aniston and Sadler is awesome. I hope they make more movies together."

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.8, min_df=1, stop_words='english', use_idf=True)
tf_idf = vectorizer.fit_transform(text)
tf_idf.shape

(1219, 11378)

In [26]:
feature_names = vectorizer.get_feature_names()
doc = 1
feature_index = tf_idf[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tf_idf[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print(w, s)

nice 0.13785749978974188
easy 0.14683477115187507
breezy 0.23134091124737477
murder 0.28404883516915064
mystery 0.25252532463070654
fun 0.11782810949127226
don 0.17652914783004411
count 0.1922477796359169
deep 0.14683477115187507
just 0.13537395099279823
sit 0.15948752449288753
popcorn 0.20274757389917886
soda 0.23134091124737477
enjoy 0.12039439288142967
movie 0.09363720448505282
offencive 0.23134091124737477
adult 0.18808086184108347
romp 0.20274757389917886
like 0.12866669012040097
anymore 0.16108815889756356
ignore 0.18808086184108347
people 0.08907096321377915
criticize 0.2186881579063623
think 0.09344865086381315
actual 0.1627753551590585
critics 0.17814334458055253
chemistry 0.15379808379692533
aniston 0.18808086184108347
sadler 0.23134091124737477
awesome 0.14114533045591285
hope 0.1077416395404172
make 0.0930749274823218
movies 0.1136099089743408



## Q1: Cluster the Reviews
<a id='q1'></a>

Select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.

Below we create two different clusters using k means clustering with k=3 & k=5.

In [45]:
n_clusters = 3
kmeans_model = KMeans(n_clusters=n_clusters, init='k-means++')

km3 = kmeans_model.fit(tf_idf)

unique, counts = np.unique(km3.labels_, return_counts=True)
print(np.asarray((unique, counts)).T)

[[  0 130]
 [  1 149]
 [  2 940]]


Cluster 2 represents a large portion of the reviews in our clusters when using 3 clusters, we will explore this further but some concern that this clustering is poorly clustered which we explore next.

In [29]:
n_clusters = 5
kmeans_model = KMeans(n_clusters=n_clusters, init='k-means++')

km5 = kmeans_model.fit(tf_idf)

unique, counts = np.unique(km5.labels_, return_counts=True)
print(np.asarray((unique, counts)).T)

[[  0 142]
 [  1 141]
 [  2  98]
 [  3 435]
 [  4 403]]


We have two larger clusters 3 & 4 with several 100 -150 observations, so likely a little better than k= 3, but lets review the semantic intepretations of the two clusterings.

## Q2: Interpreting the Clusters
<a id='q2'></a>

Write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact. 

First lets define some functions to help determine the semantic meaning behind each cluster

In [35]:
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

In [36]:
def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

In [37]:
def top_feats_by_class(Xtr, y, features, min_tfidf=0.1, top_n=25):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    labels = np.unique(y)
    for label in labels:
        ids = np.where(y==label)
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs

##### Kmeans where K = 3

In [46]:
cluster_description = top_feats_by_class(tf_idf, km3.labels_, feature_names, min_tfidf=0.1, top_n=25)

for c in enumerate(cluster_description):
    print("Cluster {}".format(c[0]))
    print(c[1])

Cluster 0
         feature     tfidf
0         season  0.231639
1             10  0.028140
2           wait  0.025221
3           just  0.024387
4        episode  0.022559
5        netflix  0.020688
6        writers  0.016164
7       watching  0.015919
8        started  0.015917
9          thank  0.013940
10  disappointed  0.013914
11         worst  0.013872
12         story  0.013727
13        better  0.013033
14        series  0.012836
15          stop  0.012205
16          want  0.011764
17          love  0.011695
18         watch  0.011538
19       watched  0.011421
20           don  0.011393
21          best  0.011187
22         loved  0.011021
23      episodes  0.010918
24        second  0.010848
Cluster 1
    feature     tfidf
0     movie  0.145287
1      film  0.070289
2    action  0.053485
3      wick  0.037010
4      john  0.032943
5   redford  0.029700
6     bundy  0.024532
7    movies  0.019706
8     watch  0.017101
9     keanu  0.015445
10      fun  0.015202
11     just  0

Based on the above it looks there are two clusters that are largely driven by words used to describe series vs movies. Cluster 0 seems specific to TV shows while cluster 1 is specific to movies. 

However, cluster 2 (our largest cluster) is also made up of mainly terms commonly used with television shows as well. This is makes it difficult to determine the specific semantic meaning between cluster 0 & cluster 2.

In [47]:
cluster_description = top_feats_by_class(tf_idf, km5.labels_, feature_names, min_tfidf=0.1, top_n=25)

for c in enumerate(cluster_description):
    print("Cluster {}".format(c[0]))
    print(c[1])

Cluster 0
         feature     tfidf
0         season  0.211458
1       watching  0.036058
2             10  0.025693
3           wait  0.024525
4           just  0.022326
5        episode  0.020653
6        netflix  0.019758
7           stop  0.017428
8          watch  0.014563
9        started  0.014137
10         story  0.013731
11        series  0.013454
12       writers  0.013416
13          want  0.012816
14     political  0.012781
15         thank  0.012762
16        second  0.012744
17  disappointed  0.012738
18         worst  0.012700
19           don  0.012470
20          love  0.012144
21       watched  0.012129
22        people  0.012086
23          plot  0.011545
24       amazing  0.011422
Cluster 1
    feature     tfidf
0     movie  0.146067
1      film  0.069117
2    action  0.057386
3      wick  0.039110
4      john  0.034813
5   redford  0.032318
6     bundy  0.025924
7    movies  0.021689
8     keanu  0.016321
9     shaft  0.015540
10     just  0.015462
11    watch  0

Similarly to above there is one cluster (Cluster 1) that seems very specific to movies while the remaining are related to television shows. While some seem very specific to certain shows like Cluster 2 looks specific to breaking bad, which makes sense as it is our smallest cluster. However, others a little more vague like Cluster 0, 3, & 4 that mentions several different shows or more general terms.

## Q3: Cluster the Reviews
<a id='q3'></a>

Explain which of the two clustering results from question 1 is preferable (if one of them is), and why.

In [53]:
print("K-Means (k=3) silhouette score is {}".format(silhouette_score(tf_idf, km3.labels_)))
print("K-Means (k=5) silhouette score is {}".format(silhouette_score(tf_idf, km5.labels_)))

K-Means (k=3) silhouette score is 0.0045259664479304065
K-Means (k=5) silhouette score is 0.005248936404982356


Based on a numerical measure neither of these clusterings are very good. While k=5 is a little better both are very far from 1 meaning there is a lot of overlap between the different clusters making it very difficult to differentiate between the various groups.

From a semantic perspective you can begin to see more value in the clustering with k=5 as the various title specific reviews start to bubble up. This is perhaps a good indication that a higher number of clusters closer the number to of titles we scraped reviews from would likely improve the clustering.

Overall, neither clustering technique performed well from a quantitative or semantic review and a higher number of clusters would likely improve the performance of the clustering.