This notebook will focus on processing of text data and the results of the clustering from the previous section for modeling and testing in the next section.

First, I load the necessary python libraries.

In [1]:
import pickle as pkl
import nltk
from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from nltk.stem import PorterStemmer
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import math

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nickj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nickj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Here, I create a dictionary linking each movie ID to its reviews, and also initialize the Porter Stemmer, a text processing tool which reduces similar words to a shared base.

In [None]:
df = pd.read_csv('movie_reviews_clean.csv')
df = df.dropna()
df = df.reset_index()
porter = PorterStemmer()

dict_keys = df['movieId']
dict_items = df['review_text']
print(len(dict_keys))
data = dict()

print(dict_items[0])
for i in range(len(dict_keys)):
    print(dict_keys[i])
    data[dict_keys[i]] = dict_items[i]

Here, I split each movie's reviews into individual words, convert them to lower case, remove any commonly used stopwords, and user the Porter Stemmer to stem the words, saving the result using Pickle.

In [None]:
totalwords = []

print(len(data))

for key in data.keys():
    print(key)
    words = nltk.word_tokenize(data[key])

    words_no_p = []

    for w in words:
        if w.isalpha():
            words_no_p.append(w.lower())
    #print(words_no_p)
    clean_words = []
    for w in words_no_p:
        if w not in nltk.corpus.stopwords.words('english'):
            clean_words.append(w)
    for i in range(len(clean_words)):
        clean_words[i] = porter.stem(clean_words[i])
    totalwords.append(clean_words)

outfile = open('totalwords.pkl','wb')
pkl.dump(totalwords,outfile)
outfile.close()

I then create a list of strings to represent the above data.

In [3]:
totalwords = pkl.load(open('totalwords.pkl','rb'))

newlist = []
for x in totalwords:
    text = ""
    for word in x:
        text = text + word + " "
    newlist.append(text)

print(len(newlist))

9985


Here, I create 1grams (single words) and bigrams (pairs of words) from the data. I use the Tf-Idf (term frequency/inverse document frequency) method to find the most relevant terms for the creation of the 1grams and bigrams.

Some of this code is adapted from code that I found on the internet while researching this topic.

In [8]:
vectorizer = CountVectorizer(stop_words=None,ngram_range = (1,1)) 
X1 = vectorizer.fit_transform(newlist)  
features = (vectorizer.get_feature_names()) 

vectorizer = TfidfVectorizer(stop_words=None,ngram_range = (1,1)) 
X2 = vectorizer.fit_transform(newlist) 
X2 = X2.astype('float32')
scores = (X2.toarray()) 
#print("\n\nScores : \n", scores) 

sums = X2.sum(axis = 0) 
data1 = [] 
for col, term in enumerate(features): 
    data1.append( (term, sums[0,col] )) 
ranking = pd.DataFrame(data1, columns = ['term','rank']) 
words_1gram = (ranking.sort_values('rank', ascending = False)) 
print ("\n\nWords head : \n", words_1gram.head(20)) 



Words head : 
           term        rank
18692     film  716.570068
36320     movi  640.840515
38839      one  351.436432
31271     like  290.370239
9040   charact  258.497742
54558     time  234.471924
51778    stori  231.685501
21784     good  222.329208
47820      see  218.288284
32773     make  218.128036
31943     love  205.090851
21095      get  197.714996
59085    watch  196.089096
22207    great  189.719666
44139   realli  185.801620
47202    scene  182.003174
60402    would  180.950806
59372     well  178.284592
40668    peopl  177.449936
17361     even  174.749695


In [9]:
outfile = open('1grams.pkl','wb')
pkl.dump(words_1gram.head(2500),outfile)
outfile.close()

In [10]:
vectorizer = CountVectorizer(stop_words=None,ngram_range = (2,2)) 
X1 = vectorizer.fit_transform(newlist[0:1000])  
features = (vectorizer.get_feature_names()) 
#print("\n\nFeatures : \n", features) 
#print("\n\nX1 : \n", X1.toarray()) 

vectorizer = TfidfVectorizer(stop_words=None,ngram_range = (2,2)) 
X2 = vectorizer.fit_transform(newlist[0:1000]) 
X2 = X2.astype('float32')
scores = (X2.toarray()) 
#print("\n\nScores : \n", scores) 

sums = X2.sum(axis = 0) 
data1 = [] 
for col, term in enumerate(features): 
    data1.append( (term, sums[0,col] )) 
ranking = pd.DataFrame(data1, columns = ['term','rank']) 
words_bigram = (ranking.sort_values('rank', ascending = False)) 
print ("\n\nWords head : \n", words_bigram.head(20))



Words head : 
                   term      rank
198525        one best  4.708863
306868      watch movi  3.825894
105626      first time  3.475363
192430        new york  3.242450
89722        ever seen  3.169294
89098      even though  3.108515
185926       movi like  3.061846
167387       love movi  2.995861
319299        year old  2.961021
185519       movi ever  2.946520
120126       good movi  2.885079
167617      love stori  2.878795
2560       action movi  2.849945
264265  special effect  2.830976
306682      watch film  2.739706
186108        movi one  2.719745
122453      great movi  2.687935
247917        see movi  2.671350
95975        fall love  2.664104
198854     one favorit  2.654391


In [11]:
outfile = open('bigram1.pkl','wb')
pkl.dump(words_bigram.head(5000),outfile)
outfile.close()

It was necessary to extract the bigrams over several iterations due to memory constraints. I then combined all the separate bigram files with the 1grams file.

In [4]:
the1grams = pkl.load(open("1grams.pkl",'rb'))
bigrams1 = pkl.load(open("bigram" + str(1) + ".pkl",'rb'))
for i in range(2,11):
    bigrams1 = pd.merge(pkl.load(open("bigram" + str(i) + ".pkl",'rb')),bigrams1,how='inner',on='term')


the1grams = pd.merge(the1grams,bigrams1,how='outer',on='term')
the1grams.to_csv('total_words.csv')

After saving the terms to the file total_words.csv, I then removed any irrelevant terms by hand, keeping only the ones which were relevant dsecriptors of the movie and would be valuable features. eg. I removed "good movie" because it is nonspecific and offers no information, but kept "action scene" because an action scene is a feature that some movie watchers may find more desirable than others. The final list of terms is saved as total_words_reduced.csv.

I then construct a new dataframe, and store the count of each relevant word/bigram feature in the reviews for each movie.

In [None]:
df_new = pd.DataFrame(columns=['movieId','Word','Count'])
totalwords = []
words_used = pd.read_csv('total_words_reduced.csv')['term'].to_list()

for key in data.keys():
    print(key)
    print(df_new.shape)
    words = nltk.word_tokenize(data[key])

    words_no_p = []

    for w in words:
        if w.isalpha():
            words_no_p.append(w.lower())
    clean_words = []
    for w in words_no_p:
        if w not in nltk.corpus.stopwords.words('english'):
            clean_words.append(w)
            
    for i in range(len(clean_words)):
        clean_words[i] = porter.stem(clean_words[i])
        
    bigrams = []
    for i in range(len(clean_words)-1):
        bigrams.append(clean_words[i] + " " + clean_words[i+1])
    c = Counter(clean_words)
    bc = Counter(bigrams)
    for k,v in c.items():
        if k not in words_used:
            continue
        #k = porter.stem(k)
        vals = {"movieId":key,"Word":k,"Count":v}
        df_new = df_new.append(vals,ignore_index=True)
        
    for k,v in bc.items():
        vals = {"movieId":key,"Word":k,"Count":v}
        if k not in words_used:
            continue
        df_new = df_new.append(vals,ignore_index=True)

df_new.to_csv('all_words_final.csv')


Having acquired the word features, I turn my attention back to the genre information. In order to make each cluster disjoint, I require a 'tie breaker' to determine which cluster a movie should belong to if it belongs to genres overlapping multiple clusters, the importance dict below fulfills this need.

In [6]:
importance_dict = {'Documentary':17,'Family':16,'Horror':15,"Comedy":14,"Animation":13,"Thriller":12,"Sci-Fi":11,"Action":10,"Drama":9,"Romance":8,"Fantasy":7,"Musical":6,"Mystery":5,"Adventure":4,"War":3,"Western":2,"Crime":1,"Film-Noir":0,"Biography":-1,"History":-2}

The following functions are used to split all the movies into the appropriate clusters, determined in the previous notebook, and the appropriate superclusters (larger clusters that several clusters can belong to).

In [20]:
def get_cluster(x):
    clusters_dict = pkl.load(open('movie_clusters.pkl','rb'))[0]
    #print(clusters_dict)
    genres = x[11:82]
    #print(genres)
    genres_used = genres[genres == 1]
    genres_dropped = set()
    for genre in genres_used.index:
        genre_split = genre.split('_')
        for genre2 in genres_used.index:
            if genre == genre2:
                continue
            genre2_split = genre2.split('_')
            if (len(genre_split) == 2) and (len(genre2_split) == 2):
                for i in range(len(genre_split)):
                    for j in range(len(genre2_split)):
                        if genre_split[i] == genre2_split[j]:
                            if importance_dict[genre_split[1-i]] > importance_dict[genre2_split[1-j]]:
                                genres_dropped.add(genre2)
                            else:
                                genres_dropped.add(genre)
    #print(genres_dropped)
    for genre in genres_dropped:
        genres_used = genres_used.drop(genre)  
    #print(genres_used) 
    #print(genres)
    if len(genres_used) == 0:
        return -1
    determining_genre = genres_used.index[0]
    split = []
    for genre in genres_used.index:
        if '_' in genre:
            determining_genre = genre
            split = genre.split('_')
    
    for genre in genres_used.index:
        if '_' not in genre:
            if len(split) != 0:
                if (importance_dict[genre] > importance_dict[split[0]]) and (importance_dict[genre] > importance_dict[split[1]]):
                    determining_genre = genre
            else:
                for genre2 in genres_used.index:
                    if genre == genre2:
                        continue
                    if importance_dict[genre] > importance_dict[genre2]:
                        determining_genre = genre
                    else:
                        determing_genre = genre2
                        
    return(clusters_dict[determining_genre])

In [21]:
def get_supercluster(x):
    clusters_dict = pkl.load(open('movie_clusters.pkl','rb'))[0]
    superclusters_dict = pkl.load(open('movie_clusters.pkl','rb'))[1]
    super_clusters = {}
    for key in clusters_dict.keys():
        super_clusters[clusters_dict[key]] = superclusters_dict[key]
    return(super_clusters[x['cluster']])

In [15]:
df = pd.read_csv('movies_everything.csv').iloc[:,1:]
print(df.columns)

Index(['imdbId', 'title', 'age_rating', 'time_minutes', 'genres', 'imdb_score',
       'imdb_votes', 'director', 'actors', 'language', 'movieId', 'Drama',
       'Fantasy', 'War', 'Sci-Fi', 'Biography', 'Thriller', 'Action', 'Family',
       'History', 'Animation', 'Mystery', 'Musical', 'Documentary', 'Crime',
       'Romance', 'Comedy', 'Adventure', 'Film-Noir', 'Horror', 'Western',
       'Drama_Fantasy', 'Drama_War', 'Drama_Sci-Fi', 'Drama_Biography',
       'Drama_Thriller', 'Drama_Action', 'Drama_Family', 'Drama_History',
       'Drama_Mystery', 'Drama_Crime', 'Drama_Romance', 'Drama_Comedy',
       'Drama_Adventure', 'Drama_Horror', 'Fantasy_Sci-Fi', 'Fantasy_Action',
       'Fantasy_Family', 'Fantasy_Animation', 'Fantasy_Romance',
       'Fantasy_Comedy', 'Fantasy_Adventure', 'Fantasy_Horror',
       'Sci-Fi_Thriller', 'Sci-Fi_Action', 'Sci-Fi_Comedy', 'Sci-Fi_Adventure',
       'Sci-Fi_Horror', 'Thriller_Action', 'Thriller_Mystery',
       'Thriller_Crime', 'Thriller_Romance', 

Here, I call the functions to cluster the data, and combine the genre clusters with the decade information from earlier, to determine the final clusters.

In [22]:
df['cluster'] = df.apply(lambda x: get_cluster(x),axis=1)
df['supercluster'] = df.apply(lambda x: get_supercluster(x),axis=1)
df['combined_cluster'] = df.apply(lambda x: str(df['cluster']) + '_' + str(df['decade_cluster']))
df['combined_supercluster'] = df.apply(lambda x: str(df['cluster']) + '_' + str(df['decade_supercluster']))
df['combined_megacluster'] = df.apply(lambda x: str(df['supercluster']) + '_' + str(df['decade_supercluster']))
print(df['cluster'])

df.to_csv('movies_everything_new.csv')

0       17
1       17
2        8
3        8
4       17
        ..
9816    19
9817    22
9818    19
9819    23
9820     8
Name: cluster, Length: 9821, dtype: int64


In the next section, I will use the model I have developed to recommend movies based on a single movie input, and test the functionality of the model.