### KMeans Clustering
The goal of this notebook is to:
Create a KMeans Clustering Model
* With the first run of the Kmeans model I noticed that there were some words that I would like to filter out for example, "Production, film, films and films" and filter out words with 3 characters or less.

References:
* <https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f>
* <https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-f2ad05ed5203>
* <https://towardsdatascience.com/cluster-then-predict-for-classification-tasks-142fdfdc87d6>

In [1]:
#Import Necessary Libraries
import pandas as pd
import numpy as np
import gensim
import nltk
nltk.download('punkt')
from gensim.models import Word2Vec 
import string
from sklearn.metrics import jaccard_score


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
#Upload dataframe
content_df = pd.read_pickle("content_df.pkl")
#Make a dataframe that is 80% of the content_df
train = content_df.sample(frac = 0.80, random_state = 8)
train.shape

(49938, 15)

In [3]:
#funtion to reduce length of string (implemented after the first run of kmeans)
def modify_charac_length(input, len_num):
  input_string = input
  temp = input_string.split( )
  result = [x for x in temp if len(x) > len_num]
  result = ' '.join(result)
  return result


givenwords_list= ['production','genres', 'film', 'films', 'best', 'pictures','drama','movie', 'corporation', 'listed', 'overview']

def remove_givenwords(text):
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [x for x in tokens if x not in givenwords_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [4]:
#remove givenwords and words with 3 characters or less from corpus
train['corpus_edited'] = train['corpus'].apply(lambda x: modify_charac_length(x, 3))
train['corpus_edited'] = train['corpus_edited'].apply(lambda x: remove_givenwords(x))

In [5]:
#First transform corpus into tfdif vector
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = train['corpus_edited']

vectorizer = TfidfVectorizer(min_df= 1, ngram_range=(1, 1), stop_words='english')

feature_matrix = vectorizer.fit_transform(corpus).astype(float)
feature_matrix.shape

#get feature names
feature_names = vectorizer.get_feature_names_out()

In [6]:
#KMeans model
from sklearn.cluster import KMeans

def k_means(feature_matrix, num_clusters= 20):
    km = KMeans(n_clusters=num_clusters,
                max_iter=10000, random_state = 42)
    km.fit(feature_matrix)
    clusters = km.labels_
    return km, clusters

#implement kmeans model
num_clusters = 20   
km_obj, clusters = k_means(feature_matrix=feature_matrix, num_clusters=num_clusters)
train['Cluster'] = clusters
train.head()

Unnamed: 0,movieId,imdb_id,tmdbId,overview,production_companies,tagline,title,genres,year,user_tag_list,tagline_clean,overview_clean,user_tag_list_clean,corpus_tokens,corpus,corpus_edited,Cluster
39026,47047.0,455958,14963.0,A groom (Ed Burns) and his four attendants wre...,[],,Groomsmen The,"[comedy, drama, romance]",[2006],"[independent film, Jinni Top Pick]",[],"[groom, ed, burn, four, attendant, wrestle, is...","[independent film, jinni top pick]","[groom, ed, burn, four, attendant, wrestle, is...",groom ed burn four attendant wrestle issue rel...,groom burn four attendant wrestle issue relate...,1
56671,150682.0,2458412,257627.0,,[],,NSFW Not Safe for Work,"[comedy, drama, romance]",[2014],,[],[],[],"[comedy, drama, romance]",comedy drama romance,comedy romance,15
32975,132198.0,130671,90509.0,"It's hard, in the year 2001, to remember what ...",[Annazan],,Exhausted John C Holmes the Real Story,[documentary],[1981],,[],"[hard, year, 2001, remember, porn, like, 70s, ...",[],"[hard, year, 2001, remember, porn, like, 70s, ...",hard year remember porn like s decade pornsta...,hard year remember porn like decade pornstars ...,0
44825,80230.0,1083853,44255.0,,[],,Off and Running,[documentary],[2009],[woman director],[],[],[woman director],"[documentary, woman director]",documentary woman director,documentary woman director,6
63610,174663.0,5351458,448847.0,,[],,The Hatton Garden Job,[crime],[2017],"[crime, heist]",[],[],"[crime, heist]","[crime, crime, heist]",crime crime heist,crime crime heist,5


In [7]:
#see the number of movies in each cluster
from collections import Counter
c = Counter(clusters)
c

Counter({0: 21154,
         1: 6429,
         2: 6101,
         3: 638,
         4: 1741,
         5: 605,
         6: 805,
         7: 893,
         8: 543,
         9: 2476,
         10: 585,
         11: 458,
         12: 516,
         13: 986,
         14: 1440,
         15: 905,
         16: 680,
         17: 1595,
         18: 536,
         19: 852})

In [8]:
#Function to organize the data for presentation
def get_cluster_data(clustering_obj, movie_data, feature_names, num_clusters, topn_features=10):

    cluster_details = {}  
    # get cluster centroids
    ordered_centroids = clustering_obj.cluster_centers_.argsort()[:, ::-1]
    # get key features for each cluster
    # get movies belonging to each cluster
    for cluster_num in range(num_clusters):
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster_num'] = cluster_num
        key_features = [feature_names[index] 
                        for index 
                        in ordered_centroids[cluster_num, :topn_features]]
        cluster_details[cluster_num]['key_features'] = key_features
        
        movies = movie_data[movie_data['Cluster'] == cluster_num]['title'].values.tolist()
        cluster_details[cluster_num]['movies'] = movies
    
    return cluster_details

In [9]:
#Applying function to data
cluster_data = get_cluster_data(clustering_obj = km_obj, movie_data= train,
                                feature_names = feature_names, num_clusters = num_clusters,
                                topn_features = 5)

In [10]:
#Printing results
for x in range(20):

  print('cluster number:', cluster_data[x]['cluster_num'])
  print('cluster_features:', cluster_data[x]['key_features'])
  print('cluster_movies:', cluster_data[x]['movies'])
  print(' ')

cluster number: 0
cluster_features: ['documentary', 'comedy', 'action', 'world', 'life']
 
cluster number: 1
cluster_features: ['romance', 'love', 'woman', 'relationship', 'family']
cluster_movies: ['Groomsmen The ', 'Gainsbourg Vie Héroïque ', 'Before Sunset ', 'Chizuko s Younger Sister Futari ', 'Sophiiiie ', 'Love Me or Leave Me ', 'The Amazons ', 'Free Enterprise ', 'The Quiet Love ', 'Only the Young ', 'State Fair ', 'Midnight Masquerade ', 'Savages The ', 'Och Karol ', 'Nobody s Fool ', 'Tomorrow ', 'All Around Us ', 'China Girl ', 'Joe Versus the Volcano ', 'My Dog Rusty ', 'Better Tomorrow III Love and Death in Saigon A ', 'Un homme à la hauteur ', 'Touch of Pink ', 'Dream Bi mong ', 'Henry Poole is Here ', 'Boy Shônen ', 'Common Law The ', 'Majority of One A ', 'Kid Stays in the Picture The ', 'Red Firecracker Green Firecracker Pao Da Shuang Deng ', 'Making Mr Right ', 'Gross Misconduct ', 'She s Gotta Have It ', 'A Star Is Born ', 'Brothers The Return ', 'Disappearing Acts ',

In [13]:
#Using the model to predict the clusters

#create a test dataframe
common = content_df.merge(train, on=["movieId"])
test = content_df[~content_df.movieId.isin(common.movieId)]

test['corpus_edited'] = test['corpus'].apply(lambda x: modify_charac_length(x, 3))
test['corpus_edited'] = test['corpus_edited'].apply(lambda x: remove_givenwords(x))

#Kmeans clustering model
km = KMeans(n_clusters=num_clusters, max_iter=10000, random_state = 42)
km.fit(feature_matrix)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


KMeans(max_iter=10000, n_clusters=20, random_state=42)

In [16]:
#Create test corpus
test_corpus = test['corpus_edited']
#Make into matrix
test_corpus_matrix= vectorizer.transform(test_corpus)
#Use model to predict cluser
prediction = km.predict(test_corpus_matrix)
prediction.shape

#Add cluster to dataframe
test['Cluster'] = prediction
test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,movieId,imdb_id,tmdbId,overview,production_companies,tagline,title,genres,year,user_tag_list,tagline_clean,overview_clean,user_tag_list_clean,corpus_tokens,corpus,corpus_edited,Cluster
5,6.0,113277,949.0,"Obsessive master thief, Neil McCauley leads a ...","[Regency Enterprises, Forward Pass, Warner Bros.]",A Los Angeles Crime Saga,Heat,"[action, crime, thriller]",[1995],"[imdb top 250, great acting, realistic action,...","[los, angeles, crime, saga]","[obsessive, master, thief, neil, mccauley, lea...","[imdb top 250, great act, realistic action, su...","[obsessive, master, thief, neil, mccauley, lea...",obsessive master thief neil mccauley lead top ...,obsessive master thief neil mccauley lead notc...,2
8,9.0,114576,9091.0,International action superstar Jean Claude Van...,"[Universal Pictures, Imperial Entertainment, S...",Terror goes into overtime.,Sudden Death,[action],[1995],"[explosive, hostage, terrorist, vice president...","[terror, go, overtime]","[international, action, superstar, jean, claud...","[explosive, hostage, terrorist, vice president...","[international, action, superstar, jean, claud...",international action superstar jean claude van...,international action superstar jean claude dam...,0
10,11.0,112346,9087.0,"Widowed U.S. president Andrew Shepherd, one of...","[Columbia Pictures, Castle Rock Entertainment]",Why can't the most powerful man in the world h...,American President The,"[comedy, drama, romance]",[1995],"[Romance, white house, new love, usa president...","[powerful, man, world, one, thing, want]","[widowed, u, president, andrew, shepherd, one,...","[romance, white house, new love, usa president...","[widowed, u, president, andrew, shepherd, one,...",widowed u president andrew shepherd one world ...,widowed president andrew shepherd world powerf...,1
12,13.0,112453,21032.0,An outcast half-wolf risks his life to prevent...,"[Universal Pictures, Amblin Entertainment, Amb...",Part Dog. Part Wolf. All Hero.,Balto,"[adventure, animation, children]",[1995],"[Ei muista, alaska, bear attack, dog, dog sled...","[part, dog, part, wolf, hero]","[outcast, half, wolf, risk, life, prevent, dea...","[ei muista, alaska, bear attack, dog, dog sled...","[outcast, half, wolf, risk, life, prevent, dea...",outcast half wolf risk life prevent deadly epi...,outcast half wolf risk life prevent deadly epi...,4
13,14.0,113987,10858.0,An all-star cast powers this epic look at Amer...,"[Hollywood Pictures, Cinergi Pictures Entertai...","Triumphant in Victory, Bitter in Defeat. He Ch...",Nixon,[drama],[1995],"[biography, government, historical figure, pre...","[triumphant, victory, bitter, defeat, change, ...","[star, cast, power, epic, look, american, pres...","[biography, government, historical figure, pre...","[star, cast, power, epic, look, american, pres...",star cast power epic look american president r...,star cast power epic look american president r...,0


In [17]:
#combine the two data_frames
frames = [train, test]
content_KMeans_clusters = pd.concat(frames)

In [19]:
content_KMeans_clusters['Cluster'] = np.array(content_KMeans_clusters['Cluster'])

In [20]:
#Write to pkl file
content_KMeans_clusters.to_pickle("content_KMeans_clusters.pkl")