## Movielens clustering

In this Notebook, we are looking for user clusters in the Movielens data, using _k_-means clustering.

In [2]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans #The k-means algorithm

First, let's start by creating a user-item matrix, as explained in the other Notebook.

In [17]:
movie_file = pd.read_csv('movies.csv')
ratings_file = pd.read_csv('ratings.csv')
df = pd.merge(movie_file, ratings_file)

ratings = pd.pivot_table(df, index='userId', columns='title', values='rating')
ratings.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


Let's put the most popular movies at the front.

In [66]:
#This piece of code is a bit complex. Here it is, step by step:
#1. reindex shuffles a dataframe according to a new list
#2. ratings.count() gets the number of non-NaN values per column/movie
#3. sort_values() sort those values, descending (because ascending=False)
#4. finally, .index gets the names of the columns/movies
#axis=1 tells Pandas we want to reshuffle the columns (not the rows)
ratings = ratings.reindex(ratings.count().sort_values(ascending=False).index, axis=1)
ratings.head(3)

title,cluster,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),...,"Last Song, The (2010)",Last Train Home (2009),"Last Waltz, The (1978)","Last Wave, The (1977)","Last Wedding, The (Kivenpyörittäjän kylä) (1995)","Last Winter, The (2006)",Last Year's Snow Was Falling (1983),Last of the Dogmen (1995),Late Marriage (Hatuna Meuheret) (2001),'71 (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,4.0,,3.0,4.0,5.0,5.0,4.0,4.0,,...,,,,,,,,,,
2,1,,3.0,,,,,,,,...,,,,,,,,,,
3,1,,,,,,,,,,...,,,,,,,,,,


We will now find clusters. Unfortunately, the _k_-means algorithm won't work with NaN values. We will put a 0 in the empty cells. This is not ideal for many reasons, but the best we can do for now without getting really complex

In [67]:
ratings_full = ratings.fillna(0) #fill the NaN with the mean of each column
ratings_full.head(3)

title,cluster,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),...,"Last Song, The (2010)",Last Train Home (2009),"Last Waltz, The (1978)","Last Wave, The (1977)","Last Wedding, The (Kivenpyörittäjän kylä) (1995)","Last Winter, The (2006)",Last Year's Snow Was Falling (1983),Last of the Dogmen (1995),Late Marriage (Hatuna Meuheret) (2001),'71 (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,4.0,0.0,3.0,4.0,5.0,5.0,4.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finish the code below. You need to...
1. Pick a suitable number of clusters (somewhere between 4 and 10 will work)
2. Apply the k-means algorithm to the Movielens user-item matrix that is in the code. Store the cluster predictions in the original `ratings` dataframe and continue working with that dataframe.
3. Print the number of users per cluster (do you remember the relevant Pandas function?).
4. Calculate the mean rating by user cluster using the Pandas pivot_table function. Pandas will sort alphabetically after making the pivot table, so you will need to reorder your pivot table with `my_pivot.reindex(ratings.count().sort_values(ascending=False).index, axis=1)`. Replace `my_pivot` with the name of your pivot table.
5. Examine the mean ratings of the top rated movies by user cluster. Can you describe the user clusters in plain language (e.g., ‘simple-minded action movie lover’)? This may be hard…
