### Finding Similar Movies

We'll start by loading up the MovieLens dataset. Using Pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with movie names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)

In [28]:
import pandas as pd

col_name = ['movie_id','movie_title','movie_genre']
movies = pd.read_csv('movies.dat', sep='::', names=col_name, usecols=range(3))

movies

  after removing the cwd from sys.path.


Unnamed: 0,movie_id,movie_title,movie_genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [34]:
user_col = ['user_id','user_gender','user_age']
users = pd.read_csv('users.dat', sep='::',names = user_col, usecols=range(3))
users

  


Unnamed: 0,user_id,user_gender,user_age
0,1,F,1
1,2,M,56
2,3,M,25
3,4,M,45
4,5,M,25
5,6,F,50
6,7,M,35
7,8,M,25
8,9,M,25
9,10,F,35


In [35]:
ratings_col = ['user_id','movie_id','ratings']
ratings = pd.read_csv('ratings.dat',sep='::',names = ratings_col, usecols=range(3))
ratings

  


Unnamed: 0,user_id,movie_id,ratings
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


In [39]:
ratings_by_user = pd.merge(movies,ratings)
ratings_by_user.head()

Unnamed: 0,movie_id,movie_title,movie_genre,user_id,ratings
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5


Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.

In [44]:
movieRatings = ratings_by_user.pivot_table(index = ['user_id'], columns = ['movie_title'], values = 'ratings')
movieRatings

movie_title,"$1,000,000 Duck (1971)",'Night Mother (1986),'Til There Was You (1997),"'burbs, The (1989)",...And Justice for All (1979),1-900 (1994),10 Things I Hate About You (1999),101 Dalmatians (1961),101 Dalmatians (1996),12 Angry Men (1957),...,"Young Poisoner's Handbook, The (1995)",Young Sherlock Holmes (1985),Young and Innocent (1937),Your Friends and Neighbors (1998),Zachariah (1971),"Zed & Two Noughts, A (1985)",Zero Effect (1998),Zero Kelvin (Kj�rlighetens kj�tere) (1995),Zeus and Roxanne (1997),eXistenZ (1999)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,2.0
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,4.0,,,,,,3.0,...,,,,,,,,,,


Let's extract a Series of users who rated Young Sherlock Holmes (1985):

In [49]:
sherlockRatings = movieRatings['Young Sherlock Holmes (1985)']
sherlockRatings

user_id
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12      NaN
13      2.0
14      NaN
15      NaN
16      NaN
17      NaN
18      4.0
19      3.0
20      NaN
21      NaN
22      NaN
23      NaN
24      NaN
25      NaN
26      NaN
27      NaN
28      NaN
29      NaN
30      NaN
       ... 
6011    NaN
6012    NaN
6013    NaN
6014    NaN
6015    NaN
6016    NaN
6017    NaN
6018    NaN
6019    NaN
6020    NaN
6021    NaN
6022    NaN
6023    NaN
6024    NaN
6025    NaN
6026    NaN
6027    NaN
6028    NaN
6029    NaN
6030    NaN
6031    NaN
6032    NaN
6033    NaN
6034    NaN
6035    NaN
6036    3.0
6037    NaN
6038    NaN
6039    3.0
6040    NaN
Name: Young Sherlock Holmes (1985), Length: 6040, dtype: float64

In [50]:
sherlockRatings.mean()

3.390501319261214

Pandas' corrwith function makes it really easy to compute the pairwise correlation of Young Sherlock Holmes (1985)' vector of user rating with every other movie! After that, we'll drop any results that have no data, and construct a new DataFrame of movies and their correlation score (similarity) to Young Sherlock Holmes (1985):

In [51]:
similarMovies = movieRatings.corrwith(sherlockRatings)
similarMovies = similarMovies.dropna()
df = pd.DataFrame(similarMovies)
df.head(10)

  c = cov(x, y, rowvar)
  c *= 1. / np.float64(fact)


Unnamed: 0_level_0,0
movie_title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",-0.106243
'Night Mother (1986),-0.276359
'Til There Was You (1997),-0.016137
"'burbs, The (1989)",0.278425
...And Justice for All (1979),0.052517
10 Things I Hate About You (1999),0.185754
101 Dalmatians (1961),0.013905
101 Dalmatians (1996),0.059504
12 Angry Men (1957),0.281127
"13th Warrior, The (1999)",0.231933


(That warning is safe to ignore.) Let's sort the results by similarity score, and we should have the movies most similar to Young Sherlock Holmes (1985)! Except... we don't. These results make no sense at all! This is why it's important to know your data - clearly we missed something important.

In [52]:
similarMovies.sort_values(ascending=False)

movie_title
Tie That Binds, The (1995)                                                      1.000000
Golden Earrings (1947)                                                          1.000000
Steam: The Turkish Bath (Hamam) (1997)                                          1.000000
Native Son (1986)                                                               1.000000
Young Sherlock Holmes (1985)                                                    1.000000
Trick or Treat (1986)                                                           1.000000
School of Flesh, The (L' �cole de la chair) (1998)                              1.000000
Source, The (1999)                                                              1.000000
Beyond Silence (1996)                                                           1.000000
Infinity (1996)                                                                 1.000000
Born American (1986)                                                            1.000000
My Life a