#  Making Recommendations Based on Correlation

Item-based
Item-based collaborative filtering methods use item similarity rather than user similarity to make predictions. This is very similar to content-based filtering that we looked at above, however, content-based filtering uses the metadata of products to perceive differences, item-based collaborative filtering uses user preferences.

In [2]:
import numpy as np
import pandas as pd

In [3]:
#import data 
movies = pd.read_csv('data/ml-latest-small/movies.csv')
ratings = pd.read_csv('data/ml-latest-small/ratings.csv')
movies_ratings = movies.merge(ratings)

### Preparing Data For Correlation

We will look for movies that are similar to the most popular movie from the last notebook 1_movies_popularity "Forrest Gump (1994)". "Similarity" will be defined by how well other movies correlate with "Forrest Gump (1994)" movie in the user-item matrix. In this matrix, we have all the users in the rows and all the movies in the columns. It has many NaNs because most of the time users have not watched many movies —we call this a sparse matrix.

In [9]:
movies_crosstab = pd.pivot_table(data=movies_ratings,values='rating', index='userId', columns='movieId')
movies_crosstab.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
6,,4.0,5.0,3.0,5.0,4.0,4.0,3.0,,3.0,...,,,,,,,,,,
7,4.5,,,,,,,,,,...,,,,,,,,,,
8,,4.0,,,,,,,,2.0,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


Let's look at the users that have watched "Forrest Gump (1994)":

In [10]:
# "Forrest Gump (1994)"
top_popular_movieId= 356

In [12]:
Forrest_Gump_ratings=movies_crosstab[top_popular_movieId]
Forrest_Gump_ratings[Forrest_Gump_ratings>0] # exclude NaNs

userId
1      4.0
6      5.0
7      5.0
8      3.0
10     3.5
      ... 
605    3.0
606    4.0
608    3.0
609    4.0
610    3.0
Name: 356, Length: 329, dtype: float64

## Evaluating Similarity Based on Correlation

Now we will look at how well other movies correlate with "Forrest Gump (1994)". A strong positive correlation between two movies indicates that users who liked one movies also liked the other. A negative correlation would mean that users who liked one movie did not like the other. So, we will look for strong, positive correlations to find similar movies.

In [None]:
# we get warnings because computing the pearson correlation coefficient with NaNs, but the results are still ok
similar_to_Forrest = movies_crosstab.corrwith(Forrest_Gump_ratings)
similar_to_Forrest

Many movies get a NaN, because there are no users that went to both that movies and Tortas Forrest. But some of them give us a correlation score. Let's drop NaNs and look at the valid results:

In [16]:
corr_Forrest = pd.DataFrame(similar_to_Forrest, columns=['PearsonR'])
corr_Forrest.dropna(inplace=True)
corr_Forrest.head(10)

Unnamed: 0_level_0,PearsonR
movieId,Unnamed: 1_level_1
1,0.303465
2,0.367247
3,0.534682
4,0.388514
5,0.349541
6,0.137421
7,0.106567
8,0.65602
9,0.0
10,0.217441


In [18]:
rating=pd.DataFrame(movies_ratings.groupby('movieId')['rating'].mean())
rating['rating_count']=movies_ratings.groupby('movieId')['rating'].count()

In [21]:
Forrest_corr_summary = corr_Forrest.join(rating['rating_count'])
Forrest_corr_summary.drop(top_popular_movieId,inplace=True) # drop Forrest itself
Forrest_corr_summary

Unnamed: 0_level_0,PearsonR,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.303465,215
2,0.367247,110
3,0.534682,52
4,0.388514,7
5,0.349541,49
...,...,...
185585,-1.000000,2
187541,1.000000,4
187593,-0.203519,12
187595,0.870388,5


Let's filter out movies with a rating count below 100.

Then, take the top 10 movies in terms of similarity to Forrest:

In [32]:
top10 = Forrest_corr_summary[Forrest_corr_summary['rating_count']>=100].sort_values('PearsonR', ascending=False).head(10)
top10

Unnamed: 0_level_0,PearsonR,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1704,0.484042,141
588,0.464268,183
2329,0.457287,129
1682,0.432556,125
110,0.416976,237
2918,0.40583,109
500,0.401408,144
1222,0.397241,102
2028,0.390074,188
6377,0.385565,141


In [27]:
data=movies[['movieId', 'title']]

In [31]:
top10 = top10.merge(data,left_index=True, right_on="movieId")
top10

Unnamed: 0,PearsonR,rating_count,movieId,title
1284,0.484042,141,1704,Good Will Hunting (1997)
506,0.464268,183,588,Aladdin (1992)
1734,0.457287,129,2329,American History X (1998)
1267,0.432556,125,1682,"Truman Show, The (1998)"
97,0.416976,237,110,Braveheart (1995)
2195,0.40583,109,2918,Ferris Bueller's Day Off (1986)
436,0.401408,144,500,Mrs. Doubtfire (1993)
923,0.397241,102,1222,Full Metal Jacket (1987)
1503,0.390074,188,2028,Saving Private Ryan (1998)
4360,0.385565,141,6377,Finding Nemo (2003)


### Function

we can create a function that takes as input a movie id and a number n  (n is an integer, how many movies we wish to display), and outputs the names of the top n most similar movies to the inputed one.

You can assume that the user-item matrix (places_crosstab) is already created.

In [40]:
def top_n_movie(movie_id,n):
    movie_ratings = movies_crosstab[movie_id]
    similar_to_movie = movies_crosstab.corrwith(movie_ratings)
    corr_movie=pd.DataFrame(similar_to_Forrest, columns=['PearsonR'])
    corr_movie.dropna(inplace=True)
    movie_corr_summary=corr_movie.join(rating['rating_count'])# drop the inputed movies itself
    movie_corr_summary.drop(movie_id,inplace=True)
    top10 = movie_corr_summary[movie_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    top10 = top10.merge(data,left_index=True, right_on="movieId")
    return list(top10["title"])

In [41]:
import warnings
warnings.filterwarnings('ignore')

In [42]:
rating.sort_values(by="rating_count", ascending=False).head(12)

Unnamed: 0_level_0,rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
356,4.164134,329
318,4.429022,317
296,4.197068,307
593,4.16129,279
2571,4.192446,278
260,4.231076,251
480,3.75,238
110,4.031646,237
589,3.970982,224
527,4.225,220


In [43]:
top_n_movie(356, 10)

['Unbearable Lightness of Being, The (1988)',
 'Beethoven (1992)',
 'Tales from the Crypt Presents: Demon Knight (1995)',
 "Ocean's Eleven (a.k.a. Ocean's 11) (1960)",
 'Charade (1963)',
 'Elite Squad (Tropa de Elite) (2007)',
 'Something to Talk About (1995)',
 'Mighty Morphin Power Rangers: The Movie (1995)',
 'Inside Job (2010)',
 'Sherlock: The Abominable Bride (2016)']