# Collaborative Filtering Recommender System

To recommend movies to a user based on movies liked by similar users

Import the necessary modules:

In [41]:
import pandas as pd
from math import sqrt

Get the datasets and see a sample of the movies dataset:

In [42]:
movies_df = pd.read_csv("../datasets/movies.csv")
ratings_df = pd.read_csv("../datasets/ratings.csv")
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


The preprocessing this time is almost identical, with the only exception being that we'll also remove genres column from the dataset:

In [43]:
movies_df["year"] = movies_df["title"].str.extract("(\(\d\d\d\d\))", expand=False)
movies_df["year"] = movies_df["year"].str.extract("(\d\d\d\d)", expand=False)
movies_df["title"] = movies_df["title"].str.replace("(\(\d\d\d\d\))", '')
movies_df["title"] = movies_df["title"].apply(lambda x: x.strip())
movies_df = movies_df.drop("genres", 1)
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


Now, let's look at the ratings dataset:

In [44]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


Process the dataset:

In [45]:
ratings_df = ratings_df.drop("timestamp", 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


Create a dummy user:

In [46]:
test_movies = pd.DataFrame([
    {'title':'Breakfast Club, The', 'rating':5},
    {'title':'Toy Story', 'rating':3.5},
    {'title':'Jumanji', 'rating':2},
    {'title':"Pulp Fiction", 'rating':5},
    {'title':'Akira', 'rating':4.5}
])
test_ids = movies_df[movies_df["title"].isin(test_movies["title"].tolist())]
test_movies = pd.merge(test_ids, test_movies).drop("year", 1)
test_movies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


Create a new dataset of users who watched the same movies as the dummy user:

In [47]:
neighbors_df = ratings_df[ratings_df["movieId"].isin(test_movies["movieId"].tolist())]
neighbors_df.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0


Group the dataset by user IDs:

In [48]:
neighborhoods_df = neighbors_df.groupby("userId")

Let's see the dataset for one user:

In [49]:
neighborhoods_df.get_group(1130)

Unnamed: 0,userId,movieId,rating
104167,1130,1,0.5
104168,1130,2,4.0
104214,1130,296,4.0
104363,1130,1274,4.5
104443,1130,1968,4.5


For getting best recommendations, we need the users with closest associations:

In [50]:
neighborhoods_df = sorted(neighborhoods_df, key=lambda x: len(x[1]), reverse=True)
neighborhoods_df[:3]

[(75,
        userId  movieId  rating
  7507      75        1     5.0
  7508      75        2     3.5
  7540      75      296     5.0
  7633      75     1274     4.5
  7673      75     1968     5.0),
 (106,
        userId  movieId  rating
  9083     106        1     2.5
  9084     106        2     3.0
  9115     106      296     3.5
  9198     106     1274     3.0
  9238     106     1968     3.5),
 (686,
         userId  movieId  rating
  61336     686        1     4.0
  61337     686        2     3.0
  61377     686      296     4.0
  61478     686     1274     4.0
  61569     686     1968     5.0)]

We don't need all the users, only keep the 100 best groups to save memory:

In [51]:
neighborhoods_df = neighborhoods_df[:100]

To get recommendations, we need to create a Pearson Correlation Index, which is calculated as:

$$r = \frac{
            \sum_{i=1}^{n}{
                               (x_i - \bar{x})
                           \times
                               (y_i - \bar{y})
                           }
           }
           {\sqrt{
                  \sum_{i=1}^{n}{
                                (x_i - \bar{x})^2
                                }
            }
            \sqrt{
                  \sum_{i=1}^{n}{
                                (y_i - \bar{y})^2
                                }
            }
}$$

We store this correlation value in a `dict` object for every user in the best users dataset:

In [52]:
pearson_corr = {}

for name, neighbor in neighborhoods_df:
    neighbor = neighbor.sort_values(by="movieId")
    test_movies = test_movies.sort_values(by="movieId")
    temp_df = test_movies[test_movies["movieId"].isin(neighbor["movieId"].tolist())]
    temp_ratings = temp_df["rating"].tolist()
    temp_groups = neighbor["rating"].tolist()
    S_xx = sum([i**2 for i in temp_ratings]) - pow(sum(temp_ratings), 2) / float(len(neighbor))
    S_yy = sum([i**2 for i in temp_groups]) - pow(sum(temp_groups), 2) / float(len(neighbor))
    S_xy = sum([i*j for i, j in zip(temp_ratings, temp_groups)]) - sum(temp_ratings)*sum(temp_groups) / float(len(neighbor))
    if S_xx != 0 and S_yy != 0:
        pearson_corr[name] = S_xy / sqrt(S_xx*S_yy)
    else:
        pearson_corr[name] = 0

Convert this dictionary into a dataframe object for further calculations:

In [53]:
pearson_df = pd.DataFrame.from_dict(pearson_corr, orient="index")
pearson_df.columns = ["similarityIndex"]
pearson_df["userId"] = pearson_df.index
pearson_df.index = range(len(pearson_df))
pearson_df.head()

Unnamed: 0,similarityIndex,userId
0,0.827278,75
1,0.586009,106
2,0.83205,686
3,0.576557,815
4,0.943456,1040


Since we want the best recommendations, sort the dataset in descending order and select the first 50 values:

In [54]:
best_neighbors = pearson_df.sort_values(by="similarityIndex", ascending=False)[:50]
best_neighbors.head()

Unnamed: 0,similarityIndex,userId
64,0.961678,12325
34,0.961538,6207
55,0.961538,10707
67,0.960769,13053
4,0.943456,1040


Create a new dataset of the best rated movies by merging the above dataset with the ratings dataset:

In [55]:
best_rated = best_neighbors.merge(ratings_df, left_on="userId", right_on="userId", how="inner")
best_rated.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.961678,12325,1,3.5
1,0.961678,12325,2,1.5
2,0.961678,12325,3,3.0
3,0.961678,12325,5,0.5
4,0.961678,12325,6,2.5


Create a weighted rating by simply multiplying the Pearson Correlation Score with the movie rating:

In [56]:
best_rated["weightedRating"] = best_rated["similarityIndex"]*best_rated["rating"]
best_rated.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.961678,12325,1,3.5,3.365874
1,0.961678,12325,2,1.5,1.442517
2,0.961678,12325,3,3.0,2.885035
3,0.961678,12325,5,0.5,0.480839
4,0.961678,12325,6,2.5,2.404196


Now, group the dataset by movie IDs to get overall weighted ratings:

In [57]:
best_rated = best_rated.groupby("movieId").sum()[["similarityIndex", "weightedRating"]]
best_rated.columns = ["sum_similarityIndex", "sum_weightedRating"]
best_rated.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.376281,140.800834
2,38.376281,96.656745
3,10.253981,27.254477
4,0.929294,2.787882
5,11.723262,27.151751


Create a weighted score for every movie by dividing the weighted rating with the Pearson Score and sorting the dataset in a descending order:

In [58]:
recom_df = pd.DataFrame()
recom_df["weightedScore"] = best_rated["sum_weightedRating"] / best_rated["sum_similarityIndex"]
recom_df["movieId"] = best_rated.index
recom_df = recom_df.sort_values(by="weightedScore", ascending=False)
recom_df.head()

Unnamed: 0_level_0,weightedScore,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
5073,5.0,5073
3329,5.0,3329
2284,5.0,2284
26801,5.0,26801
6776,5.0,6776


Finally, use the movie IDs in the above dataset to get the recommended movies:

In [59]:
recommendations_df = movies_df.loc[movies_df["movieId"].isin(recom_df.head(20)["movieId"].tolist())]
recommendations_df

Unnamed: 0,movieId,title,year
97,99,Heidi Fleiss: Hollywood Madam,1995
119,121,"Boys of St. Vincent, The",1992
2200,2284,Bandit Queen,1994
3243,3329,"Year My Voice Broke, The",1987
3449,3539,"Filth and the Fury, The",2000
3669,3759,Fun and Fancy Free,1947
3679,3769,Thunderbolt and Lightfoot,1974
3685,3775,Make Mine Music,1946
3686,3776,Melody Time,1948
3759,3851,I'm the One That I Want,2000


And we're done!