## Recommend System Using Collaborative Filtering

This project aims to create a movie recommendation mechanism using Collaborative filtering. Collaborative filtering, as defined by Extstrand et al, ”is a recommendation algorithm that bases its pre- dictions and recommendations on the ratings or be- havior of other users in the system.” This approach is based on the assumption that users with simi- lar tastes will rate the same item similarly.

The dataset for this project can be downloaded [here](https://www.kaggle.com/laowingkin/netflix-movie-recommendation). 

## Table of contents
- [Data Processing]
    - [Movie data]
- [Model Training]
- [Model Prediction]
    - [get the top-N recommendations for each user]
- [Dump and Reload the Model]

In [3]:
import os
import pandas as pd
import numpy as np
import time
from collections import defaultdict
import pprint
import random

from surprise import Reader, Dataset, SVD, dump, evaluate, accuracy
from surprise.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [4]:
filename = "Data/all_ratings.csv"

 
## Data Processing
 


In [5]:
df = pd.read_csv(filename)
print("Number of data points ", len(df))
df = df.drop_duplicates()
print("Number of data points after dedup", len(df))

Number of data points  100480507
Number of data points after dedup 100480507


In [6]:
df.head()

Unnamed: 0,movie_id,user_id,rating
0,4500,2532865,4
1,4500,573364,3
2,4500,1696725,3
3,4500,1253431,3
4,4500,1265574,2


In [7]:
print("Number of Movies", len(df.movie_id.unique()))
print("Number of Users", len(df.user_id.unique()))
print("Range of Ratings", min(df.rating), " to ", max(df.rating))
print("Average Ratings", np.mean(df.rating))

Number of Movies 17770
Number of Users 480189
Range of Ratings 1  to  5
Average Ratings 3.604289964420661


In [6]:
f = ['count','mean']

df_movie_summary = df.groupby('movie_id')['rating'].agg(f)
df_movie_summary.index = df_movie_summary.index.map(int)
movie_benchmark = round(df_movie_summary['count'].quantile(0.8),0)
drop_movie_list = df_movie_summary[df_movie_summary['count'] < movie_benchmark].index

print('Movie minimum times of review: {}'.format(movie_benchmark))

df_cust_summary = df.groupby('user_id')['rating'].agg(f)
df_cust_summary.index = df_cust_summary.index.map(int)
cust_benchmark = round(df_cust_summary['count'].quantile(0.8),0)
drop_cust_list = df_cust_summary[df_cust_summary['count'] < cust_benchmark].index

print('Customer minimum times of review: {}'.format(cust_benchmark))

Movie minimum times of review: 4040.0
Customer minimum times of review: 322.0


 
### Movie data

In [8]:
movie_df = pd.read_csv('Data/movie_titles.csv', encoding = "ISO-8859-1", header = None, 
                       names = ['Movie_Id', 'Year', 'Name'])
movie_df.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Movie_Id,Year,Name
movie_id,Year,Title,C,D,E,F,G
1,2003,Dinosaur Planet,,,,,
2,2004,Isle of Man TT 2004 Review,,,,,
3,1997,Character,,,,,
4,1994,Paula Abdul's Get Up & Dance,,,,,


What are the highly ranked movies?

In [9]:
movie_avg_rating = df.groupby("movie_id")[["rating"]].mean().reset_index()
movie_avg_df = movie_df.merge(movie_avg_rating, left_on="Movie_Id", right_on="movie_id")
print("Top 10 highly ranked movies are")
movie_avg_df.sort_values("rating", ascending=False)[:10]

Top 10 highly ranked movies are


Unnamed: 0,Movie_Id,Year,Name,movie_id,rating


In [10]:
# convert the movie data into a dictionary for later use
movie_dict = {}
for ind, row in movie_df.iterrows():
    movie_dict[row["Movie_Id"]] = row["Name"]

 
## Model Training
 

Here, we use `surprise` package to train our model. You can install it in your terminal by useing:
    `pip install scikit-surprise`
    
You could find more details about this package [here](http://surpriselib.com/).

In [None]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating']], reader)
trainset = data.build_full_trainset()

It takes about **1 hours** to train the whole data set. If you don't want to train the model again, just go to the last part (Dump and Reload the Model) to reload the model.

In [8]:
a = time.time()
svd = SVD()
svd.fit(trainset)
b = time.time()
print("Training time ", (b-a)/60, "min")

NameError: name 'trainset' is not defined

 
## Model Prediction
 

Now, we can predict ratings for individual user.

In [12]:
u = 2532865
m = 4500
print("predicted rating for user id ", u)
svd.predict(u, m, verbose=True)
# svd.test(trainset.build_testset()[:10])

predicted rating for user id  2532865
user: 2532865    item: 4500       r_ui = None   est = 3.44   {'was_impossible': False}


Prediction(uid=2532865, iid=4500, r_ui=None, est=3.442622414175244, details={'was_impossible': False})

We could also predict ratings using a dataframe as an input.

In [16]:
def create_right_format(test_df):
    """
    creat file with the right format for prediction
    """
    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(test_df[['user_id', 'movie_id', 'rating']], reader)
    test_data = data.build_full_trainset().build_testset()
    return test_data
    
def get_pred(model, test_df):
    """
    predict a pandas dataframe
    """
    test_data = create_right_format(test_df)
    predictions = model.test(test_data)
    pred = [p.est for p in predictions]
    test_df["prediction"] = pred
    return test_df

In [33]:
test_df = df.sample(100000)
test_data = create_right_format(test_df)
res = get_pred(svd, test_df)
res.head()

Unnamed: 0,movie_id,user_id,rating,prediction
1269691,4698,2320706,4,3.783523
89428453,15887,1032499,4,3.543714
28831444,9617,578853,4,3.355648
2216503,4883,2186818,3,3.212031
28698059,9598,288949,3,2.825447


 
### get the top-N recommendations for each user
 

Here is an example where we retrieve retrieve the top-10 items with highest rating prediction for each user. Since the data if huge, we use a subset to demostrate this function.

In [7]:
sample = df.sample(1000)

In [9]:
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [10]:
# First train an SVD algorithm on the dataset.
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(sample[['user_id', 'movie_id', 'rating']], reader)
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

testset = trainset.build_anti_testset()

In [11]:
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=10)

In [16]:
# Print the recommended items for each user
d = top_n.keys()
keys = random.sample(list(d), 10)

# for uid, user_ratings in top_n.items():
for uid in keys:
    user_ratings = top_n[uid]
    print("Recommended movies for user ", uid)
    print([iid for (iid, _) in user_ratings])
    pprint.pprint([movie_dict[iid] for (iid, _) in user_ratings])
    print("")

Recommended movies for user  722473
[1406, 1854, 8216, 13728, 12338, 9098, 17764, 15702, 11214, 5261]
['Hook',
 'Crazy/Beautiful',
 'Dummy',
 'Gladiator',
 'Harry Potter and the Prisoner of Azkaban',
 'A Cry in the Dark',
 'Shakespeare in Love',
 'Close Encounters of the Third Kind',
 'La Bamba',
 'Looney Tunes: Reality Check!']

Recommended movies for user  1393246
[1854, 5862, 12338, 10042, 14364, 9886, 3962, 16711, 5847, 11234]
['Crazy/Beautiful',
 'Memento',
 'Harry Potter and the Prisoner of Azkaban',
 'Raiders of the Lost Ark',
 'What a Girl Wants',
 'Star Wars: Episode I: The Phantom Menace',
 'Finding Nemo (Widescreen)',
 'Sex and the City: Season 6: Part 1',
 'A Passage to India',
 'Simon Birch']

Recommended movies for user  1499529
[13763, 13636, 16452, 2452, 9690, 10042, 14667, 15755, 692, 14047]
['Jerry Maguire',
 'Fast Times at Ridgemont High',
 'Chocolat',
 'Lord of the Rings: The Fellowship of the Ring',
 'Duck Soup',
 'Raiders of the Lost Ark',
 'Field of Dreams',
 'Bi

 
## Dump and Reload the Model
 

Next, We will dump our model to a file.

In [12]:
model_file = "svd_model"
# Dump algorithm
file_name = os.path.expanduser(model_file)

### Dump algorithm

In [18]:
dump.dump(file_name, algo=svd)

### reload algorithm

In [13]:
# reload algorithm
_, svd = dump.load(file_name)
# We now ensure that the algo is still the same by checking the predictions.

In [14]:
df = pd.read_csv(filename)
test_df = df.sample(1000)

After reload the model, we could let the model do prediction.

In [17]:
res = get_pred(svd, test_df)
res.head()

Unnamed: 0,movie_id,user_id,rating,prediction
74798172,13540,2085397,1,2.672629
24437171,8751,876018,4,3.594337
27870702,9426,1960115,5,4.652851
33878529,10607,1640835,4,4.460634
39268347,11521,496374,5,4.391726
