Typically, the workflow of a collaborative filtering system is: (https://en.wikipedia.org/wiki/Collaborative_filtering)

A user expresses his or her preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain.

The system matches this user's ratings against other users' and finds the people with most "similar" tastes.

With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item)

(All the dataset can be found here: https://grouplens.org/datasets/movielens/)

We will use surprise to build and analyze recommender systems, see link here: http://surpriselib.com/

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os
import datetime as dt
import warnings  
warnings.filterwarnings('ignore')

**link for surprise:** http://surpriselib.com/

In [2]:
from surprise import Reader, Dataset, SVD, evaluate
reader = Reader()

In [3]:
ratings = pd.read_csv('./ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [10]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
userId       100004 non-null int64
movieId      100004 non-null int64
rating       100004 non-null float64
timestamp    100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


More details about surprise usage (SVD...) examples: http://surprise.readthedocs.io/en/stable/getting_started.html

In [4]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)

In [5]:
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8859
MAE:  0.6821
------------
Fold 2
RMSE: 0.8959
MAE:  0.6925
------------
Fold 3
RMSE: 0.9023
MAE:  0.6925
------------
Fold 4
RMSE: 0.9020
MAE:  0.6950
------------
Fold 5
RMSE: 0.8978
MAE:  0.6913
------------
------------
Mean RMSE: 0.8968
Mean MAE : 0.6907
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'mae': [0.68210043147319666,
                             0.69249583650094226,
                             0.69252724281467992,
                             0.69496472144804933,
                             0.69131467182476503],
                            'rmse': [0.88589623314177279,
                             0.89588434655711402,
                             0.9022602831738441,
                             0.9020066808715631,
                             0.89781968945876134]})

** Train on a whole trainset and specifically query for predictions **

We will here review how to get a prediction for specified users and items. In the mean time, we will also review how to train on a whole dataset, without performing cross-validation (i.e. there is no test set).

In [6]:
# Retrieve the trainset.
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
algo = SVD()
algo.train(trainset)

surprise.prediction parameters: http://surprise.readthedocs.io/en/stable/predictions_module.html

Let’s say you’re interested in user 100 and movie 302, and you know that the true rating r_ui=4.

In [14]:
ratings[ratings['userId'] == 100]

Unnamed: 0,userId,movieId,rating,timestamp
15273,100,1,4.0,854193977
15274,100,3,4.0,854194024
15275,100,6,3.0,854194023
15276,100,7,3.0,854194024
15277,100,25,4.0,854193977
15278,100,32,5.0,854193977
15279,100,52,3.0,854194056
15280,100,62,3.0,854193977
15281,100,86,3.0,854194208
15282,100,88,2.0,854194208


As we can see, we don't know how user 100 rates movie 302 yet. So we can use the surprise.predict() to do the work.

In [13]:
uid = 100 
iid = 302 

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 100        item: 302        r_ui = 4.00   est = 3.39   {'was_impossible': False}


Here the meaning is, you know the true rating (based on all ratings from all users), and now you just want to predict one user (100), and how he/she rates this movie (the story is that this user didn't rate/watch this movie before)

If the rating by this user is high (maybe >= 4), then we should recommend this movie to the user.

### More work to do:
* Try other algorithms by surprise
* Build the algorithms by self

Reference: 
* http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
* https://medium.com/@tomar.ankur287/content-based-recommender-system-in-python-2e8e94b16b9e
* https://www.kaggle.com/rounakbanik/movie-recommender-systems
* https://www.kaggle.com/sohier/film-recommendation-engine-converted-to-use-tmdb