## Surprise Library Introduction

Surprise is an easy-to-use Python scikit learn like tool for recommender systems. Surprise library provides essential tools to build and experiment with various collaborative filtering methods. It provides support for:
1. Cross Validation
2. Grid Search
3. Built-in Datasets
4. Various Collaborative filtering methods


## Installation
Surprise library can be easily installed in the current environment using pip or conda


`$ pip install numpy`

`$ pip install scikit-surprise`

With conda you can use:

`$ conda install -c conda-forge scikit-surprise`

For the latest version, you can also clone the repo and build the source (you’ll first need Cython and numpy):

`$ pip install numpy cython`

`$ git clone https://github.com/NicolasHug/surprise.git`

`$ cd surprise`

`$ python setup.py install`

## Table of Content

[1. Reading Dataset](#Reading-Dataset)

[2. Merging Movie information to ratings dataframe](#merge)

[3. Creating train and test data & setting evaluation metric](#eval)

[4. Importing Surprise & Loading Dataset](#dataload)

[5. Grid Search CV for Neighbourhood size and similarity measure](#gridsearch)

[6. Fitting Model on complete train set & checking performance on test data](#testperf)

[7. What's Next?](#whatsnext)

## 1. Reading Dataset <a class="anchor" id="Reading-Dataset"></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

In [2]:
#Reading ratings file:
ratings = pd.read_csv('ratings.csv')

#Reading Movie Info File
movie_info = pd.read_csv('movie_info.csv')

## 2.  Merging Movie information to ratings dataframe <a class="anchor" id="merge"></a>

The movie names are contained in a separate file. Let's merge that data with ratings and store it in ratings dataframe. The idea is to bring movie title information in ratings dataframe as it would be useful later on

In [3]:
ratings = ratings.merge(movie_info[['movie id','movie title']], how='left', left_on = 'movie_id', right_on = 'movie id')

In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title
0,196,242,3,881250949,242,Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997)
2,22,377,1,878887116,377,Heavyweights (1994)
3,244,51,2,880606923,51,Legends of the Fall (1994)
4,166,346,1,886397596,346,Jackie Brown (1997)


Lets also combine movie id and movie title separated by ': ' and store it in a new column named movie

In [5]:
ratings['movie'] = ratings['movie_id'].map(str) + str(': ') + ratings['movie title'].map(str)

In [6]:
ratings.columns

Index(['user_id', 'movie_id', 'rating', 'unix_timestamp', 'movie id',
       'movie title', 'movie'],
      dtype='object')

Keeping the columns movie, user_id and rating in the ratings dataframe and drop all others

In [7]:
ratings = ratings.drop(['movie id', 'movie title', 'movie_id','unix_timestamp'], axis = 1)

In [8]:
ratings = ratings[['user_id','movie','rating']]

## 3. Creating Train & Test Data & Setting Evaluation Metric <a class="anchor" id="eval"></a>
In order to test how well we do with a given rating prediction method, we would first need to define our train and test set, we will only use the train set to build different models and evaluate our model using the test set.

In [9]:
#Assign X as the original ratings dataframe
X = ratings.copy()

#Split into training and test datasets
X_train, X_test = train_test_split(X, test_size = 0.25, random_state=42)

In [10]:
#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

## 4. Importing Surprise & Loading Dataset <a class="anchor" id="dataload"></a>

In [12]:
#Importing functions to be used in this notebook from Surprise Package
from surprise import Dataset, Reader
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms import KNNWithMeans

To load a dataset from a pandas dataframe within Surprise, you will need the load_from_df() method. 
1. You will also need a `Reader` object and the `rating_scale` parameter must be specified. 
2. The dataframe here must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. 
3. Each row thus corresponds to a given rating

In [13]:
#Reader object to import ratings from X_train
reader = Reader(rating_scale=(1, 5))

#Storing Data in surprise format from X_train
data = Dataset.load_from_df(X_train[['user_id','movie','rating']], reader)

## 5. Grid Search for Neighbourhood size and similarity measure <a class="anchor" id="gridsearch"></a>

The `cross_validate()` function reports accuracy metric over a cross-validation procedure for a given set of parameters. If you want to know which parameter combination yields the best results, the `GridSearchCV` class comes to the rescue. 

Given a dict of parameters, this class exhaustively tries all the combinations of parameters and reports the best parameters for any accuracy measure (averaged over the different splits). It is heavily inspired from scikit-learn’s GridSearchCV.

In [14]:
#Defining the parameter grid with k as the neighbourhood size & trying 2 similarity measures KNNwithMeans
#& 5 folds
param_grid = {"k":list(range(1,50,5)),
              "sim_options":{"name":["cosine","pearson"]}}

#KNNWithMeans by default does user based collaborative filtering and here we are trying to find the best set of 
#parameters
gs = GridSearchCV(KNNWithMeans, 
                  param_grid, 
                  measures=['rmse'], 
                  cv=5, 
                  n_jobs = -1)

#We fit the grid search on data to find out the best score
gs.fit(data)

#Printing the best score
print(gs.best_score['rmse'])

#Printing the best set of parameters
print(gs.best_params['rmse'])

0.9625349082619087
{'k': 46, 'sim_options': {'name': 'pearson', 'user_based': True}}


## 6. Fitting Model on complete train set & checking performance on test data<a class="anchor" id="testperf"></a>

In [19]:
#Defining similarity measure as per the best parameters
sim_options = {'name': 'pearson'}

#Fitting the model on train data
model = KNNWithMeans(k = 46, sim_options = sim_options)

#Build full trainset will essentially fit the knnwithmeans on the complete train set instead of a part of it
#like we do in cross validation
model.fit(data.build_full_trainset())

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f92c1433b90>

In [40]:
#id pairs for test set
id_pairs = zip(X_test['user_id'], X_test['movie'])

#Making predictions for test set using predict method from Surprise
[model.predict(uid = user, iid = movie) for (user, movie) in id_pairs]

[Prediction(uid=877, iid="381: Muriel's Wedding (1994)", r_ui=None, est=4.104196596682839, details={'actual_k': 38, 'was_impossible': False}),
 Prediction(uid=815, iid='602: American in Paris, An (1951)', r_ui=None, est=3.728344392131596, details={'actual_k': 32, 'was_impossible': False}),
 Prediction(uid=94, iid='431: Highlander (1986)', r_ui=None, est=3.556145118850295, details={'actual_k': 46, 'was_impossible': False}),
 Prediction(uid=416, iid="875: She's So Lovely (1997)", r_ui=None, est=3.331263765779608, details={'actual_k': 37, 'was_impossible': False}),
 Prediction(uid=500, iid='182: GoodFellas (1990)', r_ui=None, est=3.9628659616218593, details={'actual_k': 46, 'was_impossible': False}),
 Prediction(uid=259, iid='1074: Reality Bites (1994)', r_ui=None, est=3.277829437744131, details={'actual_k': 46, 'was_impossible': False}),
 Prediction(uid=598, iid='286: English Patient, The (1996)', r_ui=None, est=3.534383641100695, details={'actual_k': 46, 'was_impossible': False}),
 Pred

We can see that unlike scikit learn, this doesn't just output a list of predictions but also details such as `actual_k` and `was_impossible`. 

1. `was_impossible`
Was impossible means there were no or not enough neighbours to make the predictions.

2. `actual_k`
For each of these algorithms, the actual number of neighbors that are aggregated to compute an estimation is necessarily less than or equal to 𝑘. 

Firstly, There might just not exist enough neighbors

Secondly, For predictions you only include neighbors for which the similarity measure is positive

For a given prediction, the actual number of neighbors can be retrieved in the 'actual_k' field of the details dictionary of the prediction.

If we want to check performance on test set, we need to extract predicted ratings from this which can be done by subseting this tuple

In [42]:
#id pairs for test set
id_pairs = zip(X_test['user_id'], X_test['movie'])

#Making predictions for test set using predict method from Surprise
y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

#Actual rating values for test set
y_true = X_test['rating']

# Checking performance on test set
rmse(y_true, y_pred)

0.9512496782880885

## 7. What's Next? <a class="anchor" id="whatsnext"></a>
We again see improvement as compared to the last notebook where we did not optimize for number of neighbours. Now there is another parameter that you can play around with called `min_k`. This parameter is the minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Try tweaking this parameter along with the others to see if you get a further improvement in performance.