# Building up a recommender system using Surprise package

**Some Key takeaway**

* According to 10 fold cross validation RMSE error, I choose SVD as my final model.
* Used the package [Surprise](http://surpriselib.com/) to build up the recommender system
* Using the whole song dataset takes some time to run. Currently I use only 2000 users with totally around 60k rows to get the final parameter.

**Outline**

* [Introduction](#intro)
* [Read Data](#read)
* [Fit SVD model](#svd)
* [Fit Collaborative Filtering model](#cf)
* [Model Selection](#select)

In [1]:
%load_ext watermark

In [2]:
import pandas as pd
import random
from surprise import Dataset
from surprise import Reader
from surprise import SVD, KNNBasic, KNNWithMeans, KNNWithZScore
from surprise import accuracy
from surprise.model_selection import KFold
from surprise.model_selection import GridSearchCV
from collections import defaultdict

%watermark -a 'Johnny' -d -t -v -p pandas,random,surprise 

Johnny 2018-03-09 13:37:34 

CPython 3.6.3
IPython 6.1.0

pandas 0.20.3
random n
surprise 1.0.5


---

# <a id='intro'>Introduction</a>

[**Surprise**](http://surpriselib.com/) is an easy-to-use Python scikit (short for SciPy Toolkits, are add-on packages for SciPy) for recommender systems. It has a set of built-in algorithms and datasets for you to play with. It supports algorithms for recommender systems ranging from the basic collaborative filtering algorithm to many different kinds of algorithms. Currently, there are 10 different built in algorithms in the Surprise package, listing as following

> **random_pred.NormalPredictor:**
*  Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

> ** baseline_only.BaselineOnly: **
*  Algorithm predicting the baseline estimate for given user and item.

> **knns.KNNBasic:**
*  A basic collaborative filtering algorithm.

> ** knns.KNNWithMeans: **
*  A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

> ** knns.KNNBaseline: **
*  A basic collaborative filtering algorithm taking into account a baseline rating.

> ** matrix_factorization.SVD: **
*  The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

> ** matrix_factorization.SVDpp: **
*  The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

> ** matrix_factorization.NMF: **
*  A collaborative filtering algorithm based on Non-negative Matrix Factorization.

> ** slope_one.SlopeOne: **
*  A simple yet accurate collaborative filtering algorithm.

> ** co_clustering.CoClustering: **
*  A collaborative filtering algorithm based on co-clustering.

Here are some functions that will be used in this notebook

In [3]:
DEFAULT_COUNT=50
SEED=12345

def readSongData(top):
    """
    Read song data from database

    Parameters
    ----------
    top: random sample n users from song_df

    Returns
    -------
    a pandas dataframe with columns 'user_id', 'song_id', 'listen_count', 'title', 'release', 'artist_name',
   'year', 'song'

    """

    song_df = pd.read_pickle('../../data/song.pkl')
    # random sample n users from song_df
    user_list= list(song_df.user_id.unique())
    random.seed(SEED)
    random.shuffle(user_list)
    song_df = song_df[song_df.user_id.isin(user_list[0:top])]

    return song_df

def createNewObs(songidList):
    """
    Append a new row with userId Johnny that is interested in some selected songs

    Parameters
    ----------
    songidList: the user selected song_ids with format like `SOAKIMP12A8C130995`

    Returns
    -------
    a pandas dataframe with columns 'user_id', 'song_id', 'listen_count'

    """

    ratings_dict = {'user_id': ['johnny']*len(songidList),
                    'song_id': songidList,
                    'listen_count': [DEFAULT_COUNT]*len(songidList)}
    newObs = pd.DataFrame(ratings_dict)
    newObs = newObs[['user_id', 'song_id', 'listen_count']]

    return newObs

def readSurpriseFormat(newObs, song_df):
    """
    combine newObs dataframe with song dataframe and transform it into Surprise data format

    Parameters
    ----------
    newObs: the dataframe obtain from the function createNewObs
    song_df: a dataframe containing the song information

    Returns
    -------
    a surprise.dataset

    """

    # A reader is still needed but only the rating_scale param is requiered.
    reader = Reader(rating_scale=(1, 100))

    # get train data
    train = song_df[['user_id', 'song_id', 'listen_count']]

    # combine together
    full = pd.concat([train, newObs]).reset_index(drop=True)

    # The columns must correspond to user id, item id and ratings (in that order).
    data = Dataset.load_from_df(full, reader)

    return data

def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

# <a id='read'>Read Data</a>

**Read filtered song data**

In [4]:
n_users = 2000
song_df = readSongData(n_users)
song_df.head()

Unnamed: 0,user_id,song_id,listen_count,title,release,artist_name,year,song
46,bd4c6e843f00bd476847fb75c47b4fb430a06856,SOBDRND12A8C13FD08,1,Sun Hands,Gorilla Manor,Local Natives,2009,Sun Hands - Local Natives
47,bd4c6e843f00bd476847fb75c47b4fb430a06856,SOCHBAJ12AAF3B3A4F,1,Camera Talk,Camera Talk,Local Natives,2009,Camera Talk - Local Natives
48,bd4c6e843f00bd476847fb75c47b4fb430a06856,SOCZTMT12AF72A078E,1,Belle,In Between Dreams,Jack Johnson,2005,Belle - Jack Johnson
49,bd4c6e843f00bd476847fb75c47b4fb430a06856,SOHRQZQ12A6D4F81D2,1,Auto Rock,Mr. Beast,Mogwai,2006,Auto Rock - Mogwai
50,bd4c6e843f00bd476847fb75c47b4fb430a06856,SOJGMYY12AB01809BE,2,Who Knows Who Cares,Gorilla Manor,Local Natives,2009,Who Knows Who Cares - Local Natives


In [5]:
song_df.shape

(53035, 8)

**Create dummy user data**

In [6]:
newObs = createNewObs(['SOAKIMP12A8C130995','SOBBMDR12A8C13253B','SOBXHDL12A81C204C0','SOBYHAJ12A6701BF1D','SODACBL12A8C13C273'])
newObs

Unnamed: 0,user_id,song_id,listen_count
0,johnny,SOAKIMP12A8C130995,50
1,johnny,SOBBMDR12A8C13253B,50
2,johnny,SOBXHDL12A81C204C0,50
3,johnny,SOBYHAJ12A6701BF1D,50
4,johnny,SODACBL12A8C13C273,50


**Transform into Surprise data format**

In [7]:
data = readSurpriseFormat(newObs, song_df)

# <a id='svd'>SVD</a>

Fit model using Singular Value Decomposition. For more information, see [here](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD).

**Tune algorithm parameters with GridSearchCV**

In [7]:
param_grid = {'n_factors': [10, 50, 100, 200], 'lr_all': [0.002, 0.005],
              'reg_all': [0.02, 0.04, 0.08, 0.1]}
gs_svd = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=10)
gs_svd.fit(data)

# best RMSE score
print(gs_svd.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_svd.best_params['rmse'])

6.35423331887
{'n_factors': 50, 'lr_all': 0.002, 'reg_all': 0.04}


**Fit Model and make prediction**

In [8]:
# fit model
trainset = data.build_full_trainset()
algo_svd = SVD(n_factors=50, lr_all=0.005, reg_all=0.04, random_state=12345)
algo_svd.fit(trainset)

# predict all the cells without values
testset = trainset.build_anti_testset()
predictions_svd = algo_svd.test(testset)

# get the top n recommendations for every user
top_n_svd = get_top_n(predictions_svd, n=10)

In [9]:
top_n_svd['johnny']

[('SOBDRND12A8C13FD08', 100),
 ('SOCHBAJ12AAF3B3A4F', 100),
 ('SOCZTMT12AF72A078E', 100),
 ('SOHRQZQ12A6D4F81D2', 100),
 ('SOJGMYY12AB01809BE', 100),
 ('SOQFEDG12AB018DD24', 100),
 ('SOVRZIX12AAF3B2A32', 100),
 ('SOZMJFG12AB017BDAF', 100),
 ('SOAFTRR12AF72A8D4D', 100),
 ('SOALEQA12A58A77839', 100)]

# <a id='cf'>Collaborative Fitering</a>

In *Surprise* package, there are three collaborative filtering related prediction algorithms, which are 

* **KNNBasic**: A basic collaborative filtering algorithm.
* **KNNWithMeans**: A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
* **KNNWithZScore**: A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

The difference between these three methods are the scaling of the utility matrix. We can set the parameter:
* **user_based=True**: user-user collaborative filtering 
* **user_based=False**: item-item collaborative filtering

**Tune algorithm parameters with GridSearchCV**

In [10]:
param_grid = {'k': [10, 50, 100, 200], 'user_based': [True, False]}
gs_cf_NULL = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=10)

gs_cf_NULL.fit(data)

# best RMSE score
print(gs_cf_NULL.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_cf_NULL.best_params['rmse'])

Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Do

In [11]:
param_grid = {'k': [10, 50, 100, 200], 'user_based': [True, False]}
gs_cf_center = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse', 'mae'], cv=10)

gs_cf_center.fit(data)

# best RMSE score
print(gs_cf_center.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_cf_center.best_params['rmse'])

Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Do

In [12]:
param_grid = {'k': [10, 50, 100, 200], 'user_based': [True, False]}
gs_cf_zscore = GridSearchCV(KNNWithZScore, param_grid, measures=['rmse', 'mae'], cv=10)

gs_cf_zscore.fit(data)

# best RMSE score
print(gs_cf_zscore.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_cf_zscore.best_params['rmse'])

Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Do

# <a id='select'>Model Selection</a>

Choose the best model according to the 10 fold cross validation RMSE error. The *Surprise* package don't support *random* and *popular* method. Currently I just compare the model performance between collaborative filtering and SVD.

In [13]:
result = {
    'model':['SVD','Collaborative Filtering (Original Scale)', 'Collaborative Filtering (Centered)', 'Collaborative Filtering (Standardized)'],
    'CV RMSE':[gs_svd.best_score['rmse'], gs_cf_NULL.best_score['rmse'], gs_cf_center.best_score['rmse'], gs_cf_zscore.best_score['rmse']]
}
model_df = pd.DataFrame(result)
model_df[['model', 'CV RMSE']].sort_values(by='CV RMSE', ascending=True)

Unnamed: 0,model,CV RMSE
0,SVD,6.354233
2,Collaborative Filtering (Centered),6.808045
3,Collaborative Filtering (Standardized),7.204787
1,Collaborative Filtering (Original Scale),7.286879


We can see that the best model is SVD with the follow parameters. I'll use the following parameter to make prediction for my application. Note that the parameter is tuned using only a selected number of user_ids.

In [16]:
n_users

2000

In [14]:
print(gs_svd.best_params['rmse'])

{'n_factors': 50, 'lr_all': 0.002, 'reg_all': 0.04}
