### 256-SURPRISE LIBRARY DEMO

### Importing libraries

In [2]:
#Credits-Prof Eirinaki, Rashmi Sharma and Aditya Patel
#conda install -c conda-forge scikit-surprise
#!pip install scikit-surprise
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise.model_selection.split import train_test_split
from surprise.model_selection import cross_validate, GridSearchCV
import pandas as pd
import numpy as np
import os, io
from surprise import KNNBasic, KNNWithMeans
from surprise import SVDpp
from surprise import SVD
from surprise import accuracy

In [4]:
ratings_df  = pd.read_csv('movie_night_ratings.csv')# read csv into ratings_df dataframe
#ratings_df.head()

### Reading a file

In [5]:
reader = Reader(rating_scale=(1,5))  #invoke reader instance of surprise library
data=Dataset.load_from_df(ratings_df,reader) #load dataset into Surprise datastructure Dataset

### Understanding the recommendations' generation problem
Basic recommender system design revolves around three fields user id,item id and ratings. 

The dataset consists of all three columns and they could be visualised as matrix containing userid as rows,item id as columns and ratings as data given by user for that item. As seen in class, major techniques to predict ratings of the user for an item are collaborative filtering and matrix factorization. We will work through the collaborative filtering  technique using Python's surprise library (https://surprise.readthedocs.io/en/stable/index.html) which provides a lot of built in function tailored to build recommender system. 

In [6]:
ratings_df.head(5)

Unnamed: 0,user,movie,rating
0,1,1,4.0
1,1,2,5.0
2,1,3,4.0
3,1,5,4.0
4,1,6,4.0



### Training the model 

There are several ways to train a recommender system using the surprise library.

The first way is to set similarity measures and employ one of the collaborative filtering algorithms (i.e. the "original" algorithms and their variations). 
There is also an option of using baseline estimates (i.e.minimizing error using some optimization).

We follow the first approach here.

#### Neighborhood-based Collaborative Filtering

Before training the model, we need to create a training set. This needs to be distinct from any set used for cross-validation or testing/evaluation. 

There are several ways to perform hyperparameter tuning and/or evaluation. 

Surprise library provides several cross-validation iterators that allow to do the split from user-item matrix as below. (Ref: https://surprise.readthedocs.io/en/stable/getting_started.html#use-cross-validation-iterators)

##### Option 1: Holdout set

In [7]:
#create training set
trainingSet, testSet = train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)

#### Training 

Surprise provides inbuilt algorithms like KNN (neighborhood-based CF), SVD (latent factor CF), CoClustering etc.

You can check them all here https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

In [8]:
#lets configure some parameters for Collaborative Filtering Algorithm
sim_options = {
    'name': 'pearson', #similarity measure default is MSD
    'user_based': True #user-based CF
}
#Other options:
#For item-based CF ->False
#For name ->pearson,cosine,msd,pearson_baseline


In [9]:
#KNN
knn = KNNBasic(sim_options=sim_options,k=3,min_k=1) #neighbours=3, other parameters set as above
knn.fit(trainingSet) #fit model to the training set
predictions_knn = knn.test(testSet) #predict for test set values

Computing the pearson similarity matrix...
Done computing similarity matrix.


##### Testing

We will check the algorithm's accuracy or using the RMSE score of the predicted ratings.

In [10]:
#validating rating predictions using RMSE
accuracy.rmse(predictions_knn, verbose=True) 

RMSE: 1.1309


1.1309185685925158

In [11]:
# for each user-item combination in the test set we get predictions
predictions_knn

#We can also predict for a particular user-item combination, if we know the actual rating
#pred = knn.predict(152, 10, r_ui=3, verbose=True)

[Prediction(uid=83, iid=15, r_ui=1.0, est=2.3169093665564735, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=5, iid=7, r_ui=3.0, est=3.3333333333333335, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=38, iid=10, r_ui=5.0, est=4.045454545454546, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=156, iid=15, r_ui=5.0, est=3.8793456032719837, details={'was_impossible': True, 'reason': 'User and/or item is unkown.'}),
 Prediction(uid=143, iid=10, r_ui=2.0, est=4.138714505676302, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=282, iid=3, r_ui=3.0, est=3.3333333333333335, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=236, iid=5, r_ui=5.0, est=3.6666666666666665, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=144, iid=10, r_ui=4.0, est=4.3026861097448945, details={'actual_k': 3, 'was_impossible': False}),
 Prediction(uid=28, iid=5, r_ui=5.0, est=4.333333333333333, details={'ac

##### Option 2: Cross-validation

Run a cross validation procedure for a given algorithm, reporting accuracy measures and computation times.

You have several options in surprise library: https://surprise.readthedocs.io/en/stable/model_selection.html

In [12]:
cross_validate(knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=True) ##rerun the training part with different parameters

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1833  1.1221  1.1378  1.1725  1.1908  1.1613  0.0267  
MAE (testset)     0.9454  0.8659  0.9006  0.9258  0.9317  0.9139  0.0280  
Fit time          0.20    0.21    0.24    0.19    0.22    0.21    0.02    
Test time         0.11    0.09    0.16    0.22    0.13    0.14    0.04    


{'test_rmse': array([1.18331557, 1.1221027 , 1.1377998 , 1.17249608, 1.19081863]),
 'test_mae': array([0.94535196, 0.86585543, 0.90059459, 0.92584096, 0.93169156]),
 'fit_time': (0.20246100425720215,
  0.20719599723815918,
  0.23529267311096191,
  0.18636155128479004,
  0.21964478492736816),
 'test_time': (0.1102447509765625,
  0.0942697525024414,
  0.16352128982543945,
  0.21567869186401367,
  0.1340794563293457)}

##### Option 3:  GridSearchCV
The GridSearchCV class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This is useful for finding the best set of parameters for a prediction algorithm. It is analogous to GridSearchCV from scikit-learn.

In [13]:
param_grid = {'k': [5, 10, 20],
              'sim_options': {'name': ['pearson', 'cosine'],
                              'min_support': [1, 5],   #the minimum number of common items needed between users to consider them for similarity. For the item-based approach, this corresponds to the minimum number of common users for two items.
                              'user_based': [True]}
              }


In [14]:
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=5) 


In [15]:
gs.fit(data)

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Co

In [17]:
# best RMSE score
print(gs.best_score['rmse'])

1.0300555052316596


In [18]:
# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

{'k': 20, 'sim_options': {'name': 'cosine', 'min_support': 5, 'user_based': True}}


In [19]:
# We can now use the algorithm that yields the best rmse:
knn = gs.best_estimator['rmse']
knn.fit(data.build_full_trainset())

#You may use this instead of some parts of the following section, to make predictions for the unseen data (i.e. all the missing ratings)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x25dea81df28>


### Example -- Making predictions for unknown ratings


#### UI prep
For our demo, we will create a user dictionary and movie dictionary where for user dictionary key is username and value userId which is used in our original dataset. For movie dictionary key is movieId and value is movie name. 

In [20]:
user_df = pd.read_csv("user_name.csv")


In [21]:
user_df.head(5)

Unnamed: 0,username,id
0,SPRING-20-933,1
1,SPRING-20-287,2
2,SPRING-20-013,1
3,SPRING-20-157,2
4,SPRING-20-891,3


In [22]:
user_dict = {}
for i in range(len(user_df)):
    user_dict[user_df.iloc[i].username] = user_df.iloc[i].id

In [23]:
movie_df = pd.read_csv("movie_name.csv")

In [24]:
movie_df.head(5)

Unnamed: 0,movieName,id
0,Rating [Rogue One/Star Wars],1
1,Rating [Fight Club],2
2,Rating [The Lord of the Rings],3
3,Rating [Trolls],4
4,Rating [Despicable Me],5


In [25]:
movie_dict = {}
for i in range(len(movie_df)):
    movie_dict[movie_df.iloc[i].id] = movie_df.iloc[i].movieName

#### Find user-item pairs with no ratings

The build_anti_testset() function returns all the ratings that are not in the trainset, i.e. all the ratings 𝑟𝑢𝑖 where the user 𝑢 is known, the item 𝑖 is known, but the rating 𝑟𝑢𝑖 is not in the trainset. As 𝑟𝑢𝑖 is unknown, it is either replaced by the fill value or assumed to be equal to the mean of all ratings global_mean.

In [26]:
#for simplicity, we use the entire dataset, with the default algorithm to traing the model. 
#In reality, you should follow one of the techniques above to find the optimal parameters.

# Retrieve the trainset.
trainset = data.build_full_trainset()

# Build an algorithm, and train it. Follow methodology provided previously
algo = KNNBasic()
algo.fit(trainset)

# Find missing values and predict
anti_test_set = trainset.build_anti_testset() 
predictions = knn.test(anti_test_set)

Computing the msd similarity matrix...
Done computing similarity matrix.


The getMovieRecommendations function takes topN parameter which is how many movies you want to recommend to the users. It uses predictions which we generate from anti-test.

In [27]:
from collections import defaultdict

def getMovieRecommendations(topN=3):
    top_recs = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions: 
        top_recs[uid].append((iid, est))
     
    for uid, user_ratings in top_recs.items():
        user_ratings.sort(key = lambda x: x[1], reverse = True)
        top_recs[uid] = user_ratings[:topN]
     
    return top_recs 

In [28]:
recommendations = getMovieRecommendations(3)

Fetch thee movie name from movie dict and clean it

In [29]:
def getMovieName(movie_id):
    if movie_id not in movie_dict:
        return ""
    m = movie_dict[movie_id].split('[')
    temp = m[1].split(']')
    return temp[0]

The getMovieRecommendationsForUser fuction takes username, and recommendations which we get from getMovieRecommendations function. 

In [30]:
def getMovieRecommendationsForUser(userId, recommendations):
    if userId not in user_dict:
        print("User id is not present")
        return
    u_id = user_dict[userId]
    recommended_movies = recommendations[u_id]
    movie_list = []
    for movie in recommended_movies:
        movie_list.append((getMovieName(movie[0]),movie[1]))
    return movie_list    

In [32]:
getMovieRecommendationsForUser('SPRING-20-663',recommendations)

[('Rogue One/Star Wars', 4.350296486127488),
 ('The hangover', 4.349583286452919),
 ('Pulp Fiction', 4.100834799392867)]

### Tips

1.Surprise dataset function just takes three columns,user-item and ratings so be careful.

2.Building Antitest set gives you all the unknown user-item ratings,you may not require all of them.

3.Explore more and have fun!
