# Surprise recommender systems library with MLflow
## Development Stage
In this development notebook we show how we can use suprise package to train different recommender algorithms. With the use of MLflow we demonstrate how we can select our best performing algorithm in terms of accuracy (rmse) and of training/predicting time. First we import our required libraries:

In [3]:
import pandas as pd 

from surprise import Reader, Dataset #converts pandas dataframe to surprise objects
from surprise import NormalPredictor, KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline #algorithms
from surprise import SVD, BaselineOnly, SVDpp, NMF, SlopeOne, CoClustering #algorithms

from surprise.model_selection import cross_validate
from surprise.accuracy import rmse
#from surprise import accuracy
#from surprise.model_selection import train_test_split

import mlflow #load mlflow - no further setup is needed in faculty as it has a native UI.
#Warning: if you are going to use mlflow in other environment you will need to set it up first

import pickle

We load our ratings files into a pandas DataFrame. We remove the timestamp column as we will not use it further. We transform pandas DataFrame to a surpise object:

In [10]:
ratings = pd.read_csv('/project/DataCollection/ratings.csv')
ratings['rating'] = ratings['rating'].astype(int)
ratings = ratings.drop('timestamp', axis=1)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,5
1,1,2,3
2,1,3,4
3,1,4,3
4,1,5,3


In [11]:
#now we convert it to surpise object
reader = Reader(rating_scale=(0, 5)) #set the range of rating column
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

Now we create a new MLflow experiment instance. This will save the several metrics from the different models that we will experiment with.

In [12]:
# Set the experiment name to an experiment 
mlflow.end_run()
mlflow.set_experiment("Surprise_train")

INFO: 'Surprise_train' does not exist. Creating a new experiment


Below we create a loop which will train different algorithms from the Surprise package. With MLflow we store the results.

In [13]:
#Below we select several algorithms from surprise package:
algorithms = [SVD(), SVDpp(), SlopeOne(), NormalPredictor(), 
              KNNBaseline(), KNNBasic(), KNNWithMeans(), 
              KNNWithZScore(), BaselineOnly(), CoClustering()]

# Iterate over all algorithms
for algorithm in algorithms:
    mlflow.start_run() #for each iteration we will save the new results
    
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, n_jobs=-1, verbose=False)

    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    
    #Save results to MLflow
    mlflow.set_tag("Model", tmp[3])
    mlflow.log_param("TestRMSE", str(tmp[0]))
    mlflow.log_param("Training Time", tmp[1])
    mlflow.log_param("Predicting Time", tmp[2])
    mlflow.end_run()

At this stage we access the Experiments sections of Faculty.ai . We access the experiment that we have logged and we examine the results:

[comment]: <> (Local file is mlflow1.png)
![](https://i.imgur.com/qLarbSP.png)


We select all of the experiments and we request to compare them. In the scatter plot below we request to have in x'x axis the Test RMSE and in y'y axis the time for making predictions.
After examining the different models we note down that SVD was the model that achieved both the best accuracy and predicting time:

[comment]: <> (Local file is mlflow2.png)
![](https://i.imgur.com/pXweI6M.png)

## Production Stage
Based on the best algorithm from the previous stage (SVD) we have create a **surprise_SVD_job.py** file which will be executed once a day (to update the recommendations) on Faculty.ai Jobs.




This script will train model and will save in a pickle file the movie recommendations for each user.
In the next cells we replicate the code that we can find in this script file. 

In [27]:
#stage 1
ratings = pd.read_csv('/project/DataCollection/ratings.csv')
ratings['rating'] = ratings['rating'].astype(int)
ratings = ratings.drop('timestamp', axis=1)
reader = Reader(rating_scale=(0, 5)) #set the range of rating column
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Select your best parameters for SVD
param_grid = {'n_factors':[80,100,120],
              #'n_epochs': [15,20,25], 
              #'lr_all': [0.002, 0.005, 0.01],
              #'reg_all': [0.01,0.02,0.03],
              'random_state':[94]
             }
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)
grid_search.fit(data)

best_model = grid_search.best_estimator['rmse']

# Get the best model
data = data.build_full_trainset()
best_model = best_model.fit(data)


With the best model now we retrieve the top 10 predictions for each user. 
With the surpise's .build_anti_testset() method we create a matrix that has all of the movies that each user haven't rated so far.
The following code has been adapted from surpise's documentation (reference in comment).

In [35]:
#stage 2
#https://surprise.readthedocs.io/en/stable/FAQ.html?highlight=get%20top%20predictions#how-to-get-the-top-n-recommendations-for-each-user

def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
data_anti_testset = data.build_anti_testset()
predictions = best_model.test(data_anti_testset)

top_10 = get_top_n(predictions, n=10)


In [36]:
#Here we get the prediction for the user 1.
#For movie with ID 483, the predicted rating is 5
top_10[1]

[(483, 5),
 (513, 4.796565258129434),
 (474, 4.771459346501919),
 (408, 4.753590453730842),
 (285, 4.743957308039365),
 (511, 4.738838663889011),
 (603, 4.72774841361666),
 (479, 4.674467842693398),
 (498, 4.620256873215769),
 (302, 4.613314636579151)]

In [38]:
#stage 3
#Finally we save the recommendations to a pickle file in our local directory
f = open("top_10_recomm.pkl","wb")
pickle.dump(top_10,f)
f.close()

# Deployment stage 
Below we demonstrate how we can serve these recommendations with an API through faculty.ai.
First we have created the **svd_app.py** script which will host our Flask API. What the script does, is to load the top_10_recomm.pkl pickle file (from the previous script) and serve it as an API. The API offers a GET endpoint as the internal clients (on premises or remotely) will just request the recommendations for each user.

In [40]:
import os
import sys
os.environ['PYSPARK_PYTHON'] = "python3" #requirement for faculty.ai
os.environ['PYSPARK_DRIVER_PYTHON'] = "python3" 
from flask import Flask
from flask import request

import pandas as pd
import pickle
import json
from collections import defaultdict

flask_server = Flask(__name__)

with open('top_10_recomm.pkl', 'rb') as f:  #we load the recommendations
    top_10 = pickle.load(f)
    
@flask_server.route('/predict/<int:uid>', methods=['GET'])  #we define our get endpoint
def get_top_10(uid):
    if not top_10[uid]:
        empty = pd.DataFrame(columns=['movieId' , 'pred_rating', 'userId'])
        empty.loc[0,'userId']=uid
        empty_json = empty.to_json()
        return empty_json
    else:
        extr = top_10[uid]
        extr = pd.DataFrame(extr)
        extr['user_id'] = uid
        extr.columns = ['movieId' , 'pred_rating', 'userId']
        resp_json = extr.to_json()
        return resp_json
    
if __name__ == '__main_':
    logging.info('Listening on port {}'.format(8080))
    flask_server.run(debug=True, host='0.0.0.0', port=8080, threaded=False)  #(debug=False, host='127.0.0.1', port=8080,  threaded=False) 

To deploy the API we access Faculty.ai > Deployments > API . We select our script file and the server object inside the script (in our case flask_server). Finally we request to generate a deployment API key.

[comment]: <> (Local file is facultyapi.png)
![](https://i.imgur.com/j4Il1O7.png)

In [None]:
In the deploy tab 