<a href="https://colab.research.google.com/github/michaelwnau/ai-academy-machine-learning-2023/blob/main/W10S2_RecSys_SVDTuning_Sol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Week 10 - Session 2 : Recommendation Sytem - Grid Search

 - Finding the best parameter for MovieLens-100k data with GridSearchCV

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Dependencies
- Install package: surprise  (pip install surprise)

In [3]:
# Dependencies
!pip install surprise
# or
# !conda install -c conda-forge scikit-surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163768 sha256=66ef4399b184cc85db5f422eae27cadff76a29585169c7083be9ef77846ce650
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


## 1. Load Data

In [4]:
import numpy as np
import pandas as pd
from surprise import SVD
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import GridSearchCV
from surprise import accuracy

df = pd.read_csv("/content/drive/MyDrive/ai-academy-machine-learning-2023/week-10/2_W10S2_Code_RecSys2_SVDTuning/2_W10S2_Code_RecSys2_SVDTuning/ml-latest-small/ratings.csv")
df.columns = ['userID', 'itemID', 'rating', 'timestamp']

In [5]:
df.head()

Unnamed: 0,userID,itemID,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## 2. Parameter Search - Tuning

- Set up a movie recommender system using the matrix factorization.SVD algorithm, which was popularized by Simon Funk during the Netﬂix Prize.
- Tune your model with 3-fold cross validation with the following methods.
- surprise.SVD : https://surprise.readthedocs.io/en/stable/matrix_factorization.html
- paper: https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf

In [6]:
def tuning(df, target_rating, rating_scale, n_jobs):
    # A reader is still needed but only the rating_scale param is requiered.
    reader = Reader(rating_scale=rating_scale)

    # The columns must correspond to user id, item id and ratings (in that order).
    data = Dataset.load_from_df(df[['userID', 'itemID', target_rating]], reader)

    # SVD parameters
    # 'n_factors': number of latent factors
    # 'n_epochs': The number of iteration of the SGD procedure.
    # 'lr_all': learning rate
    # 'reg_all': regularization

    param_grid = {'n_factors': [100, 200, 300], 'n_epochs': [20, 30], 'lr_all': [0.005, 0.01],
                  'reg_all': [0.02, 0.2]}

    # SVD : the method for recommendation
    # param_grid: parameter search space
    # measures: The performance measures to compute.
    # cv: cross validation
    # n_jobs: The maximum number of parallel training procedures.

    gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=n_jobs, joblib_verbose=50)
    gs.fit(data)

    # best RMSE score
    print(gs.best_score['rmse'])

    # combination of parameters that gave the best RMSE score
    print(gs.best_params['rmse'])
    return data, gs

In [7]:
data, gs = tuning(df, target_rating = 'rating', rating_scale = (1,5), n_jobs = 6)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    8.3s
[Parallel(n_jobs=6)]: Done   2 tasks      | elapsed:    8.8s
[Parallel(n_jobs=6)]: Done   3 tasks      | elapsed:   10.0s
[Parallel(n_jobs=6)]: Done   4 tasks      | elapsed:   11.2s
[Parallel(n_jobs=6)]: Done   5 tasks      | elapsed:   13.2s
[Parallel(n_jobs=6)]: Done   6 tasks      | elapsed:   20.6s
[Parallel(n_jobs=6)]: Done   7 tasks      | elapsed:   21.3s
[Parallel(n_jobs=6)]: Done   8 tasks      | elapsed:   21.6s
[Parallel(n_jobs=6)]: Done   9 tasks      | elapsed:   22.5s
[Parallel(n_jobs=6)]: Done  10 tasks      | elapsed:   22.5s
[Parallel(n_jobs=6)]: Done  11 tasks      | elapsed:   22.5s
[Parallel(n_jobs=6)]: Done  12 tasks      | elapsed:   29.6s
[Parallel(n_jobs=6)]: Done  13 tasks      | elapsed:   35.3s
[Parallel(n_jobs=6)]: Done  14 tasks      | elapsed:   36.3s
[Parallel(n_jobs=6)]: Done  15 tasks      | elapsed:   37.6s
[Parallel(

### GridSearchCV Result

 - Showing the overall tuning results from the above 3-fold cross validation. (RMSE/MAE from all folds)

In [8]:
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df

Unnamed: 0,split0_test_rmse,split1_test_rmse,split2_test_rmse,mean_test_rmse,std_test_rmse,rank_test_rmse,split0_test_mae,split1_test_mae,split2_test_mae,mean_test_mae,...,rank_test_mae,mean_fit_time,std_fit_time,mean_test_time,std_test_time,params,param_n_factors,param_n_epochs,param_lr_all,param_reg_all
0,0.881631,0.882175,0.881182,0.881663,0.000406,13,0.678864,0.679008,0.677707,0.678526,...,13,5.066942,0.080472,1.171917,0.088817,"{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0...",100,20,0.005,0.02
1,0.878713,0.87802,0.878667,0.878467,0.000316,10,0.678105,0.676317,0.677435,0.677286,...,10,6.807975,1.855632,1.906147,0.43616,"{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0...",100,20,0.005,0.2
2,0.888985,0.885889,0.884655,0.88651,0.001821,17,0.682934,0.680652,0.678761,0.680782,...,16,9.226729,0.281512,1.515217,0.107656,"{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0...",100,20,0.01,0.02
3,0.874513,0.874329,0.875119,0.874654,0.000338,6,0.674157,0.672983,0.673941,0.673694,...,6,7.211572,0.972513,1.620541,0.348462,"{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0...",100,20,0.01,0.2
4,0.884734,0.881454,0.882038,0.882742,0.001429,14,0.680365,0.67667,0.678558,0.678531,...,14,11.658336,0.640538,2.172846,0.248948,"{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0...",100,30,0.005,0.02
5,0.875345,0.875286,0.875673,0.875435,0.00017,7,0.67509,0.67356,0.674537,0.674396,...,7,12.451009,0.493842,1.530296,0.121733,"{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0...",100,30,0.005,0.2
6,0.892475,0.891654,0.890823,0.891651,0.000674,23,0.686222,0.685527,0.685835,0.685861,...,23,9.593564,0.597389,2.275002,0.245859,"{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0...",100,30,0.01,0.02
7,0.87295,0.873127,0.873903,0.873327,0.000414,3,0.672703,0.671658,0.672871,0.672411,...,3,11.458215,0.656762,2.252803,0.546809,"{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0...",100,30,0.01,0.2
8,0.888048,0.887495,0.887796,0.88778,0.000226,18,0.684993,0.682732,0.683542,0.683756,...,19,10.743034,0.932133,1.537446,0.201584,"{'n_factors': 200, 'n_epochs': 20, 'lr_all': 0...",200,20,0.005,0.02
9,0.87917,0.877846,0.878987,0.878668,0.000586,11,0.678793,0.676222,0.677887,0.677634,...,11,10.193856,1.499117,2.195036,0.445272,"{'n_factors': 200, 'n_epochs': 20, 'lr_all': 0...",200,20,0.005,0.2


## 3. Evaluate the model

- Here we evaluate the model with `RMSE`.

In [9]:
%%time
# 1) Train the best model with full trainset.
trainset = data.build_full_trainset()
algo = gs.best_estimator['rmse']
algo.fit(trainset)

# 2) Make prediction using the testset which are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
accuracy.rmse(predictions)

RMSE: 0.5064
CPU times: user 1min 8s, sys: 4.27 s, total: 1min 12s
Wall time: 1min 12s


0.5063952801124072

### Questions
- Select one specific user ID and recommend movies:
https://surprise.readthedocs.io/en/stable/FAQ.html