# Movie Recommender System

This notebook contains the implementation of the following recommender algorithms:

- Content-Based Filtering
- Item-Based Collaborative Filtering
- User-Based Collaborative Filtering
- Matrix Factorization Collaborative Filtering
- Weighted Hybrid Filtering
- Mixed Hybrid Filtering
- Cascade Hybrid Filtering

## Experimentation setup

* Objective

    * To compare how each recommender algorithm perform in predicting ratings and recommending relevant items.

* Environment

    * The comparison is run on Google colab.
        
* Datasets
        
    * Movielens 100K.


* Data split

    * The data is split into train and test sets.
    * The split ratios are 80-20 for train and test datasets.
    * The splitting is done per user.


* Evaluation metrics

    * Ranking metrics:

        * Precision@k.
        * Recall@k.
        * Normalized discounted cumulative gain@k (NDCG@k).

    * Rating metrics:
        * Root mean squared error (RMSE).

## Imports

In [1]:
import numpy as np
import pandas as pd
import feather
import random

## Parameters

In [2]:
k = 100
n_splits = 1
col_split = 'item'
n_samples = 100
n_iter = 4
col_eval = 'RMSE'

In [3]:
COL_DICT = {
    'col_user':'userID',
    'col_item':'itemID',
    'col_rating':'rating',
    'col_prediction':'prediction',
}

## Algorithms

In [4]:
from recommender.content_based_recommender import train_content, predict_content
from recommender.item_item_collab_recommender import train_item, predict_item
from recommender.user_user_collab_recommender import train_user, predict_user
from recommender.matrix_fact_collab_recommender import train_mf, predict_mf
from recommender.weighted_hybrid_recommender import train_weighted, predict_weighted
from recommender.mixed_hybrid_recommender import train_mixed, predict_mixed
from recommender.cascade_hybrid_recommender import train_cascade, predict_cascade

In [5]:
trainer = {
    "content": lambda  data, catalog, params: train_content(data, catalog, params),
    "item-item": lambda  data, catalog, params: train_item(data, catalog, params),
    "user-user": lambda  data, catalog, params: train_user(data, catalog, params), 
    "matrix-factorization":  lambda data, catalog, params: train_mf(data, catalog, params),
    "weighted": lambda  data, catalog, params: train_weighted(data, catalog, params),
    "mixed": lambda  data, catalog, params: train_mixed(data, catalog, params),
    "cascade": lambda  data, catalog, params: train_cascade(data, catalog, params),
}

In [6]:
predictor = {
    "content": lambda model, test, train: predict_content(model, test, train),
    "item-item": lambda model, test, train: predict_item(model, test, train),
    "user-user": lambda model, test, train: predict_user(model, test, train),
    "matrix-factorization": lambda model, test, train: predict_mf(model,test, train),
    "weighted": lambda model, test, train: predict_weighted(model, test, train),
    "mixed": lambda model, test, train: predict_mixed(model, test, train),
    "cascade": lambda model, test, train: predict_cascade(model, test, train),
}

## Helper Functions

In [7]:
def generate_summary(algo, rating_metrics, ranking_metrics, params):
    summary = {"Algo": algo, **params}
    if rating_metrics is None:
        rating_metrics = {
            "RMSE": np.nan,
        }
    if ranking_metrics is None:
        ranking_metrics = {
            "nDCG@k": np.nan,
            "Precision@k": np.nan,
            "Recall@k": np.nan,
            "F1@k": np.nan,
        }
    summary.update(rating_metrics)
    summary.update(ranking_metrics)
    return summary

In [8]:
def best_params(df, algo):
    grps = df.groupby(['Algo']).apply(lambda x: x.nlargest(1, col_eval))
    return grps[grps['Algo'] == algo]['params'].values[0]

## Metrics

In [9]:
from evaluation.metric import rmse, ndcg_at_k, precision_at_k, recall_at_k
from dataset.splitters import stratified_split

In [10]:
def rating_metrics(test, predictions):
    return {
        "RMSE": rmse(test, predictions, **COL_DICT),
    }

In [11]:
def ranking_metrics(test, predictions, k):
    precision = precision_at_k(test, predictions, k=k, **COL_DICT)
    recall = recall_at_k(test, predictions, k=k, **COL_DICT)
    return {
        "nDCG@k": ndcg_at_k(test, predictions, k=k, **COL_DICT),
        "Precision@k": precision,
        "Recall@k": recall,
        "F1@k": (2*precision*recall)/(precision+recall)
    }

## Hyperparamter Tuning

In [12]:
from sklearn.model_selection import ParameterSampler
from scipy.stats.distributions import uniform, randint
from dataset.splitters import stratified_split

In [13]:
mf_params = {
    'alpha' : uniform(),
    'l1_ratio': uniform()
}

weighted_params = {
    'wght' : uniform(loc=0.2, scale=0.85)
}

cascade_params = {
    'threshold': uniform(loc=0,scale=5)
}

mixed_params = {
    'lmt' : randint(5,50) #(1,10)*5
}

hyper_params = {
    "matrix-factorization": mf_params,
    "weighted": weighted_params,
    "cascade": cascade_params,
    "mixed": mixed_params,
}

In [14]:
parmeterized_algorithms = ["matrix-factorization", "weighted", "cascade", "mixed"] 

In [15]:
%%time 

# For each  algorithm, a recommender is evaluated. 
cols = ["Algo", "params", "RMSE", "nDCG@k", "Precision@k", "Recall@k", "F1@k"]
df_results = pd.DataFrame(columns=cols)

# Load the dataset
df = pd.read_csv('data/ratings.csv', header=0,
    names=['userID', 'itemID', 'rating', 'timestamp']
)

# take a random sample of the dataset to speed up cross validation
df = df[df['itemID'].isin(random.choices(df['itemID'].values,k=n_samples))]

print(f"Size of Movielens Sample: {df.shape}")

# Split the dataset
train, test  = stratified_split(df, filter_by=col_split, min_rating=1, ratio=0.8)

# remove movies with less than min_rating from the catalog
catalog = pd.read_feather('data/content_movie.ftr')
catalog = catalog[catalog['movieId'].isin(train.itemID)].reset_index()

# Loop through the algos
for algo in parmeterized_algorithms:
    print(f"\nTuning {algo} algorithm on Movielens Sample")
    
    # get model parameters
    model_hyper_params = hyper_params[algo]
    
    param_list = list(ParameterSampler(model_hyper_params, n_iter=n_iter))
    
    for model_param in param_list:
        
        if algo in ['weighted', 'cascade', 'mixed']:
            model_param =  {**model_param, **best_params(df_results, 'matrix-factorization')}
        
        print(f"\nUsing Parameters {model_param}")
        
        # Train the model
        model = trainer[algo](train, catalog, model_param)

        # Predict and evaluate
        preds = predictor[algo](model, test, train)

        # calculate metrics
        ratings = rating_metrics(test, preds)
        rankings = ranking_metrics(test, preds, k) 

        # Record results
        summary = generate_summary(algo, ratings, rankings, params={'params':model_param})

        df_results.loc[df_results.shape[0] + 1] = summary

print()

Size of Movielens Sample: (5288, 4)

Tuning matrix-factorization algorithm on Movielens Sample

Using Parameters {'alpha': 0.5019584286159557, 'l1_ratio': 0.9096845931938514}

Using Parameters {'alpha': 0.4289686294152538, 'l1_ratio': 0.6865587754478902}

Using Parameters {'alpha': 0.4022425164972864, 'l1_ratio': 0.027845334735442484}

Using Parameters {'alpha': 0.3905507270778187, 'l1_ratio': 0.5919920475925088}

Tuning weighted algorithm on Movielens Sample

Using Parameters {'wght': 1.0275380508522372, 'alpha': 0.5019584286159557, 'l1_ratio': 0.9096845931938514}
4, 5, 6, 7, 8, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 32, 33, 39, 40, 42, 44, 45, 46, 50, 57, 58, 61, 63, 64, 65, 66, 67, 68, 71, 72, 74, 76, 77, 79, 80, 82, 84, 86, 89, 91, 94, 95, 96, 97, 100, 101, 103, 105, 107, 108, 109, 110, 111, 112, 114, 115, 116, 117, 121, 122, 123, 124, 125, 126, 129, 130, 131, 132, 133, 135, 137, 139, 140, 141, 142, 144, 145, 148, 149, 152, 153, 155, 156, 159, 160, 162, 164, 165, 166, 167, 169, 17

4, 5, 6, 7, 8, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 32, 33, 39, 40, 42, 44, 45, 46, 50, 57, 58, 61, 63, 64, 65, 66, 67, 68, 71, 72, 74, 76, 77, 79, 80, 82, 84, 86, 89, 91, 94, 95, 96, 97, 100, 101, 103, 105, 107, 108, 109, 110, 111, 112, 114, 115, 116, 117, 121, 122, 123, 124, 125, 126, 129, 130, 131, 132, 133, 135, 137, 139, 140, 141, 142, 144, 145, 148, 149, 152, 153, 155, 156, 159, 160, 162, 164, 165, 166, 167, 169, 170, 171, 172, 173, 174, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 189, 191, 192, 193, 195, 197, 198, 199, 200, 202, 204, 206, 207, 209, 210, 211, 212, 213, 217, 219, 220, 221, 222, 224, 226, 227, 228, 229, 230, 232, 233, 234, 235, 237, 239, 240, 243, 246, 247, 249, 254, 255, 256, 258, 261, 262, 263, 265, 266, 267, 268, 270, 274, 275, 276, 279, 282, 283, 284, 285, 286, 287, 288, 290, 291, 292, 293, 294, 295, 297, 298, 302, 304, 305, 307, 308, 309, 310, 312, 313, 314, 317, 318, 319, 321, 322, 323, 325, 326, 328, 330, 331, 332, 334, 336, 337, 339, 340,

1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146, 148, 149, 150, 152, 153, 154, 155, 156, 157, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 232, 23

4, 5, 6, 7, 8, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 32, 33, 39, 40, 42, 44, 45, 46, 50, 57, 58, 61, 63, 64, 65, 66, 67, 68, 71, 72, 74, 76, 77, 79, 80, 82, 84, 86, 89, 91, 94, 95, 96, 97, 100, 101, 103, 105, 107, 108, 109, 110, 111, 112, 114, 115, 116, 117, 121, 122, 123, 124, 125, 126, 129, 130, 131, 132, 133, 135, 137, 139, 140, 141, 142, 144, 145, 148, 149, 152, 153, 155, 156, 159, 160, 162, 164, 165, 166, 167, 169, 170, 171, 172, 173, 174, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 189, 191, 192, 193, 195, 197, 198, 199, 200, 202, 204, 206, 207, 209, 210, 211, 212, 213, 217, 219, 220, 221, 222, 224, 226, 227, 228, 229, 230, 232, 233, 234, 235, 237, 239, 240, 243, 246, 247, 249, 254, 255, 256, 258, 261, 262, 263, 265, 266, 267, 268, 270, 274, 275, 276, 279, 282, 283, 284, 285, 286, 287, 288, 290, 291, 292, 293, 294, 295, 297, 298, 302, 304, 305, 307, 308, 309, 310, 312, 313, 314, 317, 318, 319, 321, 322, 323, 325, 326, 328, 330, 331, 332, 334, 336, 337, 339, 340,

In [16]:
df_results

Unnamed: 0,Algo,params,RMSE,nDCG@k,Precision@k,Recall@k,F1@k
1,matrix-factorization,"{'alpha': 0.5019584286159557, 'l1_ratio': 0.90...",3.583203,0.315251,0.025939,1.0,0.050567
2,matrix-factorization,"{'alpha': 0.4289686294152538, 'l1_ratio': 0.68...",3.57831,0.329407,0.025939,1.0,0.050567
3,matrix-factorization,"{'alpha': 0.4022425164972864, 'l1_ratio': 0.02...",3.573751,0.32915,0.025939,1.0,0.050567
4,matrix-factorization,"{'alpha': 0.3905507270778187, 'l1_ratio': 0.59...",3.579297,0.325973,0.025939,1.0,0.050567
5,weighted,"{'wght': 1.0275380508522372, 'alpha': 0.501958...",3.729382,0.31348,0.027033,1.0,0.052643
6,weighted,"{'wght': 0.7651361774452798, 'alpha': 0.501958...",2.849484,0.329704,0.027033,1.0,0.052643
7,weighted,"{'wght': 0.503619312804398, 'alpha': 0.5019584...",2.02321,0.334513,0.027033,1.0,0.052643
8,weighted,"{'wght': 0.41036076420498724, 'alpha': 0.50195...",1.75262,0.334867,0.027033,1.0,0.052643
9,cascade,"{'threshold': 4.670539458865781, 'alpha': 0.50...",3.655255,0.274592,0.025939,1.0,0.050567
10,cascade,"{'threshold': 2.8106435923542255, 'alpha': 0.5...",2.096875,0.272622,0.025939,1.0,0.050567


In [17]:
df_results

Unnamed: 0,Algo,params,RMSE,nDCG@k,Precision@k,Recall@k,F1@k
1,matrix-factorization,"{'alpha': 0.5019584286159557, 'l1_ratio': 0.90...",3.583203,0.315251,0.025939,1.0,0.050567
2,matrix-factorization,"{'alpha': 0.4289686294152538, 'l1_ratio': 0.68...",3.57831,0.329407,0.025939,1.0,0.050567
3,matrix-factorization,"{'alpha': 0.4022425164972864, 'l1_ratio': 0.02...",3.573751,0.32915,0.025939,1.0,0.050567
4,matrix-factorization,"{'alpha': 0.3905507270778187, 'l1_ratio': 0.59...",3.579297,0.325973,0.025939,1.0,0.050567
5,weighted,"{'wght': 1.0275380508522372, 'alpha': 0.501958...",3.729382,0.31348,0.027033,1.0,0.052643
6,weighted,"{'wght': 0.7651361774452798, 'alpha': 0.501958...",2.849484,0.329704,0.027033,1.0,0.052643
7,weighted,"{'wght': 0.503619312804398, 'alpha': 0.5019584...",2.02321,0.334513,0.027033,1.0,0.052643
8,weighted,"{'wght': 0.41036076420498724, 'alpha': 0.50195...",1.75262,0.334867,0.027033,1.0,0.052643
9,cascade,"{'threshold': 4.670539458865781, 'alpha': 0.50...",3.655255,0.274592,0.025939,1.0,0.050567
10,cascade,"{'threshold': 2.8106435923542255, 'alpha': 0.5...",2.096875,0.272622,0.025939,1.0,0.050567


# Optimal Parameters

In [18]:
params_mf = best_params(df_results, 'matrix-factorization')
params_weighted = {**best_params(df_results, 'weighted'), **best_params(df_results, 'matrix-factorization')}
params_cascade = {**best_params(df_results, 'cascade'), **best_params(df_results, 'matrix-factorization')}
params_mixed = {**best_params(df_results, 'mixed'),**best_params(df_results, 'matrix-factorization')}

params = {
    "content": {},
    "item-item": {},
    "user-user": {},
    "matrix-factorization": params_mf,
    "weighted": params_weighted,
    "cascade": params_cascade,
    "mixed": params_mixed
}

In [19]:
params

{'content': {},
 'item-item': {},
 'user-user': {},
 'matrix-factorization': {'alpha': 0.5019584286159557,
  'l1_ratio': 0.9096845931938514},
 'weighted': {'wght': 1.0275380508522372,
  'alpha': 0.5019584286159557,
  'l1_ratio': 0.9096845931938514},
 'cascade': {'threshold': 4.670539458865781,
  'alpha': 0.5019584286159557,
  'l1_ratio': 0.9096845931938514},
 'mixed': {'lmt': 10,
  'alpha': 0.5019584286159557,
  'l1_ratio': 0.9096845931938514}}

## Main Loop

In [20]:
algorithms = ["content", "item-item", "matrix-factorization", "weighted", "mixed", "cascade"] #["content", "item-item", "user-user", "matrix-factorization", "weighted", "mixed", "cascade"]

In [21]:
%%time 

# For each  algorithm, a recommender is evaluated. 
cols = ["Algo","Fold", "K", "RMSE", "nDCG@k", "Precision@k", "Recall@k", "F1@k"]
df_results = pd.DataFrame(columns=cols)

# Load the dataset
df = pd.read_csv('data/ratings.csv', header=0,
    names=['userID', 'itemID', 'rating', 'timestamp']
)

print(f"Size of Movielens: {df.shape}")

for i in range(n_splits):
    
    print(f"Doing Fold {i+1}/{n_splits}")
    
    # Split the dataset
    train, test  = stratified_split(df, filter_by=col_split, min_rating=1, ratio=0.8)

    # remove movies with less than min_rating from the catalog
    catalog = pd.read_feather('data/content_movie.ftr')
    catalog = catalog[catalog['movieId'].isin(train.itemID)].reset_index()

    # Loop through the algos
    for algo in algorithms:
        print(f"\nComputing {algo} algorithm on Movielens")

        # get the parameters
        model_params  = params[algo]

        # Train the model
        model = trainer[algo](train, catalog, model_params)

        # Predict and evaluate
        preds = predictor[algo](model, test, train)

        # calculate metrics
        ratings = rating_metrics(test, preds)
        rankings = ranking_metrics(test, preds, k) 
        
        # Record results
        summary = generate_summary(algo, ratings, rankings, params={'Fold':i+1, 'K':k})

        df_results.loc[df_results.shape[0] + 1] = summary

    print()

Size of Movielens: (100836, 4)
Doing Fold 1/1

Computing content algorithm on Movielens
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 2

2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224,

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222

## Results

In [22]:
df_results

Unnamed: 0,Algo,Fold,K,RMSE,nDCG@k,Precision@k,Recall@k,F1@k
1,content,1,100,0.911782,0.050146,0.004323,0.01146,0.006278
2,item-item,1,100,0.916321,0.000995,6.6e-05,2.9e-05,4e-05
3,matrix-factorization,1,100,3.171938,0.519016,0.093251,0.464531,0.155322
4,weighted,1,100,3.248229,0.516927,0.09302,0.456799,0.154565
5,mixed,1,100,3.168481,0.364352,0.090759,0.44643,0.15085
6,cascade,1,100,3.631921,0.054026,0.004389,0.010546,0.006199
