<a href="https://colab.research.google.com/github/nathann3/better_than_netflix_movie_recommender/blob/dev/notebooks/comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering Comparison

In this notebook we compare different recommendation systems starting with the state-of-the-art LightGCN and going back to the winning algorithm for 2009's Netflix Prize competition, SVD++.

Models include in order are LightGCN, NGCF, SVAE, SVD++, and SVD. Each model has their own individual notebooks where we go more indepth, especially LightGCN and NGCF, where we implemented them from scratch in Tensorflow. 

The last cell compares the performance of the different models using ranking metrics:


*   Precision@k
*   Recall@k
*   Mean Average Precision (MAP)
*   Normalized Discounted Cumulative Gain (NDCG)

where $k=10$



# Imports

In [1]:
import math
import numpy as np
import os
import pandas as pd
import random
import requests
import scipy.sparse as sp
import surprise
import tensorflow as tf


from src.data import make_dataset
from src.features import build_features
from src.models import SVAE, metrics
from src.models.GCN import LightGCN, NGCF
from sklearn.model_selection import train_test_split
from tensorflow.python.framework.ops import disable_eager_execution
from tqdm import tqdm

# Prepare data

In [2]:
fp = os.path.join('..', 'data', 'ml-100k.data')
make_dataset.download_movie(fp)

raw_data = pd.read_csv(fp, sep='\t', names=['userId', 'movieId', 'rating', 'timestamp'])
print(f'Shape: {raw_data.shape}')
raw_data.sample(5, random_state=123)

Shape: (100000, 4)


Unnamed: 0,userId,movieId,rating,timestamp
42083,600,651,4,888451492
71825,607,494,5,883879556
99535,875,1103,5,876465144
47879,648,238,3,882213535
36734,113,273,4,875935609


In [3]:
# Download movie titles.
url = 'https://files.grouplens.org/datasets/movielens/ml-100k/u.item'
fp = os.path.join('..', 'data', 'ml-100k.item')
r = requests.get(url, stream=True)
block_size = 1024
total_size = int(r.headers.get('content-length', 0))
num_iterables = math.ceil(total_size / block_size)

# Download if not already downloaded.
if not os.path.exists(fp):
    with open(fp, 'wb') as file:
        for data in tqdm(
            r.iter_content(block_size), total=num_iterables, unit='KB', unit_scale=True
        ):
            file.write(data)

movie_titles = pd.read_csv(fp, sep='|', names=['movieId', 'title'], usecols = range(2), encoding='iso-8859-1')
print(f'Shape: {movie_titles.shape}')
movie_titles.sample(10, random_state=123)

Shape: (1682, 2)


Unnamed: 0,movieId,title
304,305,"Ice Storm, The (1997)"
450,451,Grease (1978)
691,692,"American President, The (1995)"
1408,1409,"Swan Princess, The (1994)"
1075,1076,"Pagemaster, The (1994)"
103,104,Theodore Rex (1995)
167,168,Monty Python and the Holy Grail (1974)
1460,1461,Here Comes Cookie (1935)
1189,1190,That Old Feeling (1997)
1438,1439,Jason's Lyric (1994)


In [4]:
train_size = 0.75
train, test = make_dataset.stratified_split(raw_data, 'userId', train_size)

print(f'Train Shape: {train.shape}')
print(f'Test Shape: {test.shape}')
print(f'Do they have the same users?: {set(train.userId) == set(test.userId)}')

Train Shape: (74992, 4)
Test Shape: (25008, 4)
Do they have the same users?: True


In [5]:
combined = train.append(test)

n_users = combined['userId'].nunique()
print('Number of users:', n_users)

n_movies = combined['movieId'].nunique()
print('Number of movies:', n_movies)

Number of users: 943
Number of movies: 1682


In [6]:
# Create DataFrame with reset index of 0-n_movies.
movie_new = combined[['movieId']].drop_duplicates()
movie_new['movieId_new'] = np.arange(len(movie_new))

train_reindex = pd.merge(train, movie_new, on='movieId', how='left')
# Reset index to 0-n_users.
train_reindex['userId_new'] = train_reindex['userId'] - 1  
train_reindex = train_reindex[['userId_new', 'movieId_new', 'rating']]

test_reindex = pd.merge(test, movie_new, on='movieId', how='left')
# Reset index to 0-n_users.
test_reindex['userId_new'] = test_reindex['userId'] - 1
test_reindex = test_reindex[['userId_new', 'movieId_new', 'rating']]

# Create dictionaries so we can convert to and from indexes
item2id = dict(zip(movie_new['movieId'], movie_new['movieId_new']))
id2item = dict(zip(movie_new['movieId_new'], movie_new['movieId']))
user2id = dict(zip(train['userId'], train_reindex['userId_new']))
id2user = dict(zip(train_reindex['userId_new'], train['userId']))

In [7]:
# Create user-item graph (sparse matix where users are rows and movies are columns.
# 1 if a user reviewed that movie, 0 if they didn't).
R = sp.dok_matrix((n_users, n_movies), dtype=np.float32)
R[train_reindex['userId_new'], train_reindex['movieId_new']] = 1

# Create the adjaceny matrix with the user-item graph.
adj_mat = sp.dok_matrix((n_users + n_movies, n_users + n_movies), dtype=np.float32)

# List of lists.
adj_mat.tolil()
R = R.tolil()

# Put together adjacency matrix. Movies and users are nodes/vertices.
# 1 if the movie and user are connected.
adj_mat[:n_users, n_users:] = R
adj_mat[n_users:, :n_users] = R.T

adj_mat

<2625x2625 sparse matrix of type '<class 'numpy.float32'>'
	with 149984 stored elements in Dictionary Of Keys format>

In [8]:
# Calculate degree matrix D (for every row count the number of nonzero entries)
D_values = np.array(adj_mat.sum(1))

# Square root and inverse.
D_inv_values = np.power(D_values  + 1e-9, -0.5).flatten()
D_inv_values[np.isinf(D_inv_values)] = 0.0

 # Create sparse matrix with the values of D^(-0.5) are the diagonals.
D_inv_sq_root = sp.diags(D_inv_values)

# Eval (D^-0.5 * A * D^-0.5).
norm_adj_mat = D_inv_sq_root.dot(adj_mat).dot(D_inv_sq_root)

In [9]:
# to COOrdinate format first ((row, column), data)
coo = norm_adj_mat.tocoo().astype(np.float32)

# create an index that will tell SparseTensor where the non-zero points are
indices = np.mat([coo.row, coo.col]).transpose()

# covert to sparse tensor
A_tilde = tf.SparseTensor(indices, coo.data, coo.shape)
A_tilde

<tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7fee630dba90>

# Train models

## Graph Convoultional Networks (GCNs)

### Light Graph Convolution Network (LightGCN)

In [10]:
light_model = LightGCN(A_tilde,
                 n_users = n_users,
                 n_items = n_movies,
                 n_layers = 3)

In [11]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)
light_model.fit(epochs=25, batch_size=1024, optimizer=optimizer)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


### Neural Graph Collaborative Filtering (NGCF)

In [12]:
ngcf_model = NGCF(A_tilde,
                  n_users = n_users,
                  n_items = n_movies,
                  n_layers = 3
                  )

ngcf_model.fit(epochs=25, batch_size=1024, optimizer=optimizer)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


### Recommend with LightGCN and NGCF

In [13]:
# Convert test user ids to the new ids
users = np.array([user2id[x] for x in test['userId'].unique()])

recs = []
for model in [light_model, ngcf_model]:
    recommendations = model.recommend(users, k=10)
    recommendations = recommendations.replace({'userId': id2user, 'movieId': id2item})
    recommendations = recommendations.merge(movie_titles,
                                                    how='left',
                                                    on='movieId'
                                                    )[['userId', 'movieId', 'title', 'prediction']]

    # Create column with the predicted movie's rank for each user 
    top_k = recommendations.copy()
    top_k['rank'] = recommendations.groupby('userId', sort=False).cumcount() + 1  # For each user, only include movies recommendations that are also in the test set

    recs.append(top_k)

## Standard Variational Autoencoder (SVAE)

In [14]:
# Binarize the data (only keep ratings >= 4)
df_preferred = raw_data[raw_data['rating'] > 3.5]
df_low_rating = raw_data[raw_data['rating'] <= 3.5]

df = df_preferred.groupby('userId').filter(lambda x: len(x) >= 5)
df = df.groupby('movieId').filter(lambda x: len(x) >= 1)

# Obtain both usercount and itemcount after filtering
usercount = df[['userId']].groupby('userId', as_index = False).size()
itemcount = df[['movieId']].groupby('movieId', as_index = False).size()

unique_users =sorted(df.userId.unique())
np.random.seed(123)
unique_users = np.random.permutation(unique_users)

HELDOUT_USERS = 200

# Create train/validation/test users
n_users = len(unique_users)
train_users = unique_users[:(n_users - HELDOUT_USERS * 2)]
val_users = unique_users[(n_users - HELDOUT_USERS * 2) : (n_users - HELDOUT_USERS)]
test_users = unique_users[(n_users - HELDOUT_USERS):]

train_set = df.loc[df['userId'].isin(train_users)]
val_set = df.loc[df['userId'].isin(val_users)]
test_set = df.loc[df['userId'].isin(test_users)]
unique_train_items = pd.unique(train_set['movieId'])
val_set = val_set.loc[val_set['movieId'].isin(unique_train_items)]
test_set = test_set.loc[test_set['movieId'].isin(unique_train_items)]

# Instantiate the sparse matrix generation for train, validation and test sets
# use list of unique items from training set for all sets
am_train = build_features.AffinityMatrix(df=train_set, items_list=unique_train_items)
am_val = build_features.AffinityMatrix(df=val_set, items_list=unique_train_items)
am_test = build_features.AffinityMatrix(df=test_set, items_list=unique_train_items)

# Obtain the sparse matrix for train, validation and test sets
train_data, _, _ = am_train.gen_affinity_matrix()
val_data, val_map_users, val_map_items = am_val.gen_affinity_matrix()
test_data, test_map_users, test_map_items = am_test.gen_affinity_matrix()

# Split validation and test data into training and testing parts
val_data_tr, val_data_te = make_dataset.numpy_stratified_split(val_data, ratio=0.75, seed=123)
test_data_tr, test_data_te = make_dataset.numpy_stratified_split(test_data, ratio=0.75, seed=123)

# Binarize train, validation and test data
train_data = np.where(train_data > 3.5, 1.0, 0.0)
val_data = np.where(val_data > 3.5, 1.0, 0.0)
test_data = np.where(test_data > 3.5, 1.0, 0.0)

# Binarize validation data
val_data_tr = np.where(val_data_tr > 3.5, 1.0, 0.0)
val_data_te_ratings = val_data_te.copy()
val_data_te = np.where(val_data_te > 3.5, 1.0, 0.0)

# Binarize test data: training part 
test_data_tr = np.where(test_data_tr > 3.5, 1.0, 0.0)

# Binarize test data: testing part (save non-binary version in the separate object, will be used for calculating NDCG)
test_data_te_ratings = test_data_te.copy()
test_data_te = np.where(test_data_te > 3.5, 1.0, 0.0)

# retrieve real ratings from initial dataset 
test_data_te_ratings=pd.DataFrame(test_data_te_ratings)
val_data_te_ratings=pd.DataFrame(val_data_te_ratings)

for index,i in df_low_rating.iterrows():
    user_old= i['userId'] # old value 
    item_old=i['movieId'] # old value 

    if (test_map_users.get(user_old) is not None)  and (test_map_items.get(item_old) is not None) :
        user_new=test_map_users.get(user_old) # new value 
        item_new=test_map_items.get(item_old) # new value 
        rating=i['rating'] 
        test_data_te_ratings.at[user_new,item_new]= rating   

    if (val_map_users.get(user_old) is not None)  and (val_map_items.get(item_old) is not None) :
        user_new=val_map_users.get(user_old) # new value 
        item_new=val_map_items.get(item_old) # new value 
        rating=i['rating'] 
        val_data_te_ratings.at[user_new,item_new]= rating   


val_data_te_ratings=val_data_te_ratings.to_numpy()    
test_data_te_ratings=test_data_te_ratings.to_numpy()    

In [15]:
disable_eager_execution()
svae_model = SVAE.StandardVAE(n_users=train_data.shape[0],
                                   original_dim=train_data.shape[1], 
                                   intermediate_dim=200, 
                                   latent_dim=64, 
                                   n_epochs=400, 
                                   batch_size=100, 
                                   k=10,
                                   verbose=0,
                                   seed=123,
                                   drop_encoder=0.5,
                                   drop_decoder=0.5,
                                   annealing=False,
                                   beta=1.0
                                   )

svae_model.fit(x_train=train_data,
          x_valid=val_data,
          x_val_tr=val_data_tr,
          x_val_te=val_data_te_ratings,
          mapper=am_val
          )



### Recommend with SVAE

In [16]:
# Model prediction on the training part of test set 
top_k =  svae_model.recommend_k_items(x=test_data_tr,k=10,remove_seen=True)

# Convert sparse matrix back to df
recommendations = am_test.map_back_sparse(top_k, kind='prediction')
test_df = am_test.map_back_sparse(test_data_te_ratings, kind='ratings') # use test_data_te_, with the original ratings

# Create column with the predicted movie's rank for each user 
top_k = recommendations.copy()
top_k['rank'] = recommendations.groupby('userId', sort=False).cumcount() + 1  # For each user, only include movies recommendations that are also in the test set

recs.append(top_k)

## Singular Value Decomposition (SVD)

### SVD++

In [17]:
surprise_train = surprise.Dataset.load_from_df(train.drop('timestamp', axis=1), reader=surprise.Reader('ml-100k')).build_full_trainset()
svdpp = surprise.SVDpp(random_state=0, n_factors=64, n_epochs=10, verbose=True)
svdpp.fit(surprise_train)

 processing epoch 0
 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 5
 processing epoch 6
 processing epoch 7
 processing epoch 8
 processing epoch 9


<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x7fee58b07c90>

### SVD

In [18]:
svd = surprise.SVD(random_state=0, n_factors=64, n_epochs=10, verbose=True)
svd.fit(surprise_train)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fee5e47a110>

### Recommend with SVD++ and SVD

In [19]:
for model in [svdpp, svd]:
    predictions = []
    users = train['userId'].unique()
    items = train['movieId'].unique()

    for user in users:
            for item in items:
                predictions.append([user, item, model.predict(user, item).est])

    predictions = pd.DataFrame(predictions, columns=['userId', 'movieId', 'prediction'])

    # Remove movies already seen by users
    # Create column of all 1s
    temp = train[['userId', 'movieId']].copy()
    temp['seen'] = 1

    # Outer join and remove movies that have alread been seen (seen=1)
    merged = pd.merge(temp, predictions, on=['userId', 'movieId'], how="outer")
    merged = merged[merged['seen'].isnull()].drop('seen', axis=1)

    # Create filter for users that appear in both the train and test set
    common_users = set(test['userId']).intersection(set(predictions['userId']))

    # Filter the test and predictions so they have the same users between them
    test_common = test[test['userId'].isin(common_users)]
    svd_pred_common = merged[merged['userId'].isin(common_users)]

    if len(set(merged['userId'])) != len(set(test['userId'])):
        print('Number of users in train and test are NOT equal')
        print(f"# of users in train and test respectively: {len(set(merged['userId']))}, {len(set(test['userId']))}")
        print(f"# of users in BOTH train and test: {len(set(svd_pred_common['userId']))}")
        continue
        
    # From the predictions, we want only the top k for each user,
    # not all the recommendations.
    # Extract the top k recommendations from the predictions
    top_movies = svd_pred_common.groupby('userId', as_index=False).apply(lambda x: x.nlargest(10, 'prediction')).reset_index(drop=True)
    top_movies['rank'] = top_movies.groupby('userId', sort=False).cumcount() + 1
    
    top_k = top_movies.copy()
    top_k['rank'] = top_movies.groupby('userId', sort=False).cumcount() + 1  # For each user, only include movies recommendations that are also in the test set
    
    recs.append(top_k)

# Compare performance

Looking at all 5 of our models, we can see that the state-of-the-art model LightGCN vastly outperforms all other models. When compared to SVD++, a widely used algorithm during the Netflix Prize competition, LightGCN achieves an increase in Percision@k by 29%, Recall@k by 18%, MAP by 12%, and NDCG by 35%.

NGCF is the older sister model to LightGCN, but only by a single year. We can see how LightGCN improves in ranking metrics compared to NGCF by simply removing unnecessary operations. 

In conclusion, this demonstrates how far recommendation systems have advanced since 2009, and how new model architectures with notable performance increases can be developed in the span of just 1-2 years.

In [20]:
model_names = ['LightGCN', 'NGCF', 'SVAE', 'SVD++', 'SVD']
comparison = pd.DataFrame(columns=['Algorithm', 'Precision@k', 'Recall@k', 'MAP', 'NDCG'])

# Convert test user ids to the new ids
users = np.array([user2id[x] for x in test['userId'].unique()])

for rec, name in zip(recs, model_names):
    tester = test_df if name == 'SVAE' else test

    pak = metrics.precision_at_k(rec, tester, 'userId', 'movieId', 'rank')
    rak = metrics.recall_at_k(rec, tester, 'userId', 'movieId', 'rank')
    map = metrics.mean_average_precision(rec, tester, 'userId', 'movieId', 'rank')
    ndcg = metrics.ndcg(rec, tester, 'userId', 'movieId', 'rank')

    comparison.loc[len(comparison)] = [name, pak, rak, map, ndcg]

In [21]:
comparison

Unnamed: 0,Algorithm,Precision@k,Recall@k,MAP,NDCG
0,LightGCN,0.403181,0.214257,0.139248,0.460298
1,NGCF,0.357264,0.194407,0.117852,0.4059
2,SVAE,0.356,0.092862,0.048495,0.354768
3,SVD++,0.108271,0.0386,0.015655,0.114023
4,SVD,0.093531,0.033,0.011672,0.092656
