# Measuring the quality of recommender systems

In this notebook, I explore how the quality of recommender systems can be measured.

Recommender systems rely on predicting how users rate items. 

Based on the rating predictions, the recommender system recommends the items to a user with the highest predicted ratings.

So, measuring the quality of a recommender system can be translated into measuring the predicted rating of an item against the real rating that a user gave. If the recommender predicts a 5-star rating when the true rating is 1-star, then that signals the quality of the predictions made by the recommender system is not that great.

Some other principles of the general methodology to measure the quality:
- ratings could be predicted for a set of (user, item) tuples. Aggregate metrics like __root mean square error__ (RMSE) are used to report the overall quality of the predictions.
- rating predictions should be applied to an _unseen_ set of (user, item) observations. Otherwise, the prediction is trivial; the prediction could simply repeat the observation that the recommender had already seen. Hence, the dataset could be split into train / test sets, with the __train__ set shown to the recommender system, and the __test__ set used to report the quality of the recommender system.
- Splitting the dataset into train / test could introduce unwanted __bias__. As a result, the metrics could not reflect the real quality of the recommender system. For example, if the train set only contains ratings of 1 and 2 stars, the recommender system may never be able to predict 5-star ratings. The dataset contains many more 4-star ratings than 1-star ratings. If we randomly sampled without taking this into account, we may end up under or overrepresenting certain x-star ratings. Hence, the train-test datasets are created by preserving the distribution of the ratings in both datasets. In other words, a __stratified__ split is applied.

Some principles specific to measuring the quality of recommender systems based on __user-based collaborative filtering__:
- __the test set cannot contain users that were not part of the train set__. This is because the algorithm searches the ratings dataset for the most similar users to the current user. If the current user is not part of the ratings dataset, finding the most similar users becomes impossible.
- __the test set cannot contain items that were not part of the train set__. If the train set does not contain ratings of the item of other users, it becomes impossible to rate that specific item. This is because the predicted rating relies on the ratings of that specific item of other users.

In [1]:
%load_ext autoreload
%autoreload 2

# data manipulation
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

import numpy as np

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

from user_based_collaborative_filtering import predict
from utils import read_ratings

In [2]:
# paths to the files. download them from http://files.grouplens.org/datasets/movielens/ml-100k.zip if you don't have them yet
MOVIE_RATINGS_PATH = 'ml-100k/u.data'
MOVIE_NAMES_PATH = 'ml-100k/u.item'

# test sample size
TEST_SIZE = 0.01

In [3]:
ratings = read_ratings(MOVIE_RATINGS_PATH)
ratings = pd.DataFrame(data=ratings, columns=['user', 'movie', 'rating'])
ratings = ratings.astype(int)
ratings.head()

Unnamed: 0,user,movie,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


# Train-test-split

In [4]:
# stratified to keep the distribution of movie ratings the same in train and test datasets.
train, test = train_test_split(
    ratings, 
    test_size=TEST_SIZE, 
    stratify=ratings.rating, 
    random_state=RANDOM_SEED
)

In [5]:
train.head()

Unnamed: 0,user,movie,rating
91378,882,151,5
89176,868,566,1
37540,194,91,3
2097,291,471,4
32223,334,221,5


In [6]:
train.rating.value_counts(normalize=True)

4    0.341737
3    0.271455
5    0.212010
2    0.113697
1    0.061101
Name: rating, dtype: float64

In [7]:
test.rating.value_counts(normalize=True)

4    0.342
3    0.271
5    0.212
2    0.114
1    0.061
Name: rating, dtype: float64

In [8]:
test.shape

(1000, 3)

In [9]:
# remove users from test set that are not part of the train set
test = test.loc[test.user.isin(train.user)]

In [10]:
# remove items from test set that are not part of train set
test = test.loc[test.movie.isin(train.movie)]

In [11]:
# test = test.sample(100)
test.head()

Unnamed: 0,user,movie,rating
8548,276,282,4
6687,180,421,5
93118,308,121,3
31373,428,243,4
79953,487,301,4


# Predicting the rating for a sample (user, item)

In [12]:
sample = ratings.sample(random_state=RANDOM_SEED)
user_id = sample.user.values[0]
item_id = sample.movie.values[0]

In [13]:
user_id, item_id

(877, 381)

In [14]:
predict(user_id, item_id, train)

  c *= 1. / np.float64(fact)


4.0

# Rating predictions on the test set

In [15]:
%%time
test['pred'] = test.apply(lambda row: predict(row.user, row.movie, train), axis=1)

  c *= 1. / np.float64(fact)


Wall time: 2min 57s


In [16]:
test.head()

Unnamed: 0,user,movie,rating,pred
8548,276,282,4,3.4
6687,180,421,5,4.3
93118,308,121,3,2.9
31373,428,243,4,1.5
79953,487,301,4,3.5


In [17]:
test.shape

(999, 4)

# Quality Metrics

In [18]:
rms = sqrt(mean_squared_error(test.rating, test.pred))
rms

1.048973330780819

In [19]:
stars = train.rating.value_counts(normalize=True)

In [20]:
# make random guesses of ratings, and compute RMSE as a basis for comparison
test['random'] = np.random.choice(stars.index, p=stars.values, size=test.shape[0])
test.head()

Unnamed: 0,user,movie,rating,pred,random
8548,276,282,4,3.4,3
6687,180,421,5,4.3,1
93118,308,121,3,2.9,5
31373,428,243,4,1.5,3
79953,487,301,4,3.5,4


In [21]:
random_rms = sqrt(mean_squared_error(test.rating, test.random))
random_rms

1.540250749836653