# Recommender systems

In recommender systems, we either assume:

- that we know the **rating** that some user gave to some item (for example, "Joseph Marchand gave 4 stars over 5 to the Suzume movie"), this is called *explicit feedback*
- that we only observe the user interacting with items in a sequence (for example, which songs are played on Spotify or YouTube in which order), but **no ratings**; this is called *implicit feedback*.

In this homework, you will build models that optimize either the first setting or the second one.

First, execute all cells to ensure you have the necessary packages.

## Part 1: Explicit feedback

In [1]:
!pip install numpy scikit-learn torch spotlight tqdm

In [2]:
# !pip install spotlight  # Used as baseline and for preparing datasets
import torch
from torch import nn
import numpy as np
from spotlight.cross_validation import random_train_test_split
from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.evaluation import rmse_score
from spotlight.factorization.explicit import ExplicitFactorizationModel
from tqdm import tqdm

In [3]:
dataset = get_movielens_dataset(variant='100K')
print(dataset)

train, test = random_train_test_split(dataset)

<Interactions dataset (944 users x 1683 items x 100000 interactions)>


In [4]:
train.__dict__

{'num_users': 944,
 'num_items': 1683,
 'user_ids': array([642, 796, 557, ...,  79, 606, 201], dtype=int32),
 'item_ids': array([832, 226, 180, ..., 276, 288, 603], dtype=int32),
 'ratings': array([3., 3., 5., ..., 3., 4., 4.], dtype=float32),
 'timestamps': array([892240991, 893048410, 881179653, ..., 891271957, 877641931,
        884113924], dtype=int32),
 'weights': None}

Users are numbered from 1 to 943 and items are numbered from 1 to 1682.  
Number of users is set to 944 and number of items is set to 1683 to avoid off-by-one errors.

In [5]:
print(f"As an example, user {train.user_ids[0]} gave {train.ratings[0]} stars to item {train.item_ids[0]}.")

As an example, user 642 gave 3.0 stars to item 832.


In [6]:
len(train) / len(dataset)

0.8

Train / test is a 80:20 split. We should predict in the test set what is the rating for an unseen (user, item) pair: it is a regression problem.

In [7]:
%%time
model = ExplicitFactorizationModel(n_iter=3)
model.fit(train)

rmse_score(model, test)

CPU times: user 16.8 s, sys: 131 ms, total: 16.9 s
Wall time: 5.03 s


0.9924273

A first exercise is to reproduce this metric. We will do it together in order to make sure that we talk about the same thing, i.e.:

$$\text{RMSE}(y^*, y) = \sqrt{\frac1N \sum_{i = 1}^N (y^*_i - y_i)^2}$$

In [8]:
def our_rmse(y_true, y_pred):
    return ((y_true - y_pred) ** 2).mean() ** 0.5

In [9]:
y_pred = model.predict(test.user_ids, test.item_ids)
y_pred

array([4.544159 , 3.9084182, 3.9396346, ..., 3.5382147, 3.7397728,
       3.6165702], dtype=float32)

In [10]:
X_train = torch.LongTensor(np.column_stack((train.user_ids, train.item_ids)))
X_test = torch.LongTensor(np.column_stack((test.user_ids, test.item_ids)))
y_train = torch.Tensor(train.ratings)
y_test = torch.Tensor(test.ratings)

X_train[:5], y_train[:5]

(tensor([[642, 832],
         [796, 226],
         [557, 180],
         [347, 318],
         [895, 742]]),
 tensor([3., 3., 5., 3., 4.]))

In [11]:
our_rmse(y_test, y_pred)

tensor(0.9924)

> *Yes, I got the same thing.*

— The Social Network

In [12]:
our_rmse(y_test, torch.ones_like(y_test) * y_train.mean())  # Simplest baseline

tensor(1.1218)

In [13]:
BATCH_SIZE = 300
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
test_dataset = torch.utils.data.TensorDataset(X_test, y_test)
train_iter = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

In [14]:
class CollaborativeFilteringModel(nn.Module):
    """
    Recommender system for explicit feedback
    """
    def __init__(self, nb_users, nb_items, embedding_size):
        super().__init__()
        # Your code here
        pass

    def forward(self, x):
        # Your code here
        pass

EMBEDDING_SIZE = 20
model = CollaborativeFilteringModel(train.num_users, train.num_items, EMBEDDING_SIZE)

In [None]:
N_EPOCHS = 10
LEARNING_RATE = 0.01
loss_function = nn.MSELoss()  # It's a regression problem
# You can also check what happens where there is no weight decay i.e. no L2 regularization
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-3)

In [None]:
%%time

losses = []
for epoch in tqdm(range(N_EPOCHS)):
    # Your code here, write a training loop and plot train and test RMSE.
    # Don't forget that the loss is the mean squared error.
    # If you want to display RMSE, you need to take its square root.
    pass

## Part 2: Implicit feedback

In this section, we do not observe numerical ratings anymore, just sequences of items.

In [17]:
%%time
from spotlight.cross_validation import user_based_train_test_split
from spotlight.evaluation import sequence_mrr_score
from spotlight.sequence.implicit import ImplicitSequenceModel

# If you want to debug on a smaller dataset first, you can use this
'''from spotlight.datasets.synthetic import generate_sequential
dataset = generate_sequential(num_users=100,
                              num_items=1000,
                              num_interactions=10000,
                              concentration_parameter=0.01,
                              order=3)'''

# Otherwise we reuse Movielens
train, test = user_based_train_test_split(dataset)

train = train.to_sequence()
test = test.to_sequence()

model = ImplicitSequenceModel(n_iter=3,
                              representation='pooling',
                              loss='pointwise')
model.fit(train)

sequence_mrr_score(model, test).mean()

CPU times: user 22.8 s, sys: 15.7 ms, total: 22.9 s
Wall time: 5.79 s


0.041603751419532216

In [18]:
train

<Sequence interactions dataset (8249 sequences x 10 sequence length)>

In [19]:
train.__dict__

{'sequences': array([[ 209,   32,  189, ...,    5,   74,  102],
        [ 255,  272,  271, ...,  244,   18,  270],
        [ 100,  154,    9, ...,  222,  258,  266],
        ...,
        [ 928,   24,  274, ..., 1047,  111,  284],
        [ 763,   50,  412, ...,  685,  471,  405],
        [   0,    0,   64, ..., 1067,  127,  508]], dtype=int32),
 'user_ids': array([  1,   1,   1, ..., 943, 943, 943], dtype=int32),
 'max_sequence_length': 10,
 'num_items': 1683}

The `train` dataset contains 8000+ sequences of length 10 representing the movies seen, IDs between 1 and 1683. 

We do have access to user IDs but we will not need them. Here, the maximum length of a sequence is 10, so sequences have been split to be of max size 10. Sequences having less than 10 items are padded with 0s.

The objective becomes, given the first 9 items, predict the 10th item (classification). In order to have a better comparison of models, we are mainly interested in a ranking metric: in the ranked movies by probability, where was the correct answer? Mean reciprocal rank is, for $N$ samples:

$${\text{MRR}}={\frac {1}{N}}\sum _{{i=1}}^{{N}}{\frac {1}{{\text{rank}}_{i}}}$$

where $\text{rank}_i$ is a number between 1 and 1683, the number of items, which represents the rank of the expected answer, where movies are ranked by decreasing probability. The MRR is between 0 and 1 and higher is better (when all ranks are 1).

Again, we will first attempt to reproduce the metric.

In [20]:
from sklearn.metrics import label_ranking_average_precision_score
from sklearn.preprocessing import OneHotEncoder
from scipy.special import softmax

ohe = OneHotEncoder(categories=[list(range(train.num_items))])
target = ohe.fit_transform(test.sequences[:, [-1]]).toarray()

target.shape  # The target is a one-hot encoding of correct answers for each sample

(2190, 1683)

In [21]:
y_pred = []
for seq in test.sequences:
    y_pred.append(softmax(model.predict(seq[:-1])))
y_pred = np.array(y_pred)
y_pred.round(2)  # The predictions are probabilities for each sample x item

array([[0.  , 0.01, 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ],
       ...,
       [0.  , 0.01, 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.03, 0.  , ..., 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.  ]], dtype=float32)

In [22]:
# This function is actually more generic as it can also consider multilabel classification
# i.e. several correct answers for a sample. In our case (1 correct answer per sample) it is equal to MRR.
label_ranking_average_precision_score(target, y_pred)

0.04160375141953235

Again, we got the same number. We now show the metric for two simple baselines:

In [23]:
from collections import Counter

item_counts = Counter(train.sequences[:, -1])
most_popular_item = item_counts.most_common()[0][0]
most_popular_baseline = np.zeros_like(target)
most_popular_baseline[:, most_popular_item] = 1
label_ranking_average_precision_score(target, most_popular_baseline)

0.007895772118173742

In [24]:
popularity = np.zeros(train.num_items)
for item_id, count in item_counts.items():
    popularity[item_id] = count
popularity = softmax(popularity)
popularity_baseline = np.tile(popularity, (len(target), 1))  # Repeat for each test sample
label_ranking_average_precision_score(target, popularity_baseline)

0.024785953467286397

You should now write a model / module that takes a batch of sequences of max 9 elements and should predict the next one.

You can either take:
- a sequential approach, i.e. RNN / LSTM / GRU or a transformer like [minGPT](https://github.com/karpathy/minGPT) (which is more complex; please start simple);
- or a non-sequential one like [CBOW](https://lilianweng.github.io/posts/2017-10-15-word-embedding/#context-based-continuous-bag-of-words-cbow).

Loss can either be cross-entropy (simple), [noise contrastive estimation, or negative sampling](https://lilianweng.github.io/posts/2017-10-15-word-embedding/#noise-contrastive-estimation-nce).

The goal is to have comparable or better MRR than 0.04.

P. S. – In order to ignore index 0 you can use `padding_idx` in [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) or the function [`masked_select()`](https://pytorch.org/docs/stable/generated/torch.masked_select.html) for RNN.