#### Recommender System
**Course:** Data Mining <br>
**Authors:** Lada Morozova, Shamil Arslanov, Danis Alukaev, Maxim Faleev, Rizvan Iskaliev <br>
**Group:** B19-DS-01

## 0 Prerequisites

In [1]:
import os 
import random
import time
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch as torch
import torch.utils.data as data
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
import optuna

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Step 4. Modeling

## 1. Select Modeling Technique

This section describes modelling tecniques employed in this project. We have organised our development process in iterative manner, i.e. on each iteration we came up with a novel architecture that potentially outperforms the current SoA method. Recent advances in the field were taken from ["A review on deep learning for recommender systems: challenges and remedies"](https://link.springer.com/article/10.1007/s10462-018-9654-y) by Batmaz Z. et al. In total, there are four different implemented models: item popularity model, multilayer perceptron, factorization machine, and behaviour sequence transformer. 

### 1.1 Item Popularity Model

Model grounds on assumption that most popular movies are likely to be interested for a user. We define popularity in terms of ratings number. Thus, for each user recommender system always return top-k movies with a highest number of ratings. Certainly, this is a naive approach and is not likely to satisfy our data mining goal. For this reason, we will assume that the company uses this model in its current setup, and benchmarking of all other (more sophisticated) models will be performed in comparison with performance of this "baseline".

Although the model is simple, the great thing about it is the lack of assumptions about the data. Essentially, we can apply this method to any dataset until it contains user ratings and names of movies.

### 1.2 Multilayer Perceptron

Another quite natural option is to use Multilayer Perceptron (MLP). As an input this model takes concatenated latent representation of user, movie, age, occupation, gender, as well as other manually derived on previous step features. This data is passed through multiple dense layers with non-linear activations, e.g. rectified linear unit. The output layer is a single neuron with logit, that is activated using sigmoid function. The result can be interpreted as a probability that the user will enjoy the movie.

The developed model uses PyTorch embeddings to encode user, movie, age, occupation and gender. Therefore, the input tensor should consist of indices that can be refered to these embeddings. Also, it is quite important that the input tensor is uniformly distributed, no missing values are allowed, and the input is numeric.

### 1.3 Factorization Machine

Until recently, de facto the silver bullet in the field of recommendation systems were factorization machines. They were firstly presented in ["Factorization Machines"](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) by Rendel S. in 2010. It was inspired by and proposed as a substitution for traditional matrix factorization. This approach allows to learn interactions between user and movies, and at the same time involve user metadata for final prediction. Thus, this approach resolves well-known problem of the [cold-start](https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)).

The developed model uses PyTorch embeddings to encode user, movie, age, occupation and gender. Therefore, the input tensor should consist of indices that can be refered to these embeddings. Also, it is quite important that the input tensor is uniformly distributed, no missing values are allowed, and the input is numeric.

### 1.4 Transformer

Finally, we decided to apply Behaviour Sequence Transformer (BST) architecture for our problem. The implementation is mainly inspired by manuscript ["Behaviour Sequence Transformer for E-commerce Recommendation in Alibaba"](https://arxiv.org/pdf/1905.06874.pdf). Long story short, authors expand MLP using sequence-to-sequence model with multi-head attention mechanism. Now the model not only takes into account metadata of user and movie, but also rating history. They claim to increase online Click-Through-Rate (CTR) gain by 7.57% compared to a control group using BST.

<img src="https://miro.medium.com/max/1400/1*ThXfG04qnukM6fqWLRb_JQ.jpeg" alt="drawing" width="700"/>

The developed model uses PyTorch embeddings to encode user, movie, age, occupation and gender. Therefore, the input tensor should consist of indices that can be refered to these embeddings. Also, it is quite important that the input tensor is uniformly distributed, no missing values are allowed, and the input is numeric. Moreover, the input sequence must have an upper limit, it order to feed the behaviour sequence in model.

## 2. Generate Test Design

In order to slightly relax our problem, let's reformulate it from regression to binary classification. Recall that in original dataset grades are from 1 to 5, thus new target is 1 if the grade above 3, and 0 - otherwise.

The dataset will be splitted in train, test, and validations subsets with size of approximately 70, 20, and 10% respectively. The data is splitted according to the timestamp to make sure that it inherits its historical order. Iteratively model is trained on train subset and tested on test subset (to check how model generalizes data, detect over-fitting, etc). Final performance of models is checked on validation subset.

Frankly speaking, the recommender system cannot be properly tested without external validation. Apparently, it seems that the best way is to design the A/B testing, and check how the new approach influenced certain metric, e.g. CTR. This project does not require deployment, so building infrastructure is out of the interest scope.

### 2.1 Data Preparation

In [4]:
movielens_dir = Path("./data/movie_lens/movie_lens_1m.csv")
joint_dir = Path("./data/augmented_movie_lens.csv")

In [5]:
movielens_df = pd.read_csv(movielens_dir)
joint_df = pd.read_csv(joint_dir)

In [6]:
def index(df, columns):
    _df = df.copy()
    offset = 0
    for column in columns:
        index_column = column + '_index'
        _df[index_column] = offset + _df[column].astype('category').cat.codes
        offset += len(_df[index_column].unique())
    return _df

columns = ['UserID', 'MovieID', 'Gender', 'Age', 'Occupation']
movielens_df_indexed = index(movielens_df, columns)

In [7]:
def split_dataset(df):
    _df = df.copy()
    train_df = _df[df.Timestamp <= 975e6]
    test_df = _df[(df.Timestamp > 975e6) & (df.Timestamp < 98e7)]
    val_df = _df[df.Timestamp >= 98e7]
    return train_df, test_df, val_df

In [8]:
train_movielens_df, test_movielens_df, val_movielens_df = split_dataset(movielens_df_indexed)
train_joint_df, test_joint_df, val_joint_df = split_dataset(joint_df)

In [9]:
assert len(movielens_df) == len(train_movielens_df) + len(test_movielens_df) + len(val_movielens_df)
assert len(joint_df) == len(train_joint_df) + len(test_joint_df) + len(val_joint_df)

In [10]:
movie_mapping = dict(movielens_df_indexed[['Title', 'MovieID_index']].drop_duplicates().values[:, ::-1])
gender_mapping = dict(movielens_df_indexed[['Gender', 'Gender_index']].drop_duplicates().values[:, ::-1])
age_mapping = dict(movielens_df_indexed[['Age', 'Age_index']].drop_duplicates().values[:, ::-1])
occupation_mapping = dict(movielens_df_indexed[['Occupation', 'Occupation_index']].drop_duplicates().values[:, ::-1])

In [11]:
features = ['UserID_index', 'MovieID_index', 'Age_index', 'Gender_index', 'Occupation_index']

In [12]:
binary_target = np.where(train_movielens_df.Rating.values < 4, 0, 1)
X = torch.LongTensor(train_movielens_df[features].values)
y = torch.LongTensor(binary_target).float()
trainset = data.TensorDataset(X, y)

binary_target = np.where(test_movielens_df.Rating.values < 4, 0, 1)
X = torch.LongTensor(test_movielens_df[features].values)
y = torch.LongTensor(binary_target).float()
testset = data.TensorDataset(X, y)

binary_target = np.where(val_movielens_df.Rating.values < 4, 0, 1)
X = torch.LongTensor(val_movielens_df[features].values)
y = torch.LongTensor(binary_target).float()
valset = data.TensorDataset(X, y)

In [13]:
trainloader = data.DataLoader(trainset, batch_size=1024, shuffle=True)
testloader = data.DataLoader(testset, batch_size=1024, shuffle=True)
valloader = data.DataLoader(valset, batch_size=1024, shuffle=True)

In [14]:
num_embeddings = movielens_df_indexed[features].values.max() + 1

### 2.2 Metric Implementation

The performance of model is defined in terms of percentage of suggested movies which the user rated 4 or 5 among suggested movies which user rated (further refered as accuracy).

In [15]:
def compute_accuracy(ratings, recommendation):
    """Compute accuracy of recommender system.

    Parameters
    ----------
    ratings : pandas.DataFrame
        dataframe with user's ratings
    
    recommendation : iterable 
        recommendations of a system

    Returns
    -------
    accuracy : float
        percentage of suggested movies which the user rated 4 or 5 
        among suggested movies which user rated
    """
    well_rated = ratings[ratings.Rating >= 4].MovieID_index
    all_rated = ratings.MovieID_index
    hits = len(list(set(well_rated) & set(recommendation)))
    intersection = len(list(set(all_rated) & set(recommendation)))
    if intersection == 0:
        return np.NaN
    accuracy = hits / intersection
    return accuracy

In [16]:
def compute_mean_accuracy(df, model):
    """Compute mean accuracy of inferenced model.

    Parameters
    ----------
    df : pandas.DataFrame
        validation dataset

    model : Object or torch.nn.Module
        instance of class implementing predict method
    
    Returns
    -------
    mean_accuracy : float
        mean accuracy of recommendations
    """
    accuracies = list()
    for user_id in df.UserID.unique():
        ratings = df[df.UserID == user_id]
        indices = model.predict(ratings)
        ratings = ratings[['MovieID_index', 'Rating']]
        accuracy = compute_accuracy(ratings, indices)
        accuracies.append(accuracy)
    accuracies = np.array(accuracies)
    accuracies = accuracies[~np.isnan(accuracies)]
    mean_accuracy = accuracies.mean()
    return mean_accuracy

## 3. Build Model

This section contains implementations for item popularity model, multilayer perceptron, factorization machine, and transformer.

### 3.1 Item Popularity Model

In [17]:
class ItemPopularityModel:
    """Implementation of Item Popularity model for recommender systems."""

    def __init__(self, top_k=50):
        """Constructor of Item Popularity model.

        Parameters
        ----------
        top_k : int
            number of items to recommend
        """
        self.top_k = top_k
        self.prediction = None

    def fit(self, df):
        """Training routine.

        Sorts movies by nubmer of ratings. Sets indices of top k movies in 
        attribute prediction.

        Parameters
        ----------
        df : pandas.DataFrame
            data to use for training
        """
        assert 'MovieID_index' in df and 'Rating' in df, '"MovieID_index" and "Rating" should be in column names'
        _df = df.copy()
        ratings_number = _df[['MovieID_index', 'Rating']].groupby(['MovieID_index']).count()
        sorted_titles = ratings_number.sort_values(by='Rating', ascending=False)
        sorted_titles.reset_index(inplace=True)
        self.prediction = sorted_titles
        
    def predict(self, df=None):
        """Inference model.

        Parameters
        ----------
        df : pandas.DataFrame
            data to use for training

        Returns
        -------
        indices : list
            indices of recommended movies
        """
        recommendation = self.prediction[:self.top_k]
        indices = recommendation['MovieID_index']
        return indices

In [18]:
item_popularity_model = ItemPopularityModel()
item_popularity_model.fit(train_movielens_df)

### 3.2 Multi-layer Perceptron

In [19]:
class MLP(nn.Module):
    """Implementation of Multilayer Perceptron with embeddings."""

    def __init__(self, n, layers=[600, 400, 200, 50]):
        """Constructor of class MLP
        
        Parameters
        ----------
        n : int
            number of entities to embed
        
        layers : list
            list of layers sizes
        """
        super().__init__()
        assert layers[0] % 5 == 0
        self.embeddings = nn.Embedding(n, layers[0] // 5)
        architecture = list()
        for idx in range(len(layers) - 1):
            architecture.append(nn.Linear(layers[idx], layers[idx + 1]))
            architecture.append(nn.ReLU())
        architecture.append(nn.Linear(layers[-1], 1))
        architecture.append(nn.Sigmoid())
        self.main = nn.Sequential(*architecture)
    
    def forward(self, x):
        embeded = self.embeddings(x)
        x = embeded.flatten(start_dim=1)
        x = self.main(x)
        return x

In [20]:
class MLPModel: 
    
    def __init__(self, n, trainloader, testloader, layers=[600, 400, 200, 50],
                 lr=1e-3, weight_decay=1e-5, epochs=10, top_k=50, optimizer=None,
                 device="cuda"):
        self.n = n
        self.trainloader = trainloader
        self.testloader = testloader
        self.lr = lr
        self.layers = layers
        self.weight_decay = weight_decay
        self.epochs = epochs
        self.device = device
        self.top_k = top_k
        
        self.model = MLP(n, layers).to(device)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr, weight_decay=weight_decay)
        if optimizer is not None:
            self.optimizer = optimizer
        self.scheduler = optim.lr_scheduler.MultiStepLR(self.optimizer, 
                                                        milestones=list(range(0, epochs + 1, 5))[1:], 
                                                        gamma=0.1)
        self.criterion = nn.BCELoss().to(device)
        
    def fit(self, verbose=False):
        "High-level outline of the training process."
        min_val_loss = 1e9
        for epoch in range(self.epochs):
            train_loss = self._train_one_epoch()
            valid_loss = self._test()
            self.scheduler.step()
            if not verbose:
                print(f"Epoch #{epoch + 1} | Train loss: {train_loss:.4f} | Test loss: {valid_loss:.4f} |")
            if valid_loss < min_val_loss:
                torch.save({'epoch': epoch,
                            'model_state_dict': self.model.state_dict(),
                            'optimizer_state_dict': self.optimizer.state_dict(),
                            'loss': valid_loss, 
                            }, 
                            os.path.join("./weights/mlp/", f"{epoch}.pt")
                )
                min_val_loss = valid_loss
        return train_loss, valid_loss
    
    def _train_one_epoch(self):
        "Train model for one epoch on train dataset."
        device = self.device
        train_loss = 0
        self.model.train()
        for x, y in self.trainloader:
            self.optimizer.zero_grad()
            y_hat = self.model(x.to(device)).flatten()
            loss = self.criterion(y_hat, y.to(device))
            train_loss += loss.item() * x.shape[0]
            loss.backward()
            self.optimizer.step()
        return train_loss / len(self.trainloader.dataset)
    
    def _test(self):
        "Test model on validation dataset."
        device = self.device
        test_loss = 0
        self.model.eval()
        for x, y in self.testloader:                    
            with torch.no_grad():
                y_hat = self.model(x.to(device)).flatten()
            loss = self.criterion(y_hat, y.to(device))
            test_loss += loss.item() * x.shape[0]
        return test_loss / len(self.testloader.dataset)
    
    def predict(self, x):
        user_id = int(x.UserID_index.unique())
        age = int(x.Age_index.unique())
        gender = int(x.Gender_index.unique())
        occupation = int(x.Occupation_index.unique())

        movies = movielens_df_indexed.drop_duplicates('MovieID_index').copy()

        rankings = []
        for movie in movies.MovieID_index.unique():
            ranking = self.model(torch.LongTensor([user_id, movie, age, gender, occupation]).unsqueeze(0).to(device))
            rankings.append(ranking.item())
        
        rankings = torch.FloatTensor(rankings)
        recommendations = [i for i in movies.iloc[rankings.argsort(descending=True).cpu()]['MovieID_index'].values][:self.top_k]
        return recommendations

In [21]:
def define_mlp_model(trial):
    """Optuna auxiliary method."""
    input_dim = trial.suggest_int(name="input_dim", low=50, high=1000, step=5)
    hidden_dim = trial.suggest_int(name="hidden_dim", low=50, high=1000, step=1)
    depth_hidden = trial.suggest_int(name="depth", low=1, high=10, step=1)    
    layers = [input_dim] + [hidden_dim] * depth_hidden 
    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD"]) 
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    mlp = MLPModel(num_embeddings, trainloader, testloader, epochs=5, layers=layers, device=device)
    optimizer = getattr(optim, optimizer_name)(mlp.model.parameters(), lr=lr)
    mlp.optimizer = optimizer
    return mlp

In [22]:
def objective_mlp(trial):
    """Optuna auxiliary method."""
    model = define_mlp_model(trial)
    _, loss = model.fit(verbose=True)
    return loss

In [23]:
def hyperparameter_optimization_mlp(n_trials=15):
    """Hyperparameter optimization for MLP model."""
    study = optuna.create_study(direction="minimize")
    study.optimize(objective_mlp, n_trials=n_trials, timeout=900)
    trial = study.best_trial
    return trial.params

In [24]:
params = hyperparameter_optimization_mlp()
print(f"Best params: {params}")
layers = [params['input_dim']] + [params['hidden_dim']] * params['depth']
mlp_model = MLPModel(num_embeddings, trainloader, testloader, layers=layers, device=device)
optimizer = getattr(optim, params['optimizer'])(mlp_model.model.parameters(), lr=params['lr'])

[32m[I 2022-05-21 04:49:44,999][0m A new study created in memory with name: no-name-27363ed1-f64b-4f57-b151-e3e71ce5e0bb[0m
[32m[I 2022-05-21 04:50:35,122][0m Trial 0 finished with value: 0.6419955693373888 and parameters: {'input_dim': 850, 'hidden_dim': 918, 'depth': 4, 'optimizer': 'RMSprop', 'lr': 1.3659722937472789e-05}. Best is trial 0 with value: 0.6419955693373888.[0m
[32m[I 2022-05-21 04:51:22,936][0m Trial 1 finished with value: 43.27547892681892 and parameters: {'input_dim': 760, 'hidden_dim': 605, 'depth': 8, 'optimizer': 'RMSprop', 'lr': 0.0032513879083991538}. Best is trial 0 with value: 0.6419955693373888.[0m
[32m[I 2022-05-21 04:52:08,389][0m Trial 2 finished with value: 0.6022213128398723 and parameters: {'input_dim': 265, 'hidden_dim': 119, 'depth': 6, 'optimizer': 'Adam', 'lr': 0.0076556290815639715}. Best is trial 2 with value: 0.6022213128398723.[0m
[32m[I 2022-05-21 04:52:46,734][0m Trial 3 finished with value: 0.5900556705759868 and parameters: {'in

Best params: {'input_dim': 320, 'hidden_dim': 298, 'depth': 1, 'optimizer': 'Adam', 'lr': 0.02870069564380648}


In [25]:
_, _ = mlp_model.fit()

Epoch #1 | Train loss: 0.6111 | Test loss: 0.6236 |
Epoch #2 | Train loss: 0.5526 | Test loss: 0.5999 |
Epoch #3 | Train loss: 0.5379 | Test loss: 0.5897 |
Epoch #4 | Train loss: 0.5308 | Test loss: 0.5864 |
Epoch #5 | Train loss: 0.5247 | Test loss: 0.5866 |
Epoch #6 | Train loss: 0.5102 | Test loss: 0.5827 |
Epoch #7 | Train loss: 0.5074 | Test loss: 0.5819 |
Epoch #8 | Train loss: 0.5058 | Test loss: 0.5818 |
Epoch #9 | Train loss: 0.5044 | Test loss: 0.5830 |
Epoch #10 | Train loss: 0.5031 | Test loss: 0.5821 |


In [26]:
checkpoint = torch.load(os.path.join("weights/mlp/", "7.pt"))
mlp_model.model.load_state_dict(checkpoint['model_state_dict'])
mlp_model.model.eval()

MLP(
  (embeddings): Embedding(9776, 64)
  (main): Sequential(
    (0): Linear(in_features=320, out_features=298, bias=True)
    (1): ReLU()
    (2): Linear(in_features=298, out_features=1, bias=True)
    (3): Sigmoid()
  )
)

### 3.3 Factorization Machine

In [27]:
class FactorizationMachine(nn.Module):
    """Implementation of Factorization Machine acccording to "Factorization 
    Machines" by Rendle.
    https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
    """
    
    def __init__(self, n, k):
        """Constructor of FMM class.
        
        Parameters
        ----------
        n : int
            size of feature vector
        
        k : int
            size of embedding to use
        
        Returns
        -------
        None
        """
        super().__init__()
        self.w0 = nn.Parameter(torch.zeros(1))
        self.bias = nn.Embedding(n, 1)
        self.embeddings = nn.Embedding(n, k)
        self._init_trunc_normal()
    
    def _init_trunc_normal(self, mean=0., std=0.01):
        """Initialize weights via truncated normal function.
        
        Implemented according to "An Exploration of Word Embedding 
        Initialization in Deep-Learning Tasks" by Kocmi T. and Bojar O.
        https://arxiv.org/pdf/1711.09160.pdf
        
        Parameters
        ----------
        mean : float (default: 0.00)
            mean of normal distribution
        
        std : float (default: 0.01)
            standard deviation of normal distribution
        
        Returns
        -------
        None
        """
        with torch.no_grad(): 
            self.embeddings.weight.normal_().fmod_(2).mul_(std).add_(mean)
            self.bias.weight.normal_().fmod_(2).mul_(std).add_(mean)
    
    def forward(self, x):
        "Compute interactions using Lemma 3.1"
        bias = self.bias(x).squeeze().sum(1)
        embeded = self.embeddings(x)
        pow_of_sum = embeded.sum(dim=1).pow(2)
        sum_of_pow = embeded.pow(2).sum(dim=1)
        pairwise = (pow_of_sum - sum_of_pow).sum(1) * 0.5
        y = torch.sigmoid(self.w0 + bias + pairwise)
        return y

In [28]:
class FactorizationMachineModel: 
    
    def __init__(self, n, k, trainloader, testloader, 
                 lr=1e-3, weight_decay=1e-5, epochs=10, device="cuda", top_k=50):
        """Constructor of training routine.
        
        Parameters
        ----------
        n : int
            size of feature vector
        
        k : int
            size of embedding to use
        
        trainloader : torch.utils.data.DataLoader 
            iterator over training data
        
        testloader : torch.utils.data.DataLoader 
            iterator over test data
        
        lr : float
            learning rate for optimizer
        
        weight_decay : float
            regularization parameter for optimizer
        
        epochs : int
            number of training epochs
        
        device : str
            device to use for training
        
        Returns
        -------
        None
        """
        self.n = n
        self.k = k
        self.trainloader = trainloader
        self.testloader = testloader
        self.lr = lr
        self.weight_decay = weight_decay
        self.epochs = epochs
        self.device = device
        self.top_k = top_k
        
        self.model = FactorizationMachine(n, k).to(device)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr, weight_decay=weight_decay)
        self.scheduler = optim.lr_scheduler.MultiStepLR(self.optimizer, 
                                                        milestones=list(range(0, epochs + 1, 5))[1:], 
                                                        gamma=0.1)
        self.criterion = nn.BCELoss().to(device)
        
    def fit(self, verbose=False):
        "High-level outline of the training process."
        min_val_loss = 1e9
        for epoch in range(self.epochs):
            train_loss = self._train_one_epoch()
            valid_loss = self._test()
            self.scheduler.step()
            if not verbose:
                print(f"Epoch #{epoch + 1} | Train loss: {train_loss:.4f} | Test loss: {valid_loss:.4f} |")
            if valid_loss < min_val_loss:
                torch.save({'epoch': epoch,
                            'model_state_dict': self.model.state_dict(),
                            'optimizer_state_dict': self.optimizer.state_dict(),
                            'loss': valid_loss, 
                            }, 
                            os.path.join("./weights/fm/", f"{epoch}.pt")
                )
                min_val_loss = valid_loss
        return train_loss, valid_loss
    
    def _train_one_epoch(self):
        "Train model for one epoch on train dataset."
        device = self.device
        train_loss = 0
        self.model.train()
        for x, y in self.trainloader:
            self.optimizer.zero_grad()
            y_hat = self.model(x.to(device))
            loss = self.criterion(y_hat, y.to(device))
            train_loss += loss.item() * x.shape[0]
            loss.backward()
            self.optimizer.step()
        return train_loss / len(self.trainloader.dataset)
    
    def _test(self):
        "Test model on validation dataset."
        device = self.device
        test_loss = 0
        self.model.eval()
        for x, y in self.testloader:                    
            with torch.no_grad():
                y_hat = self.model(x.to(device))
            loss = self.criterion(y_hat, y.to(device))
            test_loss += loss.item() * x.shape[0]
        return test_loss / len(self.testloader.dataset)
    
    def predict(self, x):
        user_id = int(x.UserID_index.unique())
        age = int(x.Age_index.unique())
        gender = int(x.Gender_index.unique())
        occupation = int(x.Occupation_index.unique())

        movies = movielens_df_indexed.drop_duplicates('MovieID_index').copy()
        movie_embeddings = self.model.embeddings(torch.tensor(movies['MovieID_index'].values,device=device).long())
        movie_biases = self.model.bias(torch.tensor(movies['MovieID_index'].values,device=device).long())

        user_embedding = self.model.embeddings(torch.tensor(user_id,device=device))
        age_embedding = self.model.embeddings(torch.tensor(age,device=device))
        gender_embedding = self.model.embeddings(torch.tensor(gender,device=device))
        occupation_embedding = self.model.embeddings(torch.tensor(occupation,device=device))

        metadata_embedding = user_embedding + age_embedding + gender_embedding + occupation_embedding
        rankings = movie_biases.squeeze() + (metadata_embedding * movie_embeddings).sum(1)
        recommendations = [i for i in movies.iloc[rankings.argsort(descending=True).cpu()]['MovieID_index'].values][:self.top_k]
        return recommendations

In [29]:
def define_fm_model(trial):
    """Optuna auxiliary method."""
    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD"]) 
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    embed_dim = trial.suggest_int(name="embed_dim", low=10, high=1000, step=1)
    fm = FactorizationMachineModel(num_embeddings, embed_dim, trainloader, testloader, epochs=5, device=device)
    optimizer = getattr(optim, optimizer_name)(fm.model.parameters(), lr=lr)
    fm.optimizer = optimizer
    return fm

In [30]:
def objective_fm(trial):
    """Optuna auxiliary method."""
    model = define_fm_model(trial)
    _, loss = model.fit(verbose=True)
    return loss

In [31]:
def hyperparameter_optimization_fm(n_trials=15):
    """Hyperparameter optimization for Factorization Machine."""
    study = optuna.create_study(direction="minimize")
    study.optimize(objective_fm, n_trials=n_trials, timeout=900)
    trial = study.best_trial
    return trial.params

In [32]:
params = hyperparameter_optimization_fm()
print(f"Best params: {params}")
factorization_machine = FactorizationMachineModel(num_embeddings, params['embed_dim'], trainloader, testloader, device=device)
optimizer = getattr(optim, params['optimizer'])(factorization_machine.model.parameters(), lr=params['lr'])

[32m[I 2022-05-21 05:02:55,880][0m A new study created in memory with name: no-name-371c4d77-2173-4fb5-a45c-d0c1077eb316[0m
[32m[I 2022-05-21 05:03:34,299][0m Trial 0 finished with value: 41.08917681138435 and parameters: {'optimizer': 'Adam', 'lr': 0.07066772899461042, 'embed_dim': 601}. Best is trial 0 with value: 41.08917681138435.[0m
[32m[I 2022-05-21 05:04:10,971][0m Trial 1 finished with value: 0.6807910058580265 and parameters: {'optimizer': 'SGD', 'lr': 0.02333246853259571, 'embed_dim': 354}. Best is trial 1 with value: 0.6807910058580265.[0m
[32m[I 2022-05-21 05:04:49,580][0m Trial 2 finished with value: 0.5965228174935786 and parameters: {'optimizer': 'Adam', 'lr': 2.0152501255926844e-05, 'embed_dim': 764}. Best is trial 2 with value: 0.5965228174935786.[0m
[32m[I 2022-05-21 05:05:26,474][0m Trial 3 finished with value: 0.58588119272917 and parameters: {'optimizer': 'RMSprop', 'lr': 8.019859129538426e-05, 'embed_dim': 109}. Best is trial 3 with value: 0.58588119

Best params: {'optimizer': 'RMSprop', 'lr': 0.00020153938430693123, 'embed_dim': 280}


In [33]:
_, _ = factorization_machine.fit()

Epoch #1 | Train loss: 0.5626 | Test loss: 0.5842 |
Epoch #2 | Train loss: 0.5369 | Test loss: 0.5835 |
Epoch #3 | Train loss: 0.5272 | Test loss: 0.5813 |
Epoch #4 | Train loss: 0.5173 | Test loss: 0.5819 |
Epoch #5 | Train loss: 0.5073 | Test loss: 0.5808 |
Epoch #6 | Train loss: 0.4867 | Test loss: 0.5806 |
Epoch #7 | Train loss: 0.4832 | Test loss: 0.5810 |
Epoch #8 | Train loss: 0.4810 | Test loss: 0.5816 |
Epoch #9 | Train loss: 0.4794 | Test loss: 0.5820 |
Epoch #10 | Train loss: 0.4779 | Test loss: 0.5826 |


In [34]:
checkpoint = torch.load(os.path.join("weights/fm/", "5.pt"))
factorization_machine.model.load_state_dict(checkpoint['model_state_dict'])
factorization_machine.model.eval()

FactorizationMachine(
  (bias): Embedding(9776, 1)
  (embeddings): Embedding(9776, 280)
)

### 3.4 Transformer

#### 3.4.1 Sequential Data Preparation

In [35]:
window_len = 10
last_n = 1

In [36]:
movielens_df['Rating'] = np.where(movielens_df.Rating.values < 4, 0, 1)

In [37]:
movielens_df['valid'] = False
last_ratings = movielens_df.groupby('UserID').filter(lambda x: len(x) >= last_n).sort_values('Timestamp').groupby('UserID').tail(last_n).sort_values('UserID')
movielens_df.loc[last_ratings.index, 'valid'] = True

In [38]:
seq_ratings = movielens_df.sort_values(by='Timestamp').groupby('UserID').agg(tuple).reset_index()
seq_ratings['NumRatings'] = seq_ratings['Rating'].apply(lambda row: len(row))

In [39]:
def make_seq(value, window_len):
    sequences = list()
    for idx in range(len(value)):
        seq = value[:idx + 1]
        if len(seq) > window_len:
            seq = seq[idx - window_len + 1: idx + 1]
        elif len(seq) < window_len:
            seq = [*(['[PAD]'] * (window_len - len(seq))), *seq]
        sequences.append(seq)
    return sequences

for column in ['Title', 'Rating', 'Timestamp', 'valid']:
    seq_ratings[column] = seq_ratings[column].apply(lambda x: make_seq(x, window_len))

In [40]:
exploded_ratings = seq_ratings[['UserID', 'Title']].explode('Title', ignore_index=True)
dfs = [seq_ratings[[col]].explode(col, ignore_index=True) for col in ['Rating', 'Timestamp', 'valid']]
seq_df = pd.concat([exploded_ratings, *dfs], axis=1)
seq_df['valid'] = seq_df['valid'].apply(lambda x: x[-1])

In [41]:
seq_df['TargetRating'] = seq_df['Rating'].apply(lambda x: x[-1])
seq_df['PrevRatings'] = seq_df['Rating'].apply(lambda x: x[:-1])
seq_df.drop(columns=['Rating'], inplace=True)

In [42]:
seq_df['PAD'] = seq_df['Title'].apply(lambda x: (np.array(x) == '[PAD]'))
seq_df['NumPAD'] = seq_df['PAD'].apply(sum)
seq_df['PAD'] = seq_df['PAD'].apply(lambda x: x.tolist())

In [43]:
user_lookup = {str(v): i+1 for i, v in enumerate(movielens_df['UserID'].unique())}

In [44]:
def create_feature_lookup(df, feature):
    lookup = {v: i+1 for i, v in enumerate(df[feature].unique())}
    lookup['[PAD]'] = 0
    return lookup

movie_lookup = create_feature_lookup(movielens_df, 'Title')

In [45]:
le = LabelEncoder()
movielens_df['GenderEncoded'] = le.fit_transform(movielens_df.Gender)
movielens_df['AgeEncoded'] = le.fit_transform(movielens_df.Age)
movielens_df["UserID"] = movielens_df["UserID"].astype(str)

In [46]:
feats = ['UserID', 'GenderEncoded', 'AgeEncoded', 'Occupation']
movielens_df.UserID = movielens_df.UserID.astype(int)
seq_with_user_features = pd.merge(seq_df, movielens_df[feats].drop_duplicates(), on='UserID')
seq_with_user_features.UserID = seq_with_user_features.UserID.astype(str)

In [47]:
train_df = seq_with_user_features[seq_with_user_features.valid == False]
valid_df = seq_with_user_features[seq_with_user_features.valid == True]

In [48]:
class MovieSequenceDataset(Dataset):
    def __init__(self, df, movie_lookup, user_lookup):
        super().__init__()
        self.df = df
        self.movie_lookup = movie_lookup
        self.user_lookup = user_lookup

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        data = self.df.iloc[index]
        user_id = self.user_lookup[str(data.UserID)]
        movie_ids = torch.tensor([self.movie_lookup[title] for title in data.Title])

        previous_ratings = torch.tensor(
            [rating if rating != "[PAD]" else 0 for rating in data.PrevRatings]
        )

        attention_mask = torch.tensor(data.PAD)
        target_rating = data.TargetRating
        encoded_features = {
            "UserID": user_id,
            "MovieIDs": movie_ids,
            "Ratings": previous_ratings,
            "Age": data["AgeEncoded"],
            "Gender": data["GenderEncoded"],
            "Occupation": data["Occupation"],
        }

        return (encoded_features, attention_mask), torch.tensor(target_rating, dtype=torch.float32)

In [49]:
train_dataset = MovieSequenceDataset(train_df, movie_lookup, user_lookup)
valid_dataset = MovieSequenceDataset(valid_df, movie_lookup, user_lookup)

In [50]:
trainloader = data.DataLoader(train_dataset, batch_size=1024, shuffle=True)
valloader = data.DataLoader(valid_dataset, batch_size=1024, shuffle=True)

#### 3.4.2 Model

In [51]:
class BSTransformer(nn.Module):
    """Implementation of Behaviour Sequence Transformer."""

    def __init__(self, n_movies, n_users, window_len=10, mlp_layers=[1024, 512, 256, 1],
                embedding_size=120, n_transformers=1):
        """Constructor of behaviour sequence transformer.

        Parameters
        ----------
        n_movies : int
            number of movies
        
        n_users : int
            number of users

        window_len : int
            length of sequence to consider
        
        mlp_layers : list
            size of layers in mlp

        embedding_size : int
            size of embedding to use

        n_transformers : int
            number of multi-head attention layers
        """
        super().__init__()
        self.n_movies = n_movies
        self.n_users = n_users
        self.window_len = window_len 
        self.embedding_size = embedding_size
        self.n_transformers = n_transformers
        self.__build__(n_movies, n_users, window_len, mlp_layers, embedding_size, n_transformers)

    def __build__(self, n_movies, n_users, window_len, mlp_layers, embedding_size, n_transformers):
        """Assemble behaviour sequence transformer."""
        self.movies_embeddings = nn.Embedding(n_movies + 1, embedding_size, padding_idx=0)
        self.user_embeddings = nn.Embedding(n_users + 1, embedding_size)
        self.ratings_embeddings = nn.Embedding(3, embedding_size, padding_idx=0)
        self.sex_embeddings = nn.Embedding(3, embedding_size)
        self.occupation_embeddings = nn.Embedding(22, embedding_size)
        self.age_group_embeddings = nn.Embedding(8, embedding_size)

        self.position_embeddings = nn.Embedding(window_len, embedding_size)
        self.encoder = nn.TransformerEncoder(
            num_layers=n_transformers,
            encoder_layer=nn.TransformerEncoderLayer(
                d_model=embedding_size, nhead=12,
                dropout=0.15, batch_first=True, activation="gelu"
            )
        )
        layers = list()
        layers.extend([nn.Linear(embedding_size * (4 + window_len), mlp_layers[0]), 
                       nn.BatchNorm1d(1024), nn.ReLU()])
        for layer_idx in range(0, len(mlp_layers) - 1):
            in_dim, out_dim = mlp_layers[layer_idx], mlp_layers[layer_idx + 1]
            layers.append(nn.Linear(in_dim, out_dim))
            if layer_idx == len(mlp_layers) - 2:
                break
            layers.extend([nn.BatchNorm1d(out_dim), nn.ReLU()])
        layers.append(nn.Sigmoid())
        self.mlp = nn.Sequential(*layers)

    def forward(self, x):
        features, mask = x
        user_id = self.user_embeddings(features["UserID"].cuda())
        age_group = self.age_group_embeddings(features["Age"].cuda())
        sex = self.sex_embeddings(features["Gender"].cuda())
        occupation = self.occupation_embeddings(features["Occupation"].cuda())
        user_features = user_features = torch.cat((user_id, sex, age_group, occupation), 1)
        movie_history = features["MovieIDs"][:, :-1].cuda()
        target_movie = features["MovieIDs"][:, -1].cuda()
        ratings = self.ratings_embeddings(features["Ratings"].cuda())
        encoded_movies = self.movies_embeddings(movie_history)
        encoded_target_movie = self.movies_embeddings(target_movie)
        positions = torch.arange(
            0, 
            self.window_len - 1,
            1,
            dtype=int
        ).cuda()
        positions = self.position_embeddings(positions)
        encoded_sequence_movies_with_position_and_rating = (
            encoded_movies + ratings + positions
        )
        encoded_target_movie = encoded_target_movie.unsqueeze(1)
        transformer_features = torch.cat(
            (encoded_sequence_movies_with_position_and_rating, encoded_target_movie),
            dim=1,
        )
        transformer_output = self.encoder(
            transformer_features, src_key_padding_mask=mask.cuda()
        )
        transformer_output = torch.flatten(transformer_output, start_dim=1)
        combined_output = torch.cat((transformer_output, user_features), dim=1)
        rating = self.mlp(combined_output)
        rating = rating.squeeze()
        return rating

In [52]:
class BSTransformerModel: 
    
    def __init__(self, n_movies, n_users, k, trainloader, testloader, 
                 lr=1e-3, weight_decay=1e-5, epochs=10, device="cuda", top_k=50):
        """Constructor of training routine.
        
        Parameters
        ----------
        n_movies : int
            number of movies
        
        n_users : int
            number of users

        k : int
            size of embedding to use
        
        trainloader : torch.utils.data.DataLoader 
            iterator over training data
        
        testloader : torch.utils.data.DataLoader 
            iterator over test data
        
        lr : float
            learning rate for optimizer
        
        weight_decay : float
            regularization parameter for optimizer
        
        epochs : int
            number of training epochs
        
        device : str
            device to use for training
        
        Returns
        -------
        None
        """
        self.n_movies = n_movies
        self.n_users = n_users
        self.k = k
        self.trainloader = trainloader
        self.testloader = testloader
        self.lr = lr
        self.weight_decay = weight_decay
        self.epochs = epochs
        self.device = device
        self.top_k = top_k
        
        self.model = BSTransformer(n_movies, n_users, embedding_size=k).to(device)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr, weight_decay=weight_decay)
        self.scheduler = optim.lr_scheduler.MultiStepLR(self.optimizer, 
                                                        milestones=list(range(0, epochs + 1, 5))[1:], 
                                                        gamma=0.1)
        self.criterion = nn.BCELoss().to(device)
        
    def fit(self, verbose=False):
        "High-level outline of the training process."
        min_val_loss = 1e9
        for epoch in range(self.epochs):
            train_loss = self._train_one_epoch()
            valid_loss = self._test()
            self.scheduler.step()
            if not verbose:
                print(f"Epoch #{epoch + 1} | Train loss: {train_loss:.4f} | Test loss: {valid_loss:.4f} |")
            if valid_loss < min_val_loss:
                torch.save({'epoch': epoch,
                            'model_state_dict': self.model.state_dict(),
                            'optimizer_state_dict': self.optimizer.state_dict(),
                            'loss': valid_loss, 
                            }, 
                            os.path.join("./weights/bst/", f"{epoch}.pt")
                )
                min_val_loss = valid_loss
        return train_loss, valid_loss
    
    def _train_one_epoch(self):
        "Train model for one epoch on train dataset."
        device = self.device
        train_loss = 0
        self.model.train()
        for x, y in self.trainloader:
            self.optimizer.zero_grad()
            y_hat = self.model(x)
            loss = self.criterion(y_hat, y.cuda())
            train_loss += loss.item() * len(x[0]['UserID'])
            loss.backward()
            self.optimizer.step()
        return train_loss / len(self.trainloader.dataset)
    
    def _test(self):
        "Test model on validation dataset."
        device = self.device
        test_loss = 0
        self.model.eval()
        for x, y in self.testloader:                    
            with torch.no_grad():
                y_hat = self.model(x)
            loss = self.criterion(y_hat, y.cuda())
            test_loss += loss.item() * len(x[0]['UserID'])
        return test_loss / len(self.testloader.dataset)
    
    def predict(self, x):
        user_id = int(x.UserID_index.unique())
        age = int(x.Age_index.unique())
        gender = int(x.Gender_index.unique())
        occupation = int(x.Occupation_index.unique())

        movies = movielens_df_indexed.drop_duplicates('MovieID_index').copy()
        movie_embeddings = self.model.embeddings(torch.tensor(movies['MovieID_index'].values,device=device).long())
        movie_biases = self.model.bias(torch.tensor(movies['MovieID_index'].values,device=device).long())

        user_embedding = self.model.embeddings(torch.tensor(user_id,device=device))
        age_embedding = self.model.embeddings(torch.tensor(age,device=device))
        gender_embedding = self.model.embeddings(torch.tensor(gender,device=device))
        occupation_embedding = self.model.embeddings(torch.tensor(occupation,device=device))

        metadata_embedding = user_embedding + age_embedding + gender_embedding + occupation_embedding
        rankings = movie_biases.squeeze() + (metadata_embedding * movie_embeddings).sum(1)
        recommendations = [i for i in movies.iloc[rankings.argsort(descending=True).cpu()]['MovieID_index'].values][:self.top_k]
        return recommendations

In [53]:
transformer = BSTransformerModel(len(movie_lookup), len(user_lookup), 120, trainloader, valloader)

In [55]:
_, _ = transformer.fit()


Epoch #1 | Train loss: 0.5704 | Test loss: 0.5553 |
Epoch #2 | Train loss: 0.5471 | Test loss: 0.5384 |
Epoch #3 | Train loss: 0.5347 | Test loss: 0.5319 |
Epoch #4 | Train loss: 0.5272 | Test loss: 0.5347 |
Epoch #5 | Train loss: 0.5214 | Test loss: 0.5290 |
Epoch #6 | Train loss: 0.5022 | Test loss: 0.5258 |
Epoch #7 | Train loss: 0.4954 | Test loss: 0.5275 |
Epoch #8 | Train loss: 0.4899 | Test loss: 0.5267 |
Epoch #9 | Train loss: 0.4849 | Test loss: 0.5286 |
Epoch #10 | Train loss: 0.4795 | Test loss: 0.5278 |


In [56]:
checkpoint = torch.load(os.path.join("weights/bst/", "5.pt"))
transformer.model.load_state_dict(checkpoint['model_state_dict'])
transformer.model.eval()

BSTransformer(
  (movies_embeddings): Embedding(3708, 120, padding_idx=0)
  (user_embeddings): Embedding(6041, 120)
  (ratings_embeddings): Embedding(3, 120, padding_idx=0)
  (sex_embeddings): Embedding(3, 120)
  (occupation_embeddings): Embedding(22, 120)
  (age_group_embeddings): Embedding(8, 120)
  (position_embeddings): Embedding(10, 120)
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=120, out_features=120, bias=True)
        )
        (linear1): Linear(in_features=120, out_features=2048, bias=True)
        (dropout): Dropout(p=0.15, inplace=False)
        (linear2): Linear(in_features=2048, out_features=120, bias=True)
        (norm1): LayerNorm((120,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((120,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.15, inplace=False)
        (dropout2): 

## 4. Assess Model

In [57]:
accuracy = compute_mean_accuracy(val_movielens_df, item_popularity_model)
print(f"Accuracy of Item Popularity model is {accuracy * 100:.2f}%")

Accuracy of Item Popularity model is 74.11%


In [58]:
accuracy = compute_mean_accuracy(val_movielens_df, mlp_model)
print(f"Accuracy of Multilayer Perceptron is {accuracy * 100:.2f}%")

Accuracy of Multilayer Perceptron is 85.97%


In [59]:
accuracy = compute_mean_accuracy(val_movielens_df, factorization_machine)
print(f"Accuracy of Factorization Machine is {accuracy * 100:.2f}%")

Accuracy of Factorization Machine is 89.46%


In [60]:
accuracy = compute_mean_accuracy(val_movielens_df, transformer)
print(f"Accuracy of Behaviour Sequence Transformer is {accuracy * 100:.2f}%")

Accuracy of Behaviour Sequence Transformer is 93.15%
