<a href="https://colab.research.google.com/github/januverma/transformers-for-sequential-recommendation/blob/main/Transformer_For_Sequential_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sequential Recommendation

Existing works on recommendation systems can be divided into two paradigms:
- **General recommendation** mine static pair-wise correlations between entities from the historical interactions, and are modeled by the collaborative filtering models. 
- **Sequential recommendation** aims to predict users’ next actions based on the sequential interactions in the past. Most sequential recommendation models involve sequential data mining techniques e.g. Markov chains, RNNs, self-attention etc.

While general recommendation methods consider user-item relationships as static and ignore the dynamics & the evolution of users’ preferences, the sequential recommendation models take into consideration the short-term user behaviour and are oblivious to the global user preferences which have stabilized over time.

Recently, Transformer architecture has been shown to have superior performance for sequential data modelling. The transformer architecture lends itself to efficient parallelization and is effective at modeling long-range sequences.In this notebook, we will build a sequential recommendation model using transformer architecture. 

In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

We will use the [MovieLens]( https://files.grouplens.org/datasets/movielens) dataset which is a popular benchmark for recommendation systems research. We will use the 1-Million review dataset. 



In [3]:
data_path = 'ml-1m'

In [4]:
! wget https://files.grouplens.org/datasets/movielens/ml-1m.zip

--2023-01-10 13:03:27--  https://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip.1’


2023-01-10 13:03:30 (3.37 MB/s) - ‘ml-1m.zip.1’ saved [5917549/5917549]



In [6]:
! unzip ml-1m.zip

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


Loading data into `pandas` 

In [4]:
ratings_path = data_path + '/ratings.dat'
ratings = pd.read_csv(ratings_path, header=None, sep='::', engine='python', names=['userId', 'movieId', 'rating', 'timestamp'])
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [5]:
ratings.shape

(1000209, 4)

The data has over 1M ratings by more than 6k users for 3.7k movies.

## User Interaction Sequences
Following the standard protocol for research in sequential recommendation, we treat the presence of a rating as implict feedback i.e. the user interacted with the movie and use the `timestamps` to determine the sequential structure of the interactions. 

In [6]:
ratings = ratings.sort_values(by=['timestamp'])

In [7]:
user_seq_df = ratings.groupby('userId').agg({'movieId': lambda x:list(x), 'timestamp': lambda x:list(x)})
user_seq_df.head()

Unnamed: 0_level_0,movieId,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"[3186, 1721, 1270, 1022, 2340, 1836, 3408, 120...","[978300019, 978300055, 978300055, 978300055, 9..."
2,"[1198, 1217, 1210, 2717, 1293, 2943, 1225, 119...","[978298124, 978298151, 978298151, 978298196, 9..."
3,"[593, 2858, 3534, 1968, 1961, 1431, 1266, 3671...","[978297018, 978297039, 978297068, 978297068, 9..."
4,"[1210, 1097, 480, 3468, 3527, 1196, 260, 1198,...","[978293924, 978293964, 978294008, 978294008, 9..."
5,"[2717, 919, 908, 356, 1250, 2188, 2858, 1127, ...","[978241072, 978241072, 978241072, 978241112, 9..."


We have converted users' interactions with movies into sequences (time-ordered) of movies. 

## Pre-processing Interaction Sequences

To keep things simple, we will work with shorter sequences (`len <= 50`). 

In [8]:
MAX_LENGTH = 50
BATCH_SIZE = 8

For partitioning into train-val-test sets, we split the historical sequence for each user into three parts:
- the most recent action for testing
- the second most recent action for validation
- all remaining actions for training. 

Note that during testing, the input sequences contain training actions
and the validation action.

In [9]:
train_seqs = user_seq_df['movieId'].apply(lambda x:x[:-2]).values
train_seqs.shape, len(train_seqs[1])

((6040,), 127)

In [10]:
def create_valid_seqs(seq):
  if len(seq) > MAX_LENGTH:
    seq_len = len(seq) - MAX_LENGTH - 2
    return seq[seq_len: -1]
  else:
    return seq[:-1]

len(create_valid_seqs(list(range(58))))

51

In [11]:
valid_seqs = user_seq_df['movieId'].apply(lambda x:create_valid_seqs(x)).values
valid_seqs.shape, len(valid_seqs[0])

((6040,), 51)

In [12]:
def create_test_seqs(seq):
  if len(seq) > MAX_LENGTH:
    seq_len = len(seq) - MAX_LENGTH - 1
    return seq[seq_len:]
  else:
    return seq

len(create_test_seqs(list(range(58))))

51

### Dataloaders for train and valid data

We leverage PyTorch `dataloaders` to generate batches of sequences for training and evaluation. 

In [13]:
class TrainSequenceData(Dataset):
    def __init__(self, user_seqs, maxlen):
        '''
        Input is a list of lists where each list contains movie ids for a specific user. 
        '''
        self.seqs = []
        self.labels = []
        self.maxlen = maxlen
        for x in user_seqs:
          seqs, labels = self.build_examples(x, maxlen)
          self.seqs.extend(seqs)
          self.labels.extend(labels)

    def build_examples(self, x, max_len):
        seqs = []
        labels = []
        for i in range(0, len(x) - 1, max_len):
          seq_len = min(max_len, len(x) - 1 - i)
          src = x[i:i+seq_len]
          tgt = x[i+1:i+1+seq_len]
          seqs.append(src)
          labels.append(tgt)
        return seqs, labels

    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, index):
        seq = self.seqs[index]
        label = self.labels[index]
        return (seq, label)


In [14]:
def generate_square_subsequent_mask(sz):
    """
    Generates an upper-triangular matrix of ones, with zeros on diag.
    Shape max_length * max_length
    """
    return torch.triu(torch.ones(sz, sz, dtype=torch.bool), diagonal=1)

def generate_key_padding_mask(max_len, seq_lengths):
    """
    Generates a tensor of shape batch_size * max_length
    """
    padding_mask = torch.zeros(seq_lengths.shape[0], max_len)
    for i,x in enumerate(seq_lengths):
        padding_mask[i][x:] = 1
    return padding_mask

In [15]:
def collate_fn(batch, max_len=MAX_LENGTH):
    seqs = []
    labels = []
    seq_lens = []

    for (_seq, _label) in batch:
        label_tensor = torch.tensor(_label, dtype=torch.int64)
        labels.append(label_tensor)
        seq_tensor = torch.tensor(_seq, dtype=torch.int64)
        seqs.append(seq_tensor)
        seq_lens.append(seq_tensor.size(0))

    seq_lens = torch.tensor(seq_lens, dtype=torch.int64)
    padded_seqs = pad_sequence(seqs, batch_first=True, padding_value=0)
    padded_labels = pad_sequence(labels, batch_first=True, padding_value=0).reshape(-1)
    
    # max_len = seq_lens.max().item()
    attn_mask = generate_square_subsequent_mask(max_len)
    key_padding_mask = generate_key_padding_mask(max_len, seq_lens)
    return padded_seqs, attn_mask, key_padding_mask, padded_labels

In [16]:
train_ds = TrainSequenceData(train_seqs, maxlen=MAX_LENGTH)
train_dataloader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
len(train_dataloader)

2819

In [17]:
class ValidTestSequenceData(Dataset):
    def __init__(self, user_seqs, maxlen):
        '''
        Input is a list of lists where each list contains movie ids for a specific user. 
        '''
        self.seqs = []
        self.labels = []
        self.maxlen = maxlen
        for x in user_seqs:
          if len(x) > 2:
            seq, label = self.build_examples(x, maxlen)
            self.seqs.append(seq)
            self.labels.append(label)

    def build_examples(self, x, max_len):
        seq = x[:-1]
        label = x[1:]
        return seq, label

    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, index):
        seq = self.seqs[index]
        label = self.labels[index]
        return (seq, label)


In [18]:
valid_ds = ValidTestSequenceData(valid_seqs, maxlen=MAX_LENGTH)
valid_dataloader = DataLoader(valid_ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
len(valid_dataloader)

750

## Model Architecture

Here we will implement a transformer-based sequential recommendation model from scratch using PyTorch. For more detailed description of the model components and their implementation from scratch, refer to notebooks in this [GitHub repository](https://github.com/januverma/transformers-stuff) 

In [19]:
class MultiHeadSelfAttention(nn.Module):
    '''
    Implements MHSA using the PyTorch MultiheadAttention Layer.
    '''
    def __init__(self, hidden_dim, num_heads, dropout):
        '''
        Arguments:
            hidden_dim: Dimension of the output of the self-attention.
            num_heads: Number of heads for the multi-head attention. 
            dropout: Dropout probability for the self-attention. If `0.0` then no dropout will be used.
            
        Returns:
            A tensor of shape `num_tokens x hidden_size` containing output of the MHSA for each token.
        '''
        super().__init__()
        if hidden_dim % num_heads != 0:
            print('The hidden size {} is not a multiple of the number of heads {}'.format(hidden_dim, num_heads))
        self.attention_layer = nn.MultiheadAttention(hidden_dim, num_heads, dropout=dropout, batch_first=True)
    def forward(self, x, key_padding_mask=None, attention_mask=None):
        '''
        Arguments:
            x: Tensor containing input token embeddings.
            key_padding_mask: Mask indicating which elements within the input sequence to be considered as padding and ignored for the computation of self-attention scores.  
            attention_mask: Mask indicating which relative positions are allowed to attend.  
        '''
        return self.attention_layer(query=x, key=x, value=x, key_padding_mask=key_padding_mask, attn_mask=attention_mask)


In [20]:
class FeedForward(nn.Module):
    '''
    Implements the feed-forward component of the transfomer model.
    '''
    def __init__(self, input_dim, hidden_dim, dropout=0.0):
        '''
        Arguments:
            input_dim: Dimension of the token embedding, output of the MHSA layer.
            hidden_dim: Hidden size of the Transformer that this feed-forward layer is part of.
            dropout: Dropout probability to use for the projected activations. If `0.0` then no dropout will be used.
        Returns:
            A tensor of shape `num_tokens x hidden_dim` containing projections for each token.
        '''
        super().__init__()
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.layer_1 = nn.Linear(input_dim, hidden_dim)
        self.layer_2 = nn.Linear(hidden_dim, input_dim)
    def forward(self, x):
        x = self.layer_1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.layer_2(x)
        return x

In [21]:
class TransformerLayerNorm(nn.Module):
    '''
    Implements LayerNorm for self-attention and feed-forward networks.

    Arguments:
        input_dim: Input dimension.
    
    Returns:
        A normalized tensor of the same dimension as the input. 
    '''
    def __init__(self, input_dim):
        super().__init__()
        self.layer_norm = nn.LayerNorm(input_dim)
    def forward(self, x):
        x = x.to(self.layer_norm.weight.dtype)
        return self.layer_norm(x) 

In [22]:
class TransformerLayer(nn.Module):
    '''
    A transformer layer which is a sequential model consisting of self-attention, layer norm, residual connection, feed-forward projection, layer norm, residual connection. 
    
    Arguments:
        hidden_dim: Hidden dimension transformer layers.  
        num_heads: Number of attention heads. 
        attn_dropout: Dropout for MHSA layers. 
        ffn_dropout: Dropout for feed-forward layers.
    Returns:
        A tensor containing attention scores for each token. 
        attn_weights: A tensor of shape `num_tokens x num_tokens` containing the attention weights. 
    '''
    def __init__(self, hidden_dim, num_heads, attn_dropout=0.0, ffn_dropout=0.0):
        super().__init__()
        self.attn_layer = MultiHeadSelfAttention(hidden_dim, num_heads, dropout=attn_dropout)
        self.ffn_layer = FeedForward(hidden_dim, hidden_dim, dropout=ffn_dropout)
        self.layer_norm = TransformerLayerNorm(hidden_dim)
    def forward(self, x, key_padding_mask=None, attention_mask=None):
        attn_out, attn_weights = self.attn_layer(x, key_padding_mask, attention_mask)
        x = self.layer_norm(x + attn_out)
        ffn_out = self.ffn_layer(x)
        x = self.layer_norm(x + ffn_out)
        return x, attn_weights

In [23]:
class TransformerEncoder(nn.Module):
    '''
    Transformer Encoder which is composed for a stack of TransformerLayers. 

    Arguments:
        num_layers: Number of Transformer layers in the encoder. 
        hidden_dim: Hidden dimension of the transformer layers.  
        num_heads: Number of heads. 
        attn_dropout: Dropout for MHSA layers. 
        ffn_dropout: Dropout for feed-forward layers.
    '''
    def __init__(self, num_layers, hidden_dim, num_heads, attn_dropout=0.0, ffn_dropout=0.0):
        super().__init__()
        self.layers = nn.ModuleList([TransformerLayer(hidden_dim, num_heads, attn_dropout, ffn_dropout) for _ in range(num_layers)])
        self.attn_weights = []
    def forward(self, x, key_padding_mask=None, attention_mask=None):
        for layer in self.layers:
            x, weights = layer(x, key_padding_mask, attention_mask)
            self.attn_weights.append(weights)
        return x
    def get_attention_weights(self):
        if len(self.attn_weights) != 0:
            return self.attn_weights
        else:
            print("The model hasn't been training yet")

In [24]:
class PositionalEncoding(nn.Module):
    '''
    Implements the sinusoidal positional encoding for the input tokens. 

    Arguments:
        embed_dim: Dimension of the positional encoding, should be the same as input token embedding. 
        dropout: Dropout probability to be used for positional encoding. 
        max_len: Maximum length of the input token sequences. 
    Returns:
      A tensor containing positional embeddings for each token.
    '''
    def __init__(self, embed_dim, dropout=0.1, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_dim, 2) * (-math.log(10000.0)/embed_dim))
        pe = torch.zeros(max_len, 1, embed_dim)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)
    def forward(self, x):
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [25]:
class MySASRec(nn.Module):
    '''
    SASRec-like recommendation model.

    Arguments:
        vocab_size: Size of vocabulary.
        embed_dim: Dimension of the input token embedding. 
        num_layers: Number of Transformer layers in the encoder. 
        hidden_dim: Hidden dimension of the transformer layers.  
        ffn_hidden_dim: Hidden dimension of the Feed-forward layers. 
        num_heads: Number of heads. 
        attn_dropout: Dropout for MHSA layers. 
        ffn_dropout: Dropout for feed-forward layers.
    '''
    def __init__(self, vocab_size, embed_dim, num_layers, num_heads, attn_dropout, ffn_dropout):
        super(MySASRec, self).__init__()
        self.embedding_layer = nn.Embedding(vocab_size, embed_dim)
        self.pos_encoder = PositionalEncoding(embed_dim)
        self.transformer_encoder = TransformerEncoder(num_layers, embed_dim, num_heads, attn_dropout, ffn_dropout)
        self.decoder = nn.Linear(embed_dim, vocab_size)
        self.embed_layer_norm = nn.LayerNorm(embed_dim)
        self.init_weights()

    def init_weights(self):
        init_range = 0.5
        self.decoder.weight.data.uniform_(-init_range, init_range)
        self.decoder.bias.data.zero_()
        
    
    def forward(self, seqs, attn_mask=None, key_padding_mask=None):
        embedded_seq = self.embedding_layer(seqs)
        embedded_seq = self.pos_encoder(embedded_seq)
        embedded_seq = self.embed_layer_norm(embedded_seq)
        out = self.transformer_encoder(x=embedded_seq, key_padding_mask=key_padding_mask, attention_mask=attn_mask)
        results = self.decoder(out)
        return results
      
    def get_attn_weights(self):
      return self.transformer_encoder.get_attention_weights()

## Training Module

In [26]:
num_users = max(ratings.userId)
num_movies = max(ratings.movieId) + 1

num_users, num_movies

(6040, 3953)

In [27]:
from torch.nn import CrossEntropyLoss
from torch.optim import SGD, lr_scheduler, Adam
from torch.nn.utils import clip_grad_norm_

In [28]:
def train(dataloader, model, criterion, optimizer, device):
    """
    Trains an epoch.
    """
    losses = []
    running_losses = []
    model.train()
    for i, batch in enumerate(dataloader):
        padded_seqs, attn_mask, key_padding_mask, out = batch
        padded_seqs = padded_seqs.to(device)
        attn_mask = attn_mask.to(device)
        key_padding_mask = key_padding_mask.to(device)
        out = out.to(device)

        logits = model(padded_seqs, attn_mask, key_padding_mask)
        J = criterion(logits.view(-1, num_movies), out)
        losses.append(J.item())
        running_losses.append(J.item())
        optimizer.zero_grad()
        J.backward()
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        if i%1000 == 0:
            print('|{:5d}/{:5d} batches done | loss {:8.3f}'.format(i, len(dataloader), torch.tensor(running_losses).mean()))
            running_losses = []
    
    epoch_loss = torch.tensor(losses).mean()
    return epoch_loss

In [29]:
def evaluate(dataloader, model, criterion, device):
    """
    Trains an epoch.
    """
    losses = []
    model.eval()
    with torch.no_grad():
      for i, batch in enumerate(dataloader):
          padded_seqs, attn_mask, key_padding_mask, out = batch
          padded_seqs = padded_seqs.to(device)
          attn_mask = attn_mask.to(device)
          key_padding_mask = key_padding_mask.to(device)
          out = out.to(device)

          logits = model(padded_seqs, attn_mask, key_padding_mask)
          J = criterion(logits.view(-1, num_movies), out)
          losses.append(J.item())

    epoch_loss = torch.tensor(losses).mean()
    return epoch_loss

## Training

In [30]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda', index=0)

In [31]:
torch.cuda.empty_cache()

In [32]:
import time

In [33]:
EPOCHS = 10
LR = 0.001
rec_model = MySASRec(vocab_size=num_movies, embed_dim=50, num_layers=2, num_heads=2, attn_dropout=0.2, ffn_dropout=0.2).to(device)

criterion = CrossEntropyLoss(ignore_index=0)
optimizer = Adam(rec_model.parameters(), lr=LR, betas=(0.9, 0.98), eps=1e-08)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

In [40]:
for epoch in range(EPOCHS):
        epoch_start_time = time.time()
        train_loss = train(train_dataloader, rec_model, criterion, optimizer, device)
        val_loss = evaluate(valid_dataloader, rec_model, criterion, device)

        print('-' * 59)
        print('| end of epoch {:3d} | time: {:5.2f}s | val loss {:8.3f}'.\
            format(epoch, time.time() - epoch_start_time, val_loss))
        print('-' * 59)
        scheduler.step()


|    0/ 2819 batches done | loss    5.680
| 1000/ 2819 batches done | loss    6.050
| 2000/ 2819 batches done | loss    6.049
-----------------------------------------------------------
| end of epoch   0 | time: 25.29s | val loss    6.049
-----------------------------------------------------------
|    0/ 2819 batches done | loss    5.663
| 1000/ 2819 batches done | loss    6.021
| 2000/ 2819 batches done | loss    6.025
-----------------------------------------------------------
| end of epoch   1 | time: 25.63s | val loss    6.024
-----------------------------------------------------------
|    0/ 2819 batches done | loss    5.590
| 1000/ 2819 batches done | loss    6.000
| 2000/ 2819 batches done | loss    6.004
-----------------------------------------------------------
| end of epoch   2 | time: 25.10s | val loss    6.002
-----------------------------------------------------------
|    0/ 2819 batches done | loss    5.613
| 1000/ 2819 batches done | loss    5.980
| 2000/ 2819 bat

This concludes the training process. We trained a simplified model (2 transformer layers with 2 attention heads) model for 50 epochs. In practice, much deeper versions (e.g. BERT has 12 transformer layers each with 12 heads) are trained for a longer periods of time to achieve desired results.  

## Evaluation on the test data

In [35]:
test_seqs = user_seq_df['movieId'].apply(lambda x:create_test_seqs(x)).values
test_seqs.shape, len(test_seqs[0])

((6040,), 51)

In [36]:
test_ds = ValidTestSequenceData(test_seqs, maxlen=MAX_LENGTH)
test_dataloader = DataLoader(test_ds, batch_size=1, shuffle=False, collate_fn=collate_fn)
len(test_dataloader)

6040

Our metric of choice is HitRate@k which is defined as 

In [37]:
def calcualte_hit_rate(rec_model, test_loader, k=10):
  total = 0
  cnt = 0
  for batch in test_dataloader:
    padded_seqs, attn_mask, key_padding_mask, out = batch
    logits = rec_model(padded_seqs.to(device))
    logits = logits.squeeze()
    pred = logits[-1].argsort(dim=0, descending=True)
    if out[-1].item() in pred[:k].cpu().numpy():
      cnt += 1
      total += 1
    else:
      total += 1
  return {'total instances': total, 'hits@k': cnt}


In [41]:
calcualte_hit_rate(rec_model, test_dataloader, k=20)

{'total instances': 6040, 'hits@k': 1264}

In [42]:
print('hit rate @ 20 after 50 epochs of training is : {}'.format(1264/6040))

hit rate @ 20 after 50 epochs of training is : 0.20927152317880796
