# Neural Collaborative Filtering With Time Decay
Our team has developed a recommendation model using Neural Collaborative Filtering (NCF), combining Matrix Factorisation with a Neural Network to perform collaborative filtering. This hybrid approach enables the model to capture both linear and non-linear relationships, enhancing its ability to predict user preferences effectively. We’ve further incorporated a Time Decay factor into the predicted watch ratio, which prioritises newer content by giving it a higher predicted watch ratio. This enables our recommendation system to prioritise trending or recent videos.

## Set up

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Set your root directory below. Make sure the `/data` and `/data_exports` folders are uploaded and is situated in this directory.

In [None]:
# Adjust your root directory
root = '/content/drive/MyDrive/KuaiRec/'

## Load the Train and Validation Dataset

In [1]:
import numpy as np
import pandas as pd
import torch
import os
import torch.nn as nn
import torch.optim as optim

from sklearn.preprocessing import LabelEncoder, StandardScaler
from torch.utils.data import Dataset, DataLoader

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Training data
train = pd.read_csv(root + "data_exports/joined_train_data_segmented.csv")
val = pd.read_csv(root + "data_exports/joined_val_data_FE.csv")

print(f'Total number of training data: {len(train)}')
print(f'Total number of validation data: {len(val)}')

Total number of training data: 2552082
Total number of validation data: 1376299


In [3]:
train.head()

Unnamed: 0,user_id,video_id,time,watch_ratio,user_active_degree,is_lowactive_period,is_live_streamer,is_video_author,follow_user_num,fans_user_num,...,avg_daily_watch_time,top_3_categories,cluster,News_Politics,Auto_Tech,Lifestyle,Sports_Fitness,Entertainment,Culture,Others
0,14,148,2020-07-05 05:27:48.378,0.722103,full_active,0,0,1,73,6,...,8360719000000.0,"['Car', 'Pets', 'Real estate家居']",0,0,1,1,0,0,0,1
1,14,183,2020-07-05 05:28:00.057,1.907377,full_active,0,0,1,73,6,...,8360719000000.0,"['Car', 'Pets', 'Real estate家居']",0,0,1,1,0,0,0,1
2,14,3649,2020-07-05 05:29:09.479,2.063311,full_active,0,0,1,73,6,...,8360719000000.0,"['Car', 'Pets', 'Real estate家居']",0,0,1,1,0,0,0,1
3,14,5262,2020-07-05 05:30:43.285,0.566388,full_active,0,0,1,73,6,...,8360719000000.0,"['Car', 'Pets', 'Real estate家居']",0,0,1,1,0,0,0,1
4,14,8234,2020-07-05 05:35:43.459,0.418364,full_active,0,0,1,73,6,...,8360719000000.0,"['Car', 'Pets', 'Real estate家居']",0,0,1,1,0,0,0,1


## Deriving Video Age

To calculate the age of each video in days, we first require a reference date, which serves as a baseline for computing the time decay in our model. For consistency, we assume this date to be the day of the latest recorded interaction within the dataset.

In [4]:
# Convert type to datetime
train['time'] = pd.to_datetime(train['time'])

# Assume current date is the next day of the last date
CURRENT_DATE_TRAIN = pd.to_datetime(train['time'].dt.date.max())

# Just the date portion
print(f'Current date: {CURRENT_DATE_TRAIN}')

Current date: 2020-08-03 00:00:00


We are then able to calculate the age of each video in the training dataset.

In [5]:
video_info = pd.read_csv(root + 'data/item_daily_features.csv', usecols=['video_id', 'upload_dt']).drop_duplicates()

video_info['upload_dt'] = pd.to_datetime(video_info['upload_dt'])

In [6]:
# Get video age for training data
video_info_train = video_info[video_info['video_id'].isin(train['video_id'].unique())]
video_info_train['video_age'] = (CURRENT_DATE_TRAIN - video_info_train['upload_dt']).dt.days
video_age_dict = video_info_train.set_index('video_id')['video_age'].to_dict()    # Convert to dictionary

## Preprocessing for Feeding into the Neural Network Component of NCF

### One hot encode categorical variables

Since neural networks require numerical inputs, we need to transform categorical variables, like `user_active_degree` and `time_period`, into a numerical format. We achieve this by one-hot encoding, which converts each category into a distinct binary vector.

In [7]:
# One hot encode 'user_active_degree', 'time_period'
train_processed = pd.get_dummies(train, columns=['user_active_degree', 'time_period'])

# Remove the column for user_active_degree = UNKNOWN
train_processed = train_processed.drop(columns=['user_active_degree_UNKNOWN'])

### Scale continuous variables

Our continuous features currently have varying scales: for instance, as seen below, `follow_user_num` is typically in the tens, whereas `like_cnt` can range from tens to hundreds of thousands. Without scaling, these differences would lead to imbalances during training, causing certain features to disproportionately influence the model. To address this, we apply feature scaling to standardise the values, ensuring each feature contributes more equally to model learning and improving overall performance.

In [8]:
train_processed[['follow_user_num',
       'fans_user_num', 'friend_user_num', 'register_days', 'video_duration',
       'show_cnt', 'play_cnt', 'like_cnt', 'comment_cnt',
       'share_cnt', 'follow_cnt', 'collect_cnt', 'count_afternoon_views', 'count_evening_views', 'count_midnight_views',
       'count_morning_views', 'avg_daily_watch_time']].describe().round(3)

Unnamed: 0,follow_user_num,fans_user_num,friend_user_num,register_days,video_duration,show_cnt,play_cnt,like_cnt,comment_cnt,share_cnt,follow_cnt,collect_cnt,count_afternoon_views,count_evening_views,count_midnight_views,count_morning_views,avg_daily_watch_time
count,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0
mean,53.814,3.873,1.332,265.334,11647.906,6959049.23,7052437.0,204477.96,8935.899,3805.251,20932.721,285.876,465.834,280.911,457.937,610.06,8062631000000.0
std,141.89,9.717,4.925,264.071,13441.156,9275604.9,9511481.0,320943.089,21119.828,12695.305,63310.055,1337.505,284.492,238.512,433.983,330.571,706882700000.0
min,0.0,0.0,0.0,8.0,3066.0,644.0,331.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4632392000000.0
25%,7.0,0.0,0.0,119.0,7333.0,832913.0,762922.0,15528.0,345.0,64.0,1002.0,5.0,249.0,84.0,53.0,374.0,7686325000000.0
50%,15.0,1.0,0.0,200.0,9383.0,3127692.0,3071419.0,73590.0,2171.0,414.0,4968.0,28.0,444.0,225.0,356.0,569.0,8158000000000.0
75%,43.0,4.0,1.0,302.0,11500.0,9372330.0,9544620.0,251209.0,8918.0,2275.0,17978.0,133.0,656.0,419.0,748.0,806.0,8518700000000.0
max,1811.0,251.0,71.0,2002.0,294520.0,65255077.0,64795780.0,2762854.0,338365.0,206105.0,1215372.0,29197.0,1477.0,1435.0,1852.0,1727.0,12772440000000.0


In [9]:
scaler = StandardScaler()

columns_to_scale = ['follow_user_num',
       'fans_user_num', 'friend_user_num', 'register_days', 'video_duration',
       'show_cnt', 'play_cnt', 
       'like_cnt', 'comment_cnt',
       'share_cnt', 'follow_cnt', 'collect_cnt', 
       'total_connections',
       'watch_frequency', 
       'count_afternoon_views', 'count_evening_views', 'count_midnight_views',
       'count_morning_views', 
       'avg_daily_watch_time', 
       ]

train_processed[columns_to_scale] = scaler.fit_transform(train_processed[columns_to_scale])

We now see that the mean of all features is (close to) 0 with standard deviation 1

In [10]:
# Round to 3 dp
train_processed[['follow_user_num',
       'fans_user_num', 'friend_user_num', 'register_days', 'video_duration',
       'show_cnt', 'play_cnt', 'like_cnt', 'comment_cnt',
       'share_cnt', 'follow_cnt', 'collect_cnt', 'count_afternoon_views', 'count_evening_views', 'count_midnight_views',
       'count_morning_views', 'avg_daily_watch_time']].describe().round(3)

Unnamed: 0,follow_user_num,fans_user_num,friend_user_num,register_days,video_duration,show_cnt,play_cnt,like_cnt,comment_cnt,share_cnt,follow_cnt,collect_cnt,count_afternoon_views,count_evening_views,count_midnight_views,count_morning_views,avg_daily_watch_time
count,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0,2552082.0
mean,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.379,-0.399,-0.27,-0.974,-0.638,-0.75,-0.741,-0.637,-0.423,-0.3,-0.331,-0.214,-1.637,-1.178,-1.055,-1.845,-4.853
25%,-0.33,-0.399,-0.27,-0.554,-0.321,-0.66,-0.661,-0.589,-0.407,-0.295,-0.315,-0.21,-0.762,-0.826,-0.933,-0.714,-0.532
50%,-0.274,-0.296,-0.27,-0.247,-0.169,-0.413,-0.419,-0.408,-0.32,-0.267,-0.252,-0.193,-0.077,-0.234,-0.235,-0.124,0.135
75%,-0.076,0.013,-0.067,0.139,-0.011,0.26,0.262,0.146,-0.001,-0.121,-0.047,-0.114,0.668,0.579,0.668,0.593,0.645
max,12.384,25.433,14.146,6.577,21.045,6.285,6.071,7.971,15.598,15.935,18.867,21.616,3.554,4.839,3.212,3.379,6.663


## Create the Dataset Class

To encapsulate the dataset effectively, we implement a KuaiShouDataset class. This class holds essential components, including encoders for user and video IDs, user and video features, the target variable (`watch_ratio`), and a dictionary for video_age. By centralising these elements, the KuaiShouDataset class streamlines data access and processing, allowing efficient and more convinient handling of features and targets.

In [11]:
class KuaiShouDataset(Dataset):
    def __init__(self, data, user_id_col, video_id_col, user_feature_cols, video_feature_cols, watch_ratio_col, video_age_dict):
        self.user_feature_cols = user_feature_cols
        self.video_feature_cols = video_feature_cols

        # Initialise and fit LabelEncoders
        self.user_encoder = LabelEncoder()
        self.video_encoder = LabelEncoder()
        
        self.user_indices = torch.tensor(self.user_encoder.fit_transform(data[user_id_col]), dtype=torch.long)
        self.video_indices = torch.tensor(self.video_encoder.fit_transform(data[video_id_col]), dtype=torch.long)

        # Convert user and video features and watch ratios to tensors
        self.user_features = torch.tensor(data[user_feature_cols].values.astype(np.float32), dtype=torch.float32)
        self.video_features = torch.tensor(data[video_feature_cols].values.astype(np.float32), dtype=torch.float32)
        self.watch_ratios = torch.tensor(data[watch_ratio_col].values.astype(np.float32), dtype=torch.float32)

        # Time related features
        self.video_age_dict = video_age_dict

    def __len__(self):
        return len(self.user_indices)

    def __getitem__(self, idx):
        return self.user_indices[idx], self.video_indices[idx], self.user_features[idx], self.video_features[idx], self.watch_ratios[idx]

    def inverse_transform_user_ids(self, encoded_user_idx):
        """Decode encoded user indices to original user_ids."""
        return self.user_encoder.inverse_transform(encoded_user_idx)
    
    def inverse_transform_video_ids(self, encoded_video_idx):
        """Decode encoded video indices to original video_ids."""
        return self.video_encoder.inverse_transform(encoded_video_idx)
    
    def get_video_age(self, video_idx):
        """Get video age."""
        video_ids = self.inverse_transform_video_ids(video_idx)

        ages = []
        for i in range(len(video_idx)):
            ages.append(self.video_age_dict[video_ids[i]])
        return torch.tensor(ages, dtype=torch.float32)
    
    def get_decoded_user_video_pairs(self):
        """Get decoded user-video pairs."""
        return self.inverse_transform_user_ids(self.user_indices), self.inverse_transform_video_ids(self.video_indices)

## Defining the Model Architecture

We designed our Neural Collaborative Filtering (NCF) model with a multi-branch architecture, combining a Generalised Matrix Factorization (GMF) component with two Multi-Layer Perceptron (MLP) components. Each branch serves a distinct purpose: the GMF branch captures linear interactions between users and videos, while the MLP branches—one for embeddings and one for additional features—model complex, non-linear interactions.
- **GMF Component**: This branch computes element-wise interactions between user and video embeddings to capture collaborative signals directly.
- **MLP Components**: We use two MLP branches. One MLP processes separate user and video embeddings, while the other processes user and video features together. In each MLP layer, we apply Batch Normalisation and ReLU activation to stabilise and enhance learning, along with Dropout to mitigate overfitting.

The outputs from the GMF and MLP components are concatenated, then passed through a fully connected layer to produce the final predicted watch ratio. By blending GMF and MLP outputs, our model can capture both linear and complex patterns.

_(insert diagram here)_

In [12]:
class NCF(nn.Module):
    def __init__(self, num_users, num_videos, embedding_dim, num_user_features, num_video_features, dropout):
        super(NCF, self).__init__()

        # Hyperparameters
        self.dropout = dropout
        
        # GMF Components for embeddings
        self.user_embeddings_gmf = nn.Embedding(num_users, embedding_dim)
        self.video_embeddings_gmf = nn.Embedding(num_videos, embedding_dim)

        # MLP Components for embeddings
        self.user_embeddings_mlp = nn.Embedding(num_users, embedding_dim)
        self.video_embeddings_mlp = nn.Embedding(num_videos, embedding_dim)

        # MLP layers for user and video embeddings
        self.fc1_mlp = nn.Linear(2 * embedding_dim, 128)
        self.fc2_mlp = nn.Linear(128, 64)

        # MLP layers for user and video features
        self.user_video_features_fc1 = nn.Linear(num_user_features + num_video_features, 128)
        self.user_video_features_fc2 = nn.Linear(128, 64)

        # Final layers combining GMF, MLP for embeddings, and additional features
        self.fc1_combined = nn.Linear(embedding_dim + 64 + 64, 128)
        self.fc2_combined = nn.Linear(128, 1)

    def forward(self, user_idx, video_idx, user_features, video_features):
        ####### GMF Embedding branch #######
        user_emb_gmf = self.user_embeddings_gmf(user_idx)
        video_emb_gmf = self.video_embeddings_gmf(video_idx)
        gmf_output = user_emb_gmf * video_emb_gmf                                   # dimension: (batch_size, embedding_dim)

        ####### MLP Embedding branch #######
        user_emb_mlp = self.user_embeddings_mlp(user_idx)
        video_emb_mlp = self.video_embeddings_mlp(video_idx)
        mlp_input = torch.cat([user_emb_mlp, video_emb_mlp], dim=-1)                # dimension: (batch_size, 2 * embedding_dim)

        # First fully connected layer with BatchNorm and ReLU
        mlp_output = self.fc1_mlp(mlp_input)
        if self.training:
            mlp_output = nn.BatchNorm1d(128)(mlp_output)
        mlp_output = torch.relu(mlp_output)
        mlp_output = nn.Dropout(self.dropout)(mlp_output)

        # Second fully connected layer with BatchNorm and ReLU
        mlp_output = self.fc2_mlp(mlp_output)                                       # dimension: (batch_size, 64)
        if self.training:
            mlp_output = nn.BatchNorm1d(64)(mlp_output)
        mlp_output = torch.relu(mlp_output)
        mlp_output = nn.Dropout(self.dropout)(mlp_output)

        ####### MLP Feature processing branch #######
        user_video_features = torch.cat([user_features, video_features], dim=-1)

        # First fully connected layer with BatchNorm and ReLU
        user_video_features_processed = self.user_video_features_fc1(user_video_features)  # dimension: (batch_size, 128)
        if self.training:
            user_video_features_processed = nn.BatchNorm1d(128)(user_video_features_processed)
        user_video_features_processed = torch.relu(user_video_features_processed)
        user_video_features_processed = nn.Dropout(self.dropout)(user_video_features_processed)

        # Second fully connected layer with BatchNorm and ReLU
        user_video_features_processed = self.user_video_features_fc2(user_video_features_processed)  # dimension: (batch_size, 64)
        if self.training:
            user_video_features_processed = nn.BatchNorm1d(64)(user_video_features_processed)
        user_video_features_processed = torch.relu(user_video_features_processed)
        user_video_features_processed = nn.Dropout(self.dropout)(user_video_features_processed)

        ####### Combine GMF, MLP, and additional features #######
        combined_input = torch.cat([gmf_output, mlp_output, user_video_features_processed], dim=-1)
        combined_output = self.fc1_combined(combined_input)
        combined_output = torch.relu(combined_output)
        combined_output = nn.Dropout(self.dropout)(combined_output)

        combined_output = self.fc2_combined(combined_output)
        combined_output = torch.sigmoid(combined_output) * 5
        
        return combined_output.squeeze()

## Building the Recommendation System

We encapsulate the recommendation system in a class, `KuaiShou_NCF_RecSys`, designed to handle the training and prediction phases efficiently. This class is built to accept training data and model specifications and enables scalable batch-based predictions for each user-video combination.

In [13]:
class KuaiShou_NCF_RecSys:
    def __init__(self, dataset_train: KuaiShouDataset, model: nn.Module, embedding_dim: int, dropout: float, decay: float):
        self.dataset = dataset_train
        self.num_users = len(dataset_train.user_encoder.classes_)
        self.num_videos = len(dataset_train.video_encoder.classes_)
        self.num_user_features = len(dataset_train.user_feature_cols)
        self.num_video_features = len(dataset_train.video_feature_cols)
        
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Move model to GPU if available
        
        # Initialise the model
        self.model: nn.Module = model(self.num_users, self.num_videos, embedding_dim, self.num_user_features, self.num_video_features, dropout)

        # Time decay constants
        self.decay = decay

    def train(self, batch_size: int, num_epochs: int, lr: int, criterion, optimizer):
        # Initialise the DataLoader
        train_loader = DataLoader(self.dataset, batch_size=batch_size, shuffle=True)

        self.model.to(self.device)
        print(f"Model moved to {self.device}")

        # Optimizer and loss function
        optimizer = optimizer(self.model.parameters(), lr=lr)
        criterion = criterion

        # Training loop
        for epoch in range(num_epochs):
            self.model.train()
            total_loss = 0
            
            for user_idx, video_idx, user_features, video_features, watch_ratio in train_loader:
                user_idx, video_idx, user_features, video_features, watch_ratio = user_idx.to(self.device), video_idx.to(self.device), user_features.to(self.device), video_features.to(self.device), watch_ratio.to(self.device)
                
                # Forward pass
                optimizer.zero_grad()
                outputs = self.model(user_idx, video_idx, user_features, video_features)
                loss = criterion(outputs, watch_ratio)

                # Backward pass and optimization
                loss.backward()
                optimizer.step()

                # Accumulate loss for reporting
                total_loss += loss.item()

            # Print loss for each epoch
            avg_loss = total_loss / len(train_loader)
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

    def predict(self, batch_size=1024):
        """
        Generates a dataframe with predicted watch ratios for each user-video pair in batches.
        """
        self.model.eval()  # Set model to evaluation mode
        predictions_df = pd.DataFrame(columns=['user_id', 'video_id', 'watch_ratio'])

        for start_user_idx in range(0, self.num_users, batch_size):
            end_user_idx = min(start_user_idx + batch_size, self.num_users)
            user_indices = torch.arange(start_user_idx, end_user_idx, dtype=torch.long).to(self.device)

            # Gather user features in batch
            user_features_batch = self.dataset.user_features[user_indices].to(self.device)
            
            for video_idx in range(self.num_videos):
                video_indices_tensor = torch.tensor([video_idx], dtype=torch.long).to(self.device)
                video_feature_tensor = self.dataset.video_features[video_idx].unsqueeze(0).to(self.device)
                video_age = self.dataset.get_video_age(video_indices_tensor).to(self.device)
                
                # Repeat video data for the entire user batch
                video_tensor_batch = video_indices_tensor.expand(len(user_indices))
                video_feature_batch = video_feature_tensor.expand(len(user_indices), -1)
                video_age_batch = video_age.expand(len(user_indices))

                # Predict in batch
                with torch.no_grad():
                    predicted_watch_ratios = self.model(user_indices, video_tensor_batch, user_features_batch, video_feature_batch)

                # Apply time decay
                decay_weights = self.calculate_exponential_weight(video_age_batch)
                predicted_watch_ratios = predicted_watch_ratios * decay_weights
                
                # Append predictions to DataFrame
                batch_predictions_df = pd.DataFrame({'user_id': self.dataset.inverse_transform_user_ids(user_indices),
                                                    'video_id': self.dataset.inverse_transform_video_ids(video_tensor_batch),
                                                    'watch_ratio': predicted_watch_ratios.cpu().numpy()})
                predictions_df = pd.concat([predictions_df, batch_predictions_df])
            
        return predictions_df
    
    def get_parameters(self):
        """
        Returns the model parameters.
        """
        return self.model.state_dict()
    
    def calculate_exponential_weight(self, video_age_days):
        """
        Returns the decay weight based on the defined decay constant and the number of days since the video has been uploaded.
        """
        return torch.exp(-self.decay * video_age_days)

## Fitting the Training Data to the Model and Generating Predictions

We define the user and features which will be used in our model.

In [14]:
# Define the columns for user and video features in the user-item interaction data
user_cols = ['is_lowactive_period',
             'is_live_streamer', 'is_video_author', 'follow_user_num',
             'fans_user_num', 'friend_user_num', 'register_days', 'is_new_user',
             'total_connections', 'is_content_creator',
             'watch_frequency', 'is_weekend_interaction', 'is_weekend',
             'count_afternoon_views', 'count_evening_views', 'count_midnight_views', 'count_morning_views', 
             'avg_daily_watch_time', 
             'user_active_degree_full_active', 'user_active_degree_high_active', 'user_active_degree_middle_active', 
             'time_period_afternoon', 'time_period_evening', 'time_period_midnight', 'time_period_morning'
            ]
video_cols = ['video_duration', 'show_cnt', 'play_cnt', 
              'like_cnt', 'comment_cnt', 'share_cnt', 'follow_cnt', 'collect_cnt', 
              'News_Politics', 'Auto_Tech', 'Lifestyle', 'Sports_Fitness', 'Entertainment', 'Culture', 'Others',
            ]

Let's create a function which allows us to train and predict using the NCF model.

In [15]:
def train_and_predict(hyperparameters: dict, train_data: pd.DataFrame, video_age_train_dict, **kwargs):
    cluster = kwargs.get('cluster', None)

    # Set seed for reproducibility
    torch.manual_seed(0)

    BATCH_SIZE = hyperparameters['batch_size']
    NUM_EPOCHS = hyperparameters['num_epochs']
    LEARNING_RATE = hyperparameters['lr']
    EMBEDDING_DIM = hyperparameters['embedding_dim']
    DROPOUT = hyperparameters['dropout']
    DECAY = hyperparameters['decay']

    # Loss function and optimizer
    criterion = nn.MSELoss()
    optimiser = optim.Adam

    print(f"----- Training {'' if cluster == None else f'for cluster {cluster} '}-----")

    # Create the dataset
    dataset_train = KuaiShouDataset(train_data, 'user_id', 'video_id', user_cols, video_cols, 'watch_ratio', video_age_train_dict)

    # Initialise the NCF model
    print("Initialising...")
    ncf_rec_sys = KuaiShou_NCF_RecSys(dataset_train, NCF, EMBEDDING_DIM, DROPOUT, DECAY)

    # Train on data
    ncf_rec_sys.train(BATCH_SIZE, NUM_EPOCHS, LEARNING_RATE, criterion, optimiser)

    # Generate predictions
    print("Generating predictions...")
    predictions_df = ncf_rec_sys.predict()
    
    print("Complete!")
    return cluster, predictions_df

### Hyperparameter Tuning

#### Grid Search

In our customer segmentation process, we divided users into four distinct clusters based on their behavioral patterns. Each cluster exhibits unique characteristics, enabling us to tailor training specifically to each segment.

By training our model within each cluster, we enhance its ability to capture subtle patterns unique to each group. This approach not only improves model performance by focusing on relevant behaviors but also reduces computational complexity by minimising the dataset size for each cluster. This method allows for more efficient training and provides more personalised recommendations for each user segment.

After training with segmentation, we will also train the model on the full dataset, without segmentation, to compare and evaluate the impact of clustering on recommendation performance.

In addition, we will be performing a grid search across several hyperparameters to identify the best-performing hyperparameters. In doing so, we have held out another 20% of the latest user-video interactions from the training data for 'internal validation' and selection of the best hyperparameters, based on the average watch_ratio @ 100. This prevents data leakage and tuning on the validation/test sets.

However, do note that the notebook is ran across several days, across several PCs, and it may take a long time to re-run these parameters. Therefore, we recommend skipping running the section, and go straight into the final training of the model on the entire train set.

In [18]:
import itertools

In [None]:
def train_by_cluster_and_without(params: dict, data: pd.DataFrame, video_age_dict: dict,
                                train_by_cluster: bool = True, train_without_clustering: bool = False, 
                                is_final: bool = False):
    param_str = '_'.join([f'{key}{val}' for key, val in params.items()])
    
    # Split data based on time (last 20% for validation)
    time_threshold = np.percentile(data['time'], 80)
    train_data = data[data['time'] <= time_threshold]
    val_data = data[data['time'] > time_threshold]

    # Create directory to save predictions
    if not os.path.exists(root + 'recommendations'):
            os.makedirs(root + 'recommendations')
    
    # Train for each cluster
    if train_by_cluster:
        cluster_predictions = {}
        cluster_val_metrics = []
        
        for cluster in sorted(train_data['cluster'].unique()):
            train_cluster = train_data[train_data['cluster'] == cluster]
            val_cluster = val_data[val_data['cluster'] == cluster]
            
            # Train and get predictions
            cluster, predictions_df = train_and_predict(params, train_cluster, video_age_dict, 
                                                      validation_data=val_cluster, **{'cluster': cluster})
            
            # Store predictions
            cluster_predictions[cluster] = predictions_df
            
            # Calculate validation metrics for this cluster
            if len(val_cluster) > 0:
                val_predictions = predictions_df[
                    predictions_df['user_id'].isin(val_cluster['user_id']) & 
                    predictions_df['video_id'].isin(val_cluster['video_id'])
                ]
                
                # Calculate watch_ratio@100 for validation set
                top_100_preds = val_predictions.nlargest(100, 'watch_ratio')['video_id']
                avg_watch_ratio = val_data[val_data['video_id'].isin(top_100_preds)]['watch_ratio'].mean()
                
                cluster_val_metrics.append({
                    'cluster': cluster,
                    'watch_ratio@100': avg_watch_ratio,
                    'val_size': len(val_cluster)
                })
        
        # Combine predictions
        watch_ratio_predictions_df = pd.DataFrame()
        for cluster, df in cluster_predictions.items():
            cluster_predictions_df = df
            cluster_predictions_df['cluster'] = cluster
            watch_ratio_predictions_df = pd.concat([watch_ratio_predictions_df, cluster_predictions_df])
        
        # Calculate overall weighted validation metric
        if cluster_val_metrics:
            metrics_df = pd.DataFrame(cluster_val_metrics)
            weighted_watch_ratio = np.average(
                metrics_df['watch_ratio@100'],
                weights=metrics_df['val_size']
            )
            print(f"Overall validation watch_ratio@100: {weighted_watch_ratio:.4f}")
            
            # Save validation metrics
            metrics_df['params'] = param_str
            if not os.path.exists(root + 'metrics'):
                os.makedirs(root + 'metrics')
            metrics_df.to_csv(root + f'metrics/validation_metrics_{param_str}.csv', index=False)
        
        # Save predictions
        output_file = root + f'recommendations/w_clustering_{param_str}.csv' if not is_final else root + f'recommendations/final_w_clustering_{param_str}.csv'
        watch_ratio_predictions_df.to_csv(output_file, index=False)
        print(f'Predictions with segmentation saved to {output_file}')
        
        return weighted_watch_ratio
    
    # Train without clustering
    if train_without_clustering:
        _, predictions_df = train_and_predict(params, train_data, video_age_dict, validation_data=val_data)
        
        # Calculate validation metrics
        val_predictions = predictions_df[
            predictions_df['user_id'].isin(val_data['user_id']) & 
            predictions_df['video_id'].isin(val_data['video_id'])
        ]
        
        # Calculate watch_ratio@100 for validation set
        top_100_preds = val_predictions.nlargest(100, 'watch_ratio')
        avg_watch_ratio = top_100_preds['watch_ratio'].mean()
        print(f"Validation watch_ratio@100: {avg_watch_ratio:.4f}")
        
        # Save predictions
        output_file = root + f'recommendations/wo_clustering_{param_str}.csv' if not is_final else root + f'recommendations/final_wo_clustering_{param_str}.csv'
        predictions_df.to_csv(output_file, index=False)
        print(f'Predictions without segmentation saved to {output_file}')
        
        return avg_watch_ratio

In [None]:
hyperparameters = {
    'batch_size': [128, 256, 512],
    'num_epochs': [10, 20, 30, 40],
    'lr': [0.0001, 0.001, 0.01],
    'embedding_dim': [64, 128, 256, 512],
    'dropout': [0.0, 0.1, 0.3, 0.5],
    'decay': [0.01, 0.05, 0.10]
}

# Generate all possible combinations of hyperparameters
param_combinations = list(itertools.product(*hyperparameters.values()))

# Train for each combination of hyperparameters
best_params = None
best_watch_ratio = 0

for params in param_combinations:
    params_dict = {key: val for key, val in zip(hyperparameters.keys(), params)}
    print(f"Training with hyperparameters: {params_dict}")
    
    watch_ratio = train_by_cluster_and_without(params_dict, train_processed, video_age_dict, train_by_cluster=True, train_without_clustering=False, is_final=False)
    
    if watch_ratio > best_watch_ratio:
        best_watch_ratio = watch_ratio
        best_params = params_dict
        print(f"New best parameters found! Watch_ratio@100: {watch_ratio:.4f}")

# Train final model with best parameters
print(f"\nTraining final model with best parameters: {best_params}")
train_by_cluster_and_without(best_params, train_processed, video_age_dict, train_by_cluster=True, train_without_clustering=False, is_final=True)

Training with hyperparameters: {'batch_size': 128, 'num_epochs': 10, 'lr': 0.0001, 'embedding_dim': 64, 'dropout': 0.0, 'decay': 0.01}
----- Training for cluster 0 -----
Initialising...
Model moved to cpu
Epoch [1/10], Loss: 0.4112
Epoch [2/10], Loss: 0.3614
Epoch [3/10], Loss: 0.3537
Epoch [4/10], Loss: 0.3489
Epoch [5/10], Loss: 0.3445
Epoch [6/10], Loss: 0.3406
Epoch [7/10], Loss: 0.3364
Epoch [8/10], Loss: 0.3328
Epoch [9/10], Loss: 0.3291
Epoch [10/10], Loss: 0.3255
Generating predictions...
Complete!
----- Training for cluster 1 -----
Initialising...
Model moved to cpu
Epoch [1/10], Loss: 0.2193
Epoch [2/10], Loss: 0.1793
Epoch [3/10], Loss: 0.1742
Epoch [4/10], Loss: 0.1715
Epoch [5/10], Loss: 0.1691
Epoch [6/10], Loss: 0.1670
Epoch [7/10], Loss: 0.1651
Epoch [8/10], Loss: 0.1632
Epoch [9/10], Loss: 0.1614
Epoch [10/10], Loss: 0.1601
Generating predictions...
Complete!
----- Training for cluster 2 -----
Initialising...
Model moved to cpu
Epoch [1/10], Loss: 0.2738
Epoch [2/10], 

#### [TO BE COMPLETED] Evaluation metrics on validation set and choose final set of parameters

## Tuned Model

We have identified the set of parameters. We will therefore proceed to train the model using this set of hyperparameters on both the train and validation sets.

### Combining train and validation data
Since the user's cluster as well as the top 3 regrouped categories are only present in the trian dataset, we need to merge it into the validation set as well.

In [17]:
segmentation_categories_columns = ['user_id', 'cluster', 'News_Politics', 'Auto_Tech', 'Lifestyle', 'Sports_Fitness', 'Entertainment', 'Culture', 'Others']

user_segments_and_categories = train[segmentation_categories_columns].drop_duplicates()

# Merge into validation data
val_merged = val.merge(user_segments_and_categories, on='user_id', how='left')

# Combine train and validation data
train_val = pd.concat([train, val_merged])

### Preprocessing combined data

Similar to what we have done to the training data above, we get the new reference date and video ages.

In [18]:
# Convert type to datetime
train_val['time'] = pd.to_datetime(train_val['time'])

# Assume current date is the next day of the last date
CURRENT_DATE_VAL = train_val['time'].dt.date.max()

# Just the date portion
print(f'Current date with validation: {CURRENT_DATE_VAL}')

Current date with validation: 2020-08-19


In [19]:
# Get video age for training + val data
video_info_train_val = video_info[video_info['video_id'].isin(train_val['video_id'].unique())]
video_info_train_val['video_age'] = (CURRENT_DATE_VAL - video_info_train_val['upload_dt'].dt.date).dt.days

video_age_dict_new = video_info_train_val.set_index('video_id')['video_age'].to_dict()    # Convert to dictionary

Once again, one hot encoding of categorical variables as well as scaling of numerical variables is performed.

In [20]:
# One hot encode 'user_active_degree', 'time_period'
train_val_processed = pd.get_dummies(train_val, columns=['user_active_degree', 'time_period'])

# Remove the column for user_active_degree = UNKNOWN
train_val_processed = train_val_processed.drop(columns=['user_active_degree_UNKNOWN'])

In [21]:
scaler_new = StandardScaler()

columns_to_scale = ['follow_user_num',
       'fans_user_num', 'friend_user_num', 'register_days', 'video_duration',
       'show_cnt', 'play_cnt', 
       'like_cnt', 'comment_cnt',
       'share_cnt', 'follow_cnt', 'collect_cnt', 
       'total_connections',
       'watch_frequency', 
       'count_afternoon_views', 'count_evening_views', 'count_midnight_views',
       'count_morning_views', 
       'avg_daily_watch_time', 
       ]

train_val_processed[columns_to_scale] = scaler_new.fit_transform(train_val_processed[columns_to_scale])

### Train and predict with chosen parameters

The model is then trained on the combined dataset, using the final set of parameters we have selected.

In [None]:
tuned_params = {
    'batch_size': 512,
    'num_epochs': 20,
    'lr': 0.001,
    'embedding_dim': 64,
    'dropout': 0.3,
    'decay': 0.01
}

train_by_cluster_and_without(tuned_params, train_val_processed, video_age_dict_new, train_by_cluster=True, train_without_clustering=True, is_final=True)

----- Training for cluster 0 -----
Initialising...
Model moved to cpu
Epoch [1/20], Loss: 0.3898
Epoch [2/20], Loss: 0.3540
Epoch [3/20], Loss: 0.3437
Epoch [4/20], Loss: 0.3380
Epoch [5/20], Loss: 0.3343
Epoch [6/20], Loss: 0.3315
Epoch [7/20], Loss: 0.3291
Epoch [8/20], Loss: 0.3265
Epoch [9/20], Loss: 0.3235
Epoch [10/20], Loss: 0.3204
Epoch [11/20], Loss: 0.3171
Epoch [12/20], Loss: 0.3138
Epoch [13/20], Loss: 0.3097
Epoch [14/20], Loss: 0.3058
Epoch [15/20], Loss: 0.3015
Epoch [16/20], Loss: 0.2976
Epoch [17/20], Loss: 0.2929
Epoch [18/20], Loss: 0.2890
Epoch [19/20], Loss: 0.2846
Epoch [20/20], Loss: 0.2804
Generating predictions...
Complete!
----- Training for cluster 1 -----
Initialising...
Model moved to cpu
Epoch [1/20], Loss: 0.2172
Epoch [2/20], Loss: 0.1828
Epoch [3/20], Loss: 0.1749
Epoch [4/20], Loss: 0.1721
Epoch [5/20], Loss: 0.1703
Epoch [6/20], Loss: 0.1691
Epoch [7/20], Loss: 0.1676
Epoch [8/20], Loss: 0.1662
Epoch [9/20], Loss: 0.1649
Epoch [10/20], Loss: 0.1635
Ep