# Building a Collaborative Filtering news recommender
In this notebook we pre-process and train a news recommender system based on the MIND dataset.

> MIcrosoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. The mission of MIND is to serve as a benchmark dataset for news recommendation and facilitate the research in news recommendation and recommender systems area.

We are using a small and pre-processed version of the dataset to train a matrix-factorization recommender system from scratch. 

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch.nn as nn
import pytorch_lightning as pl
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch
from collections import Counter
import torchmetrics #accuracy
import ast
from pytorch_lightning import seed_everything

import os
os.environ['MASTER_PORT'] = '12356'  # Choose an available port


#### From documentation:  
The behaviours.cvs file contains the users' news click histories. It has 5 columns divided by the tab symbol:

- User ID. The anonymous ID of a user.  
- Epochhrs. Time passed in hrssince UNIX Epoch (00:00:00 UTC, 1 January 1970)
- Click. Id of the article that was clicked on. The dataset was prepared in a way that would only put one click per row.

In [2]:
behaviour = pd.read_csv("/kaggle/input/mind-processed-daniel/behaviour.csv")

print('Number of interactions in the behaviour dataset:', behaviour.shape[0])
print('Number of users in the behaviour dataset:', behaviour.userId.nunique())
print('Number of articles in the behaviour dataset:', behaviour.click.nunique())

behaviour.head()

Number of interactions in the behaviour dataset: 781871
Number of users in the behaviour dataset: 49832
Number of articles in the behaviour dataset: 2451


Unnamed: 0,epochhrs,userId,click
0,437073.0,U13740,N55689
1,437106.0,U91836,N17059
2,437143.0,U73700,N23814
3,437069.0,U34670,N49685
4,437083.0,U19739,N33619


## Train/Test Split

In [3]:
# Split dataset in train and test based on time
test_time_th = behaviour['epochhrs'].quantile(0.9)
train = behaviour[behaviour['epochhrs']< test_time_th].copy()

## Indexing

Before we carry on to define our first model we first need to apply indexizing for the users and items in the behaviour dataframe,
as pytorch requires integer indicies instead of strings for user and item IDs. 

We do this by two dictionaries:

- `ind2item`: mapping the item indicies given in behaviour to the real item Id given in the dataset.
- `ind2user`: mapping the user indicies given in behaviour to the real user Id given in the dataset.

Note that we also create `item2ind` and `user2ind` to do the reverse.

The indexing will be created based on the training data, where new unseen articles in the validation set will get the index 0.
We will use 90% for training 10% for validation, when we split the data it's important to make use of temporal `epochhrs` to divide the data, as a regular random split in this case does not make sense in recommender systems.


In [4]:
## Indexize items
# Allocate a unique index for each item, but let the zeroth index be a UNK index:
ind2item = {idx +1: itemid for idx, itemid in enumerate(train.click.unique())}
item2ind = {itemid : idx for idx, itemid in ind2item.items()}


## Indexize users
# Allocate a unique index for each user, but let the zeroth index be a UNK index:
ind2user = {idx +1: userid for idx, userid in enumerate(train['userId'].unique())}
user2ind = {userid : idx for idx, userid in ind2user.items()}


In [5]:
## Apply indexization

# Create a new column with userIdx:
train['userIdx'] = train['userId'].map(lambda x: user2ind.get(x,0))
train['click'] = train['click'].map(lambda item: item2ind.get(item, 0))

# Repeat for validation
valid =  behaviour[behaviour['epochhrs']>= test_time_th].copy()
valid["click"] = valid["click"].map(lambda item: item2ind.get(item, 0))
valid["userIdx"] = valid["userId"].map(lambda x: user2ind.get(x,0))

In [6]:
train

Unnamed: 0,epochhrs,userId,click,userIdx
0,437073.0,U13740,1,1
3,437069.0,U34670,2,2
4,437083.0,U19739,3,3
5,437076.0,U8355,4,4
8,437075.0,U53231,5,5
...,...,...,...,...
781866,437016.0,U66493,1944,49395
781867,437016.0,U66493,863,49395
781868,437016.0,U66493,1231,49395
781869,437016.0,U72015,1592,49396


In [7]:
valid

Unnamed: 0,epochhrs,userId,click,userIdx
1,437106.0,U91836,0,7849
2,437143.0,U73700,0,33857
6,437110.0,U46596,0,8643
7,437122.0,U79199,0,13953
9,437145.0,U89744,0,909
...,...,...,...,...
168640,437124.0,U66493,262,49395
168643,437105.0,U17467,0,21097
168644,437152.0,U72015,279,49396
168645,437127.0,U44625,0,22094


# Modeling & Negative sampling
We want to make a matrix factorization model where each user $u$ has a d-dimensional parameter vector $z_u$ and each item $i$ has a parameter vector $v_i$.

Second, we do not have a `noclicks` for every `click` interaction we will only utilize two **known** things in the training phase: The item the `userIdx` and `click`. However, as we want to model the binary behavior in terms of clicks and non-clicks we will make use of something called negative sampling. With negative sampling - we will draw a sample a random negative item for each known user-click combination to express  the lack of preference by the user for the sampled item.

In [8]:
class MindDataset(Dataset):
    # A fairly simple torch dataset module that can take a pandas dataframe (as above), 
    # and convert the relevant fields into a dictionary of arrays that can be used in a dataloader
    def __init__(self, df):
        # Create a dictionary of tensors out of the dataframe
        self.data = {
            'userIdx' : torch.tensor(df.userIdx.values.astype(np.int64)),
            'click' : torch.tensor(df.click.values.astype(np.int64))
        }
    def __len__(self):
        return len(self.data['userIdx'])
    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.data.items()}

In [9]:
# Build datasets and dataloaders of train and validation dataframes:
bs = 1024
ds_train = MindDataset(train)
train_loader = DataLoader(ds_train, batch_size=bs, shuffle=True)
ds_valid = MindDataset(valid)
valid_loader = DataLoader(ds_valid, batch_size=bs, shuffle=False)

batch = next(iter(train_loader))

## Model

#### Framework
We will use pytorch-lightning to define and train our model. It is a high-level framework (similar to fastAI) but with a slightly different way of defining things. It is my personal go-to framework and is very flexible. For more information, see https://pytorch-lightning.readthedocs.io/.

#### The model
We assume that each interaction goes as follow: the user is presented with two items: the click and no-click item, where the no-click item will be randomly chosen with negative sampling. After the user reviewed both items, she will choose the most relevant one. This can be modeled as a categorical distirbution with two options (yes, you could do binomial). There is a loss function in pytorch for this already, called the `F.binary_cross_entropy` that we will use.

In [10]:
import torch.nn.functional as F
import torch
import pytorch_lightning as pl
from torch import nn

# Defining the NewsMF class inheriting from pl.LightningModule
class NewsMF(pl.LightningModule):
    def __init__(self, num_users, num_items, dim=10):
        super().__init__()
        # Initializing dimensions, number of users, and number of items
        self.dim = dim
        self.num_users = num_users
        self.num_items = num_items
        
        # Creating embeddings for users and items
        self.useremb = nn.Embedding(num_embeddings=num_users, embedding_dim=dim)
        self.itememb = nn.Embedding(num_embeddings=num_items, embedding_dim=dim)

    # Forward function to compute relevancy scores between users and items
    def forward(self, userIdx, itemIdx):
        # Getting user vectors from user embeddings
        uservec = self.useremb(userIdx)
        # Getting item vectors from item embeddings
        itemvec = self.itememb(itemIdx)
        # Computing the dot product and applying sigmoid to get relevancy scores
        scores = torch.sigmoid((uservec * itemvec).sum(-1).unsqueeze(-1))
        return scores

    # Step function for both training and validation steps
    def step(self, batch, batch_idx, phase="train"):
        # Compute scores for clicked items using the forward function
        score_click = self.forward(batch['userIdx'], batch['click'])
        
        # Sample negative (not clicked) items and compute their scores
        neg_sample = torch.randint_like(batch["click"], 1, self.num_items)
        score_noclick = self.forward(batch['userIdx'], neg_sample)
        
        # Concatenate scores for clicked and not clicked items
        scores_all = torch.concat((score_click, score_noclick), dim=1)
        # Create target labels for clicked (1) and not clicked (0) items
        target_all = torch.concat((torch.ones_like(score_click), torch.zeros_like(score_noclick)), dim=1)
        # Compute binary cross-entropy loss
        loss = F.binary_cross_entropy(scores_all, target_all)
        
        # Log the loss for training or validation phase
        if phase == "train":
            self.log('train_loss', loss)
        elif phase == "val":
            self.log('val_loss', loss)
        
        return loss
    
    # Training step that calls the step function with phase="train"
    def training_step(self, batch, batch_idx):
        return self.step(batch, batch_idx, "train")
    
    # Validation step that calls the step function with phase="val"
    def validation_step(self, batch, batch_idx):
        return self.step(batch, batch_idx, "val")

    # Configuring the optimizer for the model
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer
    
    def training_epoch_end(self, outputs):
        # Calculate the average training loss for the epoch
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        self.log('epoch_avg_train_loss', avg_loss)

    def validation_epoch_end(self, outputs):
        # Calculate the average validation loss for the epoch
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        self.log('epoch_avg_val_loss', avg_loss)

In [11]:
from pytorch_lightning import Callback
import pandas as pd

# Create a logger
class EpochMetricsLogger(Callback):
    def __init__(self):
        super().__init__()
        self.metrics = []

    def on_epoch_end(self, trainer, pl_module):
        # Collect current epoch metrics from both training and validation
        metrics = trainer.logged_metrics
        self.metrics.append(metrics)

    def on_train_end(self, trainer, pl_module):
        # Convert collected metrics into a DataFrame for pretty printing
        metrics_df = pd.DataFrame(self.metrics)
        # Optionally, you can customize the DataFrame here (e.g., select specific columns, rename, etc.)
        print("Epoch Metrics Summary:")
        print(metrics_df)

In [12]:
from pytorch_lightning import Trainer

# Initialize the callback
epoch_metrics_logger = EpochMetricsLogger()

# Initialize your model
mf_model = NewsMF(num_users=len(ind2user) + 1, num_items=len(ind2item) + 1, dim=50)

seed_everything(42, workers=True)

# Set up the PyTorch Lightning trainer with the custom callback
trainer = Trainer(
    max_epochs=4,
    accelerator="gpu",
    deterministic=True,
    callbacks=[epoch_metrics_logger]  # Add your callback here
)

# Start the training process
trainer.fit(model=mf_model, train_dataloaders=train_loader)

  f"The `Callback.{hook}` hook was deprecated in v1.6 and"


Training: 0it [00:00, ?it/s]

Epoch Metrics Summary:
                        train_loss             epoch_avg_train_loss
0  tensor(1.9231, device='cuda:0')  tensor(2.1950, device='cuda:0')
1  tensor(1.9231, device='cuda:0')  tensor(2.1950, device='cuda:0')
2  tensor(1.9231, device='cuda:0')  tensor(2.1950, device='cuda:0')
3  tensor(1.9231, device='cuda:0')  tensor(2.1950, device='cuda:0')


## Verify Results
We load additional information about the news articles. Based on the item-embeddings we find the 10 most similar articles to a give article.

In [13]:
news = pd.read_csv("/kaggle/input/mind-processed-daniel/news.csv")
print(f"The article data consist in total of {len(news)} number of articles.")
news.head()

The article data consist in total of 51282 number of articles.


Unnamed: 0.1,Unnamed: 0,itemId,category,subcategory,title,abstract,url,title_entities,abstract_entities,ind,n_click_training
0,0,N55689,sports,football_nfl,"Charles Rogers, former Michigan State football...","Charles Rogers, the former Michigan State foot...",https://assets.msn.com/labs/mind/BBWAPO6.html,"[{""Label"": ""Charles Rogers (American football)...","[{""Label"": ""2003 NFL Draft"", ""Type"": ""U"", ""Wik...",1.0,4316.0
1,1,N49685,music,music-celebrity,Broadway Star Laurel Griggs Suffered Asthma At...,"Teen star Laurel Griggs, who passed away on No...",https://assets.msn.com/labs/mind/BBWyk8E.html,"[{""Label"": ""Broadway theatre"", ""Type"": ""F"", ""W...","[{""Label"": ""Once (musical)"", ""Type"": ""W"", ""Wik...",2.0,2294.0
2,2,N33619,news,newsus,College gymnast dies following training accide...,"Melanie Coleman, 20, of Milford, was practicin...",https://assets.msn.com/labs/mind/BBWBKRg.html,"[{""Label"": ""Connecticut"", ""Type"": ""G"", ""Wikida...",[],3.0,3245.0
3,3,N55204,entertainment,entertainment-celebrity,Stars Who Served in the Military,"Adam Driver, Jeff Bridges, Ice-T and more star...",https://assets.msn.com/labs/mind/AAJQC0V.html,[],"[{""Label"": ""Jeff Bridges"", ""Type"": ""P"", ""Wikid...",4.0,481.0
4,4,N53585,tv,tvnews,"Rip Taylor's Cause of Death Revealed, Memorial...",The comedian died at the age of 84 last month.,https://assets.msn.com/labs/mind/BBWBgRz.html,"[{""Label"": ""Rip Taylor"", ""Type"": ""P"", ""Wikidat...",[],5.0,2835.0


In [14]:
## Add more information to the article data 
# The item index
news["ind"] = news["itemId"].map(item2ind)
news = news.sort_values("ind").reset_index(drop=True)

# Number of clicks in training data per article, investigate the cold start issue
news["n_click_training"] = news["ind"].map(dict(Counter(train.click)))

In [15]:
# 5 most clicked articles
news.sort_values("n_click_training",ascending=False).head()

Unnamed: 0.1,Unnamed: 0,itemId,category,subcategory,title,abstract,url,title_entities,abstract_entities,ind,n_click_training
597,597,N306,movies,movies-celebrity,Kevin Spacey Won't Be Charged in Sexual Assaul...,The Los Angeles County District Attorney's Off...,https://assets.msn.com/labs/mind/AAJy6rv.html,"[{""Label"": ""Kevin Spacey"", ""Type"": ""P"", ""Wikid...","[{""Label"": ""Kevin Spacey"", ""Type"": ""P"", ""Wikid...",598.0,4802.0
0,0,N55689,sports,football_nfl,"Charles Rogers, former Michigan State football...","Charles Rogers, the former Michigan State foot...",https://assets.msn.com/labs/mind/BBWAPO6.html,"[{""Label"": ""Charles Rogers (American football)...","[{""Label"": ""2003 NFL Draft"", ""Type"": ""U"", ""Wik...",1.0,4316.0
656,656,N42620,lifestyle,lifestylebuzz,Heidi Klum's 2019 Halloween Costume Transforma...,You might say she's scary good at playing dres...,https://assets.msn.com/labs/mind/AAJFlhi.html,"[{""Label"": ""Heidi Klum"", ""Type"": ""P"", ""Wikidat...","[{""Label"": ""Heidi Klum"", ""Type"": ""P"", ""Wikidat...",657.0,4047.0
10,10,N47020,news,newsopinion,The News In Cartoons,News as seen through the eyes of the nation's ...,https://assets.msn.com/labs/mind/AAJ7oYd.html,[],[],11.0,3545.0
9,9,N35729,news,newsus,Porsche launches into second story of New Jers...,The Porsche went airborne off a median in Toms...,https://assets.msn.com/labs/mind/BBWyjM9.html,"[{""Label"": ""Porsche"", ""Type"": ""O"", ""WikidataId...","[{""Label"": ""Porsche"", ""Type"": ""O"", ""WikidataId...",10.0,3346.0


In [16]:
# store the learned item embedding into a seperate tensor
itememb = mf_model.itememb.weight.detach()
print(itememb)

tensor([[ 0.3336, -0.3396,  0.7344,  ...,  0.3277,  0.9406,  0.0717],
        [-0.1863,  0.2401,  0.3237,  ...,  0.8675,  0.5732, -0.3829],
        [-0.4728,  0.8010, -0.0662,  ..., -0.2519, -0.4118, -0.6288],
        ...,
        [-0.4627, -0.9460,  0.3129,  ...,  1.2525, -0.7791, -0.1034],
        [ 0.6584,  1.0735, -0.5054,  ..., -1.0410,  0.2840, -0.5321],
        [-1.1126, -0.2095, -0.4985,  ..., -0.5903,  1.5366,  0.1982]])


In [17]:
# Investigate different rows of the item embedding (articles embeddings) to see if the model works
## some examples N13259, N16636, N10272
## Can you find some examples that does not work good? Why?

ind = item2ind.get("N10272") 

print(ind)
# This calculates the cosine similarity and outputs the 10 most similar articles w.r.t to ind in descending order
similarity = torch.nn.functional.cosine_similarity(itememb[ind], itememb, dim=1)
print(similarity)
most_sim = news[~news.ind.isna()].iloc[(similarity.argsort(descending=True).cpu().numpy()-1)]
most_sim.head(5)


173
tensor([ 0.2665, -0.0143,  0.0351,  ...,  0.0035, -0.1390,  0.1026])


Unnamed: 0.1,Unnamed: 0,itemId,category,subcategory,title,abstract,url,title_entities,abstract_entities,ind,n_click_training
172,172,N10272,news,elections-2020-us,Bloomberg leads Trump by 6 points in 2020 elec...,Forty-three percent of likely voters would bac...,https://assets.msn.com/labs/mind/BBWwz21.html,"[{""Label"": ""Donald Trump"", ""Type"": ""P"", ""Wikid...","[{""Label"": ""Donald Trump"", ""Type"": ""P"", ""Wikid...",173.0,114.0
653,653,N8331,entertainment,celebrity,"Melanie Griffith, 62, shows off her insane fig...",Melanie Griffith looks more stunning than ever...,https://assets.msn.com/labs/mind/AAJBK7y.html,"[{""Label"": ""Melanie Griffith"", ""Type"": ""P"", ""W...","[{""Label"": ""Melanie Griffith"", ""Type"": ""P"", ""W...",654.0,366.0
475,475,N62801,news,newsus,3 students allegedly plotted to attack their m...,"The students, all of whom are under 16 years o...",https://assets.msn.com/labs/mind/BBWtxCa.html,[],[],476.0,142.0
224,224,N45428,entertainment,humor,Comics - 'Pluggers' by Gary Brookins,,https://assets.msn.com/labs/mind/AAJDUU0.html,"[{""Label"": ""Pluggers"", ""Type"": ""C"", ""WikidataI...",[],225.0,190.0
1890,1890,N48263,sports,basketball_nba,Winners and losers from NBA's opening night,Some of hype for NBA's opening night doublehea...,https://assets.msn.com/labs/mind/AAJcPUk.html,"[{""Label"": ""National Basketball Association"", ...","[{""Label"": ""Zion Williamson"", ""Type"": ""P"", ""Wi...",1891.0,134.0


In [18]:
#!pip install wandb
#import wandb
#wandb.login()
#from pytorch_lightning.loggers import WandbLogger
#wandb.finish()
#wandb_logger = WandbLogger(log_model="all", name="normalized")