# Baseline Content-Based Filtering
This notebook describes the baseline content-based filtering model for news recommendation using NRMS (Neural News Recommendation with Multi-head Self-attention).

# NRMS (Neural News Recommendation with Multi-Head Self-Attention):  
 The NRMS [1] model is a state-of-the-art neural news recommendation approach with multi-head self-attention. NRMS learns news representations from news titles, using a multi head self-attention network. Briefly speaking, the news encoder first maps each word in the news title to the corresponding vector, and then uses the self-attention network to learn word-level representations. Finally, a query vector is used to locate the important words in the news title, and an attention-based pooling method is used to aggregate the word-level representations into the learned title representations. To learn user representations from their browsed news, NRMS again uses the multi-head self-attention network on top of the learned news representations. The probability of a user clicking a candidate news is given by the dot product between the user representations and the news representations

## Components:

### News Encoder:
- Uses multi-head self-attention to learn news representations from news titles
- Models interactions between words
  
### User Encoder:

- Learns user representations from their browsed news
  
## Key Features:
- It helps users find the news they like
- Avoids information overload

## Load functionality

In [10]:
import os
import sys

In [2]:
#%cd ebnerd-benchmark
#!pip install .

## Importing the NRMS Model from News Recommendation Module

In [3]:
#Import Libraries

from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl

# Import Constants and Utilities
from ebrec.utils._constants import (
   DEFAULT_HISTORY_ARTICLE_ID_COL,
   DEFAULT_CLICKED_ARTICLES_COL,
   DEFAULT_INVIEW_ARTICLES_COL,
   DEFAULT_IMPRESSION_ID_COL,
   DEFAULT_SUBTITLE_COL,
   DEFAULT_LABELS_COL,
   DEFAULT_TITLE_COL,
   DEFAULT_USER_COL,
)

#Import Utility Functions
#
from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019,
    add_known_user_column,
    add_prediction_scores,
    truncate_history,
)
from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings

#Import NRMS Components
#
from ebrec.models.newsrec.dataloader import NRMSDataLoader
from ebrec.models.newsrec.model_config import hparams_nrms
from ebrec.models.newsrec import NRMSModel

2024-07-05 13:36:07.131173: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-05 13:36:07.694989: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-05 13:36:07.695063: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-05 13:36:07.711922: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-05 13:36:07.758303: I tensorflow/core/platform/cpu_feature_guar

NRMSDataLoader loads and preprocesses data for the NRMS model.

hparams_nrms provides configuration parameters for the NRMS model.

NRMSModel defines the architecture and training procedures for NRMS.

# Setting Up News Recommendation Model Components 
In this step, we configure essential components for the news recommendation model, including data preprocessing utilities and the NRMS model setup. 

In [4]:
#Import Required Modules and Componentsabs

from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings

#
from ebrec.models.newsrec.dataloader import NRMSDataLoader
from ebrec.models.newsrec.model_config import hparams_nrms
from ebrec.models.newsrec import NRMSModel

# Integrated Behavioral Data
This step integrates historical article interactions with behavioral data, facilitating data coherence and readiness for analysis

## Data Format Description
1. history.parquet
File Content: Contains historical interaction data between users and articles.
Columns Used:
DEFAULT_USER_COL: Represents the user identifier.
DEFAULT_HISTORY_ARTICLE_ID_COL: Represents the article identifier that the user interacted with.

2. behaviors.parquet
File Content: Contains behavioral data related to user interactions.
Columns:
Various columns related to user behavior and possibly article metadata.

In [48]:
def ebnerd_from_path(path: Path, history_size: int = 30) -> pl.DataFrame:
    """
    Load ebnerd - function
    """
    df_history = (
        pl.scan_parquet(path.joinpath("history.parquet"))
        .select(DEFAULT_USER_COL, DEFAULT_HISTORY_ARTICLE_ID_COL)
        .pipe(
            truncate_history,
            column=DEFAULT_HISTORY_ARTICLE_ID_COL,
            history_size=history_size,
            padding_value=0,
            enable_warning=False
        )
    )
    df_behaviors = (
        pl.scan_parquet(path.joinpath("behaviors.parquet"))
        .collect()
        .pipe(
            slice_join_dataframes,
            df2=df_history.collect(),
            on=DEFAULT_USER_COL,
            how="left",
        )
    )
    return df_behaviors

In [49]:
PATH = Path("/home/e12242664/shared/194.035-2024S/groups/Gruppe_33/Group_33/downloads")
DATASPLIT = "demo"

#  Data Preparation for Training
Loading and preprocessing data from the training split of the EB-NeRD dataset

### COLUMNS: 
Defines a list of column names that are of interest when loading and processing data from the ebnerd dataset.
### HISTORY_SIZE:
Sets the maximum number of historical articles per user to retain (used in ebnerd_from_path function).
### FRACTION:
Determines the fraction of data to sample for training and validation.

### ebnerd_from_path: 
Loads and processes training data (train split) using historical interaction data and behavioral features.
### select(COLUMNS): 
Selects specific columns defined in COLUMNS.
sampling_strategy_wu2019: Applies sampling strategy based on the Wu et al. (2019) methodology.
### Methodology: 
This algorithm adjusts the dataset to maintain a balanced ratio (npratio=4) between negative and positive samples (typically in recommendation systems).
### create_binary_labels_column:
Creates binary labels (e.g., 0 or 1) based on relevance.
### sample(fraction=FRACTION):
Samples a fraction of the dataset for efficiency or testing purposes.

In [50]:
COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
]
HISTORY_SIZE = 50
FRACTION = 1

df_train = (
    ebnerd_from_path(PATH.joinpath(DATASPLIT, "train"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(
        sampling_strategy_wu2019,
        npratio=4,
        shuffle=True,
        with_replacement=True,
        seed=123,
    )
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)
# =>
df_validation = (
    ebnerd_from_path(PATH.joinpath(DATASPLIT, "validation"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)

df_test = (ebnerd_from_path(PATH.joinpath("test"), history_size=HISTORY_SIZE)
    .select([
    DEFAULT_USER_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
])
  
    .sample(fraction=FRACTION)
          )
    
df_train.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels
u32,list[i32],list[i64],list[i64],u32,list[i8]
22779,"[9767624, 9767675, … 9770541]","[9774461, 9759966, … 9759544]",[9759966],48401,"[0, 1, … 0]"
150224,"[9755821, 9759109, … 9735909]","[9778682, 9777397, … 9482970]",[9778661],152513,"[0, 0, … 0]"


### Look at the difference between Training/Validation and Testset
Note, the testset doesn't include labels, and we have remove some of the other columns.

In [51]:
df_train.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels
u32,list[i32],list[i64],list[i64],u32,list[i8]
22779,"[9767624, 9767675, … 9770541]","[9774461, 9759966, … 9759544]",[9759966],48401,"[0, 1, … 0]"
150224,"[9755821, 9759109, … 9735909]","[9778682, 9777397, … 9482970]",[9778661],152513,"[0, 0, … 0]"


In [52]:
df_test.head(2)

user_id,article_id_fixed,article_ids_inview,impression_id
u32,list[i32],list[i32],u32
35982,"[9782499, 9783024, … 9789494]","[9796527, 7851321, … 9492777]",6451339
36012,"[9786247, 9786209, … 9790885]","[9798532, 9791602, … 9798958]",6451363


## Load articles

In [53]:
df_articles = pl.read_parquet(PATH.joinpath(DATASPLIT, "articles.parquet"))
df_articles.head(2)

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3037230,"""Ishockey-spill…","""ISHOCKEY: Isho…",2023-06-29 06:20:57,False,"""Ambitionerne o…",2003-08-28 08:55:00,,"""article_defaul…","""https://ekstra…",[],[],"[""Kriminalitet"", ""Kendt"", … ""Mindre ulykke""]",142,"[327, 334]","""sport""",,,,0.9752,"""Negative"""
3044020,"""Prins Harry tv…","""Hoffet tvang P…",2023-06-29 06:21:16,False,"""Den britiske t…",2005-06-29 08:47:00,"[3097307, 3097197, 3104927]","""article_defaul…","""https://ekstra…","[""Harry"", ""James Hewitt""]","[""PER"", ""PER""]","[""Kriminalitet"", ""Kendt"", … ""Personfarlig kriminalitet""]",414,[432],"""underholdning""",,,,0.7084,"""Negative"""


# Embedding Article Text with BERT
This step involves initializing and loading a pre-trained BERT model (bert-base-multilingual-cased) from Hugging Face.

### Transformer Model: 
Initializes a pre-trained BERT model (bert-base-multilingual-cased) from Hugging Face's Transformers library.
### Tokenizer:
Initializes a corresponding tokenizer for the BERT model to preprocess text inputs.
## Constants:
### TEXT_COLUMNS_TO_USE: 
Specifies which columns (DEFAULT_SUBTITLE_COL and DEFAULT_TITLE_COL) from df_articles to use for text processing.
### MAX_TITLE_LENGTH: 
Defines the maximum length for tokenizing the title text.

In [54]:
TRANSFORMER_MODEL_NAME = "bert-base-multilingual-cased"
TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]
MAX_TITLE_LENGTH = 30

# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model)
#
df_articles, cat_cal = concat_str_columns(df_articles, columns=TEXT_COLUMNS_TO_USE)
df_articles, token_col_title = convert_text2encoding_with_transformers(
    df_articles, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)
# =>
article_mapping = create_article_id_to_value_mapping(
    df=df_articles, value_col=token_col_title
)



# Initiate the dataloaders
In the implementations we have disconnected the models and data. Hence, you should built a dataloader that fits your needs.

### behaviors:
Provides the training dataset (df_train), which likely includes user behavior and interaction data.
### article_dict: 
Dictionary mapping article IDs to their respective token embeddings (article_mapping), facilitating efficient lookup during training.
### unknown_representation: 
Specifies how unknown articles are represented (e.g., as zero vectors).
### history_column:
Column in behaviors representing the user's historical interactions (DEFAULT_HISTORY_ARTICLE_ID_COL).
### eval_mode=False:
Indicates training mode, where the data loader prepares batches for model training.
### batch_size=64:
Determines the number of samples (user interactions) in each batch during training.

In [55]:
train_dataloader = NRMSDataLoader(
    behaviors=df_train,
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=False,
    batch_size=64,
)
val_dataloader = NRMSDataLoader(
    behaviors=df_validation,
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=True,
    batch_size=32,
)

# Train the model

### train_dataloader: 
DataLoader for training data.
### validation_data:
DataLoader for validation data.
### epochs=10:
Number of training epochs.
### callbacks: 
List of callbacks to monitor training and save model weights.

In [56]:
MODEL_NAME = "NRMS"
LOG_DIR = PATH.joinpath(f"data/{MODEL_NAME}")
MODEL_WEIGHTS = PATH.joinpath(f"data/state_dict/{MODEL_NAME}/weights")

# CALLBACKS
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR, histogram_freq=1)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2)
modelcheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=MODEL_WEIGHTS, save_best_only=True, save_weights_only=True, verbose=1
)

hparams_nrms.history_size = HISTORY_SIZE
model = NRMSModel(
    hparams=hparams_nrms,
    word2vec_embedding=word2vec_embedding,
    seed=42,
)
hist = model.model.fit(
    train_dataloader,
    validation_data=val_dataloader,
    epochs=10,
    callbacks=[tensorboard_callback, early_stopping, modelcheckpoint],
)
_ = model.model.load_weights(filepath=MODEL_WEIGHTS)

Epoch 1/10
Epoch 1: val_loss improved from inf to 0.00000, saving model to /home/e12242664/shared/194.035-2024S/groups/Gruppe_33/Group_33/downloads/data/state_dict/NRMS/weights
Epoch 2/10
Epoch 2: val_loss did not improve from 0.00000
Epoch 3/10
Epoch 3: val_loss did not improve from 0.00000


# Example how to compute some metrics:

In [57]:
pred_validation = model.scorer.predict(val_dataloader)



In [58]:
df_validation = add_prediction_scores(df_validation, pred_validation.tolist()).pipe(
    add_known_user_column, known_users=df_train[DEFAULT_USER_COL]
)
df_validation.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels,scores,is_known_user
u32,list[i32],list[i32],list[i32],u32,list[i8],list[f64],bool
76658,"[9766238, 9767642, … 9779045]","[9787499, 9783042, … 9780702]",[9783042],144772,"[0, 1, … 0]","[0.383537, 0.401613, … 0.379169]",True
76658,"[9766238, 9767642, … 9779045]","[9788352, 6741781, … 9788125]",[9788125],144777,"[0, 0, … 1]","[0.470372, 0.156016, … 0.490298]",True


In [59]:
from ebrec.evaluation import MetricEvaluator, AucScore, NdcgScore, MrrScore

metrics = MetricEvaluator(
    labels=df_validation["labels"].to_list(),
    predictions=df_validation["scores"].to_list(),
    metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
)
metrics.evaluate()

<MetricEvaluator class>: 
 {
    "auc": 0.5456715539995841,
    "mrr": 0.3341141497438294,
    "ndcg@5": 0.37227371388113534,
    "ndcg@10": 0.4530279432053414
}

# References:
[1] Yichao Lu, "Bag of Tricks and a Strong Baseline for Neural News Recommendation," Layer 6 AI, https://msnews.github.io/assets/doc/3.pdf