# NRMS (Neural News Recommendation with Multi-Head Self-Attention):  
 The NRMS [1] model is a state-of-the-art neural news recommendation approach with multi-head self-attention. NRMS learns news representations from news titles, using a multi head self-attention network. Briefly speaking, the news encoder first maps each word in the news title to the corresponding vector, and then uses the self-attention network to learn word-level representations. Finally, a query vector is used to locate the important words in the news title, and an attention-based pooling method is used to aggregate the word-level representations into the learned title representations. To learn user representations from their browsed news, NRMS again uses the multi-head self-attention network on top of the learned news representations. The probability of a user clicking a candidate news is given by the dot product between the user representations and the news representations

## Components:

### News Encoder:
- Uses multi-head self-attention to learn news representations from news titles
- Models interactions between words
  
### User Encoder:

- Learns user representations from their browsed news
  
## Key Features:
- It helps users find the news they like
- Avoids information overload

In [3]:
import os
import sys

from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl

# Import Constants and Utilities
from ebrec.utils._constants import (
   DEFAULT_HISTORY_ARTICLE_ID_COL,
   DEFAULT_CLICKED_ARTICLES_COL,
   DEFAULT_INVIEW_ARTICLES_COL,
   DEFAULT_IMPRESSION_ID_COL,
   DEFAULT_SUBTITLE_COL,
   DEFAULT_LABELS_COL,
   DEFAULT_TITLE_COL,
   DEFAULT_USER_COL,
)

#Import Utility Functions
#
from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019,
    add_known_user_column,
    add_prediction_scores,
    truncate_history,
)
from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings

#
from ebrec.models.newsrec.dataloader import NRMSDataLoader
from ebrec.models.newsrec.model_config import hparams_nrms
from ebrec.models.newsrec import NRMSModel

2024-07-05 13:36:07.131173: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-05 13:36:07.694989: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-05 13:36:07.695063: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-05 13:36:07.711922: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-05 13:36:07.758303: I tensorflow/core/platform/cpu_feature_guar

## Load and Process Behavior and History Parquet Files
The functions below are necessary for transforming the Ebnerd Dataset tables (`history.parquet` and `behaviors.parquet`) into a format suitable for training and testing. This transformation is achieved by joining the histories and behaviors based on the `user ID`. Additionally, preprocessing steps are performed, such as truncating the user history to keep only the specified `history_size`.

In [48]:
def ebnerd_from_path(path: Path, history_size: int = 30) -> pl.DataFrame:
    """
    Load ebnerd - function
    """
    df_history = (
        pl.scan_parquet(path.joinpath("history.parquet"))
        .select(DEFAULT_USER_COL, DEFAULT_HISTORY_ARTICLE_ID_COL)
        .pipe(
            truncate_history,
            column=DEFAULT_HISTORY_ARTICLE_ID_COL,
            history_size=history_size,
            padding_value=0,
            enable_warning=False
        )
    )
    df_behaviors = (
        pl.scan_parquet(path.joinpath("behaviors.parquet"))
        .collect()
        .pipe(
            slice_join_dataframes,
            df2=df_history.collect(),
            on=DEFAULT_USER_COL,
            how="left",
        )
    )
    return df_behaviors

## Setup Path and Data Configuration
Here we setup the `PATH` to the ebenrd dataset we are using for training. `COLUMNS` define the columns we are using for the model training, `TEXT_COLUMNS_TO_USE` contains columns should be considered in the embedding process. The other constants define "optimal" model paramter which has been found throughout a hyperopt process, look into `NRMS-hyperopt.ipynb` and `NRMS-analysis.ipynb` for more.

In [49]:
MODEL_NAME = "NRMS"
MODEL_PATH = Path(f"~/shared/194.035-2024S/groups/Gruppe_33/Group_33/models/{MODEL_NAME}/weights")

DATA_PATH = Path("~/shared/194.035-2024S/groups/Gruppe_33/Group_33/downloads")
DATASPLIT = "small"

COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
]

TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]

TRIAN_ON_TESTSET = False
TRAIN_MODEL = os.environ.get("TRAIN")

FRACTION = 1

In [50]:
COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
]
HISTORY_SIZE = 50
FRACTION = 1
TRAIN_MODEL = os.environ.get("TRAIN")

if TRAIN_MODEL:
    df_train = (
        ebnerd_from_path(PATH.joinpath(DATASPLIT, "train"), history_size=HISTORY_SIZE)
        .select(COLUMNS)
        .pipe(
            sampling_strategy_wu2019,
            npratio=4,
            shuffle=True,
            with_replacement=True,
            seed=123,
        )
        .pipe(create_binary_labels_column)
        .sample(fraction=FRACTION)
    )
    # =>
    df_validation = (
        ebnerd_from_path(PATH.joinpath(DATASPLIT, "validation"), history_size=HISTORY_SIZE)
        .select(COLUMNS)
        .pipe(create_binary_labels_column)
        .sample(fraction=FRACTION)
    )
    
    df_test = (ebnerd_from_path(PATH.joinpath("test"), history_size=HISTORY_SIZE)
        .select([
        DEFAULT_USER_COL,
        DEFAULT_HISTORY_ARTICLE_ID_COL,
        DEFAULT_INVIEW_ARTICLES_COL,
        DEFAULT_IMPRESSION_ID_COL,
    ])
      
        .sample(fraction=FRACTION)
              )

    print(df_test.head(2)
    print(df_train.head(2))

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels
u32,list[i32],list[i64],list[i64],u32,list[i8]
22779,"[9767624, 9767675, … 9770541]","[9774461, 9759966, … 9759544]",[9759966],48401,"[0, 1, … 0]"
150224,"[9755821, 9759109, … 9735909]","[9778682, 9777397, … 9482970]",[9778661],152513,"[0, 0, … 0]"


## Load articles

In [53]:
df_articles = pl.read_parquet(PATH.joinpath(DATASPLIT, "articles.parquet"))
df_articles.head(2)

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3037230,"""Ishockey-spill…","""ISHOCKEY: Isho…",2023-06-29 06:20:57,False,"""Ambitionerne o…",2003-08-28 08:55:00,,"""article_defaul…","""https://ekstra…",[],[],"[""Kriminalitet"", ""Kendt"", … ""Mindre ulykke""]",142,"[327, 334]","""sport""",,,,0.9752,"""Negative"""
3044020,"""Prins Harry tv…","""Hoffet tvang P…",2023-06-29 06:21:16,False,"""Den britiske t…",2005-06-29 08:47:00,"[3097307, 3097197, 3104927]","""article_defaul…","""https://ekstra…","[""Harry"", ""James Hewitt""]","[""PER"", ""PER""]","[""Kriminalitet"", ""Kendt"", … ""Personfarlig kriminalitet""]",414,[432],"""underholdning""",,,,0.7084,"""Negative"""


## Prepare and Process Training and Validation Data
This section uses the previously defined functions to create training and validation datasets. Additional preprocessing steps include applying the `sampling_strategy_wu2019` strategy, creating binary labels using `create_binary_labels_column`, and sampling a fraction of the data for efficiency.

In [None]:
if TRAIN_MODEL:
    df_train = (
        ebnerd_from_path(DATA_PATH.joinpath(DATASPLIT, "train"), history_size=HISTORY_SIZE)
        .select(COLUMNS)
        .pipe(
            sampling_strategy_wu2019,
            npratio=4,
            shuffle=True,
            with_replacement=True,
            seed=123,
        )
        .pipe(create_binary_labels_column)
        .sample(fraction=FRACTION)
    )
    
    df_validation = (
        ebnerd_from_path(DATA_PATH.joinpath(DATASPLIT, "validation"), history_size=HISTORY_SIZE)
        .select(COLUMNS)
        .pipe(create_binary_labels_column)
        .sample(fraction=FRACTION)
    )
    
    print(df_validation.head(2))

## Load the Articles
In this cell we are loading the articles into memory which will later be used for the embedding.

In [5]:
if TRAIN_MODEL:
    df_articles = pl.read_parquet(DATA_PATH.joinpath(DATASPLIT, "articles.parquet"))
    print(df_articles.head(2))

## Initialize and Configure Transformer Model

This cell loads a pre-trained transformer model and tokenizer from Hugging Face, specifically FacebookAI/xlm-roberta-base, establishing the NLP backbone for the notebook. The transformer model is critical for transforming raw text data into structured embeddings that can be effectively utilized within the recommendation system. The following steps are executed:
- **Load Transformer Model and Tokenizer**: The AutoModel and AutoTokenizer from Hugging Face are used to load the pre-trained xlm-roberta-base model.
- **Initialize Word Embeddings**: Word embeddings are initialized using the transformer's word embeddings to enhance the text representation.
- **Concatenate Text Columns**: Text columns from the articles dataframe are concatenated to create a comprehensive text field.
- **Convert Text to Encodings**: The concatenated text is tokenized and converted to numerical encodings using the transformer tokenizer, with a specified maximum length.
- **Create Article Mapping**: A mapping from article IDs to their corresponding tokenized values is created, facilitating efficient lookup and processing in the recommendation pipeline.

In [None]:
TRANSFORMER_MODEL_NAME = "bert-base-multilingual-cased"
TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]
MAX_TITLE_LENGTH = 30

# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model)
#
df_articles, cat_cal = concat_str_columns(df_articles, columns=TEXT_COLUMNS_TO_USE)
df_articles, token_col_title = convert_text2encoding_with_transformers(
    df_articles, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)
# =>
article_mapping = create_article_id_to_value_mapping(
    df=df_articles, value_col=token_col_title
)

# Training of the NRMS Model
The following cells handle the setup and training of the NRMS model. If `TRAIN_MODEL` is `True`, the necessary data loaders are created, the model is configured with specific hyperparameters, and the training process begins. If `TRAIN_MODEL` is `False`, the pre-trained model weights are loaded instead.

### Data Loading for Model Input and Model Configuration
If `TRAIN_MODEL` is `True`, this cell creates data loaders for both training and validation. A mapping from user IDs to unique integer indices is created to facilitate embedding lookup. The `NRMSDataLoader` is initialized for both training and validation datasets, handling batching, shuffling, and input feature construction. If `TRAIN_MODEL` is `False`, the data loaders are not created.

In [55]:
if TRAIN_MODEL:
    train_dataloader = NRMSDataLoader(
        behaviors=df_train,
        article_dict=article_mapping,
        unknown_representation="zeros",
        history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
        eval_mode=False,
        batch_size=64,
    )
    val_dataloader = NRMSDataLoader(
        behaviors=df_validation,
        article_dict=article_mapping,
        unknown_representation="zeros",
        history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
        eval_mode=True,
        batch_size=32,
    )

### Model Training and Configuration
If `TRAIN_MODEL` is `True`, this cell sets up the NRMS model with specific hyperparameters and begins training. It configures paths for logging and saving model weights, and sets up callbacks for TensorBoard logging, early stopping, and model checkpointing. The NRMS model is then trained using the training DataLoader and validated using the validation DataLoader. If `TRAIN_MODEL` is `False`, the model weights are loaded from the specified path instead of training.

In [56]:
if TRAIN_MODEL:
    LOG_DIR = f"data/{MODEL_NAME}"
    
    # CALLBACKS
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR, histogram_freq=1)
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2)
    modelcheckpoint = tf.keras.callbacks.ModelCheckpoint(
        filepath=MODEL_PATH, save_best_only=True, save_weights_only=True, verbose=1
    )
    
    hparams_nrms.history_size = HISTORY_SIZE
    model = NRMSModel(
        hparams=hparams_nrms,
        word2vec_embedding=word2vec_embedding,
        seed=42,
    )
    hist = model.model.fit(
        train_dataloader,
        validation_data=val_dataloader,
        epochs=10,
        callbacks=[tensorboard_callback, early_stopping, modelcheckpoint],
    )
    _ = model.model.load_weights(filepath=MODEL_PATH)

Epoch 1/10
Epoch 1: val_loss improved from inf to 0.00000, saving model to /home/e12242664/shared/194.035-2024S/groups/Gruppe_33/Group_33/downloads/data/state_dict/NRMS/weights
Epoch 2/10
Epoch 2: val_loss did not improve from 0.00000
Epoch 3/10
Epoch 3: val_loss did not improve from 0.00000


# Evaluation of the Trained Model

The following cells evaluate the trained NRMS model either on the test set or the validation set. If `TESTSET` is set to `True`, the evaluation is performed on the test set; otherwise, it is performed on the validation set. This process includes loading and preprocessing the data, making predictions with the model, calculating performance metrics, ranking predictions, and optionally writing the results to a submission file.

### Load and Preprocess the Test or Validation Data
This cell loads and preprocesses the data for evaluation. If `TESTSET` is `True`, it loads the test dataset, adds a column for clicked articles (initially empty), selects the required columns, creates binary labels, and samples a fraction of the dataset for efficiency. It then initializes the `NRMSDataLoader` for the test data. If `TESTSET` is `False`, it uses the previously initialized `val_dataloader` for validation data.

In [None]:
if TRIAN_ON_TESTSET:
    df_test = (
        ebnerd_from_path_lazy(DATA_PATH.joinpath("test"), history_size=HISTORY_SIZE)
        .with_columns(pl.Series(DEFAULT_CLICKED_ARTICLES_COL, [[]]))
        .select(COLUMNS)
        .pipe(create_binary_labels_column)
        .collect()
        .sample(fraction=FRACTION)
    )
    
    test_dataloader = NRMSDataLoader(
        user_id_mapping=user_id_mapping,
        behaviors=df_test,
        article_dict=article_mapping,
        unknown_representation="zeros",
        history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
        eval_mode=True,
        batch_size=5,
    )

    print(df_test.head(2))
    
else:
    test_dataloader = val_dataloader
    df_test = df_validation

### Model performs prediction
This cell uses the trained model to make predictions on the test or validation data. The `predict` method of the model scorer is applied to the `test_dataloader`.

In [None]:
pred_test = model.scorer.predict(val_dataloader)

In [None]:
df_test = add_prediction_scores(df_test, pred_test.tolist()).pipe(
    add_known_user_column, known_users=df_train[DEFAULT_USER_COL]
)
df_test.head(2)

### Evaluate Model Prediction Performance
This cell evaluates the model's performance using the metrics: AUC, MRR, and NDCG (5 and 10). The `MetricEvaluator` class is initialized with the true labels and prediction scores, and the evaluation is performed using the specified metrics.

In [None]:
metrics = MetricEvaluator(
    labels=df_test["labels"].to_list(),
    predictions=df_test["scores"].to_list(),
    metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
)
metrics.evaluate()

### Rank Predictions by Score
This cell ranks the predictions by their scores. It adds a new column `ranked_scores` to the dataframe, where the predictions are ranked based on their scores using the `rank_predictions_by_score` function.

In [None]:
df_test = df_test.with_columns(
    pl.col("scores")
    .map_elements(lambda x: list(rank_predictions_by_score(x)))
    .alias("ranked_scores")
)
df_test.head(2)

### Write Submission File
This cell writes the ranked predictions to a submission file. The `write_submission_file` function takes the impression IDs and ranked prediction scores from the dataframe and writes them to the specified path.

In [None]:
write_submission_file(
    impression_ids=df_test[DEFAULT_IMPRESSION_ID_COL],
    prediction_scores=df_test["ranked_scores"],
    path="downloads/predictions.txt",
)