# LSTUR: Long- and Short-term User Representation for News Recommendation

This notebook contains code for training and testing a LSTUR model on the Ebnerd Dataset. This notebook is implemented based on the LSTURDataLoader, LSTURModel and the NRMS example notebook in the Ebnerd Benchmark Repository \[1\].

LSTUR \[2\] is a news recommendation approach that captures users' both long-term preferences and short-term interests. The core of LSTUR is composed of a news encoder and a user encoder. The news encoder learns representations of news from their titles, while the user encoder learns long-term user representations from the embeddings of their IDs and short-term user representations from their recently browsed news via a GRU network.

## Properties of LSTUR:

- **Dual User Representations**: LSTUR captures both short-term and long-term preferences by using embeddings of users' IDs for long-term user representations and a GRU network to learn short-term user representations from recently browsed articles.
- **News Encoder**: Utilizes the news titles to generate news representations.
- **User Encoder**: Combines long-term and short-term user representations. Two methods are proposed for this combination:
  - Initializing the hidden state of the GRU network with the long-term user representation.
  - Concatenating both long-term and short-term user representations to form a unified user vector.
 
\[1\] https://github.com/ebanalyse/ebnerd-benchmark

\[2\] https://aclanthology.org/P19-1033/

In [None]:
import os
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl

from ebrec.utils._constants import (
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
    DEFAULT_SUBTITLE_COL,
    DEFAULT_LABELS_COL,
    DEFAULT_TITLE_COL,
    DEFAULT_USER_COL,
)

from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019,
    add_known_user_column,
    add_prediction_scores,
    truncate_history,
)
from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings
from ebrec.utils._python import write_submission_file, rank_predictions_by_score

from ebrec.models.newsrec.dataloader import LSTURDataLoader
from ebrec.models.newsrec.model_config import hparams_lstur
from ebrec.models.newsrec import LSTURModel

from ebrec.evaluation import MetricEvaluator, AucScore, NdcgScore, MrrScore

## Load and Process Behavior and History Parquet Files
The functions below are necessary for transforming the Ebnerd Dataset tables (`history.parquet` and `behaviors.parquet`) into a format suitable for training and testing. This transformation is achieved by joining the histories and behaviors based on the `user ID`. Additionally, preprocessing steps are performed, such as truncating the user history to keep only the specified `history_size`. The `ebnerd_from_path_lazy` function is specifically designed for loading the large testset to avoid impractical memory usage.


In [2]:
def ebnerd_from_path(path: Path, history_size: int = 30) -> pl.DataFrame:
    """
    Load ebnerd dataset and join history with behaviors.
    """
    df_history = (
        pl.scan_parquet(path.joinpath("history.parquet"))
        .select(DEFAULT_USER_COL, DEFAULT_HISTORY_ARTICLE_ID_COL)
        .pipe(
            truncate_history,
            column=DEFAULT_HISTORY_ARTICLE_ID_COL,
            history_size=history_size,
            padding_value=0,
            enable_warning=False,
        )
    )
    df_behaviors = (
        pl.scan_parquet(path.joinpath("behaviors.parquet"))
        .collect()
        .pipe(
            slice_join_dataframes,
            df2=df_history.collect(),
            on=DEFAULT_USER_COL,
            how="left",
        )
    )
    return df_behaviors

def ebnerd_from_path_lazy(path: Path, history_size: int = 30) -> pl.DataFrame:
    """
    Load ebnerd dataset and join history with behaviors
    in a lazy fashion.
    """
    df_history = (
        pl.scan_parquet(path.joinpath("history.parquet"))
        .select(DEFAULT_USER_COL, DEFAULT_HISTORY_ARTICLE_ID_COL)
        .pipe(
            truncate_history,
            column=DEFAULT_HISTORY_ARTICLE_ID_COL,
            history_size=history_size,
            padding_value=0,
        )
    )
    df_behaviors = (
        pl.scan_parquet(path.joinpath("behaviors.parquet"))
    )
    return df_behaviors.join(other=df_history, on=DEFAULT_USER_COL, how="left")

## Setup Path and Data Configuration
Here we setup the `PATH` to the ebenrd dataset we are using for training. `COLUMNS` define the columns we are using for the model training, `TEXT_COLUMNS_TO_USE` contains columns should be considered in the embedding process. The other constants define "optimal" model paramter which has been found throughout a hyperopt process, look into `lstur-hyperopt.ipynb` and `lstur-analysis.ipynb` for more.

In [None]:
MODEL_NAME = "LSTUR"
MODEL_PATH = Path(f"~/shared/194.035-2024S/groups/Gruppe_33/Group_33/models/{MODEL_NAME}/weights")

DATA_PATH = Path("~/shared/194.035-2024S/groups/Gruppe_33/Group_33/data")
DATASPLIT = "small"

COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
]

TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]

TITLE_LENGTH = 50
HISTORY_SIZE = 100
NUSER_SIZE = 70000
DROPOUT = 0.5 
LEARNING_RATE = 0.01

TRIAN_ON_TESTSET = False
TRAIN_MODEL = os.environ.get("TRAIN")

FRACTION = 1

## Prepare and Process Training and Validation Data
This section uses the previously defined functions to create training and validation datasets. Additional preprocessing steps include applying the `sampling_strategy_wu2019` strategy, creating binary labels using `create_binary_labels_column`, and sampling a fraction of the data for efficiency.

In [4]:
if TRAIN_MODEL:
    df_train = (
        ebnerd_from_path(DATA_PATH.joinpath(DATASPLIT, "train"), history_size=HISTORY_SIZE)
        .select(COLUMNS)
        .pipe(
            sampling_strategy_wu2019,
            npratio=4,
            shuffle=True,
            with_replacement=True,
            seed=123,
        )
        .pipe(create_binary_labels_column)
        .sample(fraction=FRACTION)
    )
    
    df_validation = (
        ebnerd_from_path(DATA_PATH.joinpath(DATASPLIT, "validation"), history_size=HISTORY_SIZE)
        .select(COLUMNS)
        .pipe(create_binary_labels_column)
        .sample(fraction=FRACTION)
    )
    
    print(df_validation.head(2))

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels
u32,list[i32],list[i32],list[i32],u32,list[i8]
22548,"[9752593, 9753773, … 9776929]","[9784679, 9784591, … 9784710]",[9784696],96791,"[0, 0, … 0]"
22548,"[9752593, 9753773, … 9776929]","[9784642, 9769155, … 9784444]",[9784281],96798,"[0, 0, … 0]"


## Load the Articles
In this cell we are loading the articles into memory which will later be used for the embedding.

In [5]:
if TRAIN_MODEL:
    df_articles = pl.read_parquet(DATA_PATH.joinpath(DATASPLIT, "articles.parquet"))
    print(df_articles.head(2))

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3001353,"""Natascha var i…","""Politiet frygt…",2023-06-29 06:20:33,False,"""Sagen om den ø…",2006-08-31 08:06:45,[3150850],"""article_defaul…","""https://ekstra…",[],[],"[""Kriminalitet"", ""Personfarlig kriminalitet""]",140,[],"""krimi""",,,,0.9955,"""Negative"""
3003065,"""Kun Star Wars …","""Biografgængern…",2023-06-29 06:20:35,False,"""Vatikanet har …",2006-05-21 16:57:00,[3006712],"""article_defaul…","""https://ekstra…",[],[],"[""Underholdning"", ""Film og tv"", ""Økonomi""]",414,"[433, 434]","""underholdning""",,,,0.846,"""Positive"""


## Initialize and Configure Transformer Model

This cell loads a pre-trained transformer model and tokenizer from Hugging Face, specifically FacebookAI/xlm-roberta-base, establishing the NLP backbone for the notebook. The transformer model is critical for transforming raw text data into structured embeddings that can be effectively utilized within the recommendation system. The following steps are executed:
- **Load Transformer Model and Tokenizer**: The AutoModel and AutoTokenizer from Hugging Face are used to load the pre-trained xlm-roberta-base model.
- **Initialize Word Embeddings**: Word embeddings are initialized using the transformer's word embeddings to enhance the text representation.
- **Concatenate Text Columns**: Text columns from the articles dataframe are concatenated to create a comprehensive text field.
- **Convert Text to Encodings**: The concatenated text is tokenized and converted to numerical encodings using the transformer tokenizer, with a specified maximum length.
- **Create Article Mapping**: A mapping from article IDs to their corresponding tokenized values is created, facilitating efficient lookup and processing in the recommendation pipeline.

In [None]:
if TRAIN_MODEL:
    TRANSFORMER_MODEL_NAME = "FacebookAI/xlm-roberta-base"
    
    # LOAD HUGGINGFACE:
    transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
    transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)
    
    # We'll init the word embeddings using the
    word2vec_embedding = get_transformers_word_embeddings(transformer_model)
    #
    df_articles, cat_cal = concat_str_columns(df_articles, columns=TEXT_COLUMNS_TO_USE)
    df_articles, token_col_title = convert_text2encoding_with_transformers(
        df_articles, transformer_tokenizer, cat_cal, max_length=TITLE_LENGTH
    )
    # =>
    article_mapping = create_article_id_to_value_mapping(
        df=df_articles, value_col=token_col_title
    )

# Training of the LSTUR Model
The following cells handle the setup and training of the LSTUR model. If `TRAIN_MODEL` is `True`, the necessary data loaders are created, the model is configured with specific hyperparameters, and the training process begins. If `TRAIN_MODEL` is `False`, the pre-trained model weights are loaded instead.

### Data Loading for Model Input and Model Configuration
If `TRAIN_MODEL` is `True`, this cell creates data loaders for both training and validation. A mapping from user IDs to unique integer indices is created to facilitate embedding lookup. The `LSTURDataLoader` is initialized for both training and validation datasets, handling batching, shuffling, and input feature construction. If `TRAIN_MODEL` is `False`, the data loaders are not created.

In [7]:
if TRAIN_MODEL:
    user_id_mapping = {user_id: i for i, user_id in enumerate(df_train[DEFAULT_USER_COL].unique())}
    
    train_dataloader = LSTURDataLoader(
        user_id_mapping=user_id_mapping,
        behaviors=df_train,
        article_dict=article_mapping,
        unknown_representation="zeros",
        history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
        eval_mode=False,
        batch_size=5,
    )
    val_dataloader = LSTURDataLoader(
        user_id_mapping=user_id_mapping,
        behaviors=df_validation,
        article_dict=article_mapping,
        unknown_representation="zeros",
        history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
        eval_mode=True,
        batch_size=5,
    )

### Model Training and Configuration
If `TRAIN_MODEL` is `True`, this cell sets up the LSTUR model with specific hyperparameters and begins training. It configures paths for logging and saving model weights, and sets up callbacks for TensorBoard logging, early stopping, and model checkpointing. The LSTUR model is then trained using the training DataLoader and validated using the validation DataLoader. If `TRAIN_MODEL` is `False`, the model weights are loaded from the specified path instead of training.

In [None]:
if TRAIN_MODEL:
    LOG_DIR = f"tmp/runs/{MODEL_NAME}"
    
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR, histogram_freq=1)
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2)
    modelcheckpoint = tf.keras.callbacks.ModelCheckpoint(
        filepath=MODEL_PATH, save_best_only=True, save_weights_only=True, verbose=1
    )
    
    hparams_lstur.history_size = HISTORY_SIZE
    hparams_lstur.title_size = TITLE_LENGTH
    hparams_lstur.n_users = NUSER_SIZE
    hparams_lstur.dropout = DROPOUT
    hparams_lstur.learning_rate = LEARNING_RATE
    
    model = LSTURModel(
        hparams=hparams_lstur,
        word2vec_embedding=word2vec_embedding,
        seed=42,
    )
    hist = model.model.fit(
        train_dataloader,
        validation_data=val_dataloader,
        epochs=1,
        callbacks=[tensorboard_callback, early_stopping, modelcheckpoint],
    )
    
_ = model.model.load_weights(filepath=MODEL_PATH)

# Evaluation of the Trained Model

The following cells evaluate the trained LSTUR model either on the test set or the validation set. If `TESTSET` is set to `True`, the evaluation is performed on the test set; otherwise, it is performed on the validation set. This process includes loading and preprocessing the data, making predictions with the model, calculating performance metrics, ranking predictions, and optionally writing the results to a submission file.

### Load and Preprocess the Test or Validation Data
This cell loads and preprocesses the data for evaluation. If `TESTSET` is `True`, it loads the test dataset, adds a column for clicked articles (initially empty), selects the required columns, creates binary labels, and samples a fraction of the dataset for efficiency. It then initializes the `LSTURDataLoader` for the test data. If `TESTSET` is `False`, it uses the previously initialized `val_dataloader` for validation data.

In [13]:
if TRIAN_ON_TESTSET:
    df_test = (
        ebnerd_from_path_lazy(DATA_PATH.joinpath("test"), history_size=HISTORY_SIZE)
        .with_columns(pl.Series(DEFAULT_CLICKED_ARTICLES_COL, [[]]))
        .select(COLUMNS)
        .pipe(create_binary_labels_column)
        .collect()
        .sample(fraction=FRACTION)
    )
    
    test_dataloader = LSTURDataLoader(
        user_id_mapping=user_id_mapping,
        behaviors=df_test,
        article_dict=article_mapping,
        unknown_representation="zeros",
        history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
        eval_mode=True,
        batch_size=5,
    )

    print(df_test.head(2))
    
else:
    test_dataloader = val_dataloader
    df_test = df_validation

### Model performs prediction
This cell uses the trained model to make predictions on the test or validation data. The `predict` method of the model scorer is applied to the `test_dataloader`.

In [11]:
pred_test = model.scorer.predict(val_dataloader)



In [13]:
df_test = add_prediction_scores(df_test, pred_test.tolist()).pipe(
    add_known_user_column, known_users=df_train[DEFAULT_USER_COL]
)
df_test.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels,scores,is_known_user
u32,list[i32],list[i32],list[i32],u32,list[i8],list[f64],bool
22548,"[9752593, 9753773, … 9776929]","[9784679, 9784591, … 9784710]",[9784696],96791,"[0, 0, … 0]","[NaN, NaN, … NaN]",True
22548,"[9752593, 9753773, … 9776929]","[9784642, 9769155, … 9784444]",[9784281],96798,"[0, 0, … 0]","[NaN, NaN, … NaN]",True


### Evaluate Model Prediction Performance
This cell evaluates the model's performance using the metrics: AUC, MRR, and NDCG (5 and 10). The `MetricEvaluator` class is initialized with the true labels and prediction scores, and the evaluation is performed using the specified metrics.

In [None]:
metrics = MetricEvaluator(
    labels=df_test["labels"].to_list(),
    predictions=df_test["scores"].to_list(),
    metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
)
metrics.evaluate()

### Rank Predictions by Score
This cell ranks the predictions by their scores. It adds a new column `ranked_scores` to the dataframe, where the predictions are ranked based on their scores using the `rank_predictions_by_score` function.

In [15]:
df_test = df_test.with_columns(
    pl.col("scores")
    .map_elements(lambda x: list(rank_predictions_by_score(x)))
    .alias("ranked_scores")
)
df_test.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels,scores,is_known_user,ranked_scores
u32,list[i32],list[i32],list[i32],u32,list[i8],list[f64],bool,list[i64]
22548,"[9752593, 9753773, … 9776929]","[9784679, 9784591, … 9784710]",[9784696],96791,"[0, 0, … 0]","[NaN, NaN, … NaN]",True,"[5, 4, … 1]"
22548,"[9752593, 9753773, … 9776929]","[9784642, 9769155, … 9784444]",[9784281],96798,"[0, 0, … 0]","[NaN, NaN, … NaN]",True,"[12, 1, … 24]"


### Write Submission File
This cell writes the ranked predictions to a submission file. The `write_submission_file` function takes the impression IDs and ranked prediction scores from the dataframe and writes them to the specified path.

In [16]:
write_submission_file(
    impression_ids=df_test[DEFAULT_IMPRESSION_ID_COL],
    prediction_scores=df_test["ranked_scores"],
    path="downloads/predictions.txt",
)

244647it [00:54, 4451.40it/s]


Zipping downloads/predictions.txt to downloads/predictions.zip
