# LSTUR: Hyperparameter Optimization for the LSTUR Model

This notebook focuses on optimizing the hyperparameters for the LSTUR model to achieve the better performance. For this approach we are using a search grid containing following search space:

```python
param_grid = {
    'history_size': [10, 50, 100],
    'n_users':  [20000, 50000, 70000],
    'title_size': [10, 50, 100],
    'learning_rate': [0.0001, 0.001, 0.01],
    'dropout':  [0.1, 0.3, 0.5]
}
```

LSTUR \[2\] is a news recommendation approach that captures users' both long-term preferences and short-term interests. The core of LSTUR is composed of a news encoder and a user encoder. The news encoder learns representations of news from their titles, while the user encoder learns long-term user representations from the embeddings of their IDs and short-term user representations from their recently browsed news via a GRU network.

## Properties of LSTUR:

- **Dual User Representations**: LSTUR captures both short-term and long-term preferences by using embeddings of users' IDs for long-term user representations and a GRU network to learn short-term user representations from recently browsed articles.
- **News Encoder**: Utilizes the news titles to generate news representations.
- **User Encoder**: Combines long-term and short-term user representations. Two methods are proposed for this combination:
  - Initializing the hidden state of the GRU network with the long-term user representation.
  - Concatenating both long-term and short-term user representations to form a unified user vector.
 
\[1\] https://github.com/ebanalyse/ebnerd-benchmark

\[2\] https://aclanthology.org/P19-1033/

In [None]:
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl
import os
import itertools

from ebrec.utils._constants import (
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
    DEFAULT_SUBTITLE_COL,
    DEFAULT_TITLE_COL,
    DEFAULT_USER_COL,
)

from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019,
    add_known_user_column,
    add_prediction_scores,
    truncate_history,
)
from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings
from ebrec.utils._python import write_submission_file, rank_predictions_by_score
from ebrec.evaluation import MetricEvaluator, AucScore, NdcgScore, MrrScore
    
from ebrec.models.newsrec.dataloader import LSTURDataLoader
from ebrec.models.newsrec.model_config import hparams_lstur
from ebrec.models.newsrec import LSTURModel

## Load and Process Behavior and History Parquet Files
The functions below are necessary for transforming the Ebnerd Dataset tables (`history.parquet` and `behaviors.parquet`) into a format suitable for training and testing. This transformation is achieved by joining the histories and behaviors based on the `user ID`. Additionally, preprocessing steps are performed, such as truncating the user history to keep only the specified `history_size`.

In [2]:
def ebnerd_from_path(path: Path, history_size: int = 30) -> pl.DataFrame:
    """
    Load ebnerd - function
    """
    df_history = (
        pl.scan_parquet(path.joinpath("history.parquet"))
        .select(DEFAULT_USER_COL, DEFAULT_HISTORY_ARTICLE_ID_COL)
        .pipe(
            truncate_history,
            column=DEFAULT_HISTORY_ARTICLE_ID_COL,
            history_size=history_size,
            padding_value=0,
            enable_warning=False,
        )
    )
    df_behaviors = (
        pl.scan_parquet(path.joinpath("behaviors.parquet"))
        .collect()
        .pipe(
            slice_join_dataframes,
            df2=df_history.collect(),
            on=DEFAULT_USER_COL,
            how="left",
        )
    )
    return df_behaviors

## Setup Path and Data Configuration
Here we setup the `PATH` to the ebenrd dataset we are using for training. `COLUMNS` define the columns we are using for the model training, `TEXT_COLUMNS_TO_USE` contains columns should be considered in the embedding process.

In [3]:
MODEL_NAME = "LSTUR"

DATA_PATH = Path("~/shared/194.035-2024S/groups/Gruppe_33/Group_33/data")
DATASPLIT = "small"

COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
]

HISTORY_SIZE = 50
MAX_TITLE_LENGTH = 50

TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]
FRACTION = 0.01

## Prepare and Process Training and Validation Data
This section uses the previously defined functions to create training and validation datasets. Additional preprocessing steps include applying the `sampling_strategy_wu2019` strategy, creating binary labels using `create_binary_labels_column`, and sampling a fraction of the data for efficiency.

In [None]:
df_train = (
    ebnerd_from_path(DATA_PATH.joinpath(DATASPLIT, "train"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(
        sampling_strategy_wu2019,
        npratio=4,
        shuffle=True,
        with_replacement=True,
        seed=123,
    )
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)

df_validation = (
    ebnerd_from_path(DATA_PATH.joinpath(DATASPLIT, "validation"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)
df_train.head(2)

## Load the Articles
In this cell we are loading the articles into memory which will later be used for the embedding.

In [None]:
df_articles = pl.read_parquet(DATA_PATH.joinpath(DATASPLIT, "articles.parquet"))
df_articles.head(2)

## Initialize and Configure Transformer Model

This cell loads a pre-trained transformer model and tokenizer from Hugging Face, specifically FacebookAI/xlm-roberta-base, establishing the NLP backbone for the notebook. The transformer model is critical for transforming raw text data into structured embeddings that can be effectively utilized within the recommendation system. The following steps are executed:
- **Load Transformer Model and Tokenizer**: The AutoModel and AutoTokenizer from Hugging Face are used to load the pre-trained xlm-roberta-base model.
- **Initialize Word Embeddings**: Word embeddings are initialized using the transformer's word embeddings to enhance the text representation.
- **Concatenate Text Columns**: Text columns from the articles dataframe are concatenated to create a comprehensive text field.
- **Convert Text to Encodings**: The concatenated text is tokenized and converted to numerical encodings using the transformer tokenizer, with a specified maximum length.
- **Create Article Mapping**: A mapping from article IDs to their corresponding tokenized values is created, facilitating efficient lookup and processing in the recommendation pipeline.

In [None]:
TRANSFORMER_MODEL_NAME = "FacebookAI/xlm-roberta-base"
TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]

# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model)
#
df_articles, cat_cal = concat_str_columns(df_articles, columns=TEXT_COLUMNS_TO_USE)
df_articles, token_col_title = convert_text2encoding_with_transformers(
    df_articles, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)
# =>
article_mapping = create_article_id_to_value_mapping(
    df=df_articles, value_col=token_col_title
)

### Data Loading for Model Input and Model Configuration
This cell creates data loaders for both training and validation. A mapping from user IDs to unique integer indices is created to facilitate embedding lookup. The `LSTURDataLoader` is initialized for both training and validation datasets, handling batching, shuffling, and input feature construction.

In [7]:
user_id_mapping = {user_id: i for i, user_id in enumerate(df_train[DEFAULT_USER_COL].unique())}

train_dataloader = LSTURDataLoader(
    user_id_mapping=user_id_mapping,
    behaviors=df_train,
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=False,
    batch_size=64,
)
val_dataloader = LSTURDataLoader(
    user_id_mapping=user_id_mapping,
    behaviors=df_validation,
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=True,
    batch_size=32,
)

# Hyperparameter Optimization for LSTUR Model
In this section, we perform hyperparameter optimization to find the best combination of parameters for the LSTUR model. The optimization focuses on input dimensions such as `history_size`, `n_users`, and `title_size`, as well as model-specific parameters like `learning_rate` and `dropout`. The process involves evaluating different combinations of these parameters to identify the configuration that yields the best performance.

### Setting Up Hyperparameter Optimization
The `objective` function is defined to train the LSTUR model with a given set of hyperparameters and evaluate its performance. The function creates directories for logs and model weights, sets the hyperparameters, trains the model, and then evaluates it. The evaluation results are saved, and predictions are written to a submission file.

In [8]:
def objective(history_size, n_users, title_size, learning_rate, dropout, df_validation,df_train):

    # Create directories for logs and model weights
    MODEL_NAME = f"LSTUR_l{learning_rate}_d{dropout}"
    LOG_DIR = f"downloads/runs/{MODEL_NAME}"
    MODEL_WEIGHTS = f"downloads/data/state_dict/{MODEL_NAME}/weights"
    RESULTS_DIR = f"downloads/evaluations/{MODEL_NAME}"
    
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR, histogram_freq=1)
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2)
    modelcheckpoint = tf.keras.callbacks.ModelCheckpoint(
        filepath=MODEL_WEIGHTS, save_best_only=True, save_weights_only=True, verbose=1
    )

    # Set the parameters
    hparams_lstur.history_size = history_size
    hparams_lstur.n_users = n_users
    hparams_lstur.title_size = title_size
    hparams_lstur.learning_rate = learning_rate
    hparams_lstur.dropout = dropout
    

    model = LSTURModel(
        hparams=hparams_lstur,
        word2vec_embedding=word2vec_embedding,
        seed=42,
    )
    hist = model.model.fit(
        train_dataloader,
        validation_data=val_dataloader,
        epochs=1,
        callbacks=[tensorboard_callback, early_stopping, modelcheckpoint],
    )
    _ = model.model.load_weights(filepath=MODEL_WEIGHTS)


    
    pred_validation = model.scorer.predict(val_dataloader)
    df_validation = add_prediction_scores(df_validation, pred_validation.tolist()).pipe(
        add_known_user_column, known_users=df_train[DEFAULT_USER_COL]
    )

    metrics = MetricEvaluator(
        labels=df_validation["labels"].to_list(),
        predictions=df_validation["scores"].to_list(),
        metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
    )
    evaluation_results = metrics.evaluate().evaluations

    # Save the evaluation results
    os.makedirs(RESULTS_DIR, exist_ok=True)
    with open(os.path.join(RESULTS_DIR, 'evaluation_results.txt'), 'w') as f:
        for key, value in evaluation_results.items():
            f.write(f"{key}: {value}\n")

    # Rank predictions and write submission file
    df_validation = df_validation.with_columns(
        pl.col("scores")
        .map_elements(lambda x: list(rank_predictions_by_score(x)))
        .alias("ranked_scores")
    )
    write_submission_file(
        impression_ids=df_validation[DEFAULT_IMPRESSION_ID_COL],
        prediction_scores=df_validation["ranked_scores"],
        path=os.path.join(RESULTS_DIR, "predictions.txt"),
    )

    return evaluation_results

### Running Hyperparameter Optimization
A grid of hyperparameter values is defined, and the `objective` function is called for each combination of these values. The results for each combination are stored and saved to a file.

In [9]:
param_grid = {
    'history_size': [100],
    'n_users':  [70000],
    'title_size': [50],
    'learning_rate': [0.0001, 0.001, 0.01],
    'dropout':  [0.1, 0.3, 0.5]
}

combinations = list(
    itertools.product(
        param_grid['history_size'], param_grid['n_users'], param_grid['title_size'], param_grid['learning_rate'], param_grid['dropout']
    )
)


all_results = []
for history_size, n_users, title_size, learning_rate, dropout in combinations:
    print(f"Evaluating combination: history_size={history_size}, n_users={n_users}, title_size={title_size}")
    result = objective(history_size, n_users, title_size, learning_rate, dropout, df_validation, df_train)
    all_results.append({
        'history_size': history_size,
        'n_users': n_users,
        'title_size': title_size,
        'evaluation_results': result
    })

# Save all results to a file
with open("downloads/evaluations/all_results.txt", 'w') as f:
    for result in all_results:
        f.write(f"{result}\n")

print("All combinations evaluated.")


Evaluating combination: history_size=100, n_users=70000, title_size=50


2024-06-19 18:39:03.339396: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-19 18:39:03.345001: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-19 18:39:03.348078: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2/Sum_1:0', description="created by layer 'att_layer2'")


2024-06-19 18:39:07.941617: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-06-19 18:39:15.400662: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8902
2024-06-19 18:39:15.670208: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-06-19 18:39:18.300474: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f7bd5fac8f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-06-19 18:39:18.300548: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2024-06-19 18:39:18.316275: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1718822358.523325   11580 device_compiler.h:186] Compiled



2024-06-19 18:44:06.538086: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 46080368640 exceeds 10% of free system memory.
2024-06-19 18:44:35.844144: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 46080368640 exceeds 10% of free system memory.
2024-06-19 18:45:02.633908: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6720096000 exceeds 10% of free system memory.



Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.0001_d0.1/weights


2446it [00:00, 10941.06it/s]


Zipping downloads/evaluations/LSTUR_l0.0001_d0.1/predictions.txt to downloads/evaluations/LSTUR_l0.0001_d0.1/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_1/Sum_1:0', description="created by layer 'att_layer2_1'")

2024-06-19 18:46:10.737159: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 46080368640 exceeds 10% of free system memory.
2024-06-19 18:46:40.059043: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 46080368640 exceeds 10% of free system memory.



Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.0001_d0.3/weights


2446it [00:00, 11404.25it/s]


Zipping downloads/evaluations/LSTUR_l0.0001_d0.3/predictions.txt to downloads/evaluations/LSTUR_l0.0001_d0.3/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_2/Sum_1:0', description="created by layer 'att_layer2_2'")
Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.0001_d0.5/weights


2446it [00:00, 11361.59it/s]


Zipping downloads/evaluations/LSTUR_l0.0001_d0.5/predictions.txt to downloads/evaluations/LSTUR_l0.0001_d0.5/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_3/Sum_1:0', description="created by layer 'att_layer2_3'")
Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.001_d0.1/weights


2446it [00:00, 11382.24it/s]


Zipping downloads/evaluations/LSTUR_l0.001_d0.1/predictions.txt to downloads/evaluations/LSTUR_l0.001_d0.1/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_4/Sum_1:0', description="created by layer 'att_layer2_4'")
Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.001_d0.3/weights


2446it [00:00, 11402.86it/s]


Zipping downloads/evaluations/LSTUR_l0.001_d0.3/predictions.txt to downloads/evaluations/LSTUR_l0.001_d0.3/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_5/Sum_1:0', description="created by layer 'att_layer2_5'")
Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.001_d0.5/weights


2446it [00:00, 11446.36it/s]


Zipping downloads/evaluations/LSTUR_l0.001_d0.5/predictions.txt to downloads/evaluations/LSTUR_l0.001_d0.5/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_6/Sum_1:0', description="created by layer 'att_layer2_6'")
Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.01_d0.1/weights


2446it [00:00, 11311.65it/s]


Zipping downloads/evaluations/LSTUR_l0.01_d0.1/predictions.txt to downloads/evaluations/LSTUR_l0.01_d0.1/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_7/Sum_1:0', description="created by layer 'att_layer2_7'")
Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.01_d0.3/weights


2446it [00:00, 11429.53it/s]


Zipping downloads/evaluations/LSTUR_l0.01_d0.3/predictions.txt to downloads/evaluations/LSTUR_l0.01_d0.3/predictions.zip
Evaluating combination: history_size=100, n_users=70000, title_size=50
KerasTensor(type_spec=TensorSpec(shape=(None, 400), dtype=tf.float32, name=None), name='att_layer2_8/Sum_1:0', description="created by layer 'att_layer2_8'")
Epoch 1: val_loss improved from inf to 0.00000, saving model to downloads/data/state_dict/LSTUR_l0.01_d0.5/weights


2446it [00:00, 11181.53it/s]

Zipping downloads/evaluations/LSTUR_l0.01_d0.5/predictions.txt to downloads/evaluations/LSTUR_l0.01_d0.5/predictions.zip
All combinations evaluated.



