**IMPORTANT NOTES**

- **Comments with two hash symbols (##) are notes for whoever runs this notebook. They are cell-by-cell instruction of what to modify and what to keep as it is.**

- **To avoid out-of-memory issues, it is strongly recommended not to run unneeded cells of code. Some of them are reported for demonstration purposes only and does not need to be run. Please follow the running instructions carefully. Running additional or unnecessary code can lead to excessive memory usage, causing Colab to disconnect.**


---
# **DRIVE MOUNTING AND LOGGING**

This section is responsible for installing the requirements and mounting the Google Drive if you run the notebook in Colab. It ensures that the required dependencies are available and the notebook can access the dataset from Google Drive.

> ## **Drive Mounting and CWD**
>
> **_Important:_**  
>
> **If the "Tabular_Transformer" folder is a shared folder, you will need to create a shortcut to it in your own Drive. You can do this by navigating to the shared folder, right-clicking, and selecting "Add shortcut to Drive". Once you add the shortcut to your Drive, you should be able to access it from Colab as described below. Be sure to set the correct ROOT_DIR path.**

In [None]:
%%capture
## CREATE A SHORTCUT TO THE DRIVE AND RUN THIS CELL
import os
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

## ADJUST THE PATHS IF NEEDED
ROOT_DIR='/content/drive/MyDrive/Tabular_Transformer/PRSA'
RAW_DATA_DIR = os.path.join(ROOT_DIR, 'data/raw')
PROCESSED_DATA_DIR = os.path.join(ROOT_DIR, 'data/processed')
VOCAB_DIR = os.path.join(ROOT_DIR, 'vocab')

# navigate to the root directory and run the setup.py file to install the required dependencies
os.chdir(ROOT_DIR)
!pip install -r requirements.txt

> ## **Logging**
>
> This cell initializes a basic logging configuration to monitor the activities.

In [None]:
## RUN THIS CELL
from src.utils import setup_logging

setup_logging()

---
# **DATA EXTRACTION**



> ## **Data Extractor Class**
>
> The DataExtractor class is designed to extract and load the data. Here's a summary of what the class does:
>
> - **Class Initialization:**
>   - The constructor initializes the object with the directory path where the data is stored and the number of samples per file to extract (if the directory contains multiple files).
>   - The inputs are validated and the data is extracted.
>
> - **Data Extraction:**
>   - The *_extract_data* private method allows to extract the data from all the files in the data directory, loading everything into a single unified pandas DataFrame. The Pollution Dataset used for this project is a public UCI dataset for predicting both PM2.5
and PM10 air concentration for 12 monitoring sites (12 files), each containing around 35k entries (rows). [Padhi et al.,  2021]
>
>   - The *_load_from_csv* private method loads the csv data at the specified location into a pandas DataFrame. You can either specify the number of samples per file to extract or set it to None to extract all the samples.
>
>   - The *_merge_time_col* private method merge various time-related columns into a single 'TIMESTAMP' column. This process ensures a unified and consistent time representation, which is crucial for time-series analysis.
>
>   - The *_split_data* private method splits data based on Station and TIMESTAMP columns. The data is splitted into training, validation, and test sets to avoid data leakage between the sets during preprocessing.
>
> **_Important_:**
>
> **In the next section, the preprocessed data will be loaded to simplify the run of the notebook, thus there's no need to run the code cell below.**
>
> **However, feel free to review the code below to understand how to use the DataExtractor class, but running it is not required and can lead to memory issues.**


[Padhi et al.,  2021]: https://arxiv.org/abs/2011.01843 "Padhi, I., Schiff, Y., Melnyk, I., et al. (2021). Tabular Transformers for Modeling Multivariate Time Series. arXiv:2011.01843 [cs.LG]"

In [None]:
## NO NEED TO RUN THIS CELL
from src.data.extraction import DataExtractor

# if None, extract all the samples
samples_per_file = None

extractor = DataExtractor(data_root_dir=RAW_DATA_DIR,
                          samples_per_file=samples_per_file,
                          train_size=0.8,
                          val_size=0.1)
train_data = extractor.train_data
val_data = extractor.val_data
test_data = extractor.test_data

2024-01-14 11:38:10,298 - INFO - numexpr.utils - NumExpr defaulting to 2 threads.
2024-01-14 11:38:11,565 - INFO - src.data.extraction - Initializing the DataExtractor...
Data Extraction:: 100%|██████████| 12/12 [00:15<00:00,  1.29s/it]
2024-01-14 11:38:27,555 - INFO - src.data.extraction - Successfully extracted 12 DataFrame. Train DataFrame has 336612 rows. Validation DataFrame has 42072 rows. Test DataFrame has 42084 rows.

2024-01-14 11:38:27,559 - INFO - src.data.extraction - DataExtractor successfully initialized.



---
# **THE DATASET**

> ## **PRSADataset Class**
>
> The PRSADataset class is designed to take raw data and prepare it for train and test phases.
>
> - **Class Initialization**:
    - Initializes with a DataFrame and the dataset mode (train, val, test).
    When the mode is 'train', the vocabulary is saved in the specified directory and transformations are applied based on statistical information from the training data.
    When the mode is 'val' or 'test', the vocabulary is loaded from the specified directory and transformations are applied based on the statistical information from the training data.
    - Other arguments include: directory paths where to save the vocabulary and the class instance, the columns to discretize, the columns to drop, the target columns, the sequence length and the stride.
    - The attributes are validated to ensure that the class is initialized correctly.
    - The data is preprocessed based on the mode.
    - The data is tokenized using the vocabulary.
    - The samples and targets are prepared for time-series analysis based on the sequence length and the stride.
>
> - **Data Preprocessing**:
    - Processes the input data based on the specified mode (train, val, test).
    - In training mode, the data is discretized and a vocabulary is created and saved.
    - In validation and test modes the data is discretized based on the bin edges found with the training data, then the vocabulary created with the training data is loaded and applied to the val/test data.
>
> - **Discretization**:
    - Compute the number of bins to discretize the dataset based on its interquartile range (IQR).
    - The method uses the Freedman-Diaconis Rule to compute the width of each bin. The rule is robust to outliers and is given as  $$ \text{Bin width} = \frac{2 \times \text{IQR}}{\sqrt[3]{\text{num observations}}} $$    
    - The number of bins is calculated as
$$ \text{Number of bins} = \frac{\text{max value} - \text{min value}}{\text{Bin width}} $$
    - The bin labels are computed based on the number of bins specified.
    - The data is discretized by assigning each value to the closest bin label.
    - The bin labels are saved in a json file and loaded when the mode is 'val' or 'test'.
>
> - **Vocabulary**:
    - Uses the *Vocab* class to manage the vocabulary of the dataset.
    - The vocabulary is created based on the columns specified in *cols_for_vocab*.
    - Each column has its own vocabulary, with the following structure: {token: [global_index, local_index]}
    - The vocabulary is created and saved when the mode is 'train', otherwise it is loaded from the specified directory.
>
> - **Tokenization**:
>   - Maps the tokens in the columns specified in *cols_for_vocab* to the corresponding global indices.
> - **Sample Preparation**:
    - Structures the data into samples and targets with a format suitable for time-series analysis.
    - A single sample contains seq_len*(ncols+1) token ids. The shape of the sample is (seq_len, ncols+1).
    - The number of samples obtained in the end depends on the stride and on the number of subsequent rows considered for each sample (sequence length).
>
> - **Saving and Loading the Dataset**:
    - It's possible to save the entire class instance using pickle for efficient storage and retrieval.
    - It's possible to load the class instance, a static method is provided for this purpose.
>
> **_Important_:**
>
> **Run the following cell keeping the 'load' argument to 'True' to load the preprocessed data we used for our experiments.
If you want to preprocess the data again, change the 'load' argument to 'False'. Keep in mind that this will take more time.**


In [None]:
## RUN THIS CELL - NOTHING TO CHANGE
from src.data.dataset import PRSADataset

load = True
if not load:
    train_dataset = PRSADataset(data=train_data,
                                mode='train',
                                vocab_dir=VOCAB_DIR,
                                save_dir=PROCESSED_DATA_DIR)
    val_dataset = PRSADataset(data=val_data,
                              mode='val',
                              vocab_dir=VOCAB_DIR,
                              save_dir=PROCESSED_DATA_DIR)
    test_dataset = PRSADataset(data=test_data,
                               mode='test',
                               vocab_dir=VOCAB_DIR,
                               save_dir=PROCESSED_DATA_DIR)
else:
    train_dataset = PRSADataset.load(data_dir=PROCESSED_DATA_DIR,
                                     mode='train')
    val_dataset = PRSADataset.load(data_dir=PROCESSED_DATA_DIR,
                                   mode='val')
    test_dataset = PRSADataset.load(data_dir=PROCESSED_DATA_DIR,
                                    mode='test')

2024-01-14 11:38:29,298 - INFO - src.data.dataset - Loading PRSA Dataset...
2024-01-14 11:38:37,026 - INFO - src.data.dataset - Class instance successfully loaded.

2024-01-14 11:38:37,035 - INFO - src.data.dataset - Loading PRSA Dataset...
2024-01-14 11:38:39,433 - INFO - src.data.dataset - Class instance successfully loaded.

2024-01-14 11:38:39,436 - INFO - src.data.dataset - Loading PRSA Dataset...
2024-01-14 11:38:41,784 - INFO - src.data.dataset - Class instance successfully loaded.



In [None]:
## RUN THIS CELL - NOTHING TO CHANGE
# -----------------------------------------------------------------------------------------------------------
# NOTE: This cell shows the structure of a single sample from the PRSADataset. Each sample comprises multiple rows,
# each with 12 columns. The data in these columns has been discretized, mapped to the nearest bin edge, and
# then converted to indices. Observing this sample helps in understanding the preprocessing steps applied to the
# dataset, such as discretization and tokenization, and how the data is presented to the data collator.
# -----------------------------------------------------------------------------------------------------------
first_sample = train_dataset[0][0]
first_sample_labels = train_dataset[0][1]
print(f"First sample:\n {first_sample}")
print(f"Shape [seq_len, num_cols]: {first_sample.shape}\n")
print(f"Targets associated to the first sample:\n {first_sample_labels}\n")
print(f"Shape:\n {first_sample_labels.shape}")

First sample:
 tensor([[  7,  86, 144, 183, 304, 318, 334, 348, 373, 381, 412,   1],
        [  7,  86, 144, 183, 304, 318, 334, 348, 373, 381, 413,   1],
        [  8,  87, 144, 184, 304, 318, 334, 348, 373, 381, 412,   1],
        [  9,  88, 144, 185, 304, 318, 334, 349, 373, 381, 414,   1],
        [ 10,  88, 144, 185, 305, 319, 334, 350, 373, 381, 413,   1],
        [ 11,  89, 145, 186, 305, 319, 334, 351, 373, 381, 413,   1],
        [ 11,  90, 146, 187, 305, 319, 334, 352, 373, 381, 415,   1],
        [ 12,  91, 146, 188, 305, 319, 334, 351, 373, 381, 412,   1],
        [ 13,  92, 146, 189, 304, 319, 334, 351, 373, 381, 412,   1],
        [ 10,  93, 145, 190, 304, 319, 334, 353, 373, 381, 413,   1]])
Shape [seq_len, num_cols]: torch.Size([10, 12])

Targets associated to the first sample:
 tensor([[4., 4.],
        [8., 8.],
        [7., 7.],
        [6., 6.],
        [3., 3.],
        [5., 5.],
        [3., 3.],
        [3., 6.],
        [3., 6.],
        [3., 8.]])

Shape:
 torc

---
# **VOCABULARY**

> ## **Vocabulary Class**
>
> The *Vocab* class is designed to manage, create, save and load vocabularies.
>
> - **Class Initialization**:
    - The class starts by defining a set of custom special tokens as class-level constants.
    - The vocab object accepts columns from the DataFrame to create the vocabulary, the actual data, a directory to save the vocabulary, and target columns.
    - Various attributes are first initialized and validated to ensure they are correctly provided.
    - Special tokens are initialized and added to the vocabulary.
    - Vocabularies are created based on the provided data.
>
> - **Vocabulary Creation**:
    - The vocabulary is constructed using the unique values from the provided columns in the data. Each unique value is added to the vocabulary with a corresponding tag (the field name), global id (the index considering all the tokens of the vocabulary) and local id (the index considering only the tokens in the current field).
    - Vocabulary structure: {column tag: {token: [global_id, local_id]}}
    - id2token structure: {global_id: [token, tag, local_id]}
>    
> - **Utility Methods**:
    - The class also provides a method to retrieve the global id of a token, a method to retrieve the token corresponding to a global id, a method to map between global and local ids using a lookup tensor and two methods to save/load the vocabulary.
>   
> - **Vocabulary Summary Display**:
    - The Vocab class includes the print_vocab_summary method. This method allows for a detailed display of various statistics and characteristics of the vocabulary. It's possible to print special tokens, sample tokens from each column, the size of the vocabulary for each column, the data types of each column, and the total length of the vocabulary. You can specify the sample size and token limit per column.
>
> **_Important_**:
>
> **Run the following cell to extract the vocabulary from the training set.**


In [None]:
## RUN THIS CELL
vocab = train_dataset.vocab
## ADJUST THE PARAMETERS AS YOU WANT (keep as they are to print everything)
vocab.print_vocab_summary(print_special_tokens=True,
                          print_sample_tokens=True,
                          sample_size=5,
                          token_limit_per_column=5,
                          print_vocab_size_per_column=True,
                          print_column_data_types=True,
                          print_vocab_length=True)

Special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]', '[START]', '[END]']

Sampling from the Vocabulary:

COLUMN_TAG: SO2
TOKEN: [GLOBAL IDX, LOCAL IDX]
4.0: [7, 0]
5.0: [8, 1]
11.0: [9, 2]
12.0: [10, 3]
18.0: [11, 4]


COLUMN_TAG: wd
TOKEN: [GLOBAL IDX, LOCAL IDX]
NNW: [412, 0]
N: [413, 1]
NW: [414, 2]
NNE: [415, 3]
ENE: [416, 4]


COLUMN_TAG: PRES
TOKEN: [GLOBAL IDX, LOCAL IDX]
1023.1: [318, 0]
1026.4: [319, 1]
1020.5: [320, 2]
1018.0: [321, 3]
1015.6: [322, 4]


COLUMN_TAG: TIMESTAMP
TOKEN: [GLOBAL IDX, LOCAL IDX]
0.0: [381, 0]
0.026666286398768335: [382, 1]
0.053332572797534894: [383, 2]
0.07999885919630323: [384, 3]
0.10666514559507156: [385, 4]


COLUMN_TAG: DEWP
TOKEN: [GLOBAL IDX, LOCAL IDX]
-18.2: [334, 0]
-14.1: [335, 1]
-10.2: [336, 2]
-6.5: [337, 3]
-3.2: [338, 4]


Number of tokens in column 'SO2': 79
Number of tokens in column 'NO2': 58
Number of tokens in column 'CO': 39
Number of tokens in column 'O3': 121
Number of tokens in column 'TEMP': 14
Number of tokens 

---
# **TOKENIZER AND DATA COLLATOR**

> ## **Tokenizer**
>
> In this cell we initialize the BERT tokenizer. The tokenizer is initialized with the vocabulary file from the vocab object.
This tokenizer is used in the CustomDataCollator class to pad the input ids and mask tokens for MLM tasks.
>
> **_Important_**:
>
> **Run the following cell to initialize the tokenizer.**

In [None]:
## RUN THIS CELL - NO CHANGES NEEDED
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast(vocab_file=vocab.vocab_file_for_bert,
                              do_lowercase=False,
                              **vocab.get_special_tokens())

> ## **Custom Data Collator**
>
> The *CustomDataCollator* class is designed to handle tabular and time series data, preparing the samples, the targets and the masked language model labels.
>
> - **Hugging Face Compatibility**:
    - It inherits from *DataCollatorForLanguageModeling*, ensuring compatibility with Hugging Face.
    - The class requires a Bert tokenizer for padding and masking the input ids.
>    
> - ***`__call__`* Method**:
    - Each row represent a collection of features (already converted to indices). Remember that the sequence lenght parameter of the dataset defines the number of subsequent rows that constitute a single sample.
    - The collator efficiently groups multiple samples into a batch, the batch received by the model has shape [batch, seq_len, ncols+1].
    - The method can handle both MLM and regression tasks.
    - In MLM mode, it masks certain tokens in the input ids based on mlm_probability and returns the labels.
    - In regression mode, it returns the targets.
> - For source code and usage, refer to the [Hugging Face's documentation](https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/data/data_collator.py#L607).
>
> **_Important_:**
>
> **In the following cell, we demonstrate how to use the CustomDataCollator class for MLM and regression tasks.
> Note that there's no need to create the collator object manually as the CustomDataCollator object will be created in the training manager.**


In [None]:
## NO NEED TO RUN THIS CELL
from src.data.collator import CustomDataCollator

# data collator for Masked Language Model
data_collator_for_mlm = CustomDataCollator(tokenizer=tokenizer,
                                           mlm=True,
                                           mlm_probability=0.15)
# data collator for Regression task
data_collator_for_regression = CustomDataCollator(tokenizer=tokenizer,
                                                  mlm=False)

---
# **BERT PARAMETERS**

>## **Bert Custom Config**
>
> The *CustomBertConfig* class is an extension of the BertConfig class, specifically designed for handling tabular and time series data.
>
> 1. **Number of Columns (ncols)**: Specifies the number of columns in the tabular data, aligning with the number of input indices in one row.
>
> 2. **Vocabulary Size (vocab_size)**: Number of unique tokens in the data.
>
> 3. **Field Hidden Size (field_hidden_size)**: Sets the hidden size for field embeddings.
>
> 4. **Hidden Size (hidden_size)**: Determines the dimensionality of the encoder output.
>
> 5. **Number of Hidden Layers (num_hidden_layers)**: Defines the depth of the Transformer encoder.
>
> 6. **Number of Attention Heads (num_attention_heads)**: The number of attention mechanisms in each encoder layer.
>
> 7. **Pad Token ID (pad_token_id)**:  Represents the index used for padding.
>
> 8. **Masked Language Model Probability (mlm_probability)**: Ratio of tokens to mask for masked language modeling.
>
> **_Important_**:
>
> **Run the cell below to create a CustomBertConfig object. This is for demonstration purposes only as the config object will be created in the training manager given the dictionary of parameters.**


In [None]:
## RUN THIS CELL - NOTHING TO CHANGE
from src.models.config import CustomBertConfig

model_config_values = {"ncols": train_dataset.get_ncols(),
                       "vocab_size": len(vocab),
                       "field_hidden_size": 64,
                       "hidden_size": 64*train_dataset.get_ncols(),
                       "num_hidden_layers": 6,
                       "num_attention_heads": 8,
                       "pad_token_id": vocab.get_id(vocab.pad_token, vocab.special_tag),
                       "mlm_probability": 0.15}
config =  CustomBertConfig(**model_config_values)
config

CustomBertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "field_hidden_size": 64,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "mlm_probability": 0.15,
  "model_type": "bert",
  "ncols": 12,
  "num_attention_heads": 8,
  "num_hidden_layers": 6,
  "pad_token_id": 2,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 429
}

---
# **THE MODEL**


In [None]:
## RUN THIS CELL - NO CHANGES NEEDED
from src.utils import set_seed

# setting the seed for reproducibility
set_seed(2024)


> ## **Hierarchical Bert Language Model**
>
> This module contains the implementation of the Hierarchical Bert Language Model.
> It can be used for Masked Language Modeling and regression tasks. It represents the core component of our project.
>
> The model is composed of three components:
> - **TabRowEmbeddings**:
    - An embedding layer for tabular data.
    - It is designed to handle tabular data where each row consists of multiple columns.
    - Each individual token is mapped to an embedding. The sequence is then passed to a transformer encoder to capture relationships between columns.
    - A final linear layer transform the embeddings to the desired hidden size.
    - Thus, each row in a sample, will be represented by an embedding of dimension 'hidden_size'
>
> - **BertModel**:
    - A BertModel from the HuggingFace library.
    - It is used to capture relationships between rows.
>
> - **MLM-specific layers or Regression-specific layers**:
    - The MLM layers are used for Masked Language Modeling. They are used when pretraining the model to obtain a representation of the field tokens.
    - The regression layers are used for the regression task. They are used to fine-tune the model after pretraining.
>
> The forward step of the model is different for MLM and regression tasks:
> - **Masked Language Modeling**:
    - The input ids are passed through the TabRowEmbeddings layer to obtain the embeddings of the tabular data.
    - The 'sequence_length' embeddings are then passed to the BertModel.
    - The outputs of the BertModel are passed through the MLM layers to obtain the predictions at field level.
    - The predictions are compared to the masked LM labels at field level and the cross entropy loss is computed.
>
> - **Regression**:  
    - The input ids are passed through the TabRowEmbeddings layer to obtain the embeddings of the tabular data.
    - The embeddings are then passed to the BertModel.
    - The outputs of the BertModel are passed through the regression layers to obtain the predictions.
    - The predictions are compared to the regression targets (two targets) and the mean squared error loss is computed.
>
> **_Important_**:
>
> **The following code is not meant to be executed. It is only used to show how to instantiate the model.
> The models will be instantiated in the training manager based on the mode (mlm or regression).**  
    


In [None]:
## NO NEED TO RUN THIS CELL
from src.models.hierarchical import HierarchicalBertLM

model_mlm = HierarchicalBertLM(config=config,
                               vocab=vocab,
                               mode='mlm')

model_regression = HierarchicalBertLM(config=config,
                                      vocab=vocab,
                                      mode='regression')

---
# **TRAINING AND EVALUATION**

> ## **Weights & Biases**
>
> In order to log the training process and the metrics, we will use [Weights & Biases](https://wandb.ai/site).
>
> **_Important_:**
>
> **You can create a free account and login from the notebook running the following cell.**
>
> **While running the cell, you will be prompted to enter your API key. You can find your API key [here](https://wandb.ai/authorize).**



In [None]:
## RUN THIS CELL - NO CHANGES NEEDED
import wandb

wandb.login()

> ## **MLM Training Configuration**
>
> In this cell we define a dictionary containing the training parameters to train the Masked Language Model.
>
> The parameters are passed to the TrainingArguments class from the transformers library, that is used to instantiate the Trainer class. Thus, be sure that the parameters are valid for the TrainingArguments class. A check is performed in the TrainingManager class but it is better to check them before.
>
> **_Important_:**
>
> **Note if you're not planning to train the model, you can skip this cell. Pretrained models will be loaded in the next sections.**


In [None]:
## NO NEED TO RUN THIS CELL UNLESS YOU WANT TO TRAIN THE MODEL
training_config_dict = {
    'per_device_train_batch_size': 256,
    'per_device_eval_batch_size': 256,
    'num_train_epochs': 50,
    'logging_strategy': 'steps',
    'logging_first_step': True,
    'logging_steps': 1,
    'save_strategy': 'steps',
    'save_steps': 150,
    'evaluation_strategy': 'steps',
    'eval_steps': 150,
    'load_best_model_at_end': True,
    'disable_tqdm': False,
    'seed': 2024,
    'learning_rate': 1e-4,
    'lr_scheduler_type': 'constant',
    'report_to':'wandb'}

> ## **Training Manager Class**
>
> The TrainingManager class is responsible for setting up the model, the data collator, the training arguments, and the HuggingFace Trainer.
>
> - **Class Initialization:**
>     - To initialize the class, we need to provide the model configuration dictionary, the training configuration dictionary, the training, validation, and test sets, the root directory, the project name, the model name, the mode (either 'mlm' or 'regression'), and the path to the pretrained model checkpoint (only required for 'regression' mode).
>     - The model configuration dictionary contains the model parameters to be logged.
>     - The training configuration dictionary contains the training parameters for the TrainingArguments class from HuggingFace. Be sure to check the [documentation](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments) for the list of available parameters. A check is performed to ensure that only valid parameters are provided.
>     - The training, validation, and test sets are instances of the PRSADataset class.
>     - The root directory is the directory where the output and logs directories will be created.
>     - The project name is the name of the project for logging on wandb.
>     - The model name is the name of the model for logging on wandb.
>     - The mode is either 'mlm' or 'regression'.
>     - The pretrained model path is the path to the pretrained model checkpoint.
>
> - **Directories and Logging:**
>     - The setup_directories() method sets up the checkpoints and logs directories.
>     - The setup_wandb() method sets up wandb for logging.
>
> - **Tokenizer and Collator:**
>     - The setup_tokenizer() method sets up the tokenizer needed in the data collator.
>     - The setup_collator() method sets up the data collator for training. If the mode is 'mlm', the data collator returns the labels for masked language modeling. If the mode is 'regression', the data collator returns the targets for regression.
>
> - **Model:**
>     - The setup_model() method sets up the model for training. If the mode a pretrained model path is provided, the model is initialized from the pretrained checkpoint. If the mode is 'mlm', the model is trained with masked language modeling. If the mode is 'regression', the model is initialized from the pretrained checkpoint (after MLM training) and trained for regression. The model is frozen except for the last two layers.
>
> - **Training:**
>     - The setup_training() method sets up the training arguments and trainer.
>     - The train() method must be called to train the model.
>     - The evaluate() method can be used to evaluate the model on the validation or test set.
>


> ## **MLM Training**
>
> **_Important_:**
>
> **Run this cell if you want to initialize the TrainingManager class for MLM training and run the train() method.**
>
> **A pretrained model checkpoint is provided to evaluate the model on the val/test set. If you prefer to train the model from scratch, set the pretrained_model_path to None.**


In [None]:
## RUN THIS CELL TO SET UP THE TRAINING MANAGER FOR MLM - NO CHANGES NEEDED
from src.train.manager import TrainingManager

## SET TO NONE TO TRAIN THE MODEL FROM SCRATCH
mlm_pretrained_model_path =  os.path.join(ROOT_DIR, 'output/mlm/checkpoints/prsa-model-1/checkpoint-final')

training_manager_mlm = TrainingManager(model_config_dict=model_config_values,
                                       training_config_dict=training_config_dict,
                                       train_set=train_dataset,
                                       val_set=val_dataset,
                                       test_set=test_dataset,
                                       root_dir=ROOT_DIR,
                                       project_name='PRSATabBert',
                                       model_name='prsa-model-professor-test',
                                       mode='mlm',
                                       pretrained_model_path=mlm_pretrained_model_path)

In [None]:
## NO NEED TO RUN THIS CELL
training_manager_mlm.train()

> ## **MLM Evaluation**
>
> The following cell evaluates the model on the validation set.
> It returns the Cross Entropy Loss for MLM.
>
> **_Important_:**
>
> **If you run the .train() cell without finishing the training process, make sure to run again the training manager cell with the provided checkpoint.**

In [None]:
## RUN THIS CELL
training_manager_mlm.evaluate()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 12.478158950805664,
 'eval_runtime': 6.4304,
 'eval_samples_per_second': 1306.3,
 'eval_steps_per_second': 5.132}

> ## **Regression Training Configuration**
>
> In this cell we define a dictionary containing the training parameters to train the model for the regression task.
>
> The parameters are passed to the TrainingArguments class from the transformers library, that is used to instantiate the Trainer class. Thus, be sure that the parameters are valid for the TrainingArguments class. A check is performed in the TrainingManager class but it is better to check them before.
>
> **_Important_:**
>
> **Run this cell as it is.**


In [None]:
## RUN THIS CELL - NO CHANGES NEEDED
training_config_dict = {
    'per_device_train_batch_size': 512,
    'per_device_eval_batch_size': 512,
    'num_train_epochs': 100,
    'logging_strategy': 'steps',
    'logging_first_step': True,
    'logging_steps': 1,
    'save_strategy': 'steps',
    'save_steps': 100,
    'evaluation_strategy': 'steps',
    'eval_steps': 100,
    'load_best_model_at_end': True,
    'disable_tqdm': False,
    'seed': 2024,
    'learning_rate': 1e-4,
    'lr_scheduler_type': 'linear',
    'report_to':'wandb'}


> ## **Regression Training**
>
> **_Important_:**
>
> **Run this cell if you want to initialize the TrainingManager class for regression training and run the train() method.**
>
> **A pretrained model checkpoint is provided to evaluate the model on the val/test set. If you prefer to train the model from scratch (starting from pretrained mlm), set the pretrained_model_path to mlm_pretrained_model_path.**


In [None]:
## RUN THIS CELL TO SET UP THE TRAINING MANAGER FOR REGRESSION - NO CHANGES NEEDED
from src.train.manager import TrainingManager

## SET TO 'mlm_pretrained_model_path' TO TRAIN THE MODEL FROM THE PRETRAINED BERT (from mlm)
reg_pretrained_model_path = os.path.join(ROOT_DIR, 'output/regression/checkpoints/prsa-model-1-reg/checkpoint-final')

training_manager_reg = TrainingManager(model_config_dict=model_config_values,
                                       training_config_dict=training_config_dict,
                                       train_set=train_dataset,
                                       val_set=val_dataset,
                                       test_set=test_dataset,
                                       root_dir=ROOT_DIR,
                                       model_name='prsa-reg-professor-test',
                                       mode='regression',
                                       pretrained_model_path=reg_pretrained_model_path)

In [None]:
## NO NEED TO RUN THIS CELL
training_manager_reg.train()

> ## **Regression Evaluation**
>
> The following cell evaluates the model on the validation/test set based on the value of test.
> The method evaluate returns the metrics, the predictions and the labels.
>
> The following metrics are computed:
> - **RMSE (Root Mean Squared Error)**: Measures the quadratic mean of the difference between the predicted and the actual values.
> - **MAE (Mean Absolute Error)**: Measures the mean of the absolute difference between the predicted and the actual values.
> - **R-Squared**: A statistical measure that represents the goodness of fit of a regression model.
>
> We achieve similar RMSE values on the validation set as reported in the paper suggesting that our model generalizes well to unseen data. However, a slight increase in RMSE on the test set suggest that the model isn't performing as well here, possibly due to differences in data distribution or overfitting during training.
>
> **_Important_:**
>
> **If you run the .train() cell without finishing the training process, make sure to run again the training manager cell with the provided checkpoint.**


In [None]:
## RUN THIS CELL
## SET TEST TO FALSE/TRUE TO EVALUATE THE MODEL ON THE VAL/TEST SET
training_manager_reg.evaluate(test=True)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


({'rmse': 50.781578, 'mae': 30.270535, 'r_squared': 0.7886199766742757},
 array([[120.41879 , 124.4578  ],
        [123.752716, 124.91998 ],
        [123.75899 , 123.92647 ],
        ...,
        [ 20.964413,  30.007421],
        [ 25.65062 ,  37.11924 ],
        [ 28.10255 ,  41.86554 ]], dtype=float32),
 array([[ 88., 122.],
        [ 91., 117.],
        [ 96., 110.],
        ...,
        [ 24.,  53.],
        [ 37.,  68.],
        [ 50.,  63.]], dtype=float32))

---
# **WANDB**

* [Link to wandb project](https://wandb.ai/neural-network-tab-bert/PRSATabBert?workspace=user-ferretti-2039579).

* MLM task reports including training/evaluation plots can be seen at the following [link](https://api.wandb.ai/links/neural-network-tab-bert/g3x9vuqu).

* Regression task reports including training/evaluation plots can be seen at the following [link](https://api.wandb.ai/links/neural-network-tab-bert/g0oldvef).





---
# **REFERENCES**

- Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman. "Tabular Transformers for Modeling Multivariate Time Series". 2021. arXiv:2011.01843 [cs.LG].

