In [None]:
FIRST_NAME = "Jacob"
LAST_NAME = "Friedman"
STUDENT_ID = "801444589"

In [None]:
# For using google collab when repo has been uploaded to your google drive
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/CourseProject_JFriedman

In [None]:
# To get started locally quickly, create a Python 3.11 environment with conda using the following command:
# conda create -n fastai python=3.11
# conda activate fastai

# Setup torch for GPU with CUDA 12.4 & fastai - this will work for local setups as well as long as CUDA toolkit is installed
# If working locally this requires CUDA Toolbox 12.4 installed to your computer with a compatible GPU: https://developer.nvidia.com/cuda-gpus
# Ensure CUDA Toobox 12.4 is downloaded and installed: https://developer.nvidia.com/cuda-12-4-0-download-archive
# Must have proper environment variables set up for PATH, refer to NVIDIA documentation
# Torch 2.5.1 is the latest version supported by fastai and CUDA 12.4 is the latest version supported by Torch 2.5.1
# Also installs seaborn for data visualization and torchsummary for model summaries
%pip install -Uqq seaborn --upgrade-strategy only-if-needed
%pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 --upgrade-strategy only-if-needed
%pip install torchsummary --upgrade-strategy only-if-needed
%pip install fastai --upgrade-strategy only-if-needed
%pip install nltk --upgrade-strategy only-if-needed
%pip install scikit-optimize --upgrade-strategy only-if-needed

## Project Goal 
#### The goal of this project is to reimplement the agent used in the paper **"A Context-Aware Approach for Detecting Worth-Checking Claims in Political Debates"** but instead of using the original implementation we are going to try an RNN implementation and a Transformer implementation. We will utilize Grid Search to find the best hyperparameters for our models. We will follow the same pre-processing steps and utilize the same data for comparative accuracy.

---
#### Some code has been reused, or refactored, as originally developed by the author's of the paper. They have requested the following annotations be included:

```bib
@InProceedings{RANLP2017:debates,
  author    = {Pepa Gencheva and Preslav Nakov and Llu\'{i}s M\`{a}rquez and Alberto Barr\'on-Cede\~no and Ivan Koychev},
  title     = {{A Context-Aware Approach for Detecting Worth-Checking Claims in Political Debates},
  booktitle = {Proceedings of the 2017 International Conference on Recent Advances in Natural Language Processing},
  month     = {September},
  year      = {2017},
  address   = {Varna, Bulgaria},
  series    = {RANLP~'17}
}
```
---

## Author's Original Implementation Architechture
#### The **feed-forward neural network** designed as a multitask learning model, as implemented by the authors, is designed to classify check-worthy claims in political debates by leveraging shared and task-specific learning. The architecture begins with a **shared input layer** that processes the input features and learns general representations through a dense layer with 1000 units, LeakyReLU activation, and dropout for regularization. This shared layer captures common patterns across tasks.

The model then branches into two **task-specific layers**:
1. **General Check-Worthiness Branch (`pred_any`)**: This branch predicts whether a claim is check-worthy in a general sense, independent of specific criteria or datasets. It uses a dense layer with 500 units, LeakyReLU activation, dropout, and a final sigmoid output layer for binary classification.
2. **PolitiFact-Specific Branch (`pred_pf`)**: This branch specializes in identifying check-worthy claims based on **PolitiFact's annotations**, which reflect specific fact-checking criteria. It also uses a dense layer with 500 units, LeakyReLU activation, dropout, and a sigmoid output layer.

The model is trained using the **Stochastic Gradient Descent (SGD)** optimizer with **Nesterov momentum** (momentum = 0.9) and a learning rate of 0.006. The loss function for both tasks is **binary cross-entropy**, and the model uses **callbacks** such as **early stopping** (to prevent overfitting) and model checkpointing (to save the best weights based on validation accuracy). This multitask setup enables the model to learn shared representations in the shared layer while optimizing for two related but distinct tasks in the task-specific branches. This approach improves the model's ability to generalize across tasks while maintaining specialization for specific datasets or criteria.

---
## Setup Project Path in config.ini
### Before continuing any further you **must** set your **project_path** varaiable to the proper directory on your machine where you have downloaded this project. The file is located on the top level of the project directory.
### **Failure to do this will result in the project failing to run**
---
## Package Imports

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import torch

# Set the device to be used for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

import fastai
print(f"FastAI version: {fastai.__version__}")

# Set for CUDA to ensure that memory is properly allocated
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

## **1. Data Pre-Processing**
### This data pre-processing step has been slightly refactored from the original code as developed by the authors. Given the complex nature of their feature set, I decided to use their code to prevent having to reimplment features I may not fully understand. And to prevent and possible issues in evaluation of my results to their results. Some refactoring was done for modernization.
---
### **1.1 Ensure punkt and punkt_tab are Installed**
##### These are necessary required packages needed to use the author's pre-processing packages.

In [None]:
# Ensure punkt and punkt_tab are downloaded
# They are necessary for author's code to run
import nltk
from src.utils.config import get_config

config = get_config()

nltk.download('punkt', download_dir=config['data'])
nltk.download('punkt_tab', download_dir=config['data'])

# Add the directory where 'punkt' is located to NLTK's search paths
nltk.data.path.append(config['data'])

print(f"NLTK version: {nltk.__version__}")
print("NLTK data paths:", nltk.data.path)

# Verify that the 'punkt' and 'punkt_tab' resource is available
try:
    nltk.data.find('tokenizers/punkt')
    print("punkt tokenizer is available.")
    nltk.data.find('tokenizers/punkt_tab')
    print("punkt_tab tokenizer is available.")
except LookupError:
    print("punkt tokenizer is NOT available. Attempting to download...")
    nltk.download('punkt', download_dir=config['data'])
    print("punkt_tab tokenizer is NOT available. Attempting to download...")
    nltk.download('punkt_tab', download_dir=config['data'])

---
### **1.2 Continue With Pre-Processing Data**

#### **How the Authors Did Their Pre-Processing**
#### In summary, the data preprocessing involves loading the debate text, converting it into numerical features, and extracting the corresponding labels. The core of this process is the feature extraction pipeline, which transforms the raw text into a numerical representation that the machine learning model can use.

1.  **Loading the Debate Data:**

    * They use the `read_debates()` function (defined in `src/data/debates.py`) to load the debate data. This function reads the text from different debates and organizes it.
    * The data is divided into two sets: a training set and a validation set. The training set is used to train the model, and the validation set is used to evaluate its performance.
    * Essentially, this step involves gathering the raw text data and splitting it into training and validation sets.

2.  **Feature Extraction:**

    * This is a crucial step where the text data is converted into numerical features. Which is necessary for us to be able to use the data for Machine Learning
    * They use a method called `get_serialized_pipeline()` (defined in `src/features/feature_sets.py`) to extract these features using a custom pipeline. This pipeline includes several components that extract different types of information from the text:
        * **Contextual features:** These features capture information about the surrounding text of a given claim.
        * **POS Tags:** Part-of-speech tags (e.g., noun, verb, adjective) are extracted to provide syntactic information.
        * **Stylistic features:** These features capture the writing style of the text.
    * The pipeline uses pre-computed features (serialized) for efficiency reasons. This means that the feature extraction process was done beforehand, and the code just loads the extracted features.
    * The `fit_transform()` method is used to extract features from the training data, and the `transform()` method is used to extract features from the validation data.
    * In essence, this step transforms the text data into a numerical matrix representation that we need for our model.

3.  **Label Extraction:**

    * To train the model, the authors need to provide it with the correct answers (labels). In this case, the labels indicate whether a claim is worth fact-checking.
    * The code extracts labels for different evaluation scenarios:
        * `y_train_any`, `y_val_any`: These are binary labels representing a general assessment of whether a claim is check-worthy from any source.
        * `y_train_pf`, `y_val_pf`: These labels are extracted from `s.labels[5]` and correspond to PolitiFact's assessment.
        * `y_train_wp`, `y_val_wp`: These labels are extracted from `s.labels[4]` and correspond to the Washington Post's assessment.
    * This step involves preparing the target variables for training and evaluating the model's performance against different fact-checking sources.

In [None]:
from src.data.debates import read_debates, Debate
from src.features.feature_sets import get_cb_pipeline, get_serialized_pipeline
from src.stats.rank_metrics import average_precision, precision_at_n

# Step 1: Prepare Training and Validation Data
# Read debates for training and validation
train_sentences = (
    read_debates(Debate.FIRST) +
    read_debates(Debate.VP) +
    read_debates(Debate.SECOND)
)
val_sentences = read_debates(Debate.THIRD)

# Step 2: Initialize and Fit the Pipeline
# Use the serialized pipeline for feature extraction
pipeline = get_serialized_pipeline(train=train_sentences)

# Transform training and validation data
X_train = pipeline.fit_transform(train_sentences)
X_val = pipeline.transform(val_sentences)

# Step 3: Prepare Data for Various Test Cases
# Extract labels for different test cases: any, PolitiFact, and WP
y_train_any = np.array([1 if s.label > 0 else 0 for s in train_sentences]) # Binary labels
y_train_pf = np.array([s.labels[5] for s in train_sentences]) # PolitiFact labels
y_train_wp = np.array([s.labels[4] for s in train_sentences]) # WP labels

y_val_any = np.array([1 if s.label > 0 else 0 for s in val_sentences]) # Binary labels
y_val_pf = np.array([s.labels[5] for s in val_sentences]) # PolitiFact labels
y_val_wp = np.array([s.labels[4] for s in val_sentences]) # WP labels

# Step 4: Debugging and Output
# Print shapes and sample data for debugging
print(f"Training feature matrix shape: {X_train.shape}")
print(f"Sample labels for train_sentences[103]: {train_sentences[103].labels}")
print(f"Binary label for train_sentences[103]: {train_sentences[103].label}")
print(f"Shape of first training feature vector: {X_train[0].shape}")
print(f"Shape of y_train_all: {y_train_any.shape}")

---

## **2. Model Implementation**

---

### **2.1.1 RNN Model Definition**

#### Let's start by defining an RNN using the same architechture and implementation as the authors original implementation, as described above in the section "**Author's Original Implementation Architechture**" utilizing scikit-learn, fastai, and torch.

#### **MultiTaskRNN Model**

This `MultiTaskRNN` model is a multi-task Recurrent Neural Network (RNN) designed for sequence processing tasks. It employs a shared LSTM layer to capture sequential dependencies in the input data, followed by task-specific output branches.

**Architecture:**

* **Shared LSTM Layer:**
    *     The model uses a shared Long Short-Term Memory (LSTM) layer.
    *     This layer processes the input sequence (`input_size`) and generates hidden state representations.
    *     Key parameters include:
        *     `input_size`: Defines the number of expected features in the input.
        *     `hidden_size`: Specifies the number of features in the hidden state.
        *     `num_layers`: Indicates the number of recurrent layers.
        *     `dropout`: Dropout rate for regularization.
        *     `batch_first`: If `True`, the input and output tensors are provided as (batch, sequence, feature).
        *     `bidrectional`: If `True`, the input sequence will be analyzed forwards and backwards.

* **Task-Specific Branches:**
    *     The model includes two task-specific branches (`pred_any`, `pred_pf`).
    *     Each branch is a sequential network consisting of:
        *     A linear layer with 500 output units.
        *     Leaky ReLU activation.
        *     Dropout layer for regularization.
        *     A linear layer with 1 output unit.
        *     Sigmoid activation for binary classification.

**Forward Pass:**

1.  **Input Processing:**
    *     The `forward` method takes an input tensor `x`.
    *     It handles cases where `x` might be a tuple (input, targets) by extracting the input.
    *     The input is converted to `float32` format.
    *     Input dimensions are adjusted to match the LSTM's expected input shape (batch\_size, sequence\_length, input\_size).
2.  **Shared LSTM:**
    *     The input is passed through the shared LSTM layer.
    *     The LSTM outputs the hidden state sequence.
3.  **Task-Specific Predictions:**
    *     The output of the last time step from the LSTM is used as input for both task-specific branches.
    *     Each branch produces a single output, representing the prediction for its respective task.
4.  **Output:**
    *     The `forward` method returns the predictions from both task-specific branches (`pred_any_scores`, `pred_pf_scores`).

**Relation to Author's Architecture:**

My model adopts a distinct architecture compared to Gencheva et al. (2019), which relied on Support Vector Machines (SVM) and Feed-Forward Neural Networks (FNN). However, a key similarity lies in the emphasis on contextual modeling. While Gencheva et al. (2019) used SVM and FNN, my model employs a shared LSTM layer to capture relationships within the input sequence, processing the input and then using the output of the LSTM as the input for two task-specific branches (pred_any and pred_pf). These branches, like the multi-task approach in Gencheva et al. (2019), are designed to predict check-worthiness. The use of a shared LSTM, followed by task-specific branches, allows the model to learn shared representations relevant to both tasks, while also enabling the capture of task-specific nuances, reflecting Gencheva et al.'s (2019) focus on modeling the context of claims.

In [None]:
from torch import nn

class MultiTaskRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, dropout):
        super().__init__()

        # Shared RNN layer
        self.rnn = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, dropout=dropout, bidirectional=True)
        
        # Task-specific branches
        self.pred_any = nn.Sequential(
            nn.Linear(hidden_size * 2, 500), # Linear layer with 500 output units
            nn.LeakyReLU(), # Leaky ReLU activation
            nn.Dropout(dropout), # Dropout layer
            nn.Linear(500, 1), # Linear layer with 1 output unit
            nn.Sigmoid() # Sigmoid activation
        )
        
        self.pred_pf = nn.Sequential(
            nn.Linear(hidden_size * 2, 500), # Linear layer with 500 output units
            nn.LeakyReLU(), # Leaky ReLU activation
            nn.Dropout(dropout), # Dropout layer
            nn.Linear(500, 1), # Linear layer with 1 output unit
            nn.Sigmoid() # Sigmoid activation
        )

    # Forward pass through the network
    def forward(self, x):
        # Unpack inputs if x is a tuple (e.g., (inputs, targets))
        if isinstance(x, tuple):
            x = x[0]  # Extract the inputs tensor

        # Ensure input is in float32 format
        x = x.float()

        # Reshape input if necessary to match LSTM's expected input shape
        if x.dim() == 2:  # If input is (batch_size, input_size), add a sequence dimension
            # For debugging purposes
            # print("Reshaping input to account for a sequence dimension...")
            x = x.unsqueeze(1)  # Add a sequence length of 1: (batch_size, 1, input_size)

        # Pass input through the shared RNN layer
        rnn_scores, _ = self.rnn(x)

        # Use the output of the last time step for task-specific branches
        shared_scores = rnn_scores[:, -1, :]  # Take the last time step's output

        # Task-specific branches
        pred_any_scores = self.pred_any(shared_scores)
        pred_pf_scores = self.pred_pf(shared_scores)

        return pred_any_scores, pred_pf_scores

---
### **2.1.2 Test the Model**
#### Test the model to ensure the code is valid

In [None]:
# Test the model output
model = MultiTaskRNN(input_size=X_train.shape[1], hidden_size=64, num_layers=2, dropout=0.2)
model = model.to(device)

# Sample input tensor
sample_input = torch.randn(1, X_train.shape[1]).to(device)
output = model(sample_input)

# Print the model output
print(f"Model output: {output}")



---
### **2.2.1 Transformer Model Definition**

#### Let's start by defining a Transformer using the same architechture and implementation as the authors original implementation, as described above in the section "**Author's Original Implementation Architechture**" utilizing scikit-learn, fastai, and torch.

#### **MultiTaskTransformer**

The `MultiTaskTransformer` is a neural network model implemented using the PyTorch library. It is designed for multi-task learning, specifically for predicting check-worthiness of claims in political debates.

**Architecture:**

* **Shared Transformer Encoder:** The model uses a Transformer Encoder as its base. The Transformer Encoder consists of multiple `TransformerEncoderLayer` layers.
    *    `d_model`: Specifies the input size, which must match the model dimension.
    *    `nhead`: Defines the number of attention heads.
    *    `dim_feedforward`: Sets the hidden size of the feedforward layers within the Transformer Encoder Layer.
    *    `dropout`: Applies dropout for regularization.
    *    `num_layers`: Determines the number of Transformer Encoder Layers.

* **Task-Specific Branches:** The model has two task-specific branches: `pred_any` and `pred_pf`. Both branches are sequential neural networks:
    *    They consist of linear layers (`nn.Linear`), Leaky ReLU activation (`nn.LeakyReLU`), Dropout layers (`nn.Dropout`), and a final linear layer to produce a single output.
    *    The final layer uses a Sigmoid activation function as these branches are designed for binary classification tasks, in our case, check-worthiness.

**Forward Pass:**

*    The `forward` function defines how data flows through the network.
*    It handles input tensors, ensuring they are in the correct format (float32) and shape for the Transformer. If the input is 2D, it adds a sequence dimension. The input is then transposed to fit the expected input shape of the Transformer.
*    The input is passed through the shared Transformer Encoder.
*    The output from the last time step of the Transformer Encoder is used as input for the task-specific branches.
*    Each task-specific branch (`pred_any`, `pred_pf`) processes this shared output to produce its own prediction.
*    The function returns the predictions from both task-specific branches.

**Relation to Author's Architecture:**

My code uses a Transformer-based architecture, a departure from the Support Vector Machines (SVM) and Feed-Forward Neural Networks (FNN) employed by Gencheva et al. (2019). However, similar to their work, my approach emphasizes contextual modeling. This is achieved through a Transformer Encoder, designed to capture relationships within a sequence of data. The `pred_any` and `pred_pf` task-specific branches likely mirror the paper's multi-task strategy, predicting check-worthiness relative to at least one source, or a specific medium. By using a shared Transformer Encoder followed by task-specific branches, the model learns general representations applicable to all tasks, while also accommodating task-specific nuances. This architecture aligns with the paper's focus on modeling the context of claims, including the relationships between a claim and its surrounding debate.


In [None]:
from torch import nn

class MultiTaskTransformer(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_heads, dropout):
        super().__init__()

        # Shared Transformer Encoder
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_size, nhead=num_heads, dim_feedforward=hidden_size, dropout=dropout), num_layers=num_layers)

        # Task-specific branches
        self.pred_any = nn.Sequential(
            nn.Linear(input_size, 500),  # Linear layer with 500 output units
            nn.LeakyReLU(),  # Leaky ReLU activation
            nn.Dropout(dropout),  # Dropout layer
            nn.Linear(500, 1),  # Linear layer with 1 output unit
            nn.Sigmoid()  # Sigmoid activation
        )

        self.pred_pf = nn.Sequential(
            nn.Linear(input_size, 500),  # Linear layer with 500 output units
            nn.LeakyReLU(),  # Leaky ReLU activation
            nn.Dropout(dropout),  # Dropout layer
            nn.Linear(500, 1),  # Linear layer with 1 output unit
            nn.Sigmoid()  # Sigmoid activation
        )

    # Forward pass through the network
    def forward(self, x):
        # Unpack inputs if x is a tuple (e.g., (inputs, targets))
        if isinstance(x, tuple):
            x = x[0]  # Extract the inputs tensor

        # Ensure input is in float32 format
        x = x.float()

        # Reshape input if necessary to match Transformer's expected input shape
        if x.dim() == 2:  # If input is (batch_size, input_size), add a sequence dimension
            # For Debugging
            # print("Reshaping input to account for a sequence dimension...")
            x = x.unsqueeze(1)  # Add a sequence length of 1: (batch_size, 1, input_size)

        # Transpose input to match Transformer's expected shape: (sequence_length, batch_size, input_size)
        x = x.transpose(0, 1)  # Shape becomes (sequence_length, batch_size, input_size)

        # Pass input through the shared Transformer Encoder
        transformer_output = self.transformer(x)  # Output shape: (sequence_length, batch_size, input_size)

        # Use the output of the last time step for task-specific branches
        shared_scores = transformer_output[-1, :, :]  # Take the last time step's output (batch_size, input_size)

        # Task-specific branches
        pred_any_scores = self.pred_any(shared_scores)
        pred_pf_scores = self.pred_pf(shared_scores)

        return pred_any_scores, pred_pf_scores

---
### **2.2.2 Test the Model**
#### Test the model to ensure the code is valid

In [None]:
# Test the Transformer model output
model = MultiTaskTransformer(input_size=X_train.shape[1], hidden_size=64, num_layers=2, num_heads=4, dropout=0.2)
model = model.to(device)

# Create a sample input tensor
# Transformer expects input in the shape (sequence_length, batch_size, input_size)
sequence_length = 10 
sample_input = torch.randn(sequence_length, 1, X_train.shape[1]).to(device)  # (sequence_length, batch_size, input_size)

# Pass the sample input through the model
output = model(sample_input)

# Print the model output
print(f"Model output (Task: Any): {output[0]}")
print(f"Model output (Task: PF): {output[1]}")

---

### **Helper Classes and Methods**

#### The following classes and methods were developed to help the fastai learner deal with the multitask model. Let me walk you through some of the key components of my code:

### Core Classes and Functions

* **`Callback` (from `fastai.callback.core`)**:
    *     This is the base class for callbacks in the fastai library.
    *     I'm using it here to create `DebugCallback`, a custom callback that I use for debugging purposes. It simply prints the input, target, predictions, and loss after each batch. This helps me keep an eye on the training process.
* **`Metric` (from `fastai.metrics`)**:
    *     This is another fastai class that I extend to create custom metrics.
    *     I've defined two custom metric classes: `MultiTaskMetric` and `MultiTaskTotalMetric`. These are designed to handle the specific requirements of my multi-task learning setup.
* **`OptimWrapper` (from `fastai.optimizer`)**:
    *     This class helps in wrapping PyTorch optimizers for use with fastai's training loop.
    *     I'm using `partial` from `functools` along with `OptimWrapper` to create optimizer configurations (SGD, Adam, AdamW) with specific learning rates.
* **`binary_cross_entropy` (from `torch.nn.functional`)**:
    *     This is the standard PyTorch function for calculating binary cross-entropy loss.
    *     I'm using this as the core loss function within my custom multi-task loss functions.
* **Optimizers (from `torch.optim`)**:
    *     I'm using standard PyTorch optimizers: `SGD`, `Adam`, and `AdamW`.
    *     These are used to update the model's weights during training.
* **`confusion_matrix` (from `sklearn.metrics`)**:
    *     This scikit-learn function is used to compute the confusion matrix, which is helpful for evaluating classification performance.
    *     I use it within my evaluation function to calculate confusion matrices for each task.

### Custom Classes and Functions

* **`DebugCallback(Callback)`**:
    *     As mentioned earlier, this is a simple callback for printing debug information during training.
    *     It's useful for inspecting the data flow and model behavior.
* **`MultiTaskMetric(Metric)`**:
    *     This custom metric calculates accuracy and ranking metrics (precision, MAP, R-Precision, F1-score) for individual tasks in my multi-task setup.
    *     It accumulates predictions and targets for each task separately and then computes the metrics.
* **`MultiTaskTotalMetric(Metric)`**:
    *     Similar to `MultiTaskMetric`, but this one calculates the *total* accuracy and ranking metrics across *all* tasks.
    *     This gives me an overall performance view.
* **`MultiTaskDataset(torch.utils.data.Dataset)`**:
    *     This is a custom PyTorch Dataset class that I use to handle datasets with multiple targets (one for each task).
    *     It's essential for feeding data to my multi-task models.
* **`convert_to_tensor(...)`**:
    *     This function is responsible for converting my input data (NumPy arrays) into PyTorch tensors.
    *     It also includes checks for NaN and Inf values, which is crucial for numerical stability during training.
* **`evaluate_and_visualize(learn, ...)`**:
    *     This is a comprehensive function for evaluating my trained model and generating visualizations of the results.
    *     It calculates various metrics, confusion matrices, and generates plots to help me analyze the model's performance.
* **`get_optimizer(optimizer, lr)`**:
    *     This utility function helps me get the appropriate optimizer (SGD, Adam, or AdamW) based on a string identifier.
    *     It uses `partial` to pre-configure the optimizers with a learning rate.
* **`multitask_loss(preds, targets)`**:
    *     This is one of my custom loss functions for multi-task learning.
    *     It calculates the combined loss by summing the binary cross-entropy loss for each task.
* **`multitask_loss_auto_weighted_bce(preds, targets)`**:
    *     Another custom loss function that extends `multitask_loss` by automatically weighting the binary cross-entropy loss for each task based on the class distribution.
    *     This helps to address class imbalance.
* **`multitask_loss_dynamic_weighted(preds, targets)`**:
    *     This loss function is similar to the previous one but uses dynamically updated weights during training.
    *     The weights are adjusted based on the loss of each task in the previous step.
* **`multitask_loss_loss_proportional(preds, targets, epsilon=1e-8)`**:
    *     This loss function weights the loss for each task proportionally to the inverse of the task's loss.
    *     This is another strategy to balance the contribution of each task to the overall loss.
* **`multitask_splitter(model)`**:
    *     This function is used to split the model's parameters into different groups for optimizer configuration (e.g., different learning rates for different parts of the model).
    *     It's specific to my `MultiTaskRNN` and `MultiTaskTransformer` models.

In [None]:
from fastai.callback.core import Callback
from fastai.metrics import Metric
from fastai.optimizer import OptimWrapper
from functools import partial
from torch.nn.functional import binary_cross_entropy
from torch.optim import SGD, Adam, AdamW
from sklearn.metrics import confusion_matrix
import random

# Define initial weights and learning rates outside the loss function
lr_large = 0.003
lr_small = 0.001
binary_cross_entropy = nn.BCELoss()
weight_any = torch.tensor(0.5, requires_grad=False)
weight_pf = torch.tensor(0.5, requires_grad=False)

# Custom callback to print debug information
class DebugCallback(Callback):
    def after_batch(self):
        print(f"Input: {self.xb}\nTarget: {self.yb}\nPredictions: {self.pred}\nLoss: {self.loss}")

# Custom metric to compute accuracy and ranking metrics for multitask predictions
# Computes metrics for each task independently
# Args - task_idx: Index of the task in the predictions tuple
class MultiTaskMetric(Metric):
    def __init__(self, task_idx):
        self.task_idx = task_idx  # Index of the task in the predictions tuple
        self.epoch_metrics = []  # List to store metrics for each epoch
        self.reset()

    def reset(self):
        # Debugging: Print a message when resetting the predictions and targets
        # print(f"Resetting predictions and targets for task {self.task_idx}...")
        self.preds = np.empty((0, 1))
        self.targets = np.empty((0,))

    # Accumulate predictions and targets for each batch
    def accumulate(self, learn):
        # Extract predictions and targets for the specific task
        preds, targets = learn.pred[self.task_idx].cpu().numpy(), learn.yb[0][self.task_idx].cpu().numpy()

        # Concatenate predictions and targets
        self.preds = np.concatenate([self.preds, preds])
        self.targets = np.concatenate([self.targets, targets])

    @property
    # Compute and return the metric values
    def value(self):
        # Debugging: Print the shapes of self.preds and self.targets
        if self.preds.size == 0 or self.targets.size == 0:
            print("No predictions or targets accumulated. Ensure that the metric is being used correctly.")
            raise ValueError("No predictions or targets accumulated. Ensure that the metric is being used correctly.")

        # Calculate metrics and save them for later retrieval
        self.metrics = {
            'accuracy': self.calculate_accuracy(self.preds, self.targets),
            'precision@5': self.calculate_precision_at_k(self.preds, self.targets, k=5),
            'precision@10': self.calculate_precision_at_k(self.preds, self.targets, k=10),
            'map': self.calculate_map(self.preds, self.targets),
            'r_precision': self.calculate_r_precision(self.preds, self.targets),
            "f1_score": self.calculate_f1_score(self.preds, self.targets)
        }
    
        # Debugging: Log metrics
        # print(f"Metrics for task {self.task_idx}: {self.metrics}")

        # Save metrics for the current epoch
        self.epoch_metrics.append(self.metrics)
        print(f"Epoch Metrics: {self.epoch_metrics}")

        # Return the metrics dictionary
        return self.metrics

    # Validate that targets are binary (0 or 1)
    def validate_binary_targets(self, targets, threshold=0.5):
        # Check if targets are binary, and binarize if not
        if not np.all(np.isin(targets, [0, 1])):
            print("Non-binary values detected. Binarizing using threshold.")
            targets = (targets > threshold).astype(float)

        return targets

    # Calculate accuracy
    def calculate_accuracy(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Convert predictions to binary
        preds_binary = (preds > 0.5).astype(float)

        # Calculate accuracy
        correct = (preds_binary.flatten() == targets.flatten()).sum()
        total = targets.size

        return float(correct) / float(total) if total > 0 else 0.0

    # Calculate precision at k
    def calculate_precision_at_k(self, preds, targets, k):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Validate k
        if k > len(preds):
            print(f"k ({k}) cannot be greater than the number of predictions ({len(preds)}).")
            raise ValueError(f"k ({k}) cannot be greater than the number of predictions ({len(preds)}).")

        # Sort predictions in descending order
        relevant_indices = np.argsort(preds.flatten())[::-1]
        top_k_indices = relevant_indices[:k]

        # Get top-k targets
        top_k_targets = targets[top_k_indices]

        # Calculate precision at k
        correct_predictions = np.sum(top_k_targets)
        return float(correct_predictions) / float(k) if k > 0 else 0.0

    # Calculate mean average precision
    def calculate_map(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Sort predictions in descending order
        relevant_indices = np.argsort(preds.flatten())[::-1]
        ranked_targets = targets[relevant_indices]

        # Calculate mean average precision
        precisions = []
        relevant_count = 0

        for i, target in enumerate(ranked_targets):
            if target == 1:
                relevant_count += 1
                precisions.append(float(relevant_count) / float(i + 1))

        return float(np.mean(precisions)) if precisions else 0.0

    # Calculate R-Precision
    def calculate_r_precision(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Sort predictions in descending order
        relevant_indices = np.argsort(preds.flatten())[::-1]
        relevant_targets = targets[relevant_indices]

        # Calculate R-Precision
        R = np.sum(targets)  # Total number of relevant items

        if R == 0:
            return 0.0

        top_r_targets = relevant_targets[:int(R)]
        correct_predictions = np.sum(top_r_targets)

        return float(correct_predictions) / float(R)

   # Calculate F1-score
    def calculate_f1_score(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Convert predictions to binary
        preds_binary = (preds > 0.5).astype(float)

        # Calculate True Positives, False Positives, and False Negatives
        true_positives = np.sum((preds_binary.flatten() == 1) & (targets.flatten() == 1))
        false_positives = np.sum((preds_binary.flatten() == 1) & (targets.flatten() == 0))
        false_negatives = np.sum((preds_binary.flatten() == 0) & (targets.flatten() == 1))

        # Calculate precision and recall
        precision = float(true_positives) / float(true_positives + false_positives) if (true_positives + false_positives) > 0 else 0.0
        recall = float(true_positives) / float(true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0.0

        # Calculate F1-score
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

        return f1

# Custom metric to compute total accuracy and ranking metrics for multitask predictions
class MultiTaskTotalMetric(Metric):
    def __init__(self):
        self.epoch_metrics = []  # List to store metrics for each epoch
        self.reset()

    def reset(self):
        # Debugging: Print a message when resetting the predictions and targets
        # print(f"Resetting predictions and targets for task total...")
        self.preds_any = np.empty((0, 1))
        self.preds_pf = np.empty((0, 1))
        self.targets_any = np.empty((0,))
        self.targets_pf = np.empty((0,))

    # Accumulate predictions and targets for each batch
    def accumulate(self, learn):
        # Extract predictions and targets for both tasks
        preds_any, preds_pf = learn.pred[0].cpu().numpy(), learn.pred[1].cpu().numpy()
        targets_any, targets_pf = learn.yb[0][0].cpu().numpy(), learn.yb[0][1].cpu().numpy()

        self.preds_any = np.concatenate([self.preds_any, preds_any])
        self.preds_pf = np.concatenate([self.preds_pf, preds_pf])
        self.targets_any = np.concatenate([self.targets_any, targets_any])
        self.targets_pf = np.concatenate([self.targets_pf, targets_pf])

    @property
    # Compute and return the metric values
    def value(self):
        # Ensure predictions and targets are not empty
        if self.preds_any.size == 0 or self.preds_pf.size == 0 or self.targets_any.size == 0 or self.targets_pf.size == 0:
            print("No predictions or targets accumulated. Ensure that the metric is being used correctly.")
            raise ValueError("No predictions or targets accumulated. Ensure that the metric is being used correctly.")

        total_preds = np.concatenate([self.preds_any, self.preds_pf])
        total_targets = np.concatenate([self.targets_any, self.targets_pf])

        # Debugging: Log shapes of predictions and targets
        # print(f"Total predictions shape: {total_preds.shape}, Total targets shape: {total_targets.shape}")

        # Calculate metrics and save them for later retrieval
        self.metrics = {
            'accuracy': self.calculate_accuracy(total_preds, total_targets),
            'precision@5': self.calculate_precision_at_k(total_preds, total_targets, k=5),
            'precision@10': self.calculate_precision_at_k(total_preds, total_targets, k=10),
            'map': self.calculate_map(total_preds, total_targets),
            'r_precision': self.calculate_r_precision(total_preds, total_targets),
            "f1_score": self.calculate_f1_score(total_preds, total_targets), 
        }

        # Debugging: Log metrics
        # print(f"Metrics: {self.metrics}")

        # Save metrics for the current epoch
        self.epoch_metrics.append(self.metrics)
        print(f"Epoch Metrics: {self.epoch_metrics}")

        # Return the metrics dictionary
        return self.metrics

    # Validate that targets are binary (0 or 1)
    def validate_binary_targets(self, targets, threshold=0.5):
        # Check if targets are binary, and binarize if not
        if not np.all(np.isin(targets, [0, 1])):
            print("Non-binary values detected. Binarizing using threshold.")
            targets = (targets > threshold).astype(float)

        return targets

    # Calculate accuracy
    def calculate_accuracy(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Convert predictions to binary
        preds_binary = (preds > 0.5).astype(float)

        # Calculate accuracy
        correct = (preds_binary.flatten() == targets.flatten()).sum()
        total = targets.size

        # Normalize accuracy
        return float(correct) / float(total) if total > 0 else 0.0

    # Calculate precision at k
    def calculate_precision_at_k(self, preds, targets, k):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Validate k
        if k > len(preds):
            print(f"k ({k}) cannot be greater than the number of predictions ({len(preds)}).")
            raise ValueError(f"k ({k}) cannot be greater than the number of predictions ({len(preds)}).")

        # Sort predictions in descending order
        relevant_indices = np.argsort(preds.flatten())[::-1]
        top_k_indices = relevant_indices[:k]

        # Get top-k targets
        top_k_targets = targets[top_k_indices]

        # Calculate precision at k
        correct_predictions = np.sum(top_k_targets)
        return float(correct_predictions) / float(k) if k > 0 else 0.0

    # Calculate mean average precision
    def calculate_map(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Sort predictions in descending order
        relevant_indices = np.argsort(preds.flatten())[::-1]
        ranked_targets = targets[relevant_indices]

        # Calculate mean average precision
        precisions = []
        relevant_count = 0

        for i, target in enumerate(ranked_targets):
            if target == 1:
                relevant_count += 1
                precisions.append(float(relevant_count) / float(i + 1))

        return float(np.mean(precisions)) if precisions else 0.0

    # Calculate R-Precision
    def calculate_r_precision(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Sort predictions in descending order
        relevant_indices = np.argsort(preds.flatten())[::-1]
        relevant_targets = targets[relevant_indices]

        # Calculate R-Precision
        R = np.sum(targets)  # Total number of relevant items

        if R == 0:
            return 0.0

        top_r_targets = relevant_targets[:int(R)]
        correct_predictions = np.sum(top_r_targets)

        return float(correct_predictions) / float(R)
    
    # Calculate F1-score
    def calculate_f1_score(self, preds, targets):
        # Ensure targets are binary
        targets = self.validate_binary_targets(targets)

        # Convert predictions to binary
        preds_binary = (preds > 0.5).astype(float)

        #   Calculate True Positives, False Positives, and False Negatives
        true_positives = np.sum((preds_binary.flatten() == 1) & (targets.flatten() == 1))
        false_positives = np.sum((preds_binary.flatten() == 1) & (targets.flatten() == 0))
        false_negatives = np.sum((preds_binary.flatten() == 0) & (targets.flatten() == 1))

        # Calculate precision and recall
        precision = float(true_positives)/ float(true_positives + false_positives) if (true_positives + false_positives) > 0 else 0.0
        recall = float(true_positives)/ float(true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0.0

        # Calculate F1-score
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

        return f1

# Custom dataset class to handle multiple targets
# This class is used to create a dataset that returns multiple targets for each input sample
# The dataset is used to train a model that predicts multiple targets simultaneously
class MultiTaskDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, targets_any, targets_pf):
        self.inputs = inputs
        self.targets_any = targets_any
        self.targets_pf = targets_pf

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], (self.targets_any[idx], self.targets_pf[idx])

# Custom model for multitask learning and log for debugging
def convert_to_tensor(X_train, X_val, y_train_any, y_train_pf, y_val_any, y_val_pf):
    # Convert everything to float for PyTorch tensors
    y_train_any_numeric = y_train_any.astype(float)
    y_train_pf_numeric = y_train_pf.astype(float)
    y_val_any_numeric = y_val_any.astype(float)
    y_val_pf_numeric = y_val_pf.astype(float)

    # Convert data to PyTorch tensors
    X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
    y_train_any_tensor = torch.tensor(y_train_any_numeric, dtype=torch.float32)
    y_train_pf_tensor = torch.tensor(y_train_pf_numeric, dtype=torch.float32)

    X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
    y_val_any_tensor = torch.tensor(y_val_any_numeric, dtype=torch.float32)
    y_val_pf_tensor = torch.tensor(y_val_pf_numeric, dtype=torch.float32)

    # Check for NaN or Inf in the training and validation data
    print("Checking training data for NaN and inf...")
    X_train_nan = torch.isnan(X_train_tensor).any()
    y_train_any_nan = torch.isnan(y_train_any_tensor).any()
    y_train_pf_nan = torch.isnan(y_train_pf_tensor).any()

    if X_train_nan or y_train_any_nan or y_train_pf_nan:
        print(f"NaN detected in training data! X_train: {X_train_nan} | y_train_any: {y_train_any_nan} | y_train_pf {y_train_pf_nan}")
    
    X_train_inf = torch.isinf(X_train_tensor).any()
    y_train_any_inf = torch.isinf(y_train_any_tensor).any()
    y_train_pf_inf = torch.isinf(y_train_pf_tensor).any()

    if X_train_inf or y_train_any_inf or y_train_pf_inf:
        print(f"Inf detected in training data! X_train: {X_train_inf} | y_train_any: {y_train_any_inf} | y_train_pf {y_train_pf_inf}")

    print("Checking validation data for NaN and inf...")

    X_val_nan = torch.isnan(X_val_tensor).any()
    y_val_any_nan = torch.isnan(y_val_any_tensor).any()
    y_val_pf_nan = torch.isnan(y_val_pf_tensor).any()

    if X_val_nan or y_val_any_nan or y_val_pf_nan:
        print(f"NaN detected in validation data! X_val: {X_val_nan} | y_val_any: {y_val_any_nan} | y_val_pf {y_val_pf_nan}")

    X_val_inf = torch.isinf(X_val_tensor).any()
    y_val_any_inf = torch.isinf(y_val_any_tensor).any()
    y_val_pf_inf = torch.isinf(y_val_pf_tensor).any()

    if X_val_inf or y_val_any_inf or y_val_pf_inf:
        print(f"Inf detected in validation data! X_val: {X_val_inf} | y_val_any: {y_val_any_inf} | y_val_pf {y_val_pf_inf}")

    # Check shapes
    print(f"X_train shape: {X_train_tensor.shape} | y_train_any shape: {y_train_any_tensor.shape} | y_train_pf shape: {y_train_pf_tensor.shape}")
    print(f"X_val shape: {X_val_tensor.shape} | y_val_any shape: {y_val_any_tensor.shape} | y_val_pf shape: {y_val_pf_tensor.shape}")
    
    return X_train_tensor, X_val_tensor, y_train_any_tensor, y_train_pf_tensor, y_val_any_tensor, y_val_pf_tensor

# Custom function to evaluate and visualize the model
# This function evaluates the model on the validation set and visualizes the results
#
# Args:
#   learn (Learner): The FastAI Learner object containing the trained model.
#   task_names (list): List of task names for multitask learning (default: ["Any", "PF"]).
#   threshold (float): Threshold for converting probabilities to binary predictions (default: 0.5).
def evaluate_and_visualize(learn, task_names=["Any", "PF"], threshold=0.5, epochs=100):
    # Get predictions and targets for the validation set
    vld_preds, vld_targets = learn.get_preds(ds_idx=1) 

    print(f"Validation predictions: {vld_preds} | Validation targets: {vld_preds}")
    print("-----------------------------------------------------------------------------------------")

    # Unpack predictions for each task
    pred_any, pred_pf = vld_preds
    target_any, target_pf = vld_targets

    # Combine the targets for all tasks into a single tensor
    total_targets = torch.cat([target_any.flatten(), target_pf.flatten()])
    total_preds = torch.cat([pred_any.flatten(), pred_pf.flatten()])

    # Check if the validation set is empty
    if len(target_any) == 0 or len(target_pf) == 0:
        print("Validation set is empty. Skipping evaluation.")
        return

    # Convert probabilities to binary predictions using the threshold
    pred_any_binary = (pred_any > threshold).float()
    pred_pf_binary = (pred_pf > threshold).float()
    target_any_binary = (target_any > threshold).float()
    target_pf_binary = (target_pf > threshold).float()

    # Find indices of incorrect and correct predictions for each task
    incorrect_any = torch.where((target_any_binary == pred_any_binary) == False)[0]
    correct_any = torch.where((target_any_binary == pred_any_binary) == True)[0]

    incorrect_pf = torch.where((target_pf_binary == pred_pf_binary) == False)[0]
    correct_pf = torch.where((target_pf_binary == pred_pf_binary) == True)[0]

    # Calculate validation loss and metrics
    vld_loss, *metrics = learn.validate()

    # Extract accuracy for each task
    task_accuracies = metrics[:len(task_names)]  # Extract accuracy for each task

    # Print results
    print("-----------------------------------------------------------------------------------------")
    print(f"Validation Loss: {vld_loss:.5f}")
    print("-----------------------------------------------------------------------------------------")
    print(f"Number of test data samples misclassified (Any Task): {len(incorrect_any):,}")
    print(f"Number of test data samples correctly classified (Any Task): {len(correct_any):,}")
    print(f"Number of test data samples misclassified (PF Task): {len(incorrect_pf):,}")
    print(f"Number of test data samples correctly classified (PF Task): {len(correct_pf):,}")
    print("-----------------------------------------------------------------------------------------")
    print(f"Shape of validation targets: {len(total_targets.shape):,}")
    print(f"Total Number of validation targets: {len(total_targets):,}")
    print(f"Shape of validation predictions: {len(total_preds.shape):,}")
    print(f"Total Number of validation predictions: {len(total_preds):,}")
    print("-----------------------------------------------------------------------------------------")

    # Prepare data for visualization
    # Compute confusion matrices
    target_any_binary_int = target_any_binary.int()
    pred_any_binary_int = pred_any_binary.int()
    conf_matrix_any = confusion_matrix(target_any_binary_int.cpu(), pred_any_binary_int.cpu())

    target_pf_binary_int = target_pf_binary.int()
    pred_pf_binary_int = pred_pf_binary.int()
    conf_matrix_pf = confusion_matrix(target_pf_binary_int.cpu(), pred_pf_binary_int.cpu())

    # Retrieve metrics from MultiTaskMetric and MultiTaskTotalMetric
    # Metrics for task 0 (Any)
    metrics_task_any = learn.metrics[0].metrics

    # Iterate over the metrics dictionary and print each metric
    for metric_name, metric_value in metrics_task_any.items():
        print(f"(Any {metric_name} Metric): {metric_value:.5f}")

    # Metrics for task 1 (PF)
    metrics_task_pf = learn.metrics[1].metrics

    # Iterate over the metrics dictionary and print each metric
    for metric_name, metric_value in metrics_task_pf.items():
        print(f"(PF {metric_name} Metric): {metric_value:.5f}")

    # Combined metrics for all tasks
    metrics_total = learn.metrics[2].metrics

    # Iterate over the metrics dictionary and print each metric
    for metric_name, metric_value in metrics_total.items():
        print(f"(Total {metric_name} Metric): {metric_value:.5f}")

    # Extract metrics for visualization
    accuracy_any = metrics_task_any['accuracy']
    accuracy_pf = metrics_task_pf['accuracy']
    accuracy_total = metrics_total['accuracy']

    precision_at_5_any = metrics_task_any['precision@5']
    precision_at_5_pf = metrics_task_pf['precision@5']
    precision_at_5_total = metrics_total['precision@5']

    precision_at_10_any = metrics_task_any['precision@10']
    precision_at_10_pf = metrics_task_pf['precision@10']
    precision_at_10_total = metrics_total['precision@10']

    map_any = metrics_task_any['map']
    map_pf = metrics_task_pf['map']
    map_total = metrics_total['map']

    r_precision_any = metrics_task_any['r_precision']
    r_precision_pf = metrics_task_pf['r_precision']
    r_precision_total = metrics_total['r_precision']

    # Print metrics
    print("-----------------------------------------------------------------------------------------")
    print(f"Learn.Recorder.Values: {learn.recorder.values}")
    print(f"Metrics for Task 0 (Any): {metrics_task_any}")
    print(f"Metrics for Task 1 (PF): {metrics_task_pf}")
    print(f"Combined Metrics: {metrics_total}")
    print("-----------------------------------------------------------------------------------------")

    fig, axes = plt.subplots(3, 3, figsize=(20, 15))
    fig.suptitle("Model Validation Visualizations", fontsize=24, y=1.02)

    # Define a color palette using 'magma' for a purple-blue range
    palette = sns.color_palette("twilight", n_colors=4)

    # Plot 1: Correct vs. Incorrect Predictions (Both Tasks)
    # Prepare data for the grouped bar chart
    total_any = conf_matrix_any.sum()
    correct_any = (conf_matrix_any[0, 0] + conf_matrix_any[1, 1]) / total_any * 100
    incorrect_any = (conf_matrix_any[0, 1] + conf_matrix_any[1, 0]) / total_any * 100

    total_pf = conf_matrix_pf.sum()
    correct_pf = (conf_matrix_pf[0, 0] + conf_matrix_pf[1, 1]) / total_pf * 100
    incorrect_pf = (conf_matrix_pf[0, 1] + conf_matrix_pf[1, 0]) / total_pf * 100

    tasks = {
        'Two': ['Any', 'PF'],
        'Three': ['Any', 'PF', "Total"],
    }

    data = {
        'Task': ['Any', 'Any', 'PF', 'PF'],
        'Prediction': ['Correct', 'Incorrect', 'Correct', 'Incorrect'],
        'Percentage': [correct_any, incorrect_any, correct_pf, incorrect_pf]
    }

    df = pd.DataFrame(data)

    # Create the grouped bar chart
    sns.barplot(x='Task', y='Percentage', hue='Prediction', data=df, ax=axes[0, 0], palette=palette[:2])
    axes[0, 0].set_title("Correct vs. Incorrect Predictions (Any and PF Tasks)")
    axes[0, 0].set_ylabel("Percentage")
    axes[0, 0].legend(title='Prediction')  # Add a legend with a title

    # Plot 2: F1 Score Comparison
    f1_any = metrics_task_any['f1_score']
    f1_pf = metrics_task_pf['f1_score']
    f1_total = metrics_total['f1_score']

    # Data preparation for F1 score chart
    f1_data = {
        'Category': ['Any', 'PF', 'Total'],
        'F1 Score': [f1_any, f1_pf, f1_total]
    }

    df_f1 = pd.DataFrame(f1_data)

    # Create the F1 score bar chart
    sns.barplot(x='Category', y='F1 Score', hue='Category', data=df_f1, ax=axes[0, 1], palette=palette[:3])  
    axes[0, 1].set_title("F1 Score Comparison")
    axes[0, 1].set_ylabel("F1 Score")
    axes[0, 1].set_ylim(0, 1)  # Set y-axis limit between 0 and 1 for F1 score

    # Add labels to the bars
    for index, row in df_f1.iterrows():
        axes[0, 1].text(row.name, row['F1 Score'], round(row['F1 Score'], 2), color='black', ha="center", va="bottom")

    # Plot 3: Task-Specific Accuracy
    sns.barplot(x=tasks['Two'], y=[accuracy_any, accuracy_pf], hue=tasks['Two'], ax=axes[0, 2], palette=palette[:2])
    axes[0, 2].set_title("Task-Specific Accuracy Comparison")
    axes[0, 2].set_ylabel("Accuracy")

    # Plot 4: Confusion Matrix (Any Task)
    conf_matrix_any_normalized = conf_matrix_any.astype('float') / conf_matrix_any.sum(axis=1)[:, np.newaxis] * 100

    # Create annotations with both percentages and raw numbers
    annot_any = np.array([[f"{conf_matrix_any_normalized[i, j]:.1f}% ({conf_matrix_any[i, j]})" for j in range(conf_matrix_any.shape[1])] for i in range(conf_matrix_any.shape[0])])

    sns.heatmap(conf_matrix_any_normalized,
                annot=annot_any, fmt="", cmap=sns.blend_palette(["white", "purple"], as_cmap=True),  # Use custom annotations, fmt="" disables default formatting
                xticklabels=["Not Check-Worthy", "Check-Worthy"],
                yticklabels=["Not Check-Worthy", "Check-Worthy"],
                ax=axes[1, 0])
    axes[1, 0].set_title("Normalized Confusion Matrix (Any Task) [%]")
    axes[1, 0].set_xlabel("Predicted")
    axes[1, 0].set_ylabel("Actual")

    # Plot 5: Confusion Matrix (PF Task)
    conf_matrix_pf_normalized = conf_matrix_pf.astype('float') / conf_matrix_pf.sum(axis=1)[:, np.newaxis] * 100

    # Create annotations with both percentages and raw numbers
    annot_pf = np.array([[f"{conf_matrix_pf_normalized[i, j]:.1f}% ({conf_matrix_pf[i, j]})" for j in range(conf_matrix_pf.shape[1])] for i in range(conf_matrix_pf.shape[0])])

    sns.heatmap(conf_matrix_pf_normalized,
                annot=annot_pf, fmt="", cmap=sns.blend_palette(["white", "purple"], as_cmap=True),  # Use custom annotations, fmt="" disables default formatting
                xticklabels=["Not Check-Worthy", "Check-Worthy"],
                yticklabels=["Not Check-Worthy", "Check-Worthy"],
                ax=axes[1, 1])
    axes[1, 1].set_title("Normalized Confusion Matrix (PF Task) [%]")
    axes[1, 1].set_xlabel("Predicted")
    axes[1, 1].set_ylabel("Actual")

    # Plot 6: Precision@K Comparison
    # Prepare data for grouped bar chart
    data = {
        'Category': ['Any', 'PF', 'Total', 'Any', 'PF', 'Total'],
        'Precision': [precision_at_5_any, precision_at_5_pf, precision_at_5_total, precision_at_10_any, precision_at_10_pf, precision_at_10_total],
        'K': ['Precision@5', 'Precision@5', 'Precision@5', 'Precision@10', 'Precision@10', 'Precision@10']
    }

    df_pk = pd.DataFrame(data)

    sns.barplot(x='Category', y='Precision', hue='K', data=df_pk, ax=axes[1, 2], palette=sns.color_palette(['slateblue', 'darkblue']))
    axes[1, 2].set_title("Precision@K Comparison")
    axes[1, 2].set_ylabel("Precision")
    axes[1, 2].legend(title='K')

    # Plot 7: F1 Score Over Epochs
    # Extract F1 Score values over epochs - Adjust this based on how your 'learn' object stores epoch metrics
    task_any_f1 = [epoch['f1_score'] for epoch in learn.metrics[0].epoch_metrics]  # Task 0 (Any)
    task_pf_f1 = [epoch['f1_score'] for epoch in learn.metrics[1].epoch_metrics]  # Task 1 (PF)
    task_total_f1 = [epoch['f1_score'] for epoch in learn.metrics[2].epoch_metrics]  # Combined

    sns.lineplot(x=range(len(task_any_f1)), y=task_any_f1, label="Any F1 Score", ax=axes[2, 0], color='purple')
    sns.lineplot(x=range(len(task_pf_f1)), y=task_pf_f1, label="PF F1 Score", ax=axes[2, 0], color='blue')
    sns.lineplot(x=range(len(task_total_f1)), y=task_total_f1, label="Total F1 Score", ax=axes[2, 0], color='black')
    axes[2, 0].set_title("F1 Score Over Epochs")
    axes[2, 0].set_xlabel("Epochs")
    axes[2, 0].set_ylabel("F1 Score")
    axes[2, 0].legend()

    # Plot 8: R-Precision Comparison
    sns.barplot(x=tasks['Three'], y=[r_precision_any, r_precision_pf, r_precision_total], hue=tasks['Three'], ax=axes[2, 1], palette=palette[:3])
    axes[2, 1].set_title("R-Precision Comparison")
    axes[2, 1].set_ylabel("R-Precision")

    # Plot 9: Mean Average Precision (MAP) Comparison
    sns.barplot(x=tasks['Three'], y=[map_any, map_pf, map_total], hue=tasks['Three'], ax=axes[2, 2], palette=palette[:3])
    axes[2, 2].set_title("Mean Average Precision (MAP) Comparison")
    axes[2, 2].set_ylabel("MAP")

    # Adjust layout for better spacing
    plt.tight_layout()

    # Save the figure to a file
    model_name = str(learn.model).split('(')[0]
    model_number = random.randint(1, 999999) 
    plt.savefig(f'{model_name}_{model_number}.png')

    # Show the grid of plots
    plt.show()

# Function to get the optimizer based on the optimizer name
# This function returns a partial function that can be used to create an optimizer for fastai learner
def get_optimizer(optimizer, lr):
    if optimizer == 'SGD':
        return partial(OptimWrapper, opt=SGD, lr=lr, momentum=0.9, nesterov=True)
    elif optimizer == 'Adam':
        return partial(OptimWrapper, opt=Adam, lr=lr)
    elif optimizer == 'AdamW':
        return partial(OptimWrapper, opt=AdamW, lr=lr)
    else:
        raise ValueError(f"Unsupported optimizer: {optimizer}")
    
# # Custom loss function for multitask learning
# # This function computes the combined loss for multiple tasks
# # The loss for each task is computed using binary cross-entropy loss    
def multitask_loss(preds, targets):
    # Debugging: Print predictions and targets
    # print(f"Preds: {preds}")
    # print(f"Targets: {targets}")

    # Unpack predictions and targets
    pred_any, pred_pf = preds  # Predictions for each task
    y_any, y_pf = targets      # Ground truth labels for each task

    # Reshape targets to match predictions
    y_any = y_any.view(-1, 1)
    y_pf = y_pf.view(-1, 1)

    # Debugging: Check for NaN, Inf, and shape mismatches
    if torch.isnan(pred_any).any() or torch.isnan(pred_pf).any():
        print("NaN detected in predictions!")
    if torch.isnan(y_any).any() or torch.isnan(y_pf).any():
        print("NaN detected in targets!")
    if pred_any.shape != y_any.shape or pred_pf.shape != y_pf.shape:
        print(f"Shape mismatch: pred_any {pred_any.shape}, y_any {y_any.shape}, pred_pf {pred_pf.shape}, y_pf {y_pf.shape}")

    # Ensure predictions and targets are on the same device
    pred_any, pred_pf = pred_any.to(y_any.device), pred_pf.to(y_pf.device)

    # Compute binary cross-entropy loss for each task
    loss_any = binary_cross_entropy(pred_any, y_any)
    loss_pf = binary_cross_entropy(pred_pf, y_pf)

    # Combine the losses (you can adjust initial learning rates if needed)
    combined_loss = loss_any + loss_pf

    print(f"Combined loss: {combined_loss}")
    return combined_loss

# Custom loss function for multitask learning using weighted binary cross-entropy loss
# Computes a combined loss for multitask learning using weighted binary cross-entropy with automatic weight calculation for each task.
def multitask_loss_auto_weighted_bce(preds, targets):
    # Unpack predictions and targets
    pred_any, pred_pf = preds
    y_any, y_pf = targets

    # Reshape targets to match predictions
    y_any = y_any.view(-1, 1)
    y_pf = y_pf.view(-1, 1)

    # Debugging: Check for NaN, Inf, and shape mismatches
    if torch.isnan(pred_any).any() or torch.isnan(pred_pf).any():
        print("NaN detected in predictions!")
    if torch.isnan(y_any).any() or torch.isnan(y_pf).any():
        print("NaN detected in targets!")
    if pred_any.shape != y_any.shape or pred_pf.shape != y_pf.shape:
        print(f"Shape mismatch: pred_any {pred_any.shape}, y_any {y_any.shape}, pred_pf {pred_pf.shape}, y_pf {y_pf.shape}")

    # Ensure predictions and targets are on the same device
    pred_any, pred_pf = pred_any.to(y_any.device), pred_pf.to(y_pf.device)

    # Custom helper function for the loss function for multitask learning using weighted binary cross-entropy loss
    # Computes weighted binary cross-entropy loss, calculating weights automatically based on the class distribution in the target. 
    def weighted_binary_cross_entropy(preds, target):
        with torch.no_grad():
            pos_count = torch.sum(target)
            neg_count = target.numel() - pos_count

            # Avoid division by zero
            pos_weight = neg_count / (pos_count + 1e-5)
            neg_weight = pos_count / (neg_count + 1e-5)

            weights = torch.tensor([neg_weight, pos_weight]).to(target.device)

        bce = binary_cross_entropy(preds, target)
        
        # Apply the weighted binary cross-entropy loss to account for class imbalance
        # The loss is weighted based on the class distribution in the target of the positive and negaive classes
        weighted_bce = weights[1] * target * bce + weights[0] * (1 - target) * bce

        return torch.mean(weighted_bce)

    # Compute weighted BCE loss for each task with automatic weights
    loss_any = weighted_binary_cross_entropy(pred_any, y_any)
    loss_pf = weighted_binary_cross_entropy(pred_pf, y_pf)

    # Combine the losses (you can adjust initial learning rates if needed)
    combined_loss = loss_any + loss_pf

    print(f"Combined loss: {combined_loss}")
    return combined_loss

# Custom loss function for multitask learning
# This function computes the combined loss for multiple tasks
# The loss for each task is computed using binary cross-entropy loss    
def multitask_loss_dynamic_weighted(preds, targets):
    global weight_any, weight_pf # use global weights
    # Debugging: Print predictions and targets
    # print(f"Preds: {preds}")
    # print(f"Targets: {targets}")

    # Unpack predictions and targets
    pred_any, pred_pf = preds  # Predictions for each task
    y_any, y_pf = targets      # Ground truth labels for each task

    # Reshape targets to match predictions
    y_any = y_any.view(-1, 1)
    y_pf = y_pf.view(-1, 1)

    # Debugging: Check for NaN, Inf, and shape mismatches
    if torch.isnan(pred_any).any() or torch.isnan(pred_pf).any():
        print("NaN detected in predictions!")
    if torch.isnan(y_any).any() or torch.isnan(y_pf).any():
        print("NaN detected in targets!")
    if pred_any.shape != y_any.shape or pred_pf.shape != y_pf.shape:
        print(f"Shape mismatch: pred_any {pred_any.shape}, y_any {y_any.shape}, pred_pf {pred_pf.shape}, y_pf {y_pf.shape}")

    # Ensure predictions and targets are on the same device
    pred_any, pred_pf = pred_any.to(y_any.device), pred_pf.to(y_pf.device)

    # Compute binary cross-entropy loss for each task
    loss_any = binary_cross_entropy(pred_any, y_any)
    loss_pf = binary_cross_entropy(pred_pf, y_pf)

    def update_weights(loss_any, loss_pf):
        global weight_any, weight_pf
        if loss_any > loss_pf:
            weight_any += lr_large
            weight_pf -= lr_large
        else:
            weight_any -= lr_small
            weight_pf += lr_small

        # Ensure weights stay within [0, 1] and normalize them
        weight_any = torch.clamp(weight_any, 0, 1)
        weight_pf = torch.clamp(weight_pf, 0, 1)

        # Normalize to make sure they sum to 1
        total_weight = weight_any + weight_pf
        weight_any = weight_any / total_weight
        weight_pf = weight_pf / total_weight

    # Update the weights
    update_weights(loss_any.detach(), loss_pf.detach())

    # Combine the losses (you can adjust initial learning rates if needed)
    combined_loss = weight_any * loss_any + weight_pf * loss_pf

    print(f"Combined loss: {combined_loss}")
    return combined_loss

# Custom loss function for multitask learning using proportional loss weighting
# This function computes the combined loss for multiple tasks
# The loss for each task is computed using binary cross-entropy loss
def multitask_loss_loss_proportional(preds, targets, epsilon=1e-8):
    pred_any, pred_pf = preds
    y_any, y_pf = targets

    # Reshape targets
    y_any = y_any.view(-1, 1)
    y_pf = y_pf.view(-1, 1)

    # Ensure predictions and targets are on the same device
    pred_any, pred_pf = pred_any.to(y_any.device), pred_pf.to(y_pf.device)

    loss_any = binary_cross_entropy(pred_any, y_any)
    loss_pf = binary_cross_entropy(pred_pf, y_pf)

    # Calculate weights
    weight_any = 1 / (loss_any.detach().item() + epsilon)
    weight_pf = 1 / (loss_pf.detach().item() + epsilon)

    # Normalize weights
    total_weight = weight_any + weight_pf
    weight_any = weight_any / total_weight
    weight_pf = weight_pf / total_weight

    combined_loss = weight_any * loss_any + weight_pf * loss_pf

    print(f"Combined loss: {combined_loss}")
    return combined_loss

# Custom splitter to define paramater groups for multitask learning
# This function splits the model parameters into three groups:
# 1. RNN shared parameters
# 2. Predictions for 'any' task
# 3. Predictions for 'pf' task
def multitask_splitter(model):
    if isinstance(model, MultiTaskRNN):
        return [list(model.rnn.parameters()), list(model.pred_any.parameters()), list(model.pred_pf.parameters())]
    elif isinstance(model, MultiTaskTransformer):
        return [list(model.transformer.parameters()), list(model.pred_any.parameters()), list(model.pred_pf.parameters())]
    else:
        raise ValueError(f"Unsupported model: {model}")

---

## **3. RNN Training and Validation**

In [None]:
# Set parameters needed for Hyperparamter searches and model training
samples = 60
search_epoch = 20

---

### **3.1 RNN Grid Search**

#### Let's start by using this section to perform a grid search over our RNN and find the best hyperparameters for our model to train with. Here's how I've set up my hyperparameter optimization using Bayesian Optimization:

---

### **Hyperparameter Search with Bayesian Optimization**

* I'm using `gp_minimize` from the `skopt` library to perform Bayesian Optimization. This allows for an efficient exploration of the hyperparameter space by balancing exploration and exploitation.
* The search space (`rnn_param_space`) is defined using `Dimension` objects from `skopt.space`. The hyperparameters being optimized include:
    * **`hidden_size`**: The size of the hidden layer in the LSTM, defined as an integer range.
    * **`num_layers`**: The number of LSTM layers, also defined as an integer range.
    * **`optimizer`**: A categorical choice between optimizers (`'SGD'`, `'Adam'`, and `'AdamW'`).
    * **`loss_func`**: A categorical choice between custom multi-task loss functions.
    * **`patience`**: The patience parameter for early stopping, defined as an integer range.
    * **`lr`**: The learning rate, defined as a real-valued range.

* To ensure robust evaluation, I'm using **k-fold cross-validation** (`k_folds = 5`) with the `KFold` class from `sklearn.model_selection`. This helps estimate the model's performance across different data splits.

### **Objective Function**

* The `objective` function is the core of the optimization process and is minimized by `gp_minimize`.
* It takes hyperparameter values as input (using the `@use_named_args` decorator for clean parameter handling).
* Inside the `objective` function:
    * The data is split into training and validation sets for each fold.
    * The `train_and_evaluate_rnn` function is called for each fold to train and evaluate the RNN model.
    * The average validation loss across all folds is calculated.
* The `objective` function returns the average validation loss, which `gp_minimize` tries to minimize.

### **Training and Evaluation**

#### `train_and_evaluate_rnn`

* This function handles the training and evaluation of the RNN model for a given set of hyperparameters and data splits.
* Key steps include:
    1. **Extracting Hyperparameters**: The function extracts the hyperparameters passed to it.
    2. **Data Splitting**: The data is split into training and validation sets for the current fold.
    3. **Data Conversion**: The data is converted into PyTorch tensors using the `convert_to_tensor` function.
    4. **DataLoaders Creation**: Training and validation datasets are wrapped into `DataLoaders` using the `MultiTaskDataset` class.
    5. **Model Initialization**: The `MultiTaskRNN` model is instantiated with the given hyperparameters.
    6. **Learner Setup**: A `Learner` object (from `fastai.learner`) is created for training, with custom loss functions and metrics.
    7. **Early Stopping**: Early stopping is implemented to halt training if the validation loss does not improve for a specified number of epochs.
    8. **Model Saving**: The best model (based on validation loss) is saved and reloaded for final evaluation.
* The function returns the validation loss and F1 score for the current fold.

### **Bayesian Optimization Execution**

* The Bayesian Optimization process is executed using `gp_minimize`, which minimizes the `objective` function over the defined search space.
* Key parameters:
    * **`objective`**: The function to minimize (average validation loss).
    * **`rnn_param_space`**: The hyperparameter search space.
    * **`n_calls`**: The number of iterations for the optimization process.
* After the optimization:
    * The best hyperparameters are extracted and converted to their appropriate types (e.g., integers, floats, or categorical values).
    * The best validation loss is also recorded.
    
### **Summary**

This section efficiently searches for the best combination of hyperparameters for the RNN model using Bayesian Optimization. By leveraging k-fold cross-validation, the process ensures robust evaluation of each hyperparameter configuration. The use of early stopping further enhances the training process by preventing overfitting and saving computational resources. The result is an optimized RNN model with hyperparameters that minimize validation loss while maintaining strong performance across folds.

In [None]:
from skopt import gp_minimize
from skopt.space import Dimension, Categorical, Integer, Real as Real_skopt
from skopt.utils import use_named_args
from itertools import product
from fastai.data.all import *
from fastai.learner import *
from fastai.metrics import *
from sklearn.model_selection import KFold

# Fixed parameters
batch_size = 64

# Number of folds for cross-validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

# Function to train and evaluate the RNN model (remains mostly the same)
def train_and_evaluate_rnn(params, train_idx, val_idx, shared_weight_decay = 0.003, task_weight_decay = 0.001, dropout=0.4):
    # Extract parameters
    # Conditional unpacking based on the number of parameters
    if len(params) == 6:
        hidden_size, num_layers, optimizer, multitask_loss_func, patience_param, lr = params
    elif len(params) == 9:
        hidden_size, num_layers, optimizer, multitask_loss_func, patience_param, lr, shared_weight_decay, task_weight_decay, dropout = params
    else:
        raise ValueError(f"Unexpected number of parameters: {len(params)}")

    # Split the data into training and validation sets for the current fold
    X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
    y_train_any_fold, y_val_any_fold = y_train_any[train_idx], y_train_any[val_idx]
    y_train_pf_fold, y_val_pf_fold = y_train_pf[train_idx], y_train_pf[val_idx]

    # Convert data to PyTorch tensors
    X_train_tensor, X_val_tensor, y_train_any_tensor, y_train_pf_tensor, y_val_any_tensor, y_val_pf_tensor = convert_to_tensor(X_train_fold, X_val_fold, y_train_any_fold, y_train_pf_fold, y_val_any_fold, y_val_pf_fold)

    # Create training and validation datasets
    train_ds = MultiTaskDataset(X_train_tensor, y_train_any_tensor, y_train_pf_tensor)
    val_ds = MultiTaskDataset(X_val_tensor, y_val_any_tensor, y_val_pf_tensor)

    print(f"Train Dset: {train_ds}")
    print(f"Val Dset: {val_ds}")
    
    # Update DataLoaders with the current batch size
    dls = DataLoaders.from_dsets(train_ds, val_ds, bs=batch_size, device=device)

    # Debugging: Print the first batch of the training DataLoader
    # for batch in dls.train:
    #     print(f"Batch: {batch}")
    #     break
    
    # Instantiate the model with the current parameters
    # Since we have already fit and transformed the data into a flattened 2D array, we can use the input size directly
    model = MultiTaskRNN(input_size=X_train.shape[1], hidden_size=hidden_size, num_layers=num_layers, dropout=dropout)
    model = model.to(device)  # Move model to GPU if available
    
    # Create Learner
    rnn_learn = Learner(dls, model, loss_func=multitask_loss_func, opt_func=get_optimizer(optimizer, lr), splitter=multitask_splitter, metrics=[MultiTaskMetric(0), MultiTaskMetric(1), MultiTaskTotalMetric()])
    # rnn_learn.add_cb(DebugCallback()) # uncomment for debugging needs

    # Initialize the optimizer
    rnn_learn.create_opt()

    # Set weight decay for each parameter group
    rnn_learn.opt.set_hypers(wd=[shared_weight_decay, task_weight_decay, task_weight_decay])

    # Early Stopping Implementation for Grid Search 
    patience = patience_param  
    best_val_loss = float('inf')
    counter = 0
    epochs_no_improve = 0  

    for epoch in range(search_epoch):
        rnn_learn.fit(1, lr=lr)  

        # Evaluate on validation set
        val_loss = rnn_learn.validate()[0]  

        print(f"Epoch: {epoch + 1}, Validation Loss: {val_loss}")

        # Early stopping check within the 10-epoch limit
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            counter = 0
            epochs_no_improve = 0  
            torch.save(rnn_learn.model.state_dict(), 'best_model.pth')  # Save the best model
        else:
            counter += 1
            epochs_no_improve += 1

        # Check for stopping condition: patience or consistent increase
        if counter >= patience or epochs_no_improve >= 8:
            print("Early stopping triggered!")
            break

    # Load the best model
    rnn_learn.model.load_state_dict(torch.load('best_model.pth'))

    # Evaluate on the validation set one last time
    validation_loss = rnn_learn.validate()[0]

    # Get the F1 score from the MultiTaskTotalMetric
    f1_score = rnn_learn.metrics[2].metrics['f1_score']

    # Clear CUDA cache to prevent memory issues after each fold
    if device.type == 'cuda':
      torch.cuda.empty_cache()

    return validation_loss, f1_score

def run_bayesian_optimization(objective, samples, rnn_param_space):
    # Perform Bayesian Optimization for RNN with average validation loss objective
    n_calls = samples  # Number of times to sample the objective function
    result = gp_minimize(objective, rnn_param_space, n_calls=n_calls, random_state=42)

    # Ensure proper types for the results
    best_rnn_params = {}

    # Extract the best parameters and convert them to the correct types
    for param, value in zip(rnn_param_space, result.x):
        if isinstance(param, Integer):
            best_rnn_params[param.name] = int(value)
        elif isinstance(param, Real_skopt):
            best_rnn_params[param.name] = float(value)
        elif isinstance(param, Categorical):
            best_rnn_params[param.name] = value

    # Print the best parameters and loss
    print("Best RNN Parameters:", best_rnn_params)
    print("Best RNN Validation Loss:", result.fun)

    return best_rnn_params

# Run the best RNN model training with the best parameters found from Bayesian Optimization
def run_best_rnn_model_training(hidden_size, num_layers, optimizer, multitask_loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout, epochs=100):
    # Convert data to PyTorch tensors
    X_train_tensor, X_val_tensor, y_train_any_tensor, y_train_pf_tensor, y_val_any_tensor, y_val_pf_tensor = convert_to_tensor(X_train, X_val, y_train_any, y_train_pf, y_val_any, y_val_pf)

    # Create training and validation datasets
    train_ds = MultiTaskDataset(X_train_tensor, y_train_any_tensor, y_train_pf_tensor)
    val_ds = MultiTaskDataset(X_val_tensor, y_val_any_tensor, y_val_pf_tensor)

    # Update DataLoaders with the best batch size
    dls = DataLoaders.from_dsets(train_ds, val_ds, bs=batch_size)

    # Instantiate the model with the best parameters
    best_rnn_model = MultiTaskRNN(input_size=X_train.shape[1], hidden_size=hidden_size, num_layers=num_layers, dropout=dropout)

    # Create Learner
    best_rnn_learn = Learner(dls, best_rnn_model, loss_func=multitask_loss_func, opt_func=get_optimizer(optimizer, lr), splitter=multitask_splitter, metrics=[MultiTaskMetric(0), MultiTaskMetric(1), MultiTaskTotalMetric()])
    # best_rnn_learn.add_cb(DebugCallback()) # uncomment for debugging needs

    # Initialize the optimizer
    best_rnn_learn.create_opt()

    # Set weight decay for each parameter group
    best_rnn_learn.opt.set_hypers(wd=[shared_weight_decay, task_weight_decay, task_weight_decay])

    # Early Stopping implementation for RNN Grid Search
    best_val_loss = float('inf')
    counter = 0

    for epoch in range(epochs): 
        best_rnn_learn.fit(1, lr=lr)

        # Evaluate on validation set
        val_loss = best_rnn_learn.validate()[0]

        print(f"Epoch: {epoch + 1}, Validation Loss: {val_loss}")

        # Early stopping check
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            counter = 0
            torch.save(best_rnn_learn.model.state_dict(), 'best_rnn_model.pth')  # Save the best model
        else:
            counter += 1
            if counter >= patience:
                print("Early stopping triggered in final training!")
                break

    # Load the best model
    best_rnn_learn.model.load_state_dict(torch.load('best_rnn_model.pth'))

    # Evaluate on validation set
    validation_loss = best_rnn_learn.validate()[0]
    combined_metrics = best_rnn_learn.validate()[3]

    print(f"Validation Loss: {validation_loss}")
    print("-------------------------------------------------")
    print(f"Combined Metrics: {combined_metrics}")

    return best_rnn_learn

---
### **3.2 RNN Grid Search With Average Validation Loss All Params**

#### Let's do another grid search with a Combined Score Objective to see if we can find a better model. Here's how I've set up my hyperparameter optimization using Bayesian Optimization:

#### This code implements **Bayesian Optimization** to tune hyperparameters for an RNN model by optimizing a **average loss validation function**. 

### **Hyperparameter Search Space**

The search space (`rnn_param_space`) includes:
- **`hidden_size`**: The size of the LSTM hidden layer (integer range: 256–1024).
- **`num_layers`**: The number of LSTM layers (integer range: 2–8).
- **`optimizer`**: A categorical choice between `'SGD'`, `'Adam'`, and `'AdamW'`.
- **`loss_func`**: A categorical choice between custom multi-task loss functions.
- **`patience`**: The patience parameter for early stopping (integer range: 3–15).
- **`lr`**: The learning rate (real-valued range: 0.00001–0.006).
- **`shared_weight_decay`**: Weight decay for shared layers (real-valued range: 0.0001–0.01).
- **`task_weight_decay`**: Weight decay for task-specific layers (real-valued range: 0.0001–0.01).
- **`dropout`**: Dropout rate (real-valued range: 0.1–0.5).

### **Objective Function**

The `objective_avg_loss` function:
- Performs **k-fold cross-validation** (using `kf.split`) to evaluate the model's performance for each hyperparameter configuration.
- Calls `train_and_evaluate_rnn` for each fold to train and evaluate the RNN model.
- Computes the **average validation loss** across all folds, which is returned as the objective to minimize.

### **Bayesian Optimization Execution**

- The `run_bayesian_optimization` function is used to minimize the `objective_avg_loss` function over the defined search space.
- The best hyperparameters (`best_rnn_params`) are extracted after the optimization process.

### **Summary**

This section performs a comprehensive search for the optimal hyperparameters for the RNN model. By including additional parameters like `shared_weight_decay`, `task_weight_decay`, and `dropout`, the search aims to fine-tune both regularization and model architecture. The use of Bayesian Optimization and k-fold cross-validation ensures robust evaluation and efficient exploration of the hyperparameter space.


In [None]:
# Authors' recommended parameters
# dropout = 0.4
# batch_size = 64
# shared_weight_decay = 0.003
# task_weight_decay = 0.001

# Define the hyperparameter search space for Bayesian Optimization
rnn_param_space_avg_loss = [
    Integer(name='hidden_size', low=128, high=512),
    Integer(name='num_layers', low=2, high=8),
    Categorical(name='optimizer', categories=['SGD', 'Adam', 'AdamW']),
    Categorical(name='loss_func', categories=[multitask_loss, multitask_loss_loss_proportional, multitask_loss_auto_weighted_bce, multitask_loss_dynamic_weighted]),
    Integer(name='patience', low=5, high=15),
    Real_skopt(name='lr', low=0.00001, high=0.006),
    Real_skopt(name='shared_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='task_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='dropout', low=0.1, high=0.5)
]

# Objective function for Bayesian Optimization with average validation loss
@use_named_args(rnn_param_space_avg_loss)
def objective_avg_loss(hidden_size, num_layers, optimizer, loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout):
    fold_losses = []

    # Perform k-fold cross-validation
    for train_idx, val_idx in kf.split(X_train):
        # Train and evaluate the model on the current fold
        validation_loss, _ = train_and_evaluate_rnn((int(hidden_size), int(num_layers), optimizer, loss_func, int(patience), lr, shared_weight_decay, task_weight_decay, dropout), train_idx, val_idx)

        fold_losses.append(validation_loss)

    # Calculate the average validation loss across all folds
    avg_validation_loss = sum(fold_losses) / len(fold_losses)

    print(f"Params: {(hidden_size, num_layers, optimizer, loss_func.__name__, patience, lr)}, Avg Validation Loss: {avg_validation_loss}")

    return avg_validation_loss

# Perform Bayesian Optimization for RNN with baseline objective
best_rnn_params_avg_loss = run_bayesian_optimization(objective_avg_loss, samples, rnn_param_space_avg_loss)

---

### **3.3 Train RNN with Best Parameters for Average Validation Loss Objective**

#### Now we will train the RNN with the best parameters we found in our grid search for the Average Validation Loss Objective and saved in **best_rnn_params**
#### After we will print out a series of important metrics and visualizations to verify our results

In [None]:
from fastai.data.all import *
from fastai.learner import *
from fastai.metrics import *

# To run with best parameters from previous test
# Best RNN Parameters after first run of Bayesian Optimization with 60 samples and 20 epochs
# Best RNN Parameters: {'hidden_size': 128, 'num_layers': 2, 'optimizer': np.str_('AdamW'), 'loss_func': <function multitask_loss_loss_proportional at 0x0000024B235A9120>, 'patience': 15, 'lr': 0.006, 'shared_weight_decay': 0.01, 'task_weight_decay': 0.0001, 'dropout': 0.5}
# Best RNN Validation Loss: 0.17900913655757905
hidden_size = 128
num_layers = 2
optimizer = 'AdamW'
multitask_loss_func = multitask_loss_loss_proportional
lr = 0.006
patience = 15
dropout = 0.5
shared_weight_decay = 0.01
task_weight_decay = 0.0001

# Use Authors' recommended parameters for the baseline evaluation
epochs = 100
batch_size = 64

# print("Best RNN Parameters:", best_rnn_params_avg_loss)

# # Extract the best parameters
# dropout = best_rnn_params_avg_loss['dropout']
# hidden_size = int(best_rnn_params_avg_loss['hidden_size'])
# lr = best_rnn_params_avg_loss['lr']
# num_layers = int(best_rnn_params_avg_loss['num_layers'])
# multitask_loss_func = best_rnn_params_avg_loss['loss_func']
# optimizer = best_rnn_params_avg_loss['optimizer']
# patience = int(best_rnn_params_avg_loss['patience'])
# shared_weight_decay = best_rnn_params_avg_loss['shared_weight_decay']
# task_weight_decay = best_rnn_params_avg_loss['task_weight_decay']

# Run the best RNN model training with the average loss objective parameters
best_rnn_learn_avg_loss = run_best_rnn_model_training(hidden_size, num_layers, optimizer, multitask_loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout, epochs=epochs)

---

### **3.4 Visualize Our Results for Average Validation Loss Objective**

#### Let us visualize the results of our model training and evaluation for our Average Validation Loss Objective.

In [None]:
# Evaluate and visualize the model performance and results
evaluate_and_visualize(learn=best_rnn_learn_avg_loss, epochs=epochs)

---
### **3.5 RNN Grid Search With Combined Score Objective**

#### Let's do another grid search with a Combined Score Objective to see if we can find a better model. Here's how I've set up my hyperparameter optimization using Bayesian Optimization:

#### This code implements **Bayesian Optimization** to tune hyperparameters for an RNN model by optimizing a **combined objective function** that balances **F1 score** (to maximize) and **validation loss** (to minimize). 

### **Hyperparameter Search Space**

The search space (`rnn_param_space`) includes:
- **`hidden_size`**: The size of the LSTM hidden layer (integer range: 256–1024).
- **`num_layers`**: The number of LSTM layers (integer range: 2–8).
- **`optimizer`**: A categorical choice between `'SGD'`, `'Adam'`, and `'AdamW'`.
- **`loss_func`**: A categorical choice between custom multi-task loss functions.
- **`patience`**: The patience parameter for early stopping (integer range: 3–15).
- **`lr`**: The learning rate (real-valued range: 0.00001–0.006).
- **`shared_weight_decay`**: Weight decay for shared layers (real-valued range: 0.0001–0.01).
- **`task_weight_decay`**: Weight decay for task-specific layers (real-valued range: 0.0001–0.01).
- **`dropout`**: Dropout rate (real-valued range: 0.1–0.5).

### **Objective Function**

The `objective_combined` function:
- Performs **k-fold cross-validation** to evaluate the model's performance for each hyperparameter configuration.
- Computes the **average F1 score** (to maximize) and **average validation loss** (to minimize) across all folds.
- Combines these metrics into a single objective using weighted contributions:
  - **F1 score**: Weighted higher (e.g., `0.8`) to prioritize maximizing performance.
  - **Validation loss**: Weighted lower (e.g., `0.2`) to ensure regularization.

### **Bayesian Optimization Execution**

- The `gp_minimize` function minimizes the `objective_combined` function over the defined search space.
- The best hyperparameters (`best_rnn_params`) are extracted and converted to their appropriate types (e.g., integers, floats, or categorical values).
- The best combined objective value is recorded.

### **Summary**

This section optimizes the RNN model's hyperparameters by balancing F1 score and validation loss using a combined objective. The use of Bayesian Optimization ensures efficient exploration of the hyperparameter space, while k-fold cross-validation provides robust evaluation. The result is a set of hyperparameters that maximize F1 score while maintaining low validation loss.

In [None]:
# Authors' recommended parameters
# dropout = 0.4
# batch_size = 64
# shared_weight_decay = 0.003
# task_weight_decay = 0.001

# Define the hyperparameter search space for Bayesian Optimization
rnn_param_space_combined_score = [
    Integer(name='hidden_size', low=128, high=512),
    Integer(name='num_layers', low=2, high=8),
    Categorical(name='optimizer', categories=['SGD', 'Adam', 'AdamW']),
    Categorical(name='loss_func', categories=[multitask_loss, multitask_loss_loss_proportional, multitask_loss_auto_weighted_bce, multitask_loss_dynamic_weighted]),
    Integer(name='patience', low=5, high=15),
    Real_skopt(name='lr', low=0.00001, high=0.006),
    Real_skopt(name='shared_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='task_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='dropout', low=0.1, high=0.5)
]

# Objective function for Bayesian Optimization with combined objective (f1 score and validation loss)
@use_named_args(rnn_param_space_combined_score)
def objective_combined(hidden_size, num_layers, optimizer, loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout):
    fold_f1_scores = []
    fold_validation_losses = []

    # Perform k-fold cross-validation
    for train_idx, val_idx in kf.split(X_train):
        # Train and evaluate the model on the current fold
        validation_loss, f1_score = train_and_evaluate_rnn((int(hidden_size), int(num_layers), optimizer, loss_func, int(patience), lr, shared_weight_decay, task_weight_decay, dropout), train_idx, val_idx)

        fold_f1_scores.append(f1_score)
        fold_validation_losses.append(validation_loss)

    # Calculate the average F1 score and validation loss across all folds
    avg_f1_score = sum(fold_f1_scores) / len(fold_f1_scores)
    avg_validation_loss = sum(fold_validation_losses) / len(fold_validation_losses)

    # Combine the two metrics into a single objective (adjust weights as needed)
    # Higher weight for F1 score (maximize) and lower weight for validation loss (minimize)
    weight_f1 = 0.8
    weight_loss = 0.2
    combined_objective = -weight_f1 * avg_f1_score + weight_loss * avg_validation_loss

    print(f"Params: {(hidden_size, num_layers, optimizer, loss_func.__name__, patience, lr)}\n"f"Avg F1 Score: {avg_f1_score} | Avg Validation Loss: {avg_validation_loss} | Combined Objective: {combined_objective}")

    return combined_objective

# Perform Bayesian Optimization for RNN with average validation loss objective
n_calls = samples  # Number of times to sample the objective function
result = gp_minimize(objective_combined, rnn_param_space_combined_score, n_calls=n_calls, random_state=42)

# Ensure proper types for the results
best_rnn_params_combined_score = run_bayesian_optimization(objective_combined, samples, rnn_param_space_combined_score)

---

### **3.6 Train RNN with Best Parameters for Combined Score Objective**

#### Now we will train the RNN with the best parameters we found in our grid search for the Combiend Score Objective and saved in **best_rnn_params**
#### After we will print out a series of important metrics and visualizations to verify our results

In [None]:
from fastai.data.all import *
from fastai.learner import *
from fastai.metrics import *

# To run with previous test best parameters
# Best RNN Parameters after first run of Bayesian Optimization with 60 samples and 20 epochs
# Best RNN Parameters: {'hidden_size': 200, 'num_layers': 2, 'optimizer': np.str_('AdamW'), 'loss_func': <function multitask_loss_dynamic_weighted at 0x00000182204FFBA0>, 'patience': 13, 'lr': 0.002786683553419411, 'shared_weight_decay': 0.0032307513112844535, 'task_weight_decay': 0.00046888866964757425, 'dropout': 0.4314006713537418}
# Best RNN Validation Loss: -0.46442934273667535
hidden_size = 200
num_layers = 2
optimizer = 'AdamW'
multitask_loss_func = multitask_loss_dynamic_weighted
lr = 0.002786683553419411
patience = 13

# Extract the best parameters
epochs = 100
dropout = 0.4314006713537418
batch_size = 64
shared_weight_decay = 0.0032307513112844535
task_weight_decay = 0.00046888866964757425

# print("Best RNN Parameters:", best_rnn_params_combined_score)

# # Extract parameters
# hidden_size = int(best_rnn_params_combined_score['hidden_size'])
# num_layers = int(best_rnn_params_combined_score['num_layers'])
# optimizer = best_rnn_params_combined_score['optimizer']
# multitask_loss_func = best_rnn_params_combined_score['loss_func']
# patience = int(best_rnn_params_combined_score['patience'])
# lr = best_rnn_params_combined_score['lr']

# Run the best RNN model training with the combined score objective parameters
best_rnn_learn_combined_score = run_best_rnn_model_training(hidden_size, num_layers, optimizer, multitask_loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout, epochs=epochs)

---

### **3.7 Visualize Our Results for Combined Score**
#### Let us visualize the results of our model training and evaluation for our Combined Score.

In [None]:
# Evaluate and visualize the model performance and results
evaluate_and_visualize(learn=best_rnn_learn_combined_score, epochs=epochs)

---

## **4. Transformer Implementation**

In [None]:
# Set parameters needed for Hyperparamter searches and model training
samples = 60
search_epoch = 20

---

### **4.1 Transformer Grid Search**

#### Let's start by using this section to perform a grid search over our Transformer model and find the best hyperparameters for our model to train with. Here's how I've set up my hyperparameter optimization using Bayesian Optimization:

### Hyperparameter Search with Bayesian Optimization

* I'm using `gp_minimize` from the `skopt` library to perform Bayesian Optimization. This helps me efficiently search the hyperparameter space.
* I've defined the search space (`transformer_param_space`) using `Dimension` objects from `skopt.space`. This space includes:
    * `hidden_size`: An integer range for the Transformer hidden size.
    * `num_layers`: An integer range for the number of Transformer layers.
    * `num_heads`: A categorical choice of valid number of attention heads, ensuring it's a divisor of `input_size`.
    * `optimizer`: A categorical choice between 'SGD', 'Adam', and 'AdamW' optimizers.
    * `loss_func`: A categorical choice between my custom multi-task loss functions.
    * `patience`: An integer range for the early stopping patience parameter.
    * `lr`: A real-valued range for the learning rate.
* I'm employing k-fold cross-validation (with `k_folds = 5` and `KFold` from `sklearn.model_selection`) to get a robust estimate of the model's performance for each hyperparameter configuration.

### Objective Function

* The `objective` function is what `gp_minimize` optimizes.
* It takes hyperparameter values as input (thanks to the `@use_named_args` decorator).
* Inside `objective`:
    * I perform the k-fold cross-validation.
    * For each fold, I train and evaluate my Transformer model using the `train_and_evaluate_transformer` function.
    * I calculate the average validation loss across all folds.
* The `objective` function returns this average validation loss, which `gp_minimize` tries to minimize.

### Training and Evaluation

`train_and_evaluate_transformer`

* This function trains and evaluates my Transformer model for a given set of hyperparameters and data folds.
* It does the following:
    * Extracts hyperparameters.
    * Splits the data into training and validation sets for the current fold.
    * Converts the data to PyTorch tensors using `convert_to_tensor`.
    * Creates training and validation `DataLoaders` (using `MultiTaskDataset`).
    * Instantiates the `MultiTaskTransformer` model with the given hyperparameters.
    * Creates a `Learner` (from `fastai.learner`) for training.
    * Implements early stopping based on validation loss.
    * Finally, it returns the validation loss.

### Bayesian Optimization Execution

* I run the Bayesian Optimization using `gp_minimize`, specifying the `objective` function and the search space.
* The `n_calls` parameter controls how many iterations the optimization runs for.
* After the optimization, I extract the best hyperparameters and the corresponding best validation loss.

In essence, this code efficiently searches for the best combination of hyperparameters for my Transformer model by using Bayesian Optimization and k-fold cross-validation to minimize the validation loss.


In [None]:
from skopt import gp_minimize
from skopt.space import Dimension, Categorical, Integer, Real as Real_skopt
from skopt.utils import use_named_args
from itertools import product
from fastai.data.all import *
from fastai.learner import *
from fastai.metrics import *
from sklearn.model_selection import KFold

# Ensure num_heads is a divisor of input_size and within the range we want to search ie. 4-16
# If heads is not a divisor of input_size, the model will throw an error
input_size = X_train.shape[1]
valid_num_heads = [h for h in range(4, 17) if input_size % h == 0]

# Number of folds for cross-validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

# Function to train and evaluate the Transformer model (remains mostly the same)
def train_and_evaluate_transformer(params, train_idx, val_idx, shared_weight_decay = 0.003, task_weight_decay = 0.001, dropout=0.4):
    # Extract parameters
    # Conditional unpacking based on the number of parameters
    if len(params) == 7:
        hidden_size, num_layers, num_heads, optimizer, multitask_loss_func, patience_param, lr = params
    elif len(params) == 10:
        hidden_size, num_layers, num_heads, optimizer, multitask_loss_func, patience_param, lr, shared_weight_decay, task_weight_decay, dropout = params
    else:
        raise ValueError(f"Unexpected number of parameters: {len(params)}")

    # Split the data into training and validation sets for the current fold
    X_train_fold, X_val_fold = X_train[train_idx], X_train[val_idx]
    y_train_any_fold, y_val_any_fold = y_train_any[train_idx], y_train_any[val_idx]
    y_train_pf_fold, y_val_pf_fold = y_train_pf[train_idx], y_train_pf[val_idx]

    # Convert data to PyTorch tensors
    X_train_tensor, X_val_tensor, y_train_any_tensor, y_train_pf_tensor, y_val_any_tensor, y_val_pf_tensor = convert_to_tensor(X_train_fold, X_val_fold, y_train_any_fold, y_train_pf_fold, y_val_any_fold, y_val_pf_fold)

    # Create training and validation datasets
    train_ds = MultiTaskDataset(X_train_tensor, y_train_any_tensor, y_train_pf_tensor)
    val_ds = MultiTaskDataset(X_val_tensor, y_val_any_tensor, y_val_pf_tensor)

    print(f"Train Dset: {train_ds}")
    print(f"Val Dset: {val_ds}")
    
    # Update DataLoaders with the current batch size
    dls = DataLoaders.from_dsets(train_ds, val_ds, bs=batch_size, device=device)

    # Debugging: Print the first batch of the training DataLoader
    # for batch in dls.train:
    #     print(f"Batch: {batch}")
    #     break
    
    # Instantiate the Transformer model with the current parameters
    model = MultiTaskTransformer(input_size=X_train.shape[1], hidden_size=hidden_size, num_layers=num_layers, num_heads=num_heads, dropout=dropout)
    model = model.to(device)  # Move model to GPU if available
    
    # Create Learner
    transformer_learn = Learner(dls, model, loss_func=multitask_loss_func, opt_func=get_optimizer(optimizer, lr), splitter=multitask_splitter,  metrics=[MultiTaskMetric(0), MultiTaskMetric(1), MultiTaskTotalMetric()])
    # learn.add_cb(DebugCallback()) # uncomment for debugging needs

    # Initialize the optimizer
    transformer_learn.create_opt()

    # Set weight decay for each parameter group
    transformer_learn.opt.set_hypers(wd=[shared_weight_decay, task_weight_decay, task_weight_decay])
    
    # Early Stopping implementation for Transformer Grid Search
    patience = patience_param
    best_val_loss = float('inf')
    counter = 0

    for epoch in range(search_epoch):
        transformer_learn.fit(1, lr=lr)

        val_loss = transformer_learn.validate()[0]  # Get validation loss

        print(f"Epoch: {epoch + 1}, Validation Loss: {val_loss}")

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            counter = 0
            best_model_state = transformer_learn.model.state_dict()
        else:
            counter += 1
            if counter >= patience:
                print("Early stopping triggered!")
                break

    transformer_learn.model.load_state_dict(best_model_state)  # Load the best model weights

    # Evaluate on the validation set
    validation_loss = transformer_learn.validate()[0]  

    # Get the F1 score from the MultiTaskTotalMetric
    f1_score = transformer_learn.metrics[2].metrics['f1_score']

    # Clear CUDA cache to prevent memory issues after each fold
    if device.type == 'cuda':
      torch.cuda.empty_cache()

    return validation_loss, f1_score

def run_bayesian_optimization(objective, samples, transformer_param_space):
    # Perform Bayesian Optimization for Transformer
    n_calls = samples  # Number of times to sample the objective function
    result = gp_minimize(objective, transformer_param_space, n_calls=n_calls, random_state=42)

    # Ensure proper types for the results
    best_transformer_params = {}

    # Extract the best parameters and convert them to the correct types
    for param, value in zip(transformer_param_space, result.x):
        if isinstance(param, Integer):
            best_transformer_params[param.name] = int(value)
        elif isinstance(param, Real_skopt):
            best_transformer_params[param.name] = float(value)
        elif isinstance(param, Categorical):
            best_transformer_params[param.name] = value

    # Print the best parameters and loss
    print("Best Transformer Parameters:", best_transformer_params)
    print("Best Transformer Validation Loss:", result.fun)

    return best_transformer_params

# Run the best Transformer model training with the best parameters found from Bayesian Optimization
def run_best_transformer_model_training(hidden_size, num_layers, num_heads, optimizer, multitask_loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout, epochs=100):
    # Convert data to PyTorch tensors
    X_train_tensor, X_val_tensor, y_train_any_tensor, y_train_pf_tensor, y_val_any_tensor, y_val_pf_tensor = convert_to_tensor(X_train, X_val, y_train_any, y_train_pf, y_val_any, y_val_pf)

    # Create training and validation datasets
    train_ds = MultiTaskDataset(X_train_tensor, y_train_any_tensor, y_train_pf_tensor)
    val_ds = MultiTaskDataset(X_val_tensor, y_val_any_tensor, y_val_pf_tensor)

    # Update DataLoaders with the best batch size
    dls = DataLoaders.from_dsets(train_ds, val_ds, bs=batch_size)

    # Instantiate the Transformer model with the best parameters
    best_transformer_model = MultiTaskTransformer(input_size=X_train.shape[1], hidden_size=hidden_size, num_layers=num_layers, num_heads=num_heads, dropout=dropout)

    # Create Learner
    best_transformer_learn = Learner(dls, best_transformer_model, loss_func=multitask_loss_func, opt_func=get_optimizer(optimizer, lr), splitter=multitask_splitter, metrics=[MultiTaskMetric(0), MultiTaskMetric(1), MultiTaskTotalMetric()])
    # best_transformer_learn.add_cb(DebugCallback()) # uncomment for debugging needs

    # Initialize the optimizer
    best_transformer_learn.create_opt()

    # Set weight decay for each parameter group
    best_transformer_learn.opt.set_hypers(wd=[shared_weight_decay, task_weight_decay, task_weight_decay])

    # Early Stopping implementation for RNN Grid Search
    best_val_loss = float('inf')
    counter = 0

    for epoch in range(epochs): 
        best_transformer_learn.fit(1, lr=lr)

        # Evaluate on validation set
        val_loss = best_transformer_learn.validate()[0]

        print(f"Epoch: {epoch + 1}, Validation Loss: {val_loss}")

        # Early stopping check
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            counter = 0
            torch.save(best_transformer_learn.model.state_dict(), 'best_rnn_model.pth')  # Save the best model
        else:
            counter += 1
            if counter >= patience:
                print("Early stopping triggered in final training!")
                break

    # Load the best model
    best_transformer_learn.model.load_state_dict(torch.load('best_rnn_model.pth'))

    # Evaluate on validation set
    combined_score_loss = best_transformer_learn.validate()[0]
    combined_metrics = best_transformer_learn.validate()[3]

    print(f"Validation Loss: {combined_score_loss}")
    print("-------------------------------------------------")
    print(f"Combined Metrics: {combined_metrics}")

    return best_transformer_learn

---

### **4.2 Hyperparameter Search with Bayesian Optimization for Average Loss Objective**

#### Let's do grid search with a Average Loss Objective to see if we can find a better model. Here's how I've set up my hyperparameter optimization using Bayesian Optimization:

#### This code implements **Bayesian Optimization** to tune hyperparameters for a Transformer model by optimizing a **average loss validation function**.

### **Hyperparameter Search Space**

The search space (`transformer_param_space`) includes:
- **`hidden_size`**: The size of the Transformer hidden layer (integer range: 256–1024).
- **`num_layers`**: The number of Transformer layers (integer range: 2–8).
- **`num_heads`**: The number of Transformer layers (integer range: 2–16).
- **`optimizer`**: A categorical choice between `'SGD'`, `'Adam'`, and `'AdamW'`.
- **`loss_func`**: A categorical choice between custom multi-task loss functions.
- **`patience`**: The patience parameter for early stopping (integer range: 3–15).
- **`lr`**: The learning rate (real-valued range: 0.00001–0.006).
- **`shared_weight_decay`**: Weight decay for shared layers (real-valued range: 0.0001–0.01).
- **`task_weight_decay`**: Weight decay for task-specific layers (real-valued range: 0.0001–0.01).
- **`dropout`**: Dropout rate (real-valued range: 0.1–0.5).

### **Objective Function**

The `objective_avg_loss` function:
- Performs **k-fold cross-validation** to evaluate the model's performance for each hyperparameter configuration.
- Calls `train_and_evaluate_rnn` for each fold to train and evaluate the Transformer model.
- Computes the **average validation loss** across all folds, which is returned as the objective to minimize.

### **Bayesian Optimization Execution**

- The `run_bayesian_optimization` function is used to minimize the `objective_avg_loss` function over the defined search space.
- The best hyperparameters (`best_transformer_params`) are extracted after the optimization process.

### **Summary**

This section performs a comprehensive search for the optimal hyperparameters for the Transformer model. By including additional parameters like `shared_weight_decay`, `task_weight_decay`, and `dropout`, the search aims to fine-tune both regularization and model architecture. The use of Bayesian Optimization and k-fold cross-validation ensures robust evaluation and efficient exploration of the hyperparameter space.

In [None]:
# Authors' recommended parameters
# dropout = 0.4
# batch_size = 64
# shared_weight_decay = 0.003
# task_weight_decay = 0.001

# Define the hyperparameter search space for Bayesian Optimization
transformer_param_space_avg_loss = [
    Integer(name='hidden_size', low=256, high=1024),
    Integer(name='num_layers', low=2, high=8),
    Categorical(name='num_heads', categories=valid_num_heads),  # Use only valid num_heads
    Categorical(name='optimizer', categories=['SGD', 'Adam', 'AdamW']),
    Categorical(name='loss_func', categories=[multitask_loss, multitask_loss_loss_proportional, multitask_loss_auto_weighted_bce, multitask_loss_dynamic_weighted]),
    Integer(name='patience', low=5, high=15),
    Real_skopt(name='lr', low=0.00001, high=0.006),
    Real_skopt(name='shared_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='task_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='dropout', low=0.1, high=0.5)
]

# Objective function for Bayesian Optimization with average validation loss
@use_named_args(transformer_param_space_avg_loss)
def objective_avg_loss(hidden_size, num_layers, num_heads, optimizer, loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout):
    fold_losses = []

    # Perform k-fold cross-validation
    for train_idx, val_idx in kf.split(X_train):
        # Train and evaluate the model on the current fold
        validation_loss, _ = train_and_evaluate_transformer((int(hidden_size), int(num_layers), int(num_heads), optimizer, loss_func, int(patience), lr, shared_weight_decay, task_weight_decay, dropout), train_idx, val_idx)

        fold_losses.append(validation_loss)

    # Calculate the average validation loss across all folds
    avg_validation_loss = sum(fold_losses) / len(fold_losses)

    print(f"Params: {(hidden_size, num_layers, optimizer, loss_func.__name__, patience, lr)}, Avg Validation Loss: {avg_validation_loss}")

    return avg_validation_loss

# Perform Bayesian Optimization for RNN with baseline objective
best_transformer_params_avg_loss = run_bayesian_optimization(objective_avg_loss, samples, transformer_param_space_avg_loss)

---

### **4.3 Train Transformer with Best Parameters**

#### Now we will train the Transformer with the best parameters we found in our grid search and saved in **best_transformer_params**
#### After we will print out a series of important metrics and visualizations to verify our results


In [None]:
from fastai.data.all import *
from fastai.learner import *
from fastai.metrics import *

# To run with best parameters from previous runs
# First run full parameter search with 60 samples and 20 epochs each with 5 fold CV
# Best Transformer Parameters: {'hidden_size': 480, 'num_layers': 3, 'num_heads': np.int64(4), 'optimizer': np.str_('SGD'), 'loss_func': <function multitask_loss_loss_proportional at 0x000002D41F8F3D80>, 'patience': 15, 'lr': 0.006, 'shared_weight_decay': 0.00985175238373771, 'task_weight_decay': 0.00933427292776293, 'dropout': 0.1}
# Best Transformer Validation Loss: 0.12026419788599015
hidden_size = 480
num_layers = 3
num_heads = 4
optimizer = 'SGD'
multitask_loss_func = multitask_loss_loss_proportional
lr = 0.006
patience = 15
dropout = 0.1
epochs = 100
shared_weight_decay = 0.00985175238373771
task_weight_decay = 0.00933427292776293

batch_size = 64

# print("Best Transformer Parameters:", best_transformer_params_avg_loss)

# # Extract parameters
# dropout = best_transformer_params_avg_loss['dropout']
# hidden_size = int(best_transformer_params_avg_loss['hidden_size'])
# lr = best_transformer_params_avg_loss['lr']
# num_heads = int(best_transformer_params_avg_loss['num_heads'])
# num_layers = int(best_transformer_params_avg_loss['num_layers'])
# multitask_loss_func = best_transformer_params_avg_loss['loss_func']
# optimizer = best_transformer_params_avg_loss['optimizer']
# patience = int(best_transformer_params_avg_loss['patience'])
# shared_weight_decay = best_transformer_params_avg_loss['shared_weight_decay']
# task_weight_decay = best_transformer_params_avg_loss['task_weight_decay']

# Run the best Transformer model training with average validation loss objective parameters
best_transformer_learn_avg_loss = run_best_transformer_model_training(hidden_size, num_layers, num_heads, optimizer, multitask_loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout, epochs=100)

---

### **4.4 Visualize Our Results**

#### Let us visualize the results of our model training and evaluation.

In [None]:
# Evaluate and visualize the model performance and results
evaluate_and_visualize(learn=best_transformer_learn_avg_loss, epochs=epochs)

---
### **4.5 Transformer Grid Search With Combined Score Objective**

#### Let's do another grid search with a Combined Score Objective to see if we can find a better model. Here's how I've set up my hyperparameter optimization using Bayesian Optimization:

#### This code implements **Bayesian Optimization** to tune hyperparameters for a Transformer model by optimizing a **combined objective function** that balances **F1 score** (to maximize) and **validation loss** (to minimize).

### **Hyperparameter Search Space**

The search space (`transformer_param_space`) includes:
- **`hidden_size`**: The size of the Transformer hidden layer (integer range: 256–1024).
- **`num_layers`**: The number of Transformer layers (integer range: 2–8).
- **`num_heads`**: The number of attention heads (integer range: 2–16).
- **`optimizer`**: A categorical choice between `'SGD'`, `'Adam'`, and `'AdamW'`.
- **`loss_func`**: A categorical choice between custom multi-task loss functions.
- **`patience`**: The patience parameter for early stopping (integer range: 3–15).
- **`lr`**: The learning rate (real-valued range: 0.00001–0.006).
- **`shared_weight_decay`**: Weight decay for shared layers (real-valued range: 0.0001–0.01).
- **`task_weight_decay`**: Weight decay for task-specific layers (real-valued range: 0.0001–0.01).
- **`dropout`**: Dropout rate (real-valued range: 0.1–0.5).

### **Objective Function**

The `objective_combined` function:
- Performs **k-fold cross-validation** to evaluate the model's performance for each hyperparameter configuration.
- Calls `train_and_evaluate_transformer` for each fold to train and evaluate the Transformer model.
- Computes the **average F1 score** (to maximize) and **average validation loss** (to minimize) across all folds.
- Combines these metrics into a single objective using weighted contributions:
  - **F1 score**: Weighted higher (e.g., `0.7`) to prioritize maximizing performance.
  - **Validation loss**: Weighted lower (e.g., `0.3`) to ensure regularization.

### **Bayesian Optimization Execution**

- The `run_bayesian_optimization` function minimizes the `objective_combined` function over the defined search space.
- The best hyperparameters (`best_transformer_params`) are extracted after the optimization process.

### **Summary**

This section optimizes the Transformer model's hyperparameters by balancing F1 score and validation loss using a combined objective. The use of Bayesian Optimization ensures efficient exploration of the hyperparameter space, while k-fold cross-validation provides robust evaluation. The result is a set of hyperparameters that maximizes F1 score while maintaining low validation loss.

In [None]:
# Define the hyperparameter search space for Bayesian Optimization
transformer_param_space_combined_score = [
    Integer(name='hidden_size', low=256, high=1024),
    Integer(name='num_layers', low=2, high=8),
    Categorical(name='num_heads', categories=valid_num_heads),  # Use only valid num_heads
    Categorical(name='optimizer', categories=['SGD', 'Adam', 'AdamW']),
    Categorical(name='loss_func', categories=[multitask_loss, multitask_loss_loss_proportional, multitask_loss_auto_weighted_bce, multitask_loss_dynamic_weighted]),
    Integer(name='patience', low=5, high=15),
    Real_skopt(name='lr', low=0.00001, high=0.006),
    Real_skopt(name='shared_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='task_weight_decay', low=0.0001, high=0.01),
    Real_skopt(name='dropout', low=0.1, high=0.5)
]

# Objective function for Bayesian Optimization with combined objective (f1 score and validation loss)
@use_named_args(transformer_param_space_combined_score)
def objective_combined(hidden_size, num_layers, num_heads, optimizer, loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout):
    fold_f1_scores = []
    fold_validation_losses = []

    # Perform k-fold cross-validation
    for train_idx, val_idx in kf.split(X_train):
        validation_loss, f1_score = train_and_evaluate_transformer((int(hidden_size), int(num_layers), int(num_heads), optimizer, loss_func, int(patience), lr, shared_weight_decay, task_weight_decay, dropout), train_idx, val_idx)
        fold_f1_scores.append(f1_score)
        fold_validation_losses.append(validation_loss)

    # Calculate the average F1 score and validation loss across all folds
    avg_f1_score = sum(fold_f1_scores) / len(fold_f1_scores)
    avg_validation_loss = sum(fold_validation_losses) / len(fold_validation_losses)

    # Combine metrics into a single objective
    weight_f1 = 0.8  # Adjust weights as needed
    weight_loss = 0.2
    combined_objective = -weight_f1 * avg_f1_score + weight_loss * avg_validation_loss

    print(f"Params: {(hidden_size, num_layers, num_heads, optimizer, loss_func.__name__, patience, lr)}\nAvg F1 Score: {avg_f1_score} | Avg Validation Loss: {avg_validation_loss} | Combined Objective: {combined_objective}")

    return combined_objective

# Perform Bayesian Optimization for RNN with baseline objective
best_transformer_params_combined_score = run_bayesian_optimization(objective_combined, samples, transformer_param_space_combined_score)

---

### **4.6 Train Transformer with Best Parameters for Combined Score**

#### Now we will train the Transformer with the best parameters for our Combined Score we found in our grid search and saved in **best_transformer_params**
#### After we will print out a series of important metrics and visualizations to verify our results

In [None]:
from fastai.data.all import *
from fastai.learner import *
from fastai.metrics import *

# To run with best parameters found during initial hyperparameter search with bayesian optimization and 60 samples and 20 epochs
# Best Transformer Parameters: {'hidden_size': 1024, 'num_layers': 7, 'num_heads': np.int64(4), 'optimizer': np.str_('Adam'), 'loss_func': <function multitask_loss_loss_proportional at 0x000002419F837CE0>, 'patience': 5, 'lr': 1e-05, 'shared_weight_decay': 0.0001, 'task_weight_decay': 0.0001, 'dropout': 0.1}
# Best Transformer Validation Loss: -0.482519742495292
hidden_size = 1024
num_layers = 7
num_heads = 4
optimizer = 'Adam'
multitask_loss_func = multitask_loss_loss_proportional
lr = 0.00001
patience = 5
dropout = 0.1
shared_weight_decay = 0.0001
task_weight_decay = 0.0001
epochs = 100
batch_size = 64

# print("Best Transformer Parameters:", best_transformer_params_combined_score)

# # Extract parameters
# dropout = best_transformer_params_combined_score['dropout']
# hidden_size = int(best_transformer_params_combined_score['hidden_size'])
# lr = best_transformer_params_combined_score['lr']
# num_heads = int(best_transformer_params_combined_score['num_heads'])
# num_layers = int(best_transformer_params_combined_score['num_layers'])
# multitask_loss_func = best_transformer_params_combined_score['loss_func']
# optimizer = best_transformer_params_combined_score['optimizer']
# patience = int(best_transformer_params_combined_score['patience'])
# shared_weight_decay = best_transformer_params_combined_score['shared_weight_decay']
# task_weight_decay = best_transformer_params_combined_score['task_weight_decay']

# Run the best Transformer model training with the combined score objective parameters
best_transformer_learn_combined_score = run_best_transformer_model_training(hidden_size, num_layers, num_heads, optimizer, multitask_loss_func, patience, lr, shared_weight_decay, task_weight_decay, dropout, epochs=100)

---

### **4.7 Visualize Our Results for Combined Score**
#### Let us visualize the results of our model training and evaluation for our Combined Score.

In [None]:
# Evaluate and visualize the model performance and results
evaluate_and_visualize(learn=best_transformer_learn_combined_score, epochs=epochs)



---

### **5. Summary of the Project**
**1. Data Pre-Processing**
  - We followed the authors pre-processing code and used their serialized pipelines with limited refactoring.

**2. Model Definition**
  - We defined our `MultiTaskRNN` model for use with our experiments.
  - We defined our 'MultiTaskTransformer1 model to use with our experiments.
  - We defined a series of custom classes and helper methods including: `MultiTaskMetric`, `MultTaskTotalMetric`, `MultiTaskDataset`, multiple implementations of `multitask_loss`, `evaluate_and_visualize` to provide us a series of important metrics and visualizations, along with many other helper methods for convience in later tasks.

**3. RNN Implementation**:
  - Uses `nn.LSTM` for shared layers. Uses two task specifc branches with specifications defined by the authors.
  - Performs Hyperameter Search with Bayesian Optimization over RNN-specific & optimizer-specific hyper parameters.
  - We implement two seperate Objectives for Bayesian Optimization:
    - One treatment that focuses on average validation loss
    - One treatment that focuses on a combined weighted score of f1 score (which we are trying to maximize) and validation loss (which we are trying to minimize), giving more importance to f1 score.
  - After completiong our Hyperparamter Search with Bayesian Optimization over each Objective we train and validate the model on the best set of parameters for each Objective.
  - Finally we visualize our results across a series of valuable statistics like Correct vs Incorrect, F1 Score, Accuracy, Percesion@K, R Percision, and MAP.

**4. Transformer Implementation**:
  - Uses `nn.TransformerEncoder` for shared layers.
  - Performs Hyperameter Search with Bayesian Optimization over transformer-specific & optimizer-specific hyper parameters.
  - We implement two seperate Objectives for Bayesian Optimization:
    - One treatment that focuses on average validation loss
    - One treatment that focuses on a combined weighted score of f1 score (which we are trying to maximize) and validation loss (which we are trying to minimize), giving more importance to f1 score.
  - After completiong our Hyperparamter Search with Bayesian Optimization over each Objective we train and validate the model on the best set of parameters for each Objective.
  - Finally we visualize our results across a series of valuable statistics like Correct vs Incorrect, F1 Score, Accuracy, Percesion@K, R Percision, and MAP.

#### In conclusion, both implementations are different takes on the work of **Gencheva et al (2019)** in the paper "**A Context-Aware Approach for Detecting Worth-Checking Claims in Political Debates**". And overall, this presented for a wonderful thought experiment around the intersectional work between political science and artificial intelligence.