#Put your Google Colab link here:
*your link here*

# COMFORT: A Continual Fine-Tuning Framework for Foundation Models Targeted at Consumer Healthcare

In this exercise, we will use PyTorch to recreate some of the experiments done in the [COMFORT](https://arxiv.org/abs/2409.09549) paper. We will first build a health foundation model with large pre-training. Then, we will use parameter-efficient fine-tuning algorithms to continually fine-tune the health foundation model to learn downstream disease-detection tasks.

*   Navigate to the tabs above. Click `Runtime` -> `Change runtime type` -> select `GPU` as the hardware accelerator to enable GPU, which will allow you to train faster.

## Import useful libraries:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Add the current assignment folder to the system path
# to enable importing functions from utils.py located in this directory
import sys
sys.path.append('/content/drive/Shareddrives/ECE477 datasets/Assignment11')

In [None]:
import os
import time
import math
import torch
import argparse
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from utils import *
from torch import Tensor
from pathlib import Path
from sklearn.utils import shuffle
from sklearn.decomposition import PCA
from typing import Union, Dict, Optional
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

##Get access to a GPU:
To gain access to the GPUs on Colab, navigate to the `Runtime` tab above and select `Change runtime type`.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

# Part 1: Prepare the experimental datasets (1 pt)

### Experimental datasets:

For this exercise, we use the DiabDeep dataset collected for the [DiabDeep](https://ieeexplore.ieee.org/abstract/document/8935429) project and the MHDeep dataset collected for the [MHDeep](https://dl.acm.org/doi/full/10.1145/3527170) project.

*  **The DiabDeep Dataset:**
The DiabDeep dataset contains physiological signals and environmental information collected from 25 non-diabetic individuals, 14 Type-I diabetic patients, and 13 Type-II diabetic patients with a smartwatch and a smartphone.

*  **The MHDeep Dataset:**
The MHDeep dataset contains physiological signals and environmental information collected from 23 healthy participants, 23 participants with bipolar disorder, 10 participants with major depressive disorder, and 16 participants with schizoaffective disorder with a smartwatch and a smartphone.

The datasets used here are their streamlined versions. The DiabDeep dataset contains 20,957 data instances with 4,485 features. The MHDeep dataset includes 27,082 data instances with 4,485 features. Refer to Table 1 in [COMFORT](https://arxiv.org/abs/2409.09549) for the data features included in these datasets.

### Data preprocessing:

It is crucial and a good practice to [preprocess](https://neptune.ai/blog/data-preprocessing-guide) your dataset before you jump into model training. Data preprocessing includes handling missing values, data nomalization, feature selection, dimensionality reduction, etc. This part has been done for you. See Section 4.2 in [COMFORT](https://arxiv.org/abs/2409.09549) for the details about the dataset preprocessing that had been done.

### Prepare data for pre-training:

COMFORT deems to use a large number of physiological data collected from healthy individuals with wearable medical sensors (WMSs) to pre-train the health foundation model. However, in light of the difficulty of accessing such large datasets for this exercise, we will just use the data from the healthy participants in both DiabDeep and MHDeep datasets to pre-train the foundation model. These data are already packed separately in `diabdeep_data_4pretrain.zip` and `mhdeep_data_4pretrain.zip` for you.

##Create a custom dataset class: (0.5 pt)
To prepare our datasets, we create a `CustomDataset` class that inherits PyTorch's [Dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) class.


In [None]:
class CustomDataset(Dataset):
    def __init__(self, x, y, device):
        # cast x and y to `device`
        """TO DO"""
        self.x =
        self.y =

    def __len__(self):
        # Return the len of the dataset
        """TO DO"""
        return

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

##Load the data for pre-training: (0.5 pt)

Use Numpy to load the data from both datasets and transform them to torch tensors. Then, concatenate them along the corrent dimension.

In [None]:
# unzip the pre-training datasets
!unzip '/content/drive/Shareddrives/ECE477 datasets/Assignment11/diabdeep_data_4pretrain.zip' -d diabdeep_data_4pretrain
!unzip '/content/drive/Shareddrives/ECE477 datasets/Assignment11/mhdeep_data_4pretrain.zip' -d mhdeep_data_4pretrain

# check out the data
print('\n', os.listdir('diabdeep_data_4pretrain'))
print('\n', os.listdir('mhdeep_data_4pretrain'))

# load the data and transform them to torch tensors
"""TO DO"""
diab_x_4pretrain =
mh_x_4pretrain =
# concatenate the above data along the correct dimension
"""TO DO"""
x_pretrain =
print(x_pretrain.shape)

# Part 2: Create the building blocks for a Transformer (11 pts)

COMFORT targets at classification applications in the WMS data domain, where input data and output data are both numerical. Therefore, we do not need a decoder for our Transformer model. In addition, we want our health foundation model to understand the bidirectional relationship between WMS data, so we employ the BERT_TINY [[1](https://arxiv.org/abs/1810.04805), [2](https://arxiv.org/abs/2110.01518), [3](https://arxiv.org/abs/1908.08962)] architecture and pre-training objectives to construct the health foundation model.

In the following cells, we will create the building blocks of an encoder-only Transformer model. Each cell is provided with a blog link for extra readings.

## Positional encoding: (2 pts)

[Positional encoding](https://medium.com/@hunterphillips419/positional-encoding-7a93db4109e6) is used to provide a relative position for each token in a sequence.

**Hint: There is an efficient and PyTorch-centric approach to implement this.**

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_length: int = 1000):
        """
        Args:
          d_model:          dimension of embeddings
          dropout:          probability of dropout occurring
          max_length:       max sequence length
        """
        # inherit from Module
        super().__init__()

        # initialize a dropout layer
        self.dropout = nn.Dropout(p=dropout)

        # create an empty tensor of 0s
        """TO DO"""
        pe =     # pe.shape = (max_length, d_model)

        # create a column matrix for all possible positions
        """TO DO"""
        k =     # k.shape = (max_length, 1)

        # calculate the divisor for positional encoding
        # use n = 10000.0
        """TO DO"""
        div_term =

        # calcualte sine on even indices
        """TO DO"""
        pe[:, 0::2] =

        # calculate cosine on odd indices
        """TO DO"""
        pe[:, 1::2] =

        # add one dimension
        pe = pe.unsqueeze(0)

        # register the positional encoding pe to buffers
        # buffers are saved in state_dict but not trained by the optimizer
        self.register_buffer("pe", pe)

    def forward(self, x: Tensor):
        """
        Args:
          x:        input embeddings
        Returns:
          x:        input embeddings + positional encodings
        """
        # add positional encoding to the embeddings
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)

        # perform dropout
        x = self.dropout(x)
        return x

## Multi-head attention layers: (3 pts)

[Multi-head attention](https://medium.com/@hunter-j-phillips/multi-head-attention-7924371d477a) layers are the heart of a Transformer model.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int = 512, n_heads: int = 8, dropout: float = 0.1):
        """
        Args:
          d_model:      dimension of embeddings
          n_heads:      number of self attention heads
          dropout:      probability of dropout occurring
        """
        super().__init__()
        assert d_model % n_heads == 0            # Ensure an even num of heads
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_key = d_model // n_heads          # Assume d_value equals d_key

        # Use nn.Linear() to initialize the query layer weights
        """TO DO"""
        self.Wq =
        # Use nn.Linear() to initialize the key layer weights
        """TO DO"""
        self.Wk =
        # Use nn.Linear() to initialize the value layer weights
        """TO DO"""
        self.Wv =
        # Use nn.Linear() to initialize the output layer weights
        """TO DO"""
        self.Wo =

        # Initialize a dropout layer
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query: Tensor, key: Tensor, value: Tensor, mask: Tensor = None):
        """
        Args:
          query:         query vector       (batch_size, q_length, d_model)
          key:           key vector         (batch_size, k_length, d_model)
          value:         value vector       (batch_size, s_length, d_model)
          mask:          mask for decoder
        Returns:
          output:        attention values   (batch_size, q_length, d_model)
        """
        batch_size = key.size(0)

        # calculate query, key, and value tensors
        """TO DO"""
        Q =
        K =
        V =

        # split each tensor into n-heads to compute attention
        # query tensor
        Q = Q.view("""TO DO""").permute(0, 2, 1, 3)

        # key tensor
        K = K.view("""TO DO""").permute(0, 2, 1, 3)

        # value tensor
        V = V.view("""TO DO""").permute(0, 2, 1, 3)

        # computes attention
        # scaled dot product -> QK^{T}
        scaled_dot_prod = torch.matmul(Q, K.permute(0, 1, 3, 2)) / math.sqrt(self.d_key)

        # fill those positions of product as (-1e10) where mask positions are 0
        if mask is not None:
          scaled_dot_prod = scaled_dot_prod.masked_fill(mask == 0, -1e10)

        # apply softmax
        attn_probs = torch.softmax(scaled_dot_prod, dim=-1)

        # multiply by values to get attention
        A = torch.matmul(self.dropout(attn_probs), V)

        # reshape attention back
        A = A.permute(0, 2, 1, 3).contiguous()
        A = A.view(batch_size, -1, self.n_heads*self.d_key)

        # push through the final weight layer
        """TO DO"""
        output =

        return output

## Position-wise feed-forward network: (1 pts)

[Position-wise feed-forward network](https://medium.com/@hunter-j-phillips/position-wise-feed-forward-network-ffn-d4cc9e997b4c) is an expand-and-contract network that transforms each sequence using the same dense layers.

In [None]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model: int, d_ffn: int, dropout: float = 0.1):
        """
        Args:
          d_model:      dimension of embeddings
          d_ffn:        dimension of feed-forward network
          dropout:      probability of dropout occurring
        """
        super().__init__()
        # Use nn.Linear() to initialize the w_1 layer weights
        """TO DO"""
        self.w_1 =

        # Use nn.Linear() to initialize the w_2 layer weights
        """TO DO"""
        self.w_2 =

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """
        Args:
          x:        Output from attention layer             (batch_size, seq_length, d_model)
        Returns:
          x:        Expanded-and-contracted representation  (batch_size, seq_length, d_model)
        """
        # pass x through self.w_1 and a relu() layer
        """TO DO"""
        x =
        x = self.dropout(x)
        # pass x through self.w_2
        x =
        return x

## Encoder layer and encoder stack:

The [encoder layer](https://medium.com/@hunter-j-phillips/the-encoder-f698b2c7afc0) is a wrapper for the sublayers created in the previous cells. The encoder stack is a wrapper to combine the encoder layers.

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ffn: int, dropout: float):
        """
        Args:
          d_model:      dimension of embeddings
          n_heads:      number of self-attention heads
          d_ffn:        dimension of feed-forward network
          dropout:      probability of dropout occurring
        """
        super().__init__()
        # multi-head attention sublayer
        self.attention = MultiHeadAttention(d_model, n_heads, dropout)
        # layer norm for multi-head attention
        self.attn_layer_norm = nn.LayerNorm(d_model)

        # position-wise feed-forward network
        self.positionwise_ffn = PositionwiseFeedForward(d_model, d_ffn, dropout)
        # layer norm for position-wise ffn
        self.ffn_layer_norm = nn.LayerNorm(d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(self, src: Tensor, src_mask: Tensor):
        """
        Args:
          src:          positionally embedded sequences     (batch_size, seq_length, d_model)
          src_mask:     mask for the sequences              (batch_size, 1, 1, seq_length)
        Returns:
          src:          sequences after self-attention and feed-forward layers    (batch_size, seq_length, d_model)
        """
        # pass embeddings through multi-head attention
        _src = self.attention(src, src, src, src_mask)

        # residual add and norm
        src = self.attn_layer_norm(src + self.dropout(_src))

        # position-wise feed-forward network
        _src = self.positionwise_ffn(src)

        # residual add and norm
        src = self.ffn_layer_norm(src + self.dropout(_src))

        return src


class Encoder(nn.Module):
    def __init__(self, d_model: int, n_layers: int,
                 n_heads: int, d_ffn: int, dropout: float = 0.1):
        """
        Args:
          d_model:      dimension of embeddings
          n_layers:     number of encoder layers
          n_heads:      number of self-attention heads
          d_ffn:        dimension of feed-forward network
          dropout:      probability of dropout occurring
        """
        super().__init__()

        # create n_layers encoders
        self.layers = nn.ModuleList([EncoderLayer(d_model, n_heads, d_ffn, dropout)
                                     for layer in range(n_layers)])

        self.dropout = nn.Dropout(dropout)

    def forward(self, src: Tensor, src_mask: Tensor):
        """
        Args:
          src:          embedded sequences              (batch_size, seq_length, d_model)
          src_mask:     mask for the sequences          (batch_size, 1, 1, seq_length)

        Returns:
        src:          sequences after encoder layers    (batch_size, seq_length, d_model)
        """
        # pass the sequences through each encoder
        for layer in self.layers:
            src = layer(src, src_mask)
        return src

## Define the Transformer class for the foundation model:

Here, we define a Transformer class as the wrapper to combine the building blocks defined above.

In [None]:
class Transformer(nn.Module):
    def __init__(self, encoder: Encoder, pos_enc: PositionalEncoding, src_pad_idx: int, device):
        """
        Args:
          encoder:      encoder stack
          pos_enc:      positional encodings
          src_pad_idx:  padding index
          device:       device used for training
        """
        super().__init__()
        self.encoder     = encoder
        self.pos_enc     = pos_enc
        self.src_pad_idx = src_pad_idx
        self.device      = device

    def make_src_mask(self, src: Tensor):
        """
        Create a mask for source sequences.

        Args:
          src:          raw sequences with padding
        Returns:
          src_mask:     mask for each sequence
        """
        # assign 1 to tokens that need attended to and 0 to padding tokens, then add 2 dimensions
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
        return src_mask

    def forward(self, src: Tensor):
        """
        Pass the source sequences through the encoder.

        Args:
          src:          raw source sequences
        Returns:
          src:          sequences after encoder stack
        """
        # create source masks
        src_4_mask = src[:, :, 0]
        src_mask = self.make_src_mask(src_4_mask)

        # pass the src through the pos_enc and encoder layers
        src = self.pos_enc(src)
        src = self.encoder(src, src_mask)
        return src

## Make the foundation model:

Here, we define a function `make_foundation_model()` to instantiate a Transformer model as our foundation model.

In [None]:
def make_foundation_model(device, n_layers: int = 3, in_features: int = 4485, d_model: int = 512,
               d_ffn: int = 2048, n_heads: int = 8, dropout: float = 0.1,
               max_length: int = 1000, pad_idx: int = 100):
        """
        Construct a model when provided parameters.

        Args:
          device:       device used for training
          n_layers:     number of Transformer layers
          in_features:  dimension of input features
          d_model:      dimension of embeddings
          d_ffn:        dimension of feed-forward network
          n_heads:      number of heads
          dropout:      probability of dropout occurring
          max_length:   maximum sequence length for positional encodings
          pad_idx:      padding index
        Returns:
          model:        a Transformer model
        """
        # create the encoder
        encoder = Encoder(d_model, n_layers, n_heads, d_ffn, dropout)

        # create a positional encoding matrix
        pos_enc = PositionalEncoding(d_model, dropout, max_length)

        # create the Transformer model
        model = nn.Sequential(Transformer(encoder, pos_enc, pad_idx, device))

        # initialize parameters with Xavier/Glorot
        for p in model.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
        return model

## Define the pre-training function: (3 pts)

Here, we define the pre-training function `pretrain_model()` to pre-train our foundation model.

In [None]:
def pretrain_model(model, device, iterator, optimizer, criterion, clip):
    """
    Train the model on the given data.

    Args:
        model:        the Transformer model to be trained
        device:       device used for training
        iterator:     data to be trained on
        optimizer:    optimizer for updating parameters
        criterion:    loss function for updating parameters
        clip:         value to help prevent exploding gradients
    Returns:
        avg_loss:     average loss for the epoch
    """
    # set the model to training mode
    model.train()

    epoch_loss = 0

    # loop through each batch in the iterator
    for i, batch in enumerate(iterator):

        # set the source (data features) and target (target values) batches
        src, trg = batch
        for i in range(len(src)):
            # in src, randomly select 5 row indices out of the 15 rows (tokens) without replacement for masking
            """TO DO"""
            row_mask_idx =
            for j in range(len(row_mask_idx)):
                # in src, for each row_mask_idx, randomly select 15% of the data features without replacement for masking
                """TO DO"""
                column_mask_idx =
                # in each src, replace data at locations selected above with random numbers from the standard normal distribution
                """TO DO"""
                src[i, row_mask_idx[j], column_mask_idx] =

        # zero out the optimizer gradients
        """TO DO"""


        # get output logits for src
        """TO DO"""
        logits =
        logits = logits.contiguous().view(-1, logits.shape[-1])

        # expected output (traget values)
        expected_output = trg.contiguous().view(-1, trg.shape[-1])

        # calculate the loss
        loss = criterion(logits, expected_output)

        # backpropagate loss
        """TO DO"""

        # clip the weights
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        # update the weights
        """TO DO"""

        # update the loss
        epoch_loss += loss.item()

    # return the average loss for the epoch
    avg_loss = epoch_loss / len(iterator)
    return avg_loss

## Initialize model parameters:

We initialize some hyperparameters here.

Typically, constructing a foundation model requires a significant amount of pre-training data and training epochs. However, due to the resourse limitation, we will just briefly pre-train our health foundation model for **5 steps** and **20 epochs** each.

In [None]:
layers     = 2            # number of Transformer layers
features   = 128          # dimension of input features
d_model    = 128          # dimensions of embeddings
d_ffn      = d_model*4    # dimension of feed-forward network
heads      = 2            # number of heads
max_length = 15           # max sequence length
PAD_NUM    = 100          # filling number for padding

epochs     = 20           # briefly pre-train the foundation model for 20 epochs per step
steps      = 5            # briefly pre-train the foundation model for 5 step
lr         = 5e-4         # initial learning rate
batch_size = 128          # batch size

## Pre-train the foundation model: (2 pts)

In [None]:
import random
SEED = 0

# Set a fixed random seed for reproducibility
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if device == 'cuda':
  torch.cuda.manual_seed(SEED)
  torch.cuda.manual_seed_all(SEED)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False

# use make_foundation_model() to instantiate a Transformer model as the foundation model
"""TO DO"""
model =
# cast the model to `device`
"""TO DO"""


# initialize the Adam optimizer
"""TO DO"""
optimizer =
# set the loss criterion as the MSELoss()
"""TO DO"""
criterion =
# set the CLIP value to prevent exploding gradient
CLIP = 1

# training may take 20 min
for step in range(steps):
    # shuffle the pre-training data to inject randomness
    """TO DO"""
    x_pretrain =
    # create a copy of the pre-training data as their target values
    """TO DO"""
    y_pretrain =

    # apply positional encoding to the target values
    ##################################################################################
    pe = torch.zeros(max_length, d_model)
    k = torch.arange(0, max_length).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(k * div_term)
    pe[:, 1::2] = torch.cos(k * div_term)
    pe = pe.unsqueeze(0)
    y_pretrain = y_pretrain + pe[:, : y_pretrain.size(1)]
    ##################################################################################

    # use CustomDataset() to create the pre-training dataset pretrain_set
    """TO DO"""
    pretrain_set =
    # use PyTorch's DataLoader() to create the dataloader pretrain_loader
    """TO DO"""
    pretrain_loader =

    # loop through each epoch
    for epoch in range(epochs):
        # use pretrain_model() to calculate the train loss and update the parameters
        """TO DO"""
        train_loss =
        # save the model parameters
        torch.save(model.state_dict(), f'pretrain_comfort_mlm_exercise.pt')

        print(f'Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f}')

# Part 3: Continual parameter-efficient fine-tuning for downstream disease-detection tasks (8 pts)

Now that we finished pre-training our health foundation model, we will learn downstream disease-detection tasks using parameter-efficient fine-tuning (PEFT) algorithms. We will use [Low-Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685) to learn the DiabDeep detection task.

## PEFT with LoRA:

Here, we will define the building blocks for implementing LoRA.

## Position-wise feed-forward network with LoRA: (3 pts)

We will insert low-rank matrices to the position-wise feed-forward networks for PEFT.

In [None]:
class LoRAPositionwiseFeedForward(PositionwiseFeedForward):
    """
    Extends PositionwiseFeedForward module with Low-Rank Adaptation (LoRA).
    LoRA adds two matrices to the layer, allowing for efficient training of large models.
    """
    def __init__(self, rank: int, d_model:int, d_ffn: int, dropout: float = 0.1):
        """
        Args:
          rank:         rank of LoRA metrices
          d_model:      dimension of embeddings
          d_ffn:        dimension of feed-forward network
          dropout:      probability of dropout occurring
        """
        super().__init__(d_model, d_ffn, dropout)

        # Initialize LoRA matrices
        self.alpha = 8
        self.rank = rank
        # use nn.Parameter() to initialize the low-rank matrix B for w_1 in the position-wise feed-forward netwrok
        # hint: zero matrix
        """TO DO"""
        self.lora_w_1_B =
        # use nn.Parameter() to initialize the low-rank matrix A for w_1 in the position-wise feed-forward netwrok
        # hint: random values from the standard normal distribution
        """TO DO"""
        self.lora_w_1_A =
        # use nn.Parameter() to initialize the low-rank matrix B for w_2 in the position-wise feed-forward netwrok
        # hint: zero matrix
        """TO DO"""
        self.lora_w_2_B =
        # use nn.Parameter() to initialize the low-rank matrix A for w_2 in the position-wise feed-forward netwrok
        # hint: random values from the standard normal distribution
        """TO DO"""
        self.lora_w_2_A =

        # Freeze the original weight matrix
        self.w_1.requires_grad = False
        self.w_2.requires_grad = False

    def forward(self, x: Tensor) -> Tensor:
        # Compute LoRA weight matrices (delta W)
        # hint: BA
        """TO DO"""
        lora_w_1_weights =
        lora_w_2_weights =
        # scale LoRA weight matrices
        lora_w_1_weights *= (self.alpha / self.rank)
        lora_w_2_weights *= (self.alpha / self.rank)
        # Apply the original and LoRA-adjusted linear transformations
        # pass x through w_1 layer with LoRA and a relu() activation
        # hint: w_1 * x + BA * x
        """TO DO"""
        x =
        x = self.dropout(x)
        # pass x through w_2 layer with LoRA
        # hint: w_2 * x + BA * x
        """TO DO"""
        x =
        return x

## Multi-head attention layers with LoRA: (3 pts)

We will insert low-rank matrices to the multi-head attention layers for PEFT. As the original implementation in [LoRA](https://arxiv.org/abs/2106.09685), We will apply LoRA to the query and value layers only.

In [None]:
class LoRAMultiHeadAttention(MultiHeadAttention):
    """
    A module inherits from the standard MultiHeadAttention.
    This initializes and adds the LoRA (LoRA: Low-Rank Adaptation of Large Language Models) matricies to the original MultiHeadAttention.
    In __init__, initialize the new matricies needed for LoRA.
    The parameters should all start with "lora_" as we will check for this string later in the full model wrapper to identify the lora matricies.
    The LoRA paper concluded that only updating the query and value matricies is the most efficient.
    We overwrite the forward method with the new LoRA logic.
    """
    def __init__(self, rank: int, **kwargs):
        """
        Args:
          rank:         rank of LoRA matrices
          d_model:      dimension of embeddings
          n_heads:      number of attention heads
          dropout:      dropout probability, randomly zeroes-out some of the input
        """
        super().__init__(**kwargs)
        self.alpha = 8
        self.rank = rank
        self.d_model = kwargs['d_model']
        self.n_heads = kwargs['n_heads']
        self.dropout = nn.Dropout(p=kwargs['dropout'])

        # Initialize trainable matrices for query and value vectors
        # B should be initialized as zeros, A as random gaussian, such that their product and
        # thus the weight delta is zero in the beginning

        # use nn.Parameter() to initialize the low-rank matrix B for query
        """TO DO"""
        self.lora_query_matrix_B =
        # use nn.Parameter() to initialize the low-rank matrix A for query
        """TO DO"""
        self.lora_query_matrix_A =
        # use nn.Parameter() to initialize the low-rank matrix B for value
        """TO DO"""
        self.lora_value_matrix_B =
        # use nn.Parameter() to initialize the low-rank matrix A for value
        """TO DO"""
        self.lora_value_matrix_A =

    def lora_query(self, x):
        """
        LoRA query logic. To fully work with only training the LoRA parameters the regular linear
        layer has to be frozen before initializing the optimizer.

        Args:
          x:        inputs

        Returns:
          output:   Wq(x) + delta_Wq(x)
        """
        # Compute LoRA weight matrices (delta W)
        # hint: BA
        """TO DO"""
        lora_full_query_weights =
        # scale LoRA weight matrix
        lora_full_query_weights *= (self.alpha / self.rank)
        # Apply the original and LoRA-adjusted linear transformations
        # pass x through Wq layer with LoRA
        # hint: Wq * x + BA * x
        """TO DO"""
        output =
        return output

    def lora_value(self, x):
        """
        LoRA value logic. To fully work with only training the LoRA parameters the regular linear
        layer has to be frozen before initializing the optimizer.

        Args:
          x:        inputs

        Returns:
          output:   Wv(x) + delta_Wv(x)
        """
        # Compute LoRA weight matrices (delta W)
        # hint: BA
        """TO DO"""
        lora_full_value_weights =
        # scale LoRA weight matrix
        lora_full_value_weights *= (self.alpha / self.rank)
        # Apply the original and LoRA-adjusted linear transformations
        # pass x through Wv layer with LoRA
        # hint: Wv * x + BA * x
        """TO DO"""
        output =
        return output

    def forward(self, query: Tensor, key: Tensor, value: Tensor, mask: Tensor = None):
        """
        Args:
           query:         query vector         (batch_size, q_length, d_model)
           key:           key vector           (batch_size, k_length, d_model)
           value:         value vector         (batch_size, s_length, d_model)
           mask:          mask for decoder

        Returns:
           output:        attention values     (batch_size, q_length, d_model)
           attn_probs:    softmax scores       (batch_size, n_heads, q_length, k_length)
        """
        batch_size = key.size(0)

        # calculate query, key, and value tensors
        # keep the same key logic
        K = self.Wk(key)
        # replace with LoRA query logic
        """TO DO"""
        Q =
        # replace with LoRA value logic
        """TO DO"""
        V =

        # split each tensor into n-heads to compute attention
        # query tensor
        Q = Q.view("""TO DO""").permute(0, 2, 1, 3)

        # key tensor
        K = K.view("""TO DO""")).permute(0, 2, 1, 3)

        # value tensor
        V = V.view("""TO DO""")).permute(0, 2, 1, 3)

        # computes attention
        # scaled dot product -> QK^{T}
        scaled_dot_prod = torch.matmul(Q, K.permute(0, 1, 3, 2)) / math.sqrt(self.d_key)

        # fill those positions of product as (-1e10) where mask positions are 0
        if mask is not None:
          scaled_dot_prod = scaled_dot_prod.masked_fill(mask == 0, -1e10)

        # apply softmax
        attn_probs = torch.softmax(scaled_dot_prod, dim=-1)

        # multiply by values to get attention
        A = torch.matmul(self.dropout(attn_probs), V)

        # reshape attention back
        A = A.permute(0, 2, 1, 3).contiguous()
        A = A.view(batch_size, -1, self.n_heads*self.d_key)

        # push through the final weight layer
        output = self.Wo(A)

        return output

## LoRA Wrapper:

Here, we define a wrapper to create a Transformer model with LoRA.

In [None]:
class LoRAWrapper(nn.Module):
    def __init__(self, task: str, encoder: Encoder,pos_enc: PositionalEncoding,
                 d_model: int, n_layers: int, n_heads: int, pad_idx: int,
                 device, in_features: int = 4485, d_ffn: int = 512,
                 dropout_rate: float = 0.1, lora_rank: int = 8,
                 max_length: int = 20, num_classes: int = None,
                 train_biases: bool = True, train_layer_norms: bool = True):
        """
        Initializes a LoRAWrapper instance, which is a wrapper around the Transformer incorporating
        Low-Rank Adaptation (LoRA) to efficiently retrain the model for different tasks.
        LoRA allows for effective adaptation of large pre-trained models with minimal updates.

        Args:
          task:                 type of task to configure the model for: {'diabdeep', 'mhdeep'}
                                for 'diabdeep', the number of classes is 3, and 4 for 'mhdeep'
          encoder:              encoder module for the Transformer
          pos_enc:              positionalEncoding module for input src
          d_model:              dimension of embeddings
          n_layers:             number of layers in the Transformer
          n_heads:              number of attention heads
          pad_idx:              padding index
          device:               device used for training
          in_features:          input feature dimensionality
          d_ffn:                dimensionality for PositionwiseFeedForward module
          dropout_rate:         dropout probability, randomly zeroes-out some of the input
          lora_rank:            rank of the LoRA matrices
          max_length:           max input sequence length
          num_classes:          number of classes for the classification head
          train_biases:         flag indicating whether to update bias parameters during training
          train_layer_norms:    flag indicating whether to update the layer norms during training, usually this is a good idea.
        """
        super().__init__()

        supported_tasks = ['diabdeep', 'mhdeep']
        assert isinstance(task, str) and task.lower() in supported_tasks, f"task has to be one of {supported_tasks}"

        if task == "diabdeep":
            num_classes = 3
        elif task == "mhdeep":
            num_classes = 4

        # 1. Initialize the base model with parameters
        self.model = nn.Sequential(Transformer(encoder, pos_enc, pad_idx, device))
        # load the pretrained weights
        print(f"Loading pre-trained weights from pre-training...")
        self.model.load_state_dict(torch.load(f'pretrain_comfort_mlm_exercise.pt', weights_only=True), strict=False)

        self.base_model_param_count = count_parameters(self.model)

        self.lora_rank = lora_rank
        self.train_biases = train_biases
        self.train_layer_norms = train_layer_norms

        # 2. Save parameters and add the classifier head for the fine-tuning tasks
        self.task = task.lower()
        self.d_model = d_model
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.pad_idx = pad_idx
        self.device = device
        self.in_features = in_features
        self.d_ffn = d_ffn
        self.dropout = dropout_rate
        self.max_length = max_length
        self.num_classes = num_classes

        # 3. Define the additional classifier head
        self.finetune_head_fc1 = nn.Linear(d_model, 512)
        self.finetune_head_fc2 = nn.Linear(512, 128)
        self.finetune_head_fc3 = nn.Linear(128, num_classes)

        # 4. Set up the lora model for fine-tuning
        self.replace_multihead_attention()
        self.freeze_parameters_except_lora_and_bias()

    def replace_multihead_attention(self, verbose = True):
        """
        Replaces MultiHeadAttention with LoRAMultiHeadAttention in the model, which contains the LoRA logic and parameters.
        """
        self.multiheadattention_replaced_modules = 0
        self.positionwisefeedforward_replaced_modules = 0
        self.replace_multihead_attention_recursion(self.model)
        if verbose:
            print(f"Replaced {self.multiheadattention_replaced_modules} modules of MultiHeadAttention with LoRAMultiHeadAttention")
            print(f"Replaced {self.positionwisefeedforward_replaced_modules} modules of PositionwiseFeedForward with LoRAPositionwiseFeedForward")

    def replace_multihead_attention_recursion(self, model):
        """
        Recursively replaces MultiHeadAttention with LoRAMultiHeadAttention in the given model/module.
        If some components are wrapped in another class this function can recursively apply the replacement to
        find all instances of the Attention.
        """
        # Model can also be a module if it contains sub-components
        for name, module in model.named_children():
            if isinstance(module, MultiHeadAttention): # or isinstance(module, PositionwiseFeedForward):

                if isinstance(module, MultiHeadAttention):
                    # Create a new LoRAMultiheadAttention layer
                    new_layer = LoRAMultiHeadAttention(rank=self.lora_rank, d_model=self.d_model, n_heads=self.n_heads, dropout=self.dropout)
                    self.multiheadattention_replaced_modules += 1

                elif isinstance(module, PositionwiseFeedForward):
                    # Create a new LoRAPositionwiseFeedForward layer
                    new_layer = LoRAPositionwiseFeedForward(rank=self.lora_rank, d_model=self.d_model, d_ffn=self.d_ffn, dropout=self.dropout)
                    self.positionwisefeedforward_replaced_modules += 1

                # Get the state of the original layer
                state_dict_old = module.state_dict()

                # Load the state dict to the new layer
                new_layer.load_state_dict(state_dict_old, strict=False)

                # Get the state of the new layer
                state_dict_new = new_layer.state_dict()

                # Compare keys of both state dicts
                keys_old = set(k for k in state_dict_old.keys() if not k.startswith("lora_"))
                keys_new = set(k for k in state_dict_new.keys() if not k.startswith("lora_"))
                assert keys_old == keys_new, f"Keys of the state dictionaries don't match (ignoring lora parameters):\n\tExpected Parameters: {keys_old}\n\tNew Parameters (w.o. LoRA): {keys_new}"

                # Replace the original layer with the new layer
                setattr(model, name, new_layer)
            else:
                # Recurse on the child modules
                self.replace_multihead_attention_recursion(module)

    def freeze_parameters_except_lora_and_bias(self):
        """
        Freezes all parameters in the model, except those in LoRA layers and bias parameters, if specified.
        All LoRA parameters are identified by having a name that starts with *lora_*.
        """
        for name, param in self.model.named_parameters():
            if ("lora_" in name) or ("finetune_head_" in name) or (self.train_biases and "bias" in name) \
                or (self.train_layer_norms and "LayerNorm" in name):
                param.requires_grad = True
            else:
                param.requires_grad = False

    def make_src_mask(self, src: Tensor):
        """
        Args:
            src:          raw sequences with padding        (batch_size, seq_length)

        Returns:
            src_mask:     mask for each sequence            (batch_size, 1, 1, seq_length)
        """
        # assign 1 to tokens that need attended to and 0 to padding tokens, then add 2 dimensions
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)

        return src_mask

    def forward(self, src: Tensor):
        """
        Args:
            src:          raw src sequences                 (batch_size, src_seq_length)

        Returns:
            src:          sequences after the model         (batch_size, trg_seq_length, output_dim)
        """
        # push the src through the model
        src = self.model(src)       # (batch_size, src_seq_length, d_model)
        src = torch.mean(src, 1)    # (batch_size, d_model)
        src = self.finetune_head_fc1(src).relu()
        src = self.finetune_head_fc2(src).relu()
        src = self.finetune_head_fc3(src)
        return src

    def save_lora_state_dict(self, lora_filepath: Optional[Union[str, Path]] = None) -> Optional[Dict]:
        """
        Save the trainable parameters of the model into a state dict.
        If a file path is provided, it saves the state dict to that file.
        If no file path is provided, it simply returns the state dict.

        Parameters
        ----------
        lora_filepath : Union[str, Path], optional
            The file path where to save the state dict. Can be a string or a pathlib.Path. If not provided, the function
            simply returns the state dict without saving it to a file.

        Returns
        -------
        Optional[Dict]
            If no file path was provided, it returns the state dict. If a file path was provided, it returns None after saving
            the state dict to the file.
        """
        # Create a state dict of the trainable parameters
        state_dict = {name: param for name, param in self.named_parameters() if param.requires_grad}

        # add addional parameters to state dict
        state_dict['task'] = self.task
        state_dict['d_model'] = self.d_model
        state_dict['n_layers'] = self.n_layers
        state_dict['n_heads'] = self.n_heads
        state_dict['pad_idx'] = self.pad_idx
        state_dict['device'] = self.device
        state_dict['in_features'] = self.in_features
        state_dict['d_ffn'] = self.d_ffn
        state_dict['lora_rank'] = self.lora_rank
        state_dict['dropout_rate'] = self.dropout
        state_dict['max_length'] = self.max_length
        state_dict['num_classes'] = self.num_classes

        if lora_filepath is not None:
            # Convert string to pathlib.Path if necessary
            if isinstance(lora_filepath, str):
                lora_filepath = Path(lora_filepath)

            # Save the state dict to the specified file
            torch.save(state_dict, lora_filepath)
        else:
            # Return the state dict if no file path was provided
            return state_dict


    @staticmethod
    def load_lora_state_dict(lora_parameters: Union[str, Path, Dict] = None):
        """
        Load a state dict into the model from a specified file path or a state dict directly.
        This is a staticmethod to be used from the base clase, returning a fully initialized and LoRA loaded model.

        Parameters
        ----------
        lora_parameters : Union[str, Path, Dict]
            Either the file path to the state dict (can be a string or pathlib.Path) or the state dict itself. If a file path
            is provided, the function will load the state dict from the file. If a state dict is provided directly, the function
            will use it as is.

        Returns
        -------
        LoRAWrapper object, initialized and with the LoRA weights loaded.
        """
        # Check if a filepath or state dict was provided
        if lora_parameters is not None:
            # Convert string to pathlib.Path if necessary
            if isinstance(lora_parameters, str):
                lora_parameters = Path(lora_parameters)

            # If the provided object is a Path, load the state dict from file
            if isinstance(lora_parameters, Path):
                state_dict = torch.load(lora_parameters, weights_only=True)
            else:
                # If it's not a Path, assume it's a state dict
                state_dict = lora_parameters
        else:
            raise ValueError("No filepath or state dict provided")

        encoder = Encoder(state_dict['d_model'], state_dict['n_layers'], state_dict['n_heads'], state_dict['d_ffn'], state_dict['dropout_rate'])
        pos_enc = PositionalEncoding(state_dict['d_model'], state_dict['dropout_rate'], state_dict['max_length'])

        print("\nLoad trained LoRA weights...")
        instance = LoRAWrapper(task=state_dict['task'], encoder=encoder,
                               pos_enc=pos_enc, d_model=state_dict['d_model'],
                               n_layers=state_dict['n_layers'], n_heads=state_dict['n_heads'],
                               pad_idx=state_dict['pad_idx'], device=state_dict['device'],
                               in_features=state_dict['in_features'], d_ffn=state_dict['d_ffn'],
                               dropout_rate=state_dict['dropout_rate'], lora_rank = state_dict['lora_rank'],
                               max_length=state_dict['max_length'], num_classes=state_dict['num_classes'])

        # Load the state dict into the model
        print(f"Loading LoRA state dict from {lora_parameters}...")
        instance.load_state_dict(state_dict, strict=False)

        return instance

## Make the LoRA model:

Here, we define a function `make_lora_model()` to instantiate a Transformer model with LoRA based on our health foundation model.

In [None]:
def make_lora_model(task: str, device, d_model: int = 512,
                    n_layers: int = 3, n_heads: int = 8, d_ffn: int = 2048,
                    dropout: float = 0.1, in_features: int = 4485,
                    pad_idx: int = 100, lora_rank: int = 8, max_length: int = 20):
        """
        Construct a LoRA model when provided parameters.

        Args:
          task:         type of task to configure the model for: {'diabdeep', 'mhdeep'}
          device:       device used for training
          d_model:      dimension of embeddings
          n_layers:     number of Encoder and Decoders
          n_heads:      number of attention heads
          d_ffn:        dimension of feed-forward network
          dropout:      probability of dropout occurring
          in_features:  input feature dimensionality
          pad_idx:      padding index
          lora_rank:    rank of the LoRA matrices
          max_length:   max input sequence length

        Returns:
          model:        the LoRA model
        """
        # create the encoder
        encoder = Encoder(d_model, n_layers, n_heads, d_ffn, dropout)

        # create a positional encoding matrix
        pos_enc = PositionalEncoding(d_model, dropout, max_length)

        # create the LoRA model
        model = LoRAWrapper(task, encoder, pos_enc, d_model, n_layers, n_heads,
                            pad_idx, device, in_features=in_features, d_ffn=d_ffn,
                            lora_rank=lora_rank, max_length=max_length)
        return model

## Load data from the DiabDeep dataset and create the datasets and dataloaders:

The dataset has been processed into the training, validation, and test sets for you. You just need to load the corresponding experimental data.

First, we use Numpy to load the experimental data and transform them into torch tensors. Then, we create the training, validation, and test sets using the `CustomDataset` class. Finally, we instantiate dataloaders with PyTorch's [DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html).

In [None]:
# unzip the DiabDeep datasets
!unzip '/content/drive/Shareddrives/ECE477 datasets/Assignment11/diabdeep_data.zip' -d diabdeep_data

# check out the data
print('\n', os.listdir('diabdeep_data'))

# load the data and transform them to torch tensor
"""TO DO"""
x_train =
y_train =
x_valid =
y_valid =
x_test  =
y_test  =

print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
print('x_valid shape:', x_valid.shape, 'y_valid shape:', y_valid.shape)
print('x_test shape:', x_test.shape, 'y_test shape:', y_test.shape)

train_batch_size = 128    # batch size for the training set
test_batch_size = 128     # batch size for the validation and test sets

# instantiate the datasets with CustomDataset
"""TO DO"""
train_set =
valid_set =
test_set =
# instantiate the dataloaders, set shuffle=True
"""TO DO"""
train_loader =
valid_loader =
test_loader =

## Continual fine-tuning the health foundation model with LoRA: (2 pt)

Here, we fine-tune our health foundation model for the DiabDeep detection task with LoRA. The functions used here, train(), evaluate(), test(), and report(), are defined in `utils.py` in the shared folder.

In [None]:
# Set a fixed random seed for reproducibility
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if device == 'cuda':
  torch.cuda.manual_seed(SEED)
  torch.cuda.manual_seed_all(SEED)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False

task = 'diabdeep'
lora_rank = 8
num_classes = 3
epochs = 30

# use make_lora_model() to instantiate a Transformer model to be fine-tuned with LoRA
"""TO DO"""
model =
# cast the model to `device`
"""TO DO"""


# initialize the Adam optimizer
"""TO DO"""
optimizer =
# set the loss criterion as the CrossEntropyLoss()
"""TO DO"""
criterion =

CLIP = 1

best_valid_acc = float('-inf')

# loop through each epoch
for epoch in range(epochs):
    # use train() to calculate the train loss and update the parameters
    train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    # use evaluate() to calculate the train accuracy
    _, train_acc = evaluate(model, train_loader, criterion, PAD_NUM)

    # use evaluate() to calculate the loss and accuracy on the validation set
    valid_loss, valid_acc = evaluate(model, valid_loader, criterion, PAD_NUM)

    # save the model when it performs better than the previous run
    if valid_acc >= best_valid_acc:
        best_valid_acc = valid_acc
        model.save_lora_state_dict(f"comfort_{task}.pt")

    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc:.3f}')

# load the weights
model = LoRAWrapper.load_lora_state_dict(f'comfort_{task}.pt').to(device)

# use evaluate() to calculate the loss on the test set
test_loss, _ = evaluate(model, test_loader, criterion, PAD_NUM)
# use test() to calculate the test accuracy
test_acc = test(model, test_loader, PAD_NUM)

print(f'\nTask: {task}')
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc:.3f}\n')
report(model, task, device, x_test, y_test)