# Sentence Transformers & Multi-Task Learning Take Home Exercise
---

# Task 1: Sentence Transformer Implementation

Hi! So glad you’re here to check out my work. I really enjoyed putting this model together. Let’s talk about the choices I made before we get to the code!

---

# 1 Intoduction

---

## 1.1 Why BERT Was My Choice

I picked **BERT** as the core of my model, it's a perfect starting point. Here’s why:

- **It perfectly fits the downstream tasks**:  
  I needed a model that could handle sentence classification and named entity recognition. BERT’s got `[CLS]` or mean embedding token for sentence-level tasks and per-token embeddings for task like NER. 
  

## 1.2 Choices regarding the model architecture outside of the transformer backbone.

While BERT provides a solid foundation, several modifications were made to tailor it to this specific task.

### **1.2.1 Three Embedding Methods**  
I implemented three ways to extract embeddings:  

1. **CLS token embedding** – Uses the `[CLS]` token for sentence representation.  
2. **Mean pooling** – Averages all token embeddings to capture overall sentence meaning.  
3. **Per-token embeddings** – Provides embeddings for each token (important for NER).  

### **1.2.2 Attention Masking**  
- Sentences in a batch vary in length, so **padding is needed**.  
- Without masking, **padding tokens could distort embeddings**.  
- **Attention masks** were used to ensure padding tokens were ignored in calculations.  

### **1.2.3 Tokenization Strategy**  
- I used **WordPiece** tokenizer to remain consistent with BERT’s pre-trained model.  

---

## **1.3. Validation and Testing**  

To confirm that my model was working as intended, I followed two key steps.  

### **1.3.1 Matching My Model with Pre-Trained BERT**  
Since I copied weights from `bert-base-uncased`, from the same input text I checked the output of my model and the pre-trained model. If they matched, it meant my implementation was correct.  

### **1.3.2 Cosine Similarity Testing**  
- **Three sentences were tested:**  
  - Two similar sentences.  
  - One unrelated sentence.  
- **Expected results:**  
  - **Higher similarity** for related sentences.  
  - **Lower similarity** for unrelated sentences.  

---

## **1.4. Summary of Key Decisions**  

### **Model Selection**  
- Chose **BERT (`bert-base-uncased`)** for its strong performance in **classification and NER**.  
- Used a **pre-trained tokenizer** to ensure consistency with BERT’s input expectations.  

### **Architectural Adjustments**  
- **Used mean pooling** instead of `[CLS]` for better sentence representations.  
- **Implemented attention masking** to exclude padding tokens.  
- **Made embedding normalization optional** to support different tasks.  

### **Validation Approach**  
- **Compared embeddings** with pre-trained BERT to ensure correct weight initialization.  
- **Ran similarity tests** to verify that embeddings captured semantic relationships correctly.  

---

# 2 Implementation SentenceTransformer of  Using BERT


### **Key Features**
- Uses **BERT (`bert-base-uncased`)** as a backbone.
- Supports **three types of embeddings**:
  - `[CLS]` token embedding for classification.
  - **Mean pooling** for sentence representation.
  - **Full token embeddings** for NER and other sequence-based tasks.
- Implements a **custom Transformer model** with weights copied from a pre-trained BERT model.
- **Validates the model** by comparing its outputs with those from `bert-base-uncased`.

## **2.1. Import Libraries**
We first import necessary libraries for building and testing our model.

In [28]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel, BertTokenizerFast

## **2.2 Define Multi-Head Self-Attention**
This module implements **multi-head self-attention**, a core component of Transformers.  
It enables the model to **focus on different parts of a sentence simultaneously**.

In [29]:
# Define Multi-Head Self-Attention Layer
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_size, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        assert embed_size % num_heads == 0, "Embedding size must be divisible by number of heads"
        
        self.num_heads = num_heads
        self.head_dim = embed_size // num_heads

        self.query_proj = nn.Linear(embed_size, embed_size)  
        self.key_proj = nn.Linear(embed_size, embed_size)    
        self.value_proj = nn.Linear(embed_size, embed_size)  
        self.output_proj = nn.Linear(embed_size, embed_size) 

    def forward(self, queries, keys, values, mask=None):
        batch_size = queries.shape[0]

        Q = self.query_proj(queries)
        K = self.key_proj(keys)
        V = self.value_proj(values)

        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)

        scaling_factor = torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32))
        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / scaling_factor

        if mask is not None:
            attention_mask_add = mask * -10000.0  
            attention_scores = attention_scores + attention_mask_add

        attention_weights = torch.softmax(attention_scores, dim=-1)
        attention_output = torch.matmul(attention_weights, V)
        attention_output = attention_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.head_dim)
        
        return self.output_proj(attention_output)

## **2.3 Define Transformer Encoder Layer**
Each **encoder layer** contains:
- **Multi-head self-attention**
- **Layer normalization**
- **Feed-forward neural network**
- **Dropout for regularization**

In [30]:
class Encoder(nn.Module):
    def __init__(self, embed_size, num_heads, hidden_dim, dropout=0.1, layer_norm_eps=1e-12):
        super(Encoder, self).__init__()
        self.self_attention = MultiHeadSelfAttention(embed_size, num_heads)
        self.norm1 = nn.LayerNorm(embed_size, eps=layer_norm_eps)
        self.norm2 = nn.LayerNorm(embed_size, eps=layer_norm_eps)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, embed_size),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_output = self.self_attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

## **2.4 Define Sentence Transformer Model**
This custom transformer:
- Supports **CLS, Mean Pooling, and Token Embeddings**.
- Uses **BERT-like embeddings** for token, position, and token type.
- Stacks multiple **Transformer Encoder layers**.

In [31]:
class SentenceTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers, max_length, 
                 pooling="cls", output_mode="sentence", layer_norm_eps=1e-12):
        super(SentenceTransformer, self).__init__()
        self.embed_size = embed_size
        self.pooling = pooling          # "cls" or "mean" for sentence embeddings
        self.output_mode = output_mode  # "sentence" or "token"

        self.token_embeddings = nn.Embedding(vocab_size, embed_size)
        self.position_embeddings = nn.Embedding(max_length, embed_size)
        self.token_type_embeddings = nn.Embedding(2, embed_size)

        self.encoder_layers = nn.ModuleList([
            Encoder(embed_size, num_heads, hidden_dim, layer_norm_eps=layer_norm_eps) 
            for _ in range(num_layers)
        ])
        
        self.embed_layer_norm = nn.LayerNorm(embed_size, eps=layer_norm_eps)
        self.embed_dropout = nn.Dropout(0.1)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        batch_size, seq_len = input_ids.shape

        token_embed = self.token_embeddings(input_ids)
        position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand(batch_size, seq_len)
        position_embed = self.position_embeddings(position_ids)

        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        token_type_embed = self.token_type_embeddings(token_type_ids)

        x = token_embed + position_embed + token_type_embed
        x = self.embed_layer_norm(x)
        x = self.embed_dropout(x)
        
        if attention_mask is not None:
            mask = (attention_mask == 0).unsqueeze(1).unsqueeze(2).to(x.device)
        else:
            mask = None

        
        for layer in self.encoder_layers:
            x = layer(x, mask)
            
        if self.output_mode == "sentence":
            if self.pooling == "cls":
                return x[:, 0, :]  
            elif self.pooling == "mean":
                return x.mean(dim=1)  
        elif self.output_mode == "token":
            return x  

## **2.5 Function to Initialize Custom Model**
This function:
- **Extracts hyperparameters** from the pre-trained BERT model.
- **Initializes a custom Transformer model** with the same architecture.
- **Copies weights** from BERT to the custom model for a head start.

In [32]:
def initialize_custom_model(pretrained_model, model_type="sentence", pooling="cls", output_mode="sentence",
                            num_classes=2, num_entity_labels=17):
    """
    Initialize a custom model (either SentenceTransformer or MultiTaskSentenceTransformer) and copy weights 
    from a pretrained BERT model.
    """

    # **Function to Copy Weights from Pretrained BERT Model**
    def copy_pretrained_weights(custom_model, pretrained_model):
        """Copy weights from a pretrained BERT model to the custom SentenceTransformer or MultiTaskSentenceTransformer."""
        # Copy embedding weights
        custom_model.token_embeddings.weight.data.copy_(pretrained_model.embeddings.word_embeddings.weight.data)
        custom_model.position_embeddings.weight.data.copy_(pretrained_model.embeddings.position_embeddings.weight.data)
        custom_model.token_type_embeddings.weight.data.copy_(
            pretrained_model.embeddings.token_type_embeddings.weight.data)
        custom_model.embed_layer_norm.weight.data.copy_(pretrained_model.embeddings.LayerNorm.weight.data)
        custom_model.embed_layer_norm.bias.data.copy_(pretrained_model.embeddings.LayerNorm.bias.data)

        # Copy weights for each encoder layer
        for custom_layer, pretrained_layer in zip(custom_model.encoder_layers, pretrained_model.encoder.layer):
            # Self-attention projections
            custom_layer.self_attention.query_proj.weight.data.copy_(pretrained_layer.attention.self.query.weight.data)
            custom_layer.self_attention.query_proj.bias.data.copy_(pretrained_layer.attention.self.query.bias.data)
            custom_layer.self_attention.key_proj.weight.data.copy_(pretrained_layer.attention.self.key.weight.data)
            custom_layer.self_attention.key_proj.bias.data.copy_(pretrained_layer.attention.self.key.bias.data)
            custom_layer.self_attention.value_proj.weight.data.copy_(pretrained_layer.attention.self.value.weight.data)
            custom_layer.self_attention.value_proj.bias.data.copy_(pretrained_layer.attention.self.value.bias.data)
            custom_layer.self_attention.output_proj.weight.data.copy_(
                pretrained_layer.attention.output.dense.weight.data)
            custom_layer.self_attention.output_proj.bias.data.copy_(pretrained_layer.attention.output.dense.bias.data)

            # Feed-forward network
            custom_layer.feed_forward[0].weight.data.copy_(pretrained_layer.intermediate.dense.weight.data)
            custom_layer.feed_forward[0].bias.data.copy_(pretrained_layer.intermediate.dense.bias.data)
            custom_layer.feed_forward[2].weight.data.copy_(pretrained_layer.output.dense.weight.data)
            custom_layer.feed_forward[2].bias.data.copy_(pretrained_layer.output.dense.bias.data)

            # Layer normalization
            custom_layer.norm1.weight.data.copy_(pretrained_layer.attention.output.LayerNorm.weight.data)
            custom_layer.norm1.bias.data.copy_(pretrained_layer.attention.output.LayerNorm.bias.data)
            custom_layer.norm2.weight.data.copy_(pretrained_layer.output.LayerNorm.weight.data)
            custom_layer.norm2.bias.data.copy_(pretrained_layer.output.LayerNorm.bias.data)
        print("Weights successfully copied from pretrained BERT to custom model!")

    # **Extract Hyperparameters from Pretrained Model**
    config = pretrained_model.config
    vocab_size = config.vocab_size
    embed_size = config.hidden_size
    num_heads = config.num_attention_heads
    hidden_dim = config.intermediate_size
    num_layers = config.num_hidden_layers
    max_length = config.max_position_embeddings
    layer_norm_eps = config.layer_norm_eps  # Use BERT's epsilon

    print(f"Vocab Size: {vocab_size}")
    print(f"Embedding Size: {embed_size}")
    print(f"Number of Heads: {num_heads}")
    print(f"Hidden Dimension: {hidden_dim}")
    print(f"Number of Layers: {num_layers}")
    print(f"Max Position Embeddings: {max_length}")
    print(f"Layer Norm Epsilon: {layer_norm_eps}")

    # **Initialize Custom Model Based on Model Type** , sentence is for Task 1 ,and multitask for task 2 
    if model_type == "sentence":
        custom_model = SentenceTransformer(
            vocab_size=vocab_size,
            embed_size=embed_size,
            num_heads=num_heads,
            hidden_dim=hidden_dim,
            num_layers=num_layers,
            max_length=max_length,
            pooling=pooling,
            output_mode=output_mode,
            layer_norm_eps=layer_norm_eps
        )
    elif model_type == "multitask":
        custom_model = MultiTaskSentenceTransformer(
            vocab_size=vocab_size,
            embed_size=embed_size,
            num_heads=num_heads,
            hidden_dim=hidden_dim,
            num_layers=num_layers,
            max_length=max_length,
            num_classes=num_classes,
            num_entity_labels=num_entity_labels,
            pooling=pooling,
            layer_norm_eps=layer_norm_eps
        )
    else:
        raise ValueError(f"Unsupported model_type: {model_type}. Use 'sentence' or 'multitask'.")

    # **Copy Weights from Pretrained Model**
    copy_pretrained_weights(custom_model, pretrained_model)

    print("\nCustom model initialized and weights copied successfully.")

    return custom_model


## **2.6 Sentence Embeddings and Similarity Calculation**
This function:
- Generates sentence embeddings using **CLS, Mean Pooling, and Full Token Embeddings**.
- Computes **cosine similarity** to check if semantically similar sentences are close.

In [33]:
def demo_emb_predicting(custom_model, pretrained_model, tokenizer, sentences):
    """
    Demonstrates embedding extraction from both a custom model and a pretrained BERT model.
    Compares CLS, mean, and token embeddings using cosine similarity.
    """
    tokens = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

    with torch.no_grad():
        # Extract CLS embeddings
        custom_model.output_mode = "sentence"
        custom_model.pooling = "cls"
        custom_cls_embeddings = custom_model(tokens["input_ids"], tokens["attention_mask"], tokens["token_type_ids"])
        bert_outputs = pretrained_model(**tokens)
        bert_cls_embeddings = bert_outputs.last_hidden_state[:, 0, :]

        # Extract mean embeddings
        custom_model.pooling = "mean"
        custom_mean_embeddings = custom_model(tokens["input_ids"], tokens["attention_mask"], tokens["token_type_ids"])
        bert_token_embeddings = bert_outputs.last_hidden_state
        bert_mean_embeddings = bert_token_embeddings.mean(dim=1)

        # Extract token embeddings (full sequence)
        custom_model.output_mode = "token"
        custom_token_embeddings = custom_model(tokens["input_ids"], tokens["attention_mask"], tokens["token_type_ids"])
        bert_token_embeddings = bert_outputs.last_hidden_state

    # Normalize embeddings for cosine similarity comparison
    custom_cls_embeddings = F.normalize(custom_cls_embeddings, p=2, dim=-1)
    custom_mean_embeddings = F.normalize(custom_mean_embeddings, p=2, dim=-1)
    bert_cls_embeddings = F.normalize(bert_cls_embeddings, p=2, dim=-1)
    bert_mean_embeddings = F.normalize(bert_mean_embeddings, p=2, dim=-1)

    def compute_similarity(emb1, emb2):
        """Computes cosine similarity between two embeddings."""
        return F.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()

    # Compute similarities for CLS and mean embeddings
    cosine_sim_custom_cls_1_2 = compute_similarity(custom_cls_embeddings[0], custom_cls_embeddings[1])
    cosine_sim_custom_cls_2_3 = compute_similarity(custom_cls_embeddings[1], custom_cls_embeddings[2])
    cosine_sim_custom_mean_1_2 = compute_similarity(custom_mean_embeddings[0], custom_mean_embeddings[1])
    cosine_sim_custom_mean_2_3 = compute_similarity(custom_mean_embeddings[1], custom_mean_embeddings[2])

    cosine_sim_bert_cls_1_2 = compute_similarity(bert_cls_embeddings[0], bert_cls_embeddings[1])
    cosine_sim_bert_cls_2_3 = compute_similarity(bert_cls_embeddings[1], bert_cls_embeddings[2])
    cosine_sim_bert_mean_1_2 = compute_similarity(bert_mean_embeddings[0], bert_mean_embeddings[1])
    cosine_sim_bert_mean_2_3 = compute_similarity(bert_mean_embeddings[1], bert_mean_embeddings[2])

    # Display embedding demonstrations
    print("\n **Embedding Demonstrations**")

    # Display CLS embeddings
    print("\n--- CLS Embeddings ---")
    for i, sentence in enumerate(sentences):
        print(f"Sentence {i + 1}: \"{sentence}\"")
        print(f"Custom Model CLS Embedding: {custom_cls_embeddings[i].tolist()[:5]} ...")  # Print first 5 dimensions
        print(f"Pretrained BERT CLS Embedding: {bert_cls_embeddings[i].tolist()[:5]} ...\n")

    # Display mean embeddings
    print("\n--- Mean Pooling Embeddings ---")
    for i, sentence in enumerate(sentences):
        print(f"Sentence {i + 1}: \"{sentence}\"")
        print(f"Custom Model Mean Embedding: {custom_mean_embeddings[i].tolist()[:5]} ...")
        print(f"Pretrained BERT Mean Embedding: {bert_mean_embeddings[i].tolist()[:5]} ...\n")

    # Display cosine similarity results
    print("\n **Cosine Similarity Results**")
    print("\n--- Custom Model ---")
    print(f"Custom Model CLS Similarity (Sentence 1 & 2): {cosine_sim_custom_cls_1_2:.4f}")
    print(f"Custom Model CLS Similarity (Sentence 2 & 3): {cosine_sim_custom_cls_2_3:.4f}")
    print(f"Custom Model Mean Similarity (Sentence 1 & 2): {cosine_sim_custom_mean_1_2:.4f}")
    print(f"Custom Model Mean Similarity (Sentence 2 & 3): {cosine_sim_custom_mean_2_3:.4f}")

    print("\n--- Pretrained BERT ---")
    print(f"Pretrained BERT CLS Similarity (Sentence 1 & 2): {cosine_sim_bert_cls_1_2:.4f}")
    print(f"Pretrained BERT CLS Similarity (Sentence 2 & 3): {cosine_sim_bert_cls_2_3:.4f}")
    print(f"Pretrained BERT Mean Similarity (Sentence 1 & 2): {cosine_sim_bert_mean_1_2:.4f}")
    print(f"Pretrained BERT Mean Similarity (Sentence 2 & 3): {cosine_sim_bert_mean_2_3:.4f}")

    # Display token embeddings
    print("\n--- Token (Full Sequence) Embeddings ---")
    print(f"Custom Model Token Embeddings Shape: {custom_token_embeddings.shape}")
    print(f"Pretrained BERT Token Embeddings Shape: {bert_token_embeddings.shape}")
    print(
        f"\nCustom Model Token Embedding (First 5 Tokens): {custom_token_embeddings[:, :5, :]} ...")
    print(
        f"Pretrained BERT Token Embedding (First 5 Tokens): {bert_token_embeddings[:, :5, :]} ...")

## **2.7 Main Execution**
1. Load the **pretrained BERT model** and tokenizer.
2. Initialize the **custom sentence transformer**.
3. **Compare embeddings and similarities** using sample sentences.

In [34]:
pretrained_model = BertModel.from_pretrained("bert-base-uncased")
custom_model = initialize_custom_model(pretrained_model)
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
custom_model.eval()
pretrained_model.eval()
sentences = [
    "Fetch Rewards is a great app!",
    "I love eating apples.",
    "I do enjoy oranges!"
]
demo_emb_predicting(custom_model, pretrained_model, tokenizer, sentences)

Vocab Size: 30522
Embedding Size: 768
Number of Heads: 12
Hidden Dimension: 3072
Number of Layers: 12
Max Position Embeddings: 512
Layer Norm Epsilon: 1e-12
Weights successfully copied from pretrained BERT to custom model!

Custom model initialized and weights copied successfully.

 **Embedding Demonstrations**

--- CLS Embeddings ---
Sentence 1: "Fetch Rewards is a great app!"
Custom Model CLS Embedding: [0.009375251829624176, -0.003474772907793522, 0.0010710387723520398, -0.0010895710438489914, -0.035332974046468735] ...
Pretrained BERT CLS Embedding: [0.009375232271850109, -0.003474779427051544, 0.0010710282949730754, -0.0010895652230829, -0.035332970321178436] ...

Sentence 2: "I love eating apples."
Custom Model CLS Embedding: [0.008211217820644379, 0.016353486105799675, -0.030486492440104485, -0.028192920610308647, -0.03595130145549774] ...
Pretrained BERT CLS Embedding: [0.008211224339902401, 0.016353493556380272, -0.030486511066555977, -0.028192922472953796, -0.0359513089060783

## 2.8 Analysis of Results
The output compares embeddings and cosine similarities from our custom model with the pretrained BERT model for three sentences. CLS and mean pooling embeddings exhibit near-identical values (e.g., `0.0093752518` vs. `0.0093752322`), confirming successful weight transfer from BERT and accurate forward pass implementation. Cosine similarity scores align perfectly (e.g., 0.9466 for Sentences 2 and 3), reflecting the model’s ability to capture semantic proximity (e.g., higher similarity for Sentences 2 and 3 due to shared context). Token embeddings maintain consistency in shape (`[3, 9, 768]`) and values (e.g., `1.3798e-05` vs. `1.3221e-05`), validating token-level fidelity. These results demonstrate that the custom model effectively replicates BERT’s functionality, establishing a reliable foundation for downstream NLP tasks such as classification and NER.

## 2.9 Summary

This implementation is a robust and versatile sentence transformer setup based on the BERT architecture:
- **Data Handling**: Utilizes the pre-trained `BertTokenizer` to process input sentences consistently with `bert-base-uncased`, ensuring compatibility with the model's expectations.
- **Model**: A custom `SentenceTransformer` with a BERT-like structure, incorporating multi-head self-attention, transformer encoder layers, and support for CLS token, mean pooling, and per-token embeddings, initialized with weights copied from the pre-trained BERT model.
- **Validation**: Includes a comprehensive embedding comparison with the pre-trained BERT model and cosine similarity testing on sample sentences to verify correct implementation and semantic alignment.
- **Usage**: Demonstrates the initialization process, embedding generation across different pooling strategies, and similarity analysis, providing a solid foundation for downstream tasks like classification and NER.

# Task 2: Multi-Task Transformer Implementation

Hi there! I’m thrilled to have you here to explore my multi-task transformer model. 

# 1 Introduction

---

## 1.1 What’s This Task About.

For Task 2, I built a model that tackles two distinct NLP tasks: **sentence classification** and **Named Entity Recognition (NER)**. The idea is to use a single transformer model to handle both, sharing knowledge across tasks to boost efficiency and performance. Here’s a quick rundown of the tasks:

- **Sentence Classification**: Predicts a label for an entire sentence—like deciding if a movie review is "positive" or "negative." It’s a sentence-level task.
- **Named Entity Recognition (NER)**: Identifies and classifies entities (e.g., people, locations) in a sentence at the token level—like tagging "New York" as a location.

Multi-task learning is perfect here because it lets the model learn shared linguistic patterns while customizing outputs for each task.


---

## 1.2 How We’ll Construct the Model

Here’s how I designed the model to handle both tasks simultaneously:

### 1.2.1 Shared Backbone, Task-Specific Heads

The model uses a **shared BERT backbone** to generate contextual embeddings, then splits into two lightweight heads:
- **Classification Head**: A linear layer takes the sentence embedding (from `[CLS]` or mean pooling) and outputs logits for sentiment labels.
- **NER Head**: Another linear layer takes the full sequence of token embeddings and outputs per-token logits for entity tags.

This setup:
- Reuses the transformer’s computations for efficiency.
- Keeps task-specific predictions separate and tailored.

### 1.2.2 Calculating Outputs for Both Tasks

In the forward pass, the model **always computes both outputs**:
- **Classification Logits**: For sentence-level predictions (e.g., positive/negative).
- **NER Logits**: For token-level predictions (e.g., "B-geo", "O").

This dual-output approach simplifies training and inference—I can use whichever output I need based on the task, without recomputing the backbone.

### 1.2.3 Pooling Options for Classification

For sentence classification, I implemented two pooling strategies:
- **CLS Token Pooling**: Uses the `[CLS]` embedding—BERT’s built-in sentence representation.
- **Mean Pooling**: Averages all token embeddings for a more holistic sentence view.

After experimenting, **mean pooling often outperformed `[CLS]`**, so I made it an optional parameter you can toggle.


## 1.3 Datasets We’re Using

I worked with two datasets to train and test the model:
- **Classification Dataset**: IMDB Movie reviews labeled "positive" or "negative." 
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

- **NER Dataset**: Sentences with token-level entity tags (e.g., "B-geo" for beginning of a geographic entity, "O" for no entity).
https://www.kaggle.com/datasets/rajnathpatel/ner-data


## 1.4 Data Processing

### 1.4.1 Cleaning the Data

- **Classification Data**:
  - Dropped rows with missing `review` or `sentiment`.
  - Cleaned text by stripping HTML tags, special characters, and extra whitespace.
  - Filtered to ensure only "positive" or "negative" labels.

- **NER Data**:
  - Filled missing `Sentence #` values to group words into sentences.
  - Removed rows with missing `Word` or `Tag`.
  - Excluded sentences with invalid tags for consistency.

### 1.4.2 Wrapping Up with Dataset and DataLoader

I created a custom `MultiTaskDataset` class to handle both tasks:
- **Classification**: Tokenizes reviews and maps labels to 0 (negative) or 1 (positive).
- **NER**: Tokenizes word lists, aligns labels with subword tokens using `word_ids`, and assigns `-100` to ignored tokens (padding/subwords).

A custom `collate_fn` batches the data properly for the `DataLoader`, ensuring everything lines up for training.

---

## 1.5 Evaluation Metrics and Demo

To see how well the model performs, I implemented task-specific metrics and a prediction demo:

### 1.5.1 Classification Metrics
- **Accuracy**: Percentage of correctly predicted sentiments.
- **Classification Report**: Precision, recall, and F1-score for "negative" and "positive" classes.

### 1.5.2 NER Metrics
- **Entity-Level Metrics**: Precision, recall, and F1-score using `seqeval`, which evaluates NER at the entity level and ignores `-100` labels.

### 1.5.3 Prediction Demo
I’ll show how to:
- Run the model on sample inputs for both tasks.
- Compare predicted labels to true labels.
- Calculate and display the metrics above.

This demo will let you see the model in action and verify its performance.

---

## 1.6 Summary of Key Decisions

### 1.6.1 Model Design
- **Shared BERT Backbone**: `bert-base-uncased` for efficiency and pre-trained knowledge.
- **Task Heads**: Separate linear layers for classification and NER.
- **Pooling**: Optional `[CLS]` or mean pooling for classification.

### 1.6.2 Data Handling
- **Cleaning**: Removed invalid entries and normalized text.
- **Tokenization**: Used BERT’s tokenizer with subword alignment for NER.

### 1.6.3 Evaluation
- **Metrics**: Accuracy and F1 for classification, entity-level F1 for NER.
- **Demo**: Predicts embeddings and calculates metrics for both tasks.

# 2 Implementation MultiTaskSentenceTransformer of  Using BERT

## 2.1 Imports
The code begins with necessary imports for data handling, model building, and evaluation:

In [35]:
import random
import re
import os
import pickle
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
import pandas as pd
from transformers import BertTokenizer, BertTokenizerFast
from sklearn.metrics import accuracy_score, classification_report
from bs4 import BeautifulSoup
from seqeval.metrics import classification_report as seqeval_report
from seqeval.scheme import IOB2

## 2.2 Data Loading and Preparation for NLP Tasks

- Loads and preprocesses XLSX datasets for **classification** (sentiment) and **NER** (entity tagging).
- Cleans text, caches processed data, and prepares PyTorch `DataLoader`s for training/evaluation.


In [36]:
ner_label_map = {
    "O": 0,
    "B-art": 1, "I-art": 2,  # Artifact
    "B-eve": 3, "I-eve": 4,  # Event
    "B-geo": 5, "I-geo": 6,  # Geographical Entity
    "B-gpe": 7, "I-gpe": 8,  # Countries, Cities, States
    "B-nat": 9, "I-nat": 10,  # Natural Phenomena
    "B-org": 11, "I-org": 12,  # Organizations
    "B-per": 13, "I-per": 14,  # Persons
    "B-tim": 15, "I-tim": 16,  # Time Expressions
}

# Directory for storing processed datasets
PROCESSED_DIR = "./data/processed/"


def load_data_from_xlsx(file_path, task):
    """
    Load and preprocess dataset from an XLSX file.

    - Classification task expects 'review' and 'sentiment' columns.
    - NER task expects 'Sentence #', 'Word', and 'Tag' columns.
    """
    df = pd.read_excel(file_path)

    if task == "classification":
        required_columns = ["review", "sentiment"]
        if not all(col in df.columns for col in required_columns):
            raise ValueError("Classification XLSX must have 'review' and 'sentiment' columns.")

        df = df.dropna(subset=required_columns)
        df['review'] = df['review'].astype(str).apply(clean_text)

        # Filter only valid sentiment labels
        valid_sentiments = {"positive", "negative"}
        df = df[df['sentiment'].isin(valid_sentiments)]

        data = [{"sentence": row["review"], "label": row["sentiment"]} for _, row in df.iterrows()]

    elif task == "ner":
        required_columns = ["Sentence #", "Word", "Tag"]
        if not all(col in df.columns for col in required_columns):
            raise ValueError("NER XLSX must have 'Sentence #', 'Word', and 'Tag' columns.")

        df['Sentence #'] = df['Sentence #'].ffill()  # Fill missing sentence numbers
        df = df.dropna(subset=["Word", "Tag"])

        # Define valid NER tags
        valid_tags = {
            "B-art", "B-eve", "B-geo", "B-gpe", "B-nat", "B-org", "B-per", "B-tim",
            "I-art", "I-eve", "I-geo", "I-gpe", "I-nat", "I-org", "I-per", "I-tim", "O"
        }

        # Group words and tags by sentence
        grouped = df.groupby('Sentence #')
        data = []
        for _, group in grouped:
            words = group['Word'].astype(str).tolist()
            tags = group['Tag'].astype(str).tolist()
            if all(tag in valid_tags for tag in tags):
                data.append({'words': words, 'labels': tags})

    else:
        raise ValueError(f"Unsupported task type: {task}. Use 'classification' or 'ner'.")

    return data


def clean_text(text):
    """Cleans text by removing HTML tags and special characters."""
    text = BeautifulSoup(text, "html.parser").get_text()
    text = re.sub(r"[^a-zA-Z0-9\s.,!?'-]", "", text)  # Remove unwanted characters
    text = re.sub(r"\s+", " ", text).strip()  # Normalize spaces
    return text


def save_processed_data(data, filename):
    """Save processed dataset using pickle."""
    filepath = os.path.join(PROCESSED_DIR, filename)
    with open(filepath, "wb") as f:
        pickle.dump(data, f)
    print(f"Processed data saved: {filepath}")


def load_processed_data(filename):
    """Load processed dataset from cache if available."""
    filepath = os.path.join(PROCESSED_DIR, filename)
    if os.path.exists(filepath):
        with open(filepath, "rb") as f:
            data = pickle.load(f)
        print(f"Loaded cached data: {filepath}")
        return data
    return None


def prepare_classification_data(file_path, batch_size=8, test_size=100, seed=42):
    """Loads and prepares classification dataset with caching."""
    cache_filename = "classification_dataset.pkl"

    # Load cached data if available
    cached_data = load_processed_data(cache_filename)
    if cached_data:
        classification_train_data, classification_test_data = cached_data
    else:
        # Load raw data from file
        classification_data = load_data_from_xlsx(file_path, "classification")
        train_size = len(classification_data) - test_size

        # Split dataset into training and test sets
        classification_train_data, classification_test_data = random_split(
            classification_data, [train_size, test_size], generator=torch.Generator().manual_seed(seed)
        )

        # Cache processed data
        save_processed_data((classification_train_data, classification_test_data), cache_filename)

    # Create datasets
    classification_train_dataset = MultiTaskDataset(task="classification", data=classification_train_data, tokenizer=tokenizer)
    classification_test_dataset = MultiTaskDataset(task="classification", data=classification_test_data, tokenizer=tokenizer)

    # Create DataLoaders
    classification_train_loader = DataLoader(classification_train_dataset, batch_size=batch_size, shuffle=True, collate_fn=custom_collate_fn)
    classification_test_loader = DataLoader(classification_test_dataset, batch_size=batch_size, shuffle=False, collate_fn=custom_collate_fn)

    return classification_train_loader, classification_test_loader, len(classification_train_dataset), len(classification_test_dataset)


def prepare_ner_data(file_path, batch_size=8, test_size=100, seed=42, ner_label_map=ner_label_map):
    """Loads and prepares NER dataset with caching."""
    cache_filename = "ner_dataset.pkl"

    # Load cached data if available
    cached_data = load_processed_data(cache_filename)
    if cached_data:
        ner_train_data, ner_test_data = cached_data
    else:
        # Load raw data from file
        ner_data = load_data_from_xlsx(file_path, "ner")
        train_size = len(ner_data) - test_size

        # Split dataset into training and test sets
        ner_train_data, ner_test_data = random_split(
            ner_data, [train_size, test_size], generator=torch.Generator().manual_seed(seed)
        )

        # Cache processed data
        save_processed_data((ner_train_data, ner_test_data), cache_filename)

    # Create datasets
    ner_train_dataset = MultiTaskDataset(task="ner", data=ner_train_data, tokenizer=tokenizer, label_map=ner_label_map)
    ner_test_dataset = MultiTaskDataset(task="ner", data=ner_test_data, tokenizer=tokenizer, label_map=ner_label_map)

    # Create DataLoaders
    ner_train_loader = DataLoader(ner_train_dataset, batch_size=batch_size, shuffle=True, collate_fn=custom_collate_fn)
    ner_test_loader = DataLoader(ner_test_dataset, batch_size=batch_size, shuffle=False, collate_fn=custom_collate_fn)

    return ner_train_loader, ner_test_loader, len(ner_train_dataset), len(ner_test_dataset)

## 2.3 Dataset: `MultiTaskDataset`
- **Initialize** a custom PyTorch dataset for classification or NER tasks using task-specific data and a tokenizer.
- **Retrieve and tokenize** individual samples, mapping labels to integers for classification or aligning NER labels with sub-tokens.
- **Return** formatted inputs and labels as tensors, ensuring compatibility with PyTorch's `DataLoader` for efficient batch processing.

In [37]:
from torch.utils.data import Dataset
import torch
class MultiTaskDataset(Dataset):
    """
    Custom dataset for multi-task learning, supporting classification and NER tasks.
    """
    def __init__(self, task, data, tokenizer, label_map=None, max_length=512):
        self.task = task
        self.data = data
        self.tokenizer = tokenizer
        self.label_map = label_map
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        """Tokenizes input and returns processed tensors for the specified task."""
        if self.task == "classification":
            text = self.data[idx]["sentence"]
            label = 0 if self.data[idx]["label"] == "negative" else 1
            inputs = self.tokenizer(text, truncation=True, padding="max_length",
                                    max_length=self.max_length, return_tensors="pt")
            return {k: v.squeeze(0) for k, v in inputs.items()}, torch.tensor(label, dtype=torch.long)

        elif self.task == "ner":
            words = self.data[idx]["words"]
            labels = self.data[idx]["labels"]
            inputs = self.tokenizer(words, is_split_into_words=True, truncation=True,
                                    padding="max_length", max_length=self.max_length, return_tensors="pt")
            word_ids = inputs.word_ids(batch_index=0) if hasattr(inputs, "word_ids") else None

            # Initialize all label positions as ignored tokens
            label_ids = [-100] * len(inputs["input_ids"][0])

            if word_ids:
                for i, word_id in enumerate(word_ids):
                    if word_id is not None and word_id < len(labels):
                        if i == 0 or word_ids[i - 1] != word_id:
                            label_ids[i] = self.label_map.get(labels[word_id], -100) # Assign B-label
                        else:
                            label_ids[i] = self.label_map.get(f"I-{labels[word_id][2:]}", -100)  # Assign I-label

            return {k: v.squeeze(0) for k, v in inputs.items()}, torch.tensor(label_ids, dtype=torch.long)

        else:
            raise ValueError(f"Unsupported task: {self.task}. Use 'classification' or 'ner'.")

## 2.4 DataLoader: `custom_collate_fn`
- **Purpose**: Batches data for the `DataLoader` by combining individual samples into properly structured tensors.

In [38]:
def custom_collate_fn(batch):
    """
    Custom function to batch tokenized inputs and labels.
    Ensures proper tensor stacking.
    """
    inputs = [item[0] for item in batch]
    labels = [item[1] for item in batch]

    batched_inputs = {key: torch.stack([item[key] for item in inputs]) for key in inputs[0]}
    batched_labels = torch.stack(labels)

    return batched_inputs, batched_labels

## 2.5 Data Setup and Example Usage
The code sets up datasets and dataloaders for both tasks.

In [39]:
def demo_data_loading(classification_file, ner_file, sample_ratio=1.0):
    """
    Demonstrates loading of classification and NER datasets with optional sampling.
    Allows loading only a fraction of the dataset for quick testing.

    Args:
        classification_file (str): Path to classification dataset file.
        ner_file (str): Path to NER dataset file.
        sample_ratio (float): Fraction of data to sample (0.0 to 1.0).

    Returns:
        Tuple: Train/test DataLoaders for classification and NER tasks.
    """
    assert 0.0 < sample_ratio <= 1.0, "sample_ratio must be between 0 and 1"

    # Load full datasets
    classification_train_loader, classification_test_loader, classification_train_size, classification_test_size = prepare_classification_data(
        classification_file)
    ner_train_loader, ner_test_loader, ner_train_size, ner_test_size = prepare_ner_data(ner_file)

    def sample_dataloader(dataloader, sample_ratio):
        """Randomly samples a fraction of a DataLoader's dataset."""
        dataset = dataloader.dataset
        sampled_size = max(1, int(len(dataset) * sample_ratio))  # Ensure at least one sample
        sampled_indices = random.sample(range(len(dataset)), sampled_size)
        sampled_dataset = torch.utils.data.Subset(dataset, sampled_indices)
        return torch.utils.data.DataLoader(sampled_dataset, batch_size=dataloader.batch_size, shuffle=True,
                                           collate_fn=dataloader.collate_fn)

    # Apply sampling
    if sample_ratio < 1.0:
        classification_train_loader = sample_dataloader(classification_train_loader, sample_ratio)
        classification_test_loader = sample_dataloader(classification_test_loader, sample_ratio)
        ner_train_loader = sample_dataloader(ner_train_loader, sample_ratio)
        ner_test_loader = sample_dataloader(ner_test_loader, sample_ratio)

    # Print dataset sizes
    print(
        f"Classification Train Size: {len(classification_train_loader.dataset)}, Test Size: {len(classification_test_loader.dataset)}")
    print(f"NER Train Size: {len(ner_train_loader.dataset)}, Test Size: {len(ner_test_loader.dataset)}")

    # Display a sample batch for classification
    for batch_inputs, batch_labels in classification_train_loader:
        print("\n Classification Batch Sample:")
        print("Input IDs shape:", batch_inputs["input_ids"].shape)
        print("Labels shape:", batch_labels.shape)
        break  # Show only the first batch

    # Display a sample batch for NER
    for batch_inputs, batch_labels in ner_train_loader:
        print("\n NER Batch Sample:")
        print("Input IDs shape:", batch_inputs["input_ids"].shape)
        print("Labels shape:", batch_labels.shape)
        break  # Show only the first batch

    return classification_train_loader, classification_test_loader, ner_train_loader, ner_test_loader


## 2.6 Model: `MultiTaskSentenceTransformer`
- **Purpose**: Extends `SentenceTransformer` to perform sentence classification and NER simultaneously using a shared transformer backbone.
- **Structure**: Adds two linear heads—a classification head for sentence-level predictions and an NER head for token-level predictions, with configurable pooling (`"cls"` or `"mean"`).
- **Forward Pass**: Processes inputs through the transformer, applies pooling for classification logits, and generates per-token NER logits, returning both outputs for multi-task learning.

In [40]:
class MultiTaskSentenceTransformer(SentenceTransformer):
    """
    Multi-task Sentence Transformer supporting both classification and NER tasks.
    """

    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers, max_length,
                 num_classes=2, num_entity_labels=17, pooling="cls", layer_norm_eps=1e-12):
        super().__init__(vocab_size=vocab_size, embed_size=embed_size, num_heads=num_heads, hidden_dim=hidden_dim,
                         num_layers=num_layers, max_length=max_length, pooling=pooling, output_mode="token",
                         layer_norm_eps=layer_norm_eps)

        # Output layers for classification and named entity recognition (NER)
        self.classification_head = nn.Linear(embed_size, num_classes)
        self.ner_head = nn.Linear(embed_size, num_entity_labels)
        self.pooling = pooling

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        """
        Processes input through the transformer and generates outputs for classification and NER.
        """
        sequence_output = super().forward(input_ids, attention_mask, token_type_ids)

        # Apply pooling strategy to get sentence-level representation
        if self.pooling == "cls":
            cls_output = sequence_output[:, 0, :]  # Use CLS token output
        elif self.pooling == "mean":
            if attention_mask is not None:
                masked_output = sequence_output * attention_mask.unsqueeze(-1)
                cls_output = masked_output.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(
                    -1)  # Mean pooling with masking
            else:
                cls_output = sequence_output.mean(dim=1)  # Simple mean pooling

        # Generate logits for classification and NER
        classification_logits = self.classification_head(cls_output)
        ner_logits = self.ner_head(sequence_output)

        return classification_logits, ner_logits

## 2.7 Evaluation Functions
- **Classification Evaluation**: Assesses model performance on sentence classification by collecting predictions and true labels, computing accuracy, and generating a detailed report (precision, recall, F1) for "negative" and "positive" classes.
- **NER Evaluation**: Evaluates NER performance by gathering per-token predictions and labels, filtering out `-100` (padding) tokens, mapping to string labels via `id_to_label`, and producing an entity-level report (precision, recall, F1) using `seqeval`.

In [41]:
def evaluate_classification(model, dataloader, device=torch.device('cpu')):
    model.to(device)
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch_inputs, batch_labels in dataloader:
            batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()}  # Move inputs to device
            batch_labels = batch_labels.to(device)  # Move labels to device

            classification_logits, _ = model(**batch_inputs)
            preds = torch.argmax(classification_logits, dim=-1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(batch_labels.cpu().numpy())

    # Ensure at least two classes are present
    unique_classes = set(all_preds) | set(all_labels)

    if len(unique_classes) < 2:
        print(f"Warning: Only one class detected in predictions: {unique_classes}. Adjusting labels.")
        target_names = [str(cls) for cls in unique_classes]  # Dynamically assign class names
    else:
        target_names = ["negative", "positive"]

    accuracy = accuracy_score(all_labels, all_preds)
    report = classification_report(all_labels, all_preds, target_names=target_names)

    return accuracy, report


# Evaluate Named Entity Recognition (NER) performance
def evaluate_ner(model, dataloader, id_to_label, device=torch.device("cpu")):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch_inputs, batch_labels in dataloader:
            batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()}
            batch_labels = batch_labels.to(device)
            _, ner_logits = model(**batch_inputs)
            preds = torch.argmax(ner_logits, dim=-1).cpu().numpy()
            labels = batch_labels.cpu().numpy()

            for i in range(preds.shape[0]):
                seq_preds = []
                seq_labels = []
                for j in range(preds.shape[1]):
                    if labels[i, j] != -100:  # Ignore padding tokens
                        seq_preds.append(id_to_label.get(preds[i, j], "O"))  # Ensure default 'O'
                        seq_labels.append(id_to_label.get(labels[i, j], "O"))

                all_preds.append(seq_preds)
                all_labels.append(seq_labels)

    # **DEBUGGING OUTPUT**
    print("\nNER Evaluation Debugging:")
    print(f"Total Sentences Evaluated: {len(all_labels)}")
    if all_labels:
        print(f"First Prediction Example: {all_preds[0]}")
        print(f"First Label Example: {all_labels[0]}")

    if not any(any(label != "O" for label in labels) for labels in all_labels):
        print("Warning: No Named Entities Found in Labels! Possible Dataset Issue.")

    if not any(any(label != "O" for label in labels) for labels in all_preds):
        print("Warning: Model is Predicting Only 'O' Labels! Check Training.")

    # **Ensure we don't pass an empty list to seqeval**
    if not all_labels or not all_preds:
        raise ValueError("Empty sequences found! Check dataset and model output.")

    return seqeval_report(all_labels, all_preds, mode='strict', scheme=IOB2)

## 2.9 Demonstration and Evaluation
The code demonstrates predictions and evaluations for both tasks.
Because the model is untrained, with randomly initialized classification and NER heads, leading to poor performance:
- **Random Predictions**: Untrained heads produce near-random outputs, resulting in low accuracy and poor F1 scores for classification and NER.
- **Training Needed**: Performance will improve after training the heads on the datasets using appropriate loss functions and optimization.


In [42]:
# Initialize multi-task model
model = initialize_custom_model(pretrained_model, model_type="multitask", pooling="mean", num_classes=2,
                                num_entity_labels=17)
model.eval()

# Load classification and NER datasets
classification_file = "data/raw/classification_dataset.xlsx"
ner_file = "data/raw/ner_dataset.xlsx"

# set smaller sample ratio (0.1) for quick test, and larger sample ratio (1) for better performance
#sample_ratio = 1
sample_ratio = 0.1
classification_train_loader, classification_test_loader, ner_train_loader, ner_test_loader = demo_data_loading(
    classification_file, ner_file, sample_ratio)

# Classification inference demo
for batch_inputs, batch_labels in classification_test_loader:
    with torch.no_grad():
        classification_logits, _ = model(**batch_inputs)
        predicted_classes = torch.argmax(classification_logits, dim=-1)
        print("Predicted classes:", predicted_classes.tolist())
        print("True labels:", batch_labels.tolist())
    break

# NER inference demo
for batch_inputs, batch_labels in ner_test_loader:
    with torch.no_grad():
        _, ner_logits = model(**batch_inputs)
        predicted_labels = torch.argmax(ner_logits, dim=-1)
        print("Predicted NER labels:", predicted_labels.tolist())
        print("True NER labels:", batch_labels.tolist())
    break

# Evaluate classification performance
accuracy, report = evaluate_classification(model, classification_test_loader)
print(f"Accuracy: {accuracy:.4f}")
print("Report:\n", report)

# Evaluate NER performance
id_to_label = {v: k for k, v in ner_label_map.items()}
ner_report = evaluate_ner(model, ner_test_loader, id_to_label)
print("Report:\n", ner_report)

Vocab Size: 30522
Embedding Size: 768
Number of Heads: 12
Hidden Dimension: 3072
Number of Layers: 12
Max Position Embeddings: 512
Layer Norm Epsilon: 1e-12
Weights successfully copied from pretrained BERT to custom model!

Custom model initialized and weights copied successfully.
Loaded cached data: ./data/processed/classification_dataset.pkl
Loaded cached data: ./data/processed/ner_dataset.pkl
Classification Train Size: 4989, Test Size: 10
NER Train Size: 4775, Test Size: 10

 Classification Batch Sample:
Input IDs shape: torch.Size([8, 512])
Labels shape: torch.Size([8])

 NER Batch Sample:
Input IDs shape: torch.Size([8, 512])
Labels shape: torch.Size([8, 512])
Predicted classes: [1, 1, 1, 1, 1, 1, 1, 1]
True labels: [1, 0, 1, 1, 0, 1, 1, 1]
Predicted NER labels: [[8, 2, 5, 5, 15, 5, 1, 5, 1, 5, 5, 16, 15, 15, 5, 2, 2, 5, 5, 5, 15, 2, 5, 4, 5, 5, 5, 5, 5, 5, 5, 4, 5, 4, 4, 15, 15, 4, 11, 4, 4, 16, 1, 4, 1, 15, 5, 5, 5, 5, 15, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 11, 4, 11, 4, 16, 4,

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.7000
Report:
               precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.70      1.00      0.82         7

    accuracy                           0.70        10
   macro avg       0.35      0.50      0.41        10
weighted avg       0.49      0.70      0.58        10


NER Evaluation Debugging:
Total Sentences Evaluated: 10
First Prediction Example: ['I-gpe', 'I-gpe', 'B-geo', 'I-org', 'B-nat', 'B-org', 'B-geo', 'B-nat', 'B-org', 'B-nat', 'B-org', 'B-gpe', 'B-geo', 'B-geo', 'B-tim', 'B-nat', 'I-art', 'I-art', 'B-eve', 'B-nat', 'B-nat', 'I-nat', 'B-geo', 'B-nat', 'I-gpe', 'I-gpe', 'I-org']
First Label Example: ['O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'B-tim', 'I-tim', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-tim', 'I-tim', 'O', 'B-geo', 'O', 'O']
Report:
               precision    recall  f1-score   support

         art       0.00      0.00      0.00         0
         eve       0.

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## 2.10 Explanation of Results
The results could vary due to different random initilization.
1. **Model Setup**:
   - The model uses BERT’s setup (like `Vocab Size: 30522`, `Embedding Size: 768`). It copies weights from `bert-base-uncased` well. The message “Custom model initialized” shows everything worked fine at the start.

2. **Loading Data**:
   - The code uses saved files (`classification_dataset.pkl`, `ner_dataset.pkl`). It splits them into training and testing parts (e.g., Classification: 4989 train, 10 test; NER: 4775 train, 10 test). The test size is small because we use only 10% of data, good for a quick test but not enough for a big check.

3. **Classification Testing** :
   - The model predicts `[1, 0, 1, 0, 0, 0, 0, 0]` for classes, but true labels are `[1, 0, 0, 1, 0, 1, 0, 0]`. Some are correct (like first and second), but many are wrong. Since the model is not trained (only `eval()` mode), it gets 50% accuracy (0.5000).
   - The report says precision is 0.50 for both negative and positive, but recall is 0.80 for negative and 0.20 for positive. This means it finds negative better but misses positive a lot. 50% accuracy is like guessing, so the model needs training to work better.

4. **NER Testing**:
   - Predicted NER labels (like `[1, 11, 7, 4, ...]`) become tags like `B-art`, `B-org`, `B-gpe`, but true labels are mostly `0` or `-100` (padding). For example, it predicts `B-art`, `O`, `I-eve`, but true is `B-geo`, `I-geo`. It gets the tags wrong or makes up new ones.
   - The NER report shows all scores as 0.00, even with 18 real entities (like 7 `geo`, 3 `gpe`). This means it finds nothing right. The untrained NER part gives random answers, not matching the real data.

5. **How the Model Works**:
   - Without training, the model’s classification and NER parts don’t use BERT’s good pre-trained knowledge well. 50% accuracy in classification and 0.00 in NER show it acts random now.
   - The BERT base is fine (weights copied right), but the new parts need training to fit the tasks. Training will make it learn the data and improve a lot.

To sum up, the results show the model starts correctly with BERT, but without training, it does poorly—random for classification and nothing for NER. We must train it to make it good for these tasks, which is normal in NLP work.


## 2.11 Summary
This implementation is a robust and flexible multi-task learning setup:
- **Data Handling**: Processes Excel data for classification and NER, with proper tokenization and label alignment.
- **Model**: A `MultiTaskSentenceTransformer` with shared transformer layers and task-specific heads for efficiency.
- **Evaluation**: Tailored metrics for classification (accuracy, F1) and NER (entity-level precision/recall).
- **Usage**: Demonstrates loading, training readiness, and evaluation.

# Task 3: Training Considerations

Let’s move on to Task 3 and decide which method as the best one. Initially, we will first look at the limitations we have: regarding model size, hardware, we will evaluate three methods to freeze parts of the model, assess their advantages and disadvantages, and select the best training strategy. We have 50,000 samples for each dataset and are using an RTX 4060 laptop, and we will select a mandatory option while also come up with our preferred approach if given freedom to choose.

---



## 1 Discussion of the 3 Training Options:


### 1.1 Constraints to Consider

Before the discussion, let's go through the limitations we are facing:

- **Model Size**: The `bert-base-uncased` model has 110 million parameters, with 12 layers and embedding size being 768. The size of the two task specific heads are 1,536(sentiment analysis) and 6,912(ner) which are much smaller compared to the backbone. We will have 118 million parameters if fully unfrozen, which could be challenging considering our GPU.
- **Hardware**: We have an RTX 4060 laptop GPU, the memory of is about 8GB. Training the entire model with 118 million parameters may exceed this capacity.
- **Dataset Size**: We have 50,000 samples for each task. It should be enough for fine-tuning the backbone without excessive overfitting. This size appears adequate for the heads to learn, but the backbone’s 110 million parameters might require more data for optimal adjustment or we could apply a carefully chosen and smaller learning rate. 

---


### 1.2 Option 1 Freezing the Whole Network

#### Implications:
- If we freeze everything—the transformer backbone and both heads—no weights update during training. We’d stick with the pre-trained BERT weights and random head weights.
- With 50k samples per task, the model won’t learn anything specific, and the random heads will give us wild guesses. The backbone’s general knowledge won’t fit our tasks.
- It’s like using the model as a static tool without improving it.

#### Advantages:
- Super fast since no updates are needed, which is nice on the RTX 4060 laptop if we’re just testing.
- Gives a baseline to see how pre-trained embeddings do alone.

#### Conclusion: not an ideal choice:
This won’t work for training. The heads need to learn from our 50k datasets, and freezing everything stops that. It’s only good for quick tests, but not good for real training.

---

### 1.3 Option 2 Freezing Just the Transformer Backbone

#### Implications:
- The backbone (BERT layers) stays frozen with its pre-trained weights, while the classification and NER heads learn.
- With 50k samples, the heads can adapt to our tasks, but the backbone won’t change. This might limit how well it fits our data, especially for NER’s detailed token needs.
- Training is faster since we’re only updating the heads (a tiny fraction of the 110M+ parameters).

#### Advantages:
- Saves time and laptop power by skipping backbone updates, which fits the RTX 4060’s 8GB VRAM.
- Keeps BERT’s general skills, avoiding overfitting on our 50k datasets.
- Heads can still learn to handle classification and NER decently.

#### Challenges:
- The backbone’s embeddings might not be perfect for our tasks since they don’t adjust. 
- Performance might not be top-notch compared to training everything, especially with 50k data points that could benefit from fine-tuning.

#### Conclusion: good pick
This works well given our limitations. The RTX 4060 can handle training the heads (about 8,448 parameters) quickly, and 50k samples are enough for them to learn without overfitting. The backbone’s 110M parameters stay frozen, easing memory pressure, making this a practical choice for our hardware.

---

### 1.4 Option 3 Freezing One Task-Specific Head (Either Task A or Task B)

#### What Happens (e.g., Freezing Classification, Training NER):
- If we freeze one head (say, classification), it stays random. The other (NER) learns, and the backbone might tweak itself for NER.
- With 50k samples, the trainable head can improve, but the frozen head stays useless. The backbone might focus too much on NER, hurting classification.
- The model leans hard into the trainable task, which could throw things off balance.

#### Implications:
- The trainable head (e.g., NER) can get good with 50k data.
- Saves some power by freezing one head.

#### Challenges:
- The frozen head (e.g., classification) will give random answers since it’s untrained.
- The backbone might change too much for NER, making it bad for classification.
- Training one task at a time gets messy and might need separate runs.

#### Conclusion: not an ideal choice
This isn’t great because both heads start random and need training with our 50k datasets. Freezing one leaves that task broken, which goes against doing both tasks together. It only makes sense if one head was pre-trained, which isn’t our case.

---

### 1.5 How We Should Train the Model

#### 1.5.1 If We Have to Pick One Option:
- **Choice**: We’d go with **freezing just the transformer backbone (Option 2)**.
- **Why?** The RTX 4060’s 8GB VRAM struggles with the 110M backbone parameters plus the heads’ 8,448 parameters during full training. Freezing the backbone limits updates to the heads, which is manageable with 50k samples per task. This avoids memory issues and lets the heads converge without overwhelming our laptop, making it the best forced choice.

#### 1.5.2 If We Can Choose Freely:
- **Choices**:
  1. **Freeze Backbone, Train Heads**: Freeze backbone (110M params), train heads (~8,448 params). Low memory, keeps BERT’s knowledge, but limits NER. Good for small VRAM.
  2. **Unfreeze Last Layers**: Train last 2-3 layers (~30M params) and heads, freeze early layers. Balances memory and tuning, misses some details. Good for medium needs.
  3. **Unfreeze All**: Train all with adaptive rates (1e-6 early, 1e-4 heads). Best fit for 50k samples/task, high memory use, risks overfitting. Good for big tuning.
- **Our Pick**: Freeze backbone, train heads. Why? RTX 4060’s 8GB VRAM can’t handle more. With 100k samples, this is simple and fast, using BERT’s embeddings well.
- **How**: Batch size 8-16, learning rate 1e-4 for heads, dropout to stay stable.


---

### 1.6 Summary

Given our RTX 4060 laptop and 50k samples per task, **freezing the backbone** is the safest pick if we’re stuck with one option—it fits our hardware and lets the heads learn. But if we can choose and we have sufficient hardware resources, **training the whole model** is better, as 50k samples are enough to fine-tune the backbone without breaking our laptop, giving us the best shot at great results for both tasks with careful tuning.

## 2 Discussion of the Transfer Learning:
---

### 2.1 The Importance of Transfer Learning

Transfer learning is really important when we use a BERT backbone because training a BERT model from scratch is extremely costly and unnecessary. It takes tons of time, data, and computing power—way more than my laptop, an RTX 4060, can handle! Instead, using a pre-trained model lets us skip that hard work. The pre-trained BERT already picks up general language knowledge from its huge pre-training corpus, like how words relate to each other. For our specific tasks—classification and NER—we just need small classifier heads to learn the domain-specific details, which is much easier and faster. This saves us a lot of trouble, especially on limited hardware.

In Task 1, for testing purposes, we already shaped our model to match a pre-trained `bert-base-uncased` model and loaded its weights. This step proved our model works the same way, so now we can build on that for transfer learning.

---

### 2.2 Transfer Learning Strategy


#### 2.2.1 Selection of Pre-Trained Model

- **Model Choice**: I selected `bert-base-uncased` as the base.
- **Reasoning**: 
  - `bert-base-uncased` has 12 layers, embedding size be 768, in total 110 million parameters, which balances well between performance and training availablily on consumer level GPU.
  - Its uncased nature simplifies preprocessing by ignoring capitalization, and the WordPiece tokenizer effectively manages oov words.
---

#### 2.2.2 Layers to Freeze or Unfreeze

We propose three unfreezing strategies as candidates for training. We could train the model with each and select the best based on experimental results if we have more hardware resources. For simplicity and test purposes we choose to test candidate 1.

- **Candidate 1: Freeze the Backbone, Train the Heads**
  - **Freeze**: Transformer backbone (110M parameters).
  - **Unfreeze**: Classification head (~1,536 parameters) and NER head (~6,912 parameters).
  - **Pros**: Reduces memory load on RTX 4060’s 8GB VRAM; preserves pre-trained knowledge; faster training.
  - **Cons**: Backbone embeddings may not adapt, potentially limiting performance for NER.
  - **Suitable Cases**: Limited resources; tasks closely aligned with pre-trained embeddings.

- **Candidate 2: Unfreeze Only Backbone Layers Near the Output**
  - **Unfreeze**: Last 2-3 backbone layers (e.g., layers 9-11) and heads.
  - **Freeze**: Earlier layers (bottom 9-10).
  - **Pros**: Lowers parameter load (~30M); fine-tunes task-specific layers; balances adaptability.
  - **Cons**: Early layers stay static, possibly missing task nuances; moderate gains.
  - **Suitable Cases**: Need balance between efficiency and adaptation; output layers need tuning.

- **Candidate 3: Unfreeze the Entire Backbone with Adaptive Learning Rate**
  - **Unfreeze**: All backbone layers and heads.
  - **Learning Rate**: Adaptive—1e-6 for input layers, 1e-4 for output layers.
  - **Pros**: Full fine-tuning adapts to 50,000 samples; adaptive rate reduces overfitting; best performance potential.
  - **Cons**: High memory demand on RTX 4060; risk of overfitting if not tuned carefully.
  - **Suitable Cases**: Sufficient data for fine-tuning; tasks need significant backbone adaptation.

---

#### 2.2.3 Rationale Behind These Choices

- **Why Choose `bert-base-uncased`?**
  - It is a reliable option due to its extensive pre-training on a vast corpus, providing a strong foundation. The uncased feature eliminates capitalization concerns, and its size is compatible with my laptop if I use small batches (e.g., 4-8). A sentiment-finetuned version enhances Task A from the start.

- **Why Freeze the Backbone Initially?**
  - The RTX 4060’s 8GB VRAM would struggle with 110 million backbone parameters plus the heads’ 8,448 parameters. Freezing the backbone enables the heads to learn with the 50,000 samples, aligning with memory limits and safeguarding pre-trained knowledge.

- **Why Unfreeze Only Output Layers Later?**
  - The output layers are critical for task-specific adjustments and can fine-tune with the 50,000 samples. Freezing the early layers reduces the parameter count, easing memory demands on the RTX 4060, and maintains broad language skills since our data is smaller than BERT’s pre-training corpus.

- **Why Unfreeze the Entire Backbone with Adaptive Rates?**
  - The 50,000 samples per task offer enough data to adjust the backbone without overfitting, and the adaptive learning rate—lower for input layers (1e-6) and higher for output layers (1e-4)—preserves early-layer general knowledge while tailoring later layers to our classification and NER tasks. This balances stability and improvement.

- **Hardware and Dataset Fit**:
  - The RTX 4060’s 8GB VRAM imposes constraints, so starting with a frozen backbone helps. Small batches or gradient accumulation can manage unfreezing later. The 50,000 samples per task are adequate for the heads and backbone to learn with careful tuning (e.g., low learning rates and dropout), avoiding overfitting.

#### 2.2.4 Unfreeze stratege implementation

The `set_requires_grad` function is designed to control which parameters of the model are trainable by setting the `requires_grad` attribute based on the chosen unfreezing strategy. It supports four options—`'heads_only'`, `'all_adaptive'`, `'output_layers'`, and `'all'`—allowing us to freeze or unfreeze specific layers of the transformer backbone and task-specific heads. By default, it freezes all parameters, then selectively unfreezes them according to the specified option, enabling flexible training configurations. This function is essential for experimenting with the three candidate approaches and the full unfreeze option, ensuring we can adapt the model efficiently within our hardware constraints and evaluate the best strategy based on experimental results.

In [43]:
# Function to set requires_grad for specific layers
def set_requires_grad(model, unfreeze_option=None):
    """
    Set requires_grad for model parameters based on unfreeze_option.
    Options: 'heads_only', 'all_adaptive', 'output_layers', 'all'
    """
    for param in model.parameters():
        param.requires_grad = False  
    if unfreeze_option == 'heads_only':
        for name, param in model.named_parameters():
            if 'classification_head' in name or 'ner_head' in name:
                param.requires_grad = True
    elif unfreeze_option == 'all_adaptive':
        for param in model.parameters():
            param.requires_grad = True
    elif unfreeze_option == 'output_layers':
        for name, param in model.named_parameters():
            if 'encoder_layers.9' in name or 'encoder_layers.10' in name or 'encoder_layers.11' in name or \
               'classification_head' in name or 'ner_head' in name:
                param.requires_grad = True
    elif unfreeze_option == 'all':
        # Unfreeze all layers
        for param in model.parameters():
            param.requires_grad = True

def demo_unfreeze_strategy(model):
    """
    Demonstrates different unfreeze strategies and prints the number of trainable parameters.
    """
    strategies = ['heads_only', 'all_adaptive', 'output_layers', 'all']

    for strategy in strategies:
        print("\n" + "=" * 50)
        print(f"Applying Unfreeze Strategy: {strategy}")
        print("=" * 50)

        # Apply the unfreeze strategy
        set_requires_grad(model, strategy)

        # Count trainable parameters
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        total_params = sum(p.numel() for p in model.parameters())

        print(f" Trainable Parameters: {trainable_params:,} / {total_params:,}")

In [44]:
# Demonstrate different unfreeze strategies
demo_unfreeze_strategy(model)


Applying Unfreeze Strategy: heads_only
 Trainable Parameters: 14,611 / 108,906,259

Applying Unfreeze Strategy: all_adaptive
 Trainable Parameters: 108,906,259 / 108,906,259

Applying Unfreeze Strategy: output_layers
 Trainable Parameters: 21,278,227 / 108,906,259

Applying Unfreeze Strategy: all
 Trainable Parameters: 108,906,259 / 108,906,259


### 2.3 Explanation of Results

The results show trainable parameters out of 108,906,259 total:

1. **Heads Only**: 14,611  
   - Only heads train. Low memory, good for RTX 4060, but backbone stays fixed.
2. **All Adaptive**: 108,906,259  
   - All layers train. High memory, best fit, but tough on 8GB VRAM.
3. **Output Layers**: 21,278,227  
   - Last 3 layers and heads train. Balances memory and tuning.
4. **All**: 108,906,259  
   - Everything trains. Same as “all adaptive,” heavy for RTX 4060.

**Meaning**: “Heads only” fits our hardware best (1.5.2 pick). “Output layers” adds more but needs care. “All” and “all adaptive” are too big. Numbers prove the code sets layers right.


### 2.4 Summary of Key Decisions and Insights  for Task 3

- **Training Choice**: I chose to **freeze the backbone and train the heads**—best for both forced and free scenarios. My RTX 4060’s 8GB VRAM can’t handle all 118M parameters (110M backbone + 8,448 heads), and 50k samples per task (100k total) suit the heads well.
- **Rejected Options**: Freezing all stops learning—useless with 50k samples. Freezing one head unbalances tasks—both need training.
- **Free Preference**: I stuck with freezing the backbone over unfreezing last layers (~30M) or all (118M). VRAM limits favor simplicity; 50k samples don’t need full tuning.
- **Model**: Picked `bert-base-uncased` (110M params, 12 layers)—pre-trained, fits my laptop with small batches (8-16).
- **Layer Plan**: Freeze backbone, train heads (~8,448 params)—saves VRAM, uses BERT’s skills, adapts fast with 50k samples.
- **How**: Batch size 8-16, 1e-4 rate for heads, dropout for stability.
- **Insights**: 8GB VRAM limits me to small training; 50k samples work for heads, not full backbone. Transfer learning with BERT saves effort—heads focus on tasks.

# Task 4: Training Loop Implementation

Finally we have got to Task 4, to implement a training loop for our `MultiTaskSentenceTransformer` model, which performs classificationand NER. We will use datasets, data loaders, and model structure defined in Task 2. This introduction will analyze the task requirements, explain our decisions, and highlight the components we need to implement, preparing you to read the code.

---

## 1 Task Analysis and Approach

### 1.1 Task Analysis
Task 4 requires us to implement a training loop for our multi-task learning (MTL) model, focusing on handling hypothetical data, the forward pass, and metrics tracking. The model must optimize both tasks—classification and NER—using the shared transformer backbone, with separate train and test phases to compare performance before and after training. The MTL framework necessitates careful handling of dual objectives, ensuring both heads (classification and NER) learn effectively without interference, while managing our hardware constraints (RTX 4060 with 8GB VRAM).

### 1.2 Decisions Made
- **Freezing the Backbone**: For simplicity, we will freeze the transformer backbone (110 million parameters) and train only the classification head (~1,536 parameters) and NER head (~6,912 parameters). This reduces memory usage, making training feasible on our RTX 4060, while focusing on task-specific learning in the heads.
- **Training Mode**: We will implement a flexible training loop supporting both **combined training** (processing both tasks together) and **separate training** (optimizing tasks independently), controlled by a `combine_tasks` parameter. Combined training leverages MTL benefits by optimizing both heads simultaneously, but requires loss balancing. Separate training avoids this complexity, focusing on each task individually, which is simpler given the frozen backbone.
- **Loss Weighting**: In combined mode, we’ll weight classification and NER losses equally (0.5 each) for simplicity, acknowledging that NER might produce larger losses due to its complexity.
- **Metrics**: We’ll track accuracy and a classification report for Task A, and an entity-level F1-report for Task B, using the existing `evaluate_classification` and `evaluate_ner` functions from Task 2. Metrics will be computed before and after training on test sets to assess improvement.

### 1.3 How We’ll Implement It
- **Handling the Dataset**: We already have implemented `classification_loader` and `ner_loader`. To train both tasks together, we’ll set up a `CombinedTaskIterator`, which will switch between classification and NER batches without changing `MultiTaskDataset` so that we keep things organized without changing with the core dataset structure.
- **Forward Pass**: The model will compute `classification_logits` and `ner_logits` for each batch. In combined mode, both losses are calculated and combined; in separate mode, only the relevant loss is computed per task.
- **Training Loop**: We’ll implement a `train_and_evaluate` function that supports both modes. It will freeze the backbone, train the heads, print losses per epoch, and evaluate metrics before and after training on test sets.
- **Evaluation**: Metrics will be printed for both tasks, allowing comparison of performance (e.g., accuracy for classification, F1 for NER) to confirm the heads’ learning progress.

### 1.4 Components to Implement
- **CombinedTaskIterator**: A wrapper to pair classification and NER batches for combined training, ensuring balanced sampling without altering `MultiTaskDataset`.
- **Freeze Backbone Function**: A utility to freeze the backbone and unfreeze the heads, ensuring only the heads’ parameters are updated.
- **Training and Evaluation Loop**: The main loop will handle both training modes, compute losses, and evaluate metrics using existing functions, with separate train and test phases.

With these decisions and components in place, we’re ready to implement the training loop, providing flexibility to experiment with both combined and separate training approaches.

## 2 Explanation of Implementation

### 2.1 Imports

- Imports essential libraries (**PyTorch**, **tqdm**, **sklearn**, **seqeval**) for training, evaluation, and metrics in classification and NER tasks.


In [45]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from seqeval.metrics import classification_report as seqeval_report
from seqeval.scheme import IOB2
import itertools

### 2.2 Combining Classification and NER Batches
- Wraps **classification** and **NER** data loaders into a unified iterator.
- Cycles through batches seamlessly, resetting iterators on exhaustion.

In [46]:
# Wrapper to combine classification and NER batches
class CombinedTaskIterator:
    def __init__(self, classification_loader, ner_loader):
        self.classification_loader = classification_loader
        self.ner_loader = ner_loader
        self.class_iter = iter(classification_loader)
        self.ner_iter = iter(ner_loader)
    
    def __iter__(self):
        return self
    
    def __next__(self):
        try:
            class_batch = next(self.class_iter)
        except StopIteration:
            self.class_iter = iter(self.classification_loader)
            class_batch = next(self.class_iter)
        
        try:
            ner_batch = next(self.ner_iter)
        except StopIteration:
            self.ner_iter = iter(self.ner_loader)
            ner_batch = next(self.ner_iter)
        
        return {"classification": class_batch, "ner": ner_batch}

### 2.3 Multi-Task Model Training and Evaluation
- Freezes the model backbone with `freeze_backbone()`, keeping **classification** and **NER** heads trainable.
- Trains and evaluates a multi-task model in **combined** or **separate** modes, using weighted loss and metrics.

In [47]:
# Freeze backbone while keeping output heads trainable
def freeze_backbone(model):
    """Freezes model backbone while allowing classification & NER heads to be trainable."""
    for name, param in model.named_parameters():
        param.requires_grad = ('classification_head' in name or 'ner_head' in name)


# Training and evaluation function
def train_and_evaluate(model, classification_train_loader, ner_train_loader,
                       classification_test_loader, ner_test_loader,
                       criterion_classification, criterion_ner,
                       optimizer, device, id_to_label, combine_tasks=False, epochs=3, loss_weight=0.5):
    """Trains and evaluates the multi-task model in either combined or separate mode."""
    model.to(device)

    # Evaluate before training
    print("\nMetrics Before Training:")
    print("Classification:")
    accuracy_before, report_before = evaluate_classification(model, classification_test_loader, device)
    print(f"Accuracy: {accuracy_before:.4f}")
    print("Report:\n", report_before)

    print("NER:")
    ner_report_before = evaluate_ner(model, ner_test_loader, id_to_label, device)
    print("Report:\n", ner_report_before)

    model.train()

    # Combined training mode
    if combine_tasks:
        print("\nTraining in Combined Mode...")
        combined_iterator = CombinedTaskIterator(classification_train_loader, ner_train_loader)
        for epoch in range(epochs):
            total_loss = 0.0
            for i, batch in enumerate(tqdm(combined_iterator, desc=f"Combined Epoch {epoch + 1}/{epochs}",
                                           total=len(classification_train_loader))):
                optimizer.zero_grad()

                # Classification batch
                class_inputs, class_labels = batch["classification"]
                class_inputs = {k: v.to(device) for k, v in class_inputs.items()}
                class_labels = class_labels.to(device)
                class_logits, _ = model(**class_inputs)
                loss_class = criterion_classification(class_logits, class_labels)

                # NER batch
                ner_inputs, ner_labels = batch["ner"]
                ner_inputs = {k: v.to(device) for k, v in ner_inputs.items()}
                ner_labels = ner_labels.to(device)
                _, ner_logits = model(**ner_inputs)
                loss_ner = criterion_ner(ner_logits.view(-1, ner_logits.size(-1)), ner_labels.view(-1))

                # Combined loss
                loss = loss_weight * loss_class + (1 - loss_weight) * loss_ner
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
            print(f"\nCombined Epoch {epoch + 1} Average Loss: {total_loss / len(classification_train_loader):.4f}")

    # Separate training mode
    else:
        print("\nTraining in Separate Mode...")

        # Train Classification
        print("Training Classification Head...")
        for epoch in range(epochs):
            total_loss = 0.0
            for batch_inputs, batch_labels in tqdm(classification_train_loader, desc=f"Classification Epoch {epoch + 1}/{epochs}"):
                batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()}
                batch_labels = batch_labels.to(device)
                optimizer.zero_grad()
                class_logits, _ = model(**batch_inputs)
                loss = criterion_classification(class_logits, batch_labels)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
            print(f"\nClassification Epoch {epoch + 1} Average Loss: {total_loss / len(classification_train_loader):.4f}")

        # Train NER
        print("\nTraining NER Head...")
        for epoch in range(epochs):
            total_loss = 0.0
            for batch_inputs, batch_labels in tqdm(ner_train_loader, desc=f"NER Epoch {epoch + 1}/{epochs}"):
                batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()}
                batch_labels = batch_labels.to(device)
                optimizer.zero_grad()
                _, ner_logits = model(**batch_inputs)
                loss = criterion_ner(ner_logits.view(-1, ner_logits.size(-1)), batch_labels.view(-1))
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
            print(f"\nNER Epoch {epoch + 1} Average Loss: {total_loss / len(ner_train_loader):.4f}")

    # Evaluate after training
    print("\nMetrics After Training:")
    print("Classification:")
    accuracy_after, report_after = evaluate_classification(model, classification_test_loader)
    print(f"Accuracy: {accuracy_after:.4f}")
    print("Report:\n", report_after)

    print("NER:")
    ner_report_after = evaluate_ner(model, ner_test_loader, id_to_label)
    print("Report:\n", ner_report_after)

### 2.4 Multi-Task Model Training and Evaluation
- Sets up the **device** (GPU/CPU), freezes the backbone, and defines **loss functions** and **optimizer**.
- Trains the model with **separate task training** for classification and NER using `train_and_evaluate()`.

In [69]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Freeze backbone layers while keeping classification & NER heads trainable
freeze_backbone(model)

# Define loss functions for classification and NER
criterion_classification = nn.CrossEntropyLoss()
criterion_ner = nn.CrossEntropyLoss(ignore_index=-100)

# Define optimizer (only update trainable parameters)
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=5e-4)

# ID-to-label mapping for NER evaluation
id_to_label = {v: k for k, v in ner_label_map.items()}

# Train model with separate task training
print("Training with Separate Approach...")
train_and_evaluate(model, classification_train_loader, ner_train_loader,
                   classification_test_loader, ner_test_loader,
                   criterion_classification, criterion_ner,
                   optimizer, device, id_to_label,
                   combine_tasks=False)

Training with Separate Approach...

Metrics Before Training:
Classification:
Accuracy: 0.5000
Report:
               precision    recall  f1-score   support

    negative       0.50      0.80      0.62         5
    positive       0.50      0.20      0.29         5

    accuracy                           0.50        10
   macro avg       0.50      0.50      0.45        10
weighted avg       0.50      0.50      0.45        10

NER:


  _warn_prf(average, modifier, msg_start, len(result))



NER Evaluation Debugging:
Total Sentences Evaluated: 10
First Prediction Example: ['B-gpe', 'B-gpe', 'I-eve', 'I-eve', 'I-eve', 'I-eve', 'I-eve', 'O', 'I-per', 'I-eve', 'I-eve', 'B-org']
First Label Example: ['O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'B-gpe', 'O', 'O']
Report:
               precision    recall  f1-score   support

         art       0.00      0.00      0.00         0
         eve       0.00      0.00      0.00         0
         geo       0.00      0.00      0.00         7
         gpe       0.00      0.00      0.00         3
         nat       0.00      0.00      0.00         0
         org       0.00      0.00      0.00         2
         per       0.00      0.00      0.00         0
         tim       0.00      0.00      0.00         6

   micro avg       0.00      0.00      0.00        18
   macro avg       0.00      0.00      0.00        18
weighted avg       0.00      0.00      0.00        18


Training in Separate Mode...
Training Classification Head...


Classification Epoch 1/3: 100%|██████████████████████████████████████████████████████| 624/624 [02:07<00:00,  4.91it/s]



Classification Epoch 1 Average Loss: 0.5158


Classification Epoch 2/3: 100%|██████████████████████████████████████████████████████| 624/624 [02:15<00:00,  4.59it/s]



Classification Epoch 2 Average Loss: 0.3910


Classification Epoch 3/3: 100%|██████████████████████████████████████████████████████| 624/624 [02:20<00:00,  4.44it/s]



Classification Epoch 3 Average Loss: 0.3544

Training NER Head...


NER Epoch 1/3: 100%|█████████████████████████████████████████████████████████████████| 597/597 [02:13<00:00,  4.48it/s]



NER Epoch 1 Average Loss: 0.5309


NER Epoch 2/3: 100%|█████████████████████████████████████████████████████████████████| 597/597 [02:12<00:00,  4.50it/s]



NER Epoch 2 Average Loss: 0.2918


NER Epoch 3/3: 100%|█████████████████████████████████████████████████████████████████| 597/597 [02:11<00:00,  4.54it/s]



NER Epoch 3 Average Loss: 0.2601

Metrics After Training:
Classification:
Accuracy: 0.9000
Report:
               precision    recall  f1-score   support

    negative       0.83      1.00      0.91         5
    positive       1.00      0.80      0.89         5

    accuracy                           0.90        10
   macro avg       0.92      0.90      0.90        10
weighted avg       0.92      0.90      0.90        10

NER:

NER Evaluation Debugging:
Total Sentences Evaluated: 10
First Prediction Example: ['B-geo', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-tim', 'O']
First Label Example: ['B-geo', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-tim', 'O']
Report:
               precision    recall  f1-score   support

         geo       0.50      0.57      0.53         7
         gpe       0.75      1.00      0.86         3
         org       0.00      0.00      0.00         2
         ti

### 2.5 Explanation of Results
The output shows training and evaluation on an RTX 4060 with the backbone frozen and heads trained separately using 10% of the data (50k samples per task, sampled to 5k train/10 test).

#### Before Training
- **Classification**: Accuracy is 0.5000 (random guessing). Precision/recall/F1 are low (e.g., 0.50/0.80/0.62 for negative), showing the untrained heads don’t work yet.
- **NER**: All metrics are 0.00 (no correct predictions). Predictions like `B-gpe`, `I-eve` mismatch true labels (`O`, `B-geo`), confirming the NER head is random before training.

#### Training Process
- **Classification**: Loss drops from 0.5158 (Epoch 1) to 0.3544 (Epoch 3) over 624 batches per epoch. It’s steady progress with 5k samples.
- **NER**: Loss reduces from 0.5309 (Epoch 1) to 0.2601 (Epoch 3) over 597 batches. The decrease shows learning despite the small dataset.

#### After Training
- **Classification**: Accuracy jumps to 0.9000. Precision/recall/F1 improve (e.g., 0.83/1.00/0.91 for negative, 1.00/0.80/0.89 for positive), showing strong learning on 10 samples.
- **NER**: Metrics rise—micro F1 is 0.65 (from 0.00). Predictions align better (e.g., `B-geo`, `O`, `B-tim` match true labels), with F1 scores like 0.53 (geo) and 0.86 (gpe). Some entities (org) still lag.

#### Key Point
Although we only use 10% of the data, the model converges well. Loss reduces steadily (0.5158 to 0.3544 for classification, 0.5309 to 0.2601 for NER), and metrics improve (accuracy 0.50 to 0.90, NER F1 0.00 to 0.65). This shows our model works and trains as expected, even with a small sample, proving the heads learn effectively with the frozen backbone on RTX 4060.

### 2.6 Summary of Key Decisions and Insights for Task 4
- **Training Strategy**: I decided to **freeze the backbone** (110M parameters) and train only the heads (~1,536 for classification, ~6,912 for NER). This keeps memory low for my RTX 4060’s 8GB VRAM, making training possible, and focuses learning on the task-specific heads.
- **Training Mode**: I chose a **flexible loop** with two options—**combined training** (both tasks together) and **separate training** (tasks alone), controlled by `combine_tasks`. Combined mode mixes tasks with equal loss weights (0.5 each) for MTL benefits, while separate mode simplifies by training each head on its own, suiting the frozen backbone.
- **Implementation Choices**: I used a `CombinedTaskIterator` to pair batches for combined mode without changing `MultiTaskDataset`, keeping things neat. The forward pass computes both `classification_logits` and `ner_logits`, with losses handled per mode. The `train_and_evaluate` function freezes the backbone, trains heads, tracks losses, and shows metrics before and after.
- **Metrics Tracking**: I picked **accuracy** and a report for classification, and **F1-score** for NER, using existing functions. This compares performance on test sets to confirm the heads improve.
- **Insights**: Freezing the backbone fits my hardware limits and works with 50k samples per task. Combined training could boost both tasks but needs careful loss balancing (NER losses might dominate). Separate training is simpler and safer for now. The setup lets me test both approaches easily, showing how well the heads learn with a fixed backbone.
