**Task 1: Sentence Transformer Implementation**

In [1]:
from transformers import BertTokenizer, BertModel
import torch

# Loading pre trained BERT model and tokenizer
tokenizer_var = BertTokenizer.from_pretrained('bert-base-uncased')
model_var = BertModel.from_pretrained('bert-base-uncased')

# Sample sentences for performing our task
sentences_list = ["Cricket is an amazing sport.", "Indian cricket team is a champion.", "Virat Kohli is my favorite player.","Dhoni is a superstar."]

# Tokenize and encode all the sentences
inputs_var = tokenizer_var(sentences_list, padding = True, truncation = True, return_tensors = "pt")

# Get sentence embeddings
with torch.no_grad():
    outputs_var = model_var(**inputs_var)
    sentence_embeddings = outputs_var.pooler_output

# Printing outputs/results
print(sentence_embeddings.shape)
print("\n\n")
print(sentence_embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


torch.Size([4, 768])



tensor([[-0.7400, -0.6163, -0.9663,  ..., -0.9046, -0.7043,  0.7008],
        [-0.9473, -0.6311, -0.9113,  ..., -0.8447, -0.6211,  0.9265],
        [-0.8555, -0.4487, -0.8454,  ..., -0.6958, -0.6657,  0.9032],
        [-0.8796, -0.5841, -0.8118,  ..., -0.5250, -0.6784,  0.9095]])


**Regarding the model architecture, I didn't have to make any choices since I used the pre-trained BERT model as it is. The BERT model is a transformer-based model that has been pre-trained on a large corpus of text data, and it can be fine-tuned or used as a feature extractor for various natural language processing tasks, including sentence embedding.**

**Task 2: Multi-Task Learning Expansion**

In [2]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn as nn

# Loading pre trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Define the multi tasking BERT model
class MultiTaskBertModel(nn.Module):
    def __init__(self, bert_model, num_classes_task_a, num_classes_task_b):
        super(MultiTaskBertModel, self).__init__()
        self.bert = bert_model
        self.task_a_head = nn.Linear(bert_model.config.hidden_size, num_classes_task_a)
        self.task_b_head = nn.Linear(bert_model.config.hidden_size, num_classes_task_b)

    def forward(self, input_ids, attention_mask, token_type_ids):
        bert_output = self.bert(input_ids, attention_mask, token_type_ids)
        task_a_output = self.task_a_head(bert_output.pooler_output)
        task_b_output = self.task_b_head(bert_output.last_hidden_state)
        return task_a_output, task_b_output

# Instantiate the multitask models
num_classes_task_a = 5  # For Sentence Classification
num_classes_task_b = 10  # For Named Entity Recognition
model = MultiTaskBertModel(bert_model, num_classes_task_a, num_classes_task_b)

# Sample sentences for performing our task
sentences_list = ["Cricket is an amazing sport.", "Indian cricket team is a champion.", "Virat Kohli is my favorite player.", "Dhoni is a superstar."]

# Tokenize and encode all the sentences
inputs = tokenizer(sentences_list, padding = True, truncation = True, return_tensors = "pt")

# Get multitask outputs
task_a_output, task_b_output = model(**inputs)

# Define loss functions for each task
task_a_loss_fn = nn.CrossEntropyLoss()
task_b_loss_fn = nn.BCEWithLogitsLoss()

# Compute losses for each task
task_a_labels = torch.tensor([0, 1, 2, 3])

# Create target labels for Task B with the same shape as the models output
task_b_labels = torch.zeros_like(task_b_output, dtype = torch.float32)
task_b_labels[:, 0, 0] = 1
task_b_labels[:, 1, 1] = 1

task_a_loss = task_a_loss_fn(task_a_output, task_a_labels)
task_b_loss = task_b_loss_fn(task_b_output.view(-1, num_classes_task_b), task_b_labels.view(-1, num_classes_task_b))

# Combine losses for multitask learning with equal weightage
task_a_weight = 0.5
task_b_weight = 0.5
combined_loss = task_a_loss * task_a_weight + task_b_loss * task_b_weight

# Optimize the combined loss
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)
combined_loss.backward()
optimizer.step()

# Print the multitask outputs
print("Task A Output (Sentence Classification):")
print(task_a_output)
print("\nTask B Output (Named Entity Recognition):")
print(task_b_output)

Task A Output (Sentence Classification):
tensor([[-0.1995,  0.4531,  0.0522,  0.0714, -0.0694],
        [-0.2850,  0.3624,  0.0745,  0.1833,  0.1519],
        [-0.2626,  0.3303, -0.0718,  0.1448,  0.1404],
        [-0.2447,  0.3235, -0.0451,  0.2304,  0.1812]],
       grad_fn=<AddmmBackward0>)

Task B Output (Named Entity Recognition):
tensor([[[-0.0782,  0.2481,  0.2469, -0.2666,  0.0767,  0.0966, -0.2244,
           0.1793, -0.3685, -0.1526],
         [ 0.2004, -0.0919, -0.4463,  0.1196,  0.2945,  0.2364, -0.1911,
           0.2368, -0.0732,  0.2632],
         [ 0.2033, -0.1546,  0.1038, -0.0460,  0.2551,  0.2414, -0.0045,
          -0.1665, -0.0410,  0.0941],
         [-0.1745, -0.2476,  0.1787, -0.2564,  0.3351,  0.2298, -0.0077,
          -0.0946, -0.1218,  0.1827],
         [ 0.1389, -0.2824, -0.0690, -0.0648,  0.0958,  0.2538,  0.0642,
           0.4035, -0.0128,  0.5688],
         [ 0.2461, -0.3082, -0.1450,  0.1179,  0.4300,  0.4292,  0.3234,
          -0.1333, -0.3076,  0.392

**The main change in the architecture is the addition of task specific output heads on top of the pre-trained BERT model. We created a new class called MultiTaskBertModel which inherits from nn.Module. This class takes the BERT model and adds two separate heads: task_a_head for Sentence Classification and task_b_head for Named Entity Recognition.**



**These heads are linear layers that take the relevant output from BERT (pooler output for Task A and last hidden state for Task B) and map it to the respective number of classes for each task. In the forward method, we pass the input through BERT and then through these task-specific heads to get the final output for each task.**



**For training, we compute the loss for each task separately using appropriate loss functions (CrossEntropyLoss for Task A and BCEWithLogitsLoss for Task B). To handle the shape mismatch between target labels and model output for Task B, we create a tensor task_b_labels with the same shape as the model's output and fill it with the correct labels for each token.**



**We then combine the losses for both tasks using a weighted sum and optimize this combined loss using an optimizer like Adam. By making these architectural changes and adjustments to the training process, we can leverage BERT for multi-task learning, allowing the model to learn and generalize across multiple NLP tasks simultaneously.**

**Task 3: Training Considerations**

**Question:** Discuss the implications and advantages of each scenario and explain your rationale as to how the model should be trained given the following:
If the entire network should be frozen.
If only the transformer backbone should be frozen.
If only one of the task-specific heads (either for Task A or Task B) should be frozen.

**Answer:**

The decision to freeze or not freeze different parts of the multi-task model during training has significant implications on the model's performance and learning capabilities.

If the entire network, including the pre-trained BERT model and the task-specific heads, is frozen, the model will not be able to learn or update any of its parameters during training, limiting its performance to the initial capabilities of the pre-trained BERT model.

Freezing only the transformer backbone (the pre-trained BERT model) while allowing the task-specific heads to be trainable is a common approach in multi-task learning, as it allows the model to leverage the pre-trained knowledge from BERT while fine-tuning the task-specific heads to learn the nuances of each task.

On the other hand, if one of the task-specific heads (either for Task A or Task B) is frozen while allowing the other head and the transformer backbone to be trainable, the model will be able to learn and adapt for one task but not the other, which can be useful in scenarios where one task is more important or requires more fine-tuning than the other, but may lead to suboptimal performance for the task with the frozen head.

**Question:** Consider a scenario where transfer learning can be beneficial. Explain how you would approach the transfer learning process, including:
The choice of a pre-trained model.
The layers you would freeze/unfreeze.
The rationale behind these choices.

**Answer:**

Consider a scenario where we have a small dataset for a text classification task, like identifying positive or negative reviews for a product. In such cases, transfer learning can be very helpful as it allows us to leverage the knowledge learned from a large pre-trained language model trained on a massive amount of text data. For this task, I would choose a pre-trained model like BERT, as these models have learned rich representations of language that can be effectively transferred to our specific task.

When fine-tuning these pre-trained models, it is common to freeze the weights of the lower layers (like the embedding layer and initial transformer layers) and only fine-tune the higher layers (like the final few transformer layers and the classification head). The lower layers have learned general language representations that are useful for various tasks, so freezing them preserves this valuable knowledge. The higher layers, on the other hand, are more task-specific and need to be fine-tuned to adapt the model to our specific text classification task and dataset. This approach strikes a balance between leveraging the pre-trained knowledge from the lower layers and allowing the model to specialize for our target task through fine-tuning the higher layers.

**Task 4: Layer-wise Learning Rate Implementation (BONUS)**

In [3]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn as nn

# Loading pre trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Define the multitask BERT model
class MultiTaskBertModel(nn.Module):
    def __init__(self, bert_model, num_classes_task_a, num_classes_task_b):
        super(MultiTaskBertModel, self).__init__()
        self.bert = bert_model
        self.task_a_head = nn.Linear(bert_model.config.hidden_size, num_classes_task_a)
        self.task_b_head = nn.Linear(bert_model.config.hidden_size, num_classes_task_b)

    def forward(self, input_ids, attention_mask, token_type_ids):
        bert_output = self.bert(input_ids, attention_mask, token_type_ids)
        task_a_output = self.task_a_head(bert_output.pooler_output)
        task_b_output = self.task_b_head(bert_output.last_hidden_state)
        return task_a_output, task_b_output

# Instantiate the multitask model
num_classes_task_a = 5  # For Sentence Classification
num_classes_task_b = 10  # For Named Entity Recognition
model = MultiTaskBertModel(bert_model, num_classes_task_a, num_classes_task_b)

# Sample sentences for performing our task
sentences_list = ["Cricket is an amazing sport.", "Indian cricket team is a champion.", "Virat Kohli is my favorite player.", "Dhoni is a superstar."]

# Tokenize and encode all the sentences
inputs = tokenizer(sentences_list, padding = True, truncation = True, return_tensors = "pt")

# Get multi-task outputs
task_a_output, task_b_output = model(**inputs)

# Define loss functions for each task
task_a_loss_fn = nn.CrossEntropyLoss()
task_b_loss_fn = nn.BCEWithLogitsLoss()

# Compute losses for each task
task_a_labels = torch.tensor([0, 1, 2, 3])

# Create target labels for Task B with the same shape as the models output
task_b_labels = torch.zeros_like(task_b_output, dtype = torch.float32)
task_b_labels[:, 0, 0] = 1
task_b_labels[:, 1, 1] = 1

task_a_loss = task_a_loss_fn(task_a_output, task_a_labels)
task_b_loss = task_b_loss_fn(task_b_output.view(-1, num_classes_task_b), task_b_labels.view(-1, num_classes_task_b))

# Combine losses for multi-task learning
task_a_weight = 0.5
task_b_weight = 0.5
combined_loss = task_a_loss * task_a_weight + task_b_loss * task_b_weight

# Define layer wise learning rates
bert_lr = 1e-6  # Learning rate for the BERT layer
task_a_head_lr = 5e-3  # Learning rate for the Task A head
task_b_head_lr = 5e-3  # Learning rate for the Task B head

# Set up the optimizer with layer wise learning rates
param_groups = [
    {'params': model.bert.parameters(), 'lr': bert_lr},
    {'params': model.task_a_head.parameters(), 'lr': task_a_head_lr},
    {'params': model.task_b_head.parameters(), 'lr': task_b_head_lr}
]
optimizer = torch.optim.Adam(param_groups)

# Optimize the combined loss
optimizer.zero_grad()
combined_loss.backward()
optimizer.step()

# Print the multi-task outputs
print("Task A Output (Sentence Classification):")
print(task_a_output)
print("\nTask B Output (Named Entity Recognition):")
print(task_b_output)

Task A Output (Sentence Classification):
tensor([[-0.4516, -0.2711, -0.2930,  0.3932, -0.0454],
        [-0.6471, -0.1410, -0.3102,  0.4350,  0.0737],
        [-0.5945, -0.1222, -0.1726,  0.5016,  0.0890],
        [-0.6618, -0.0939, -0.2193,  0.4403,  0.1122]],
       grad_fn=<AddmmBackward0>)

Task B Output (Named Entity Recognition):
tensor([[[-2.5900e-01,  3.4105e-02, -2.9591e-01,  6.7089e-01, -1.4280e-01,
           1.2864e-01, -7.3645e-02, -7.1139e-01,  3.2045e-02,  1.7620e-01],
         [-3.0260e-01, -2.5108e-01,  2.5255e-01,  4.9917e-01, -3.4996e-01,
          -2.6063e-01, -6.3774e-02, -2.8200e-01, -1.0602e-01,  5.2566e-02],
         [ 1.4944e-01, -4.6727e-01, -1.1942e-01,  7.0476e-01,  1.3117e-02,
           8.2875e-02, -1.5034e-01, -5.0137e-01, -2.0650e-01,  1.2337e-01],
         [ 2.6011e-01, -5.7434e-01, -6.8963e-02,  4.9465e-01,  2.2418e-01,
           2.4502e-01,  8.7059e-02, -5.2728e-01,  2.2943e-02,  8.3204e-02],
         [ 4.6886e-01, -4.5517e-01, -9.3392e-02,  2.1924e-

In the given code, we set different learning rates for the BERT layers and the task-specific heads (Task A and Task B heads). The BERT layers have a lower learning rate (bert_lr = 1e-6) because they are pre-trained on a massive corpus and have already learned rich language representations. A lower learning rate helps preserve this valuable knowledge while allowing for fine-tuning.

On the other hand, the task-specific heads have a higher learning rate (task_a_head_lr = task_b_head_lr = 5e-3) as they are randomly initialized and need to learn task-specific representations from scratch, requiring faster adaptation.

Using layer-wise learning rates can lead to faster convergence by allowing layers that need more learning to have higher rates. It can also improve generalization by preserving pre-trained knowledge in lower layers and stability by preventing large updates to well-initialized layers.

In the multi-task setting, layer-wise rates are particularly beneficial as they allow task-specific heads to quickly adapt to individual tasks while keeping the shared BERT layers relatively stable, enabling effective learning across multiple tasks simultaneously.

**For Task 3 and Task 4, besides the technical explanation, also provide a brief write-up summarizing your key decisions and insights.**

For Task 3, the key decision is to strike a balance between leveraging the pre-trained knowledge from the BERT model and allowing the model to adapt to the specific tasks. Freezing the entire network limits the model's ability to learn, while freezing only the transformer backbone (BERT) and fine-tuning the task-specific heads allows the model to leverage the pre-trained knowledge while adapting to the task nuances. Freezing one task head can be useful if one task is more important or requires more fine-tuning, but may lead to suboptimal performance for that task. For transfer learning, choosing a pre-trained model like BERT and freezing the lower layers while fine-tuning the higher layers is a common approach, as it preserves the general language representations while allowing the model to specialize for the target task.

For Task 4, implementing layer-wise learning rates can lead to faster convergence, better generalization, and improved stability by assigning higher rates to layers that need more learning and lower rates to well-initialized or pre-trained layers. In the multi-task setting, this approach is particularly beneficial as it allows the task-specific heads to quickly adapt to individual tasks while keeping the shared BERT layers relatively stable, enabling effective learning across multiple tasks simultaneously.