## Sentence Transformers & Multi-Task Learning - ML Apprentice Project
### The goal of this exercise is to assess your ability to implement, train, and optimize neural network architectures, particularly focusing on transformers and multi-task learning extensions. Please explain any and all choices made in the course of this assessment.

## Task 1 - Sentence Transformer Implementation
#### Implement a sentence transformer model using any deep learning framework of your choice. This model should be able to encode input sentences into fixed-length embeddings. Test your implementation with a few sample sentences and showcase the obtained embeddings. Describe any choices you had to make regarding the model architecture outside of the transformer backbone.

I am using PyTorch due to it's flexibility, its' ease of use in experimental projects like this one, library integration from Hugging Face's Transformers, and debugging ease.

In [1]:
# making sure PyTorch and Hugging Face's transformers libraries are installed
!pip install torch transformers
!pip install sentence-transformers



In [2]:
# importing relevant libraries for Task 1
import torch
from transformers import AutoTokenizer, AutoModel

In [3]:
# I chose BERT-base as my transformer backbone since it's well-established for NLP and effective for building off of as the model gets more advanced
# I considered starting with Sentence-BERT but didn't want to pigeon-hole myself from the start

transformer_model = "bert-base-uncased"

# loading the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(transformer_model)
model = AutoModel.from_pretrained(transformer_model)


  torch.utils._pytree._register_pytree_node(


In [4]:
# creating a function that takes a batch of sentences as the input, tokenizes it, runs it through my model,
# and then puts the token embeddings into fixed-length embeddings

def get_sentence_embeddings(sentences):
    # batch tokenize the list of sentences
    inputs = tokenizer(sentences,
                      return_tensors="pt", # need to ensure the tokenizer returns the tokenized data as PyTorch tensors
                      truncation=True, # sentence will be truncated if larger than the specified max_length (128) to avoid errors or large computational load
                      padding = True, # ensuring all sentences are padded to the same length
                       max_length=128) # I chose 128 as my max length since I am controlling the length of the test sentences and they will be relatively small 
    
    # forward pass through the model for the batch to obtain hidden states
    outputs = model(**inputs) # "**"" unpacks the dictionary of inputs above into keyword arguments for the forward
                            # the output we care most about is the last_hidden_state since it holds the hidden represenations for each token in the input sequence
    
    # applying mean pooling along the token dimension to get fixed-length sentence embeddings 
    mean_pool_embeddings = outputs.last_hidden_state.mean(dim=1)
    
    return mean_pool_embeddings

I chose mean pooling as my embedding method since it's a reliable and straightforward choice for an initial baseline. It averages the token embeddings to produce a balanced representation of the sentence, reducing the influence of any single token (i.e. outliers) that might otherwise dominate the embedding, as can happen with max pooling.

I initially had a function that could only process one sentence at a time, but went back to rewrite it to handle batch processing to improve efficiency, to make the model more scaleable, and more consistent as I move to the next step of project.

In [5]:
# time to test the function with sample sentences!

# list of test sentences
test_sentences = [
    "How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
    "Prometheus stole fire from the gods and gave it to man. For this, he was chained to a rock and tortured for eternity.",
    "Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.",
    "The mystery of life isn't a problem to solve, but a reality to experience.",
    "On Monday, California firefighters finally contained the spread of the recent wildfires."
    # having the sentences in a list makes it easy to add additional sentences in the future
]

# obtaining embeddings for all sentences at once
embeddings = get_sentence_embeddings(test_sentences)

# printing the shape for each embedding in the batch - assumes shape is [num_sentences, hidden_size]
for idx, emb in enumerate(embeddings):
    print(f"Embedding shape for sentence {idx+1}: {emb.shape}")


Embedding shape for sentence 1: torch.Size([768])
Embedding shape for sentence 2: torch.Size([768])
Embedding shape for sentence 3: torch.Size([768])
Embedding shape for sentence 4: torch.Size([768])
Embedding shape for sentence 5: torch.Size([768])


These results show that the sentence transformer is functioning correctly. Each sentence's embedding is a 768-dimensional vector, which is accurate for a BERT-base model. The sentences are in a fixed-length vector format and are ready for use in further tasks, like multi-task learning in Task 2.

In [6]:
# printing the overall shape of the batch to ensure I accounted for all data
print("Batch embedding shape:", embeddings.shape)

Batch embedding shape: torch.Size([5, 768])


Result shows we have 5 sentences in our batch each represented by a 768-dimensional vector.

## Task 2 - Multi-Task Learning Expansion

### Goal: Expand the sentence transormer to handle a multi-task learning settting.
#### Task A: Sentence Classification - Classify sentences into predefined classes (you can make these up).
#### Task B: [Choose another relevant NLP task such as Named Entity Recognition, Sentiment Analysis, etc.] (you can make the labels up)
#### Describe the changes made to the architecture to support multi-task learning.

In [7]:
# import relevant libraries. I already imported 'torch' in Step 1 so just need its' neurel network module
import torch.nn as nn

In [8]:
# creating the multi-task model architecture and linear layers

class MultiTaskModel(nn.Module):
    def __init__(self, transformer_model, hidden_size, num_classes_taskA, num_classes_taskB):
        super(MultiTaskModel, self).__init__() # initiliazing nn.Module
        self.transformer = transformer_model # storing the shared transformer backbone (BERT)
        self.pooling = lambda x: x.mean(dim=1) # mean pooling over the token dimension for sentence embedding
        
        # Task A - sentence classification head
        self.classifier = nn.Linear(hidden_size, num_classes_taskA)
        
        # Task B - I'm choosing Sentiment Analysis 
        self.sentiment = nn.Linear(hidden_size, num_classes_taskB)
        
    
    # defining the forward pass function
    
    def forward(self, inputs):
        outputs = self.transformer(**inputs) # forward pass through the transformer
        pooled_output = self.pooling(outputs.last_hidden_state) # pooling the outputs to get a single fixed-length vector per sentence
        
        # generating outputs for both Tasks
        output_classification = self.classifier(pooled_output)
        output_sentiment = self.sentiment(pooled_output)
        
        return output_classification, output_sentiment
        

In [9]:
# first defining the model parameters then testing the multi-task model with examples after 
hidden_size = 768 # BERT-base hidden size
num_classes_taskA = 6 # my made up sentence categories 
num_classes_taskB = 3 # the sentiment analysis labels

# instantiating the model
multi_task_model = MultiTaskModel(model, hidden_size, num_classes_taskA, num_classes_taskB) # 'model' is the pre-trained transformer from Step 1

# tokenizing the test sentences from Task 1 for the multi-task model
inputs = tokenizer(test_sentences, return_tensors="pt", truncation=True, padding=True, max_length=128)

# passing the tokenized inputs to my multi-task model and printing the shape results
output_A, output_B = multi_task_model(inputs)
print("Task A (Classification) output shape:", output_A.shape)
print("Task B (Sentiment Analysis) output shape:", output_B.shape)


Task A (Classification) output shape: torch.Size([5, 6])
Task B (Sentiment Analysis) output shape: torch.Size([5, 3])


Results are accurate meaning the multi-level model is functioning as expected.
- Task A processed a batch of 5 sentences, producing a 6-dimensional output related to my six made up categories.
- Task B processed the same batch of 5 sentences, producing a 3-dimensional output related to my three sentiment labels.
- The dimensions show the shared transformer backbone and the two task-specific heads are working together properly.

## Task 3 - Training Considerations

A) Implications & Advantages of Different Scenarios:
1) If the entire network should be frozen
    - This means all layers of the model are frozen and would not be changed in any way during training
    - Advantages:
        - Training is extremely fast because no updates are needed 
        - Increased stability because I am relying solely on the pre-trained BERT model's representations which is useful for small datasets
        
2) If only the transformer backbone should be frozen
    - This means the transformer backbone is fixed and its weights would not update, while only the task-specific heads are trained
    - Advantages:
        - Leverages proven and advanced pre-trained features
        - More efficient since the deep layers of the transformer are not touched and it focuses on the smaller heads
        - Reduces the risk of overfitting on small datasets

3) If only one of the task-specific heads should be frozen
    - This means only one task-specific head (Task A or Taks B) is static while both the transformer backbone and the other task head are updated
    - Advantages:
        - I can concentrate updates on the more critical head if the other task already performs well or has larger amounts of data
        - If updating one head is causing interference in the model or a reduced performance, freezing this head could help combat these issues

Conclusion
- Based on this model's needs and parameters, I would freeze the entire network during training. We have a very limited dataset and pretty standard task heads that the BERT-base backbone is well-equipped to handle. Keeping things simple seems like the best option for this project's use case. I would only start focusing on freezing the backbone or task-specific heads if we had a larger dataset and wanted to see if either of those routes would improve accuracy or loss.

B) Approaching the Transfer Learning Process in Different Scenarios:
- Transfer learning is useful when we are working with limited labeled data or when our target tasks are similar to those that a pre-trained model was exposed to.
1) The choice of a pre-trained model
    - When deciding on the pre-trained model, I need to consider how a certain model relates to the dataset I am working with (the type of data and the size of the set) and the use case of the model. In the case of this project, I am working with a very limited set of data that I created myself, and the data's purpose is related to natural language processing, so I want to choose a model that is relevant and known to capture langugage features. This is why I chose the BERT-base model as it's a widely used model in NLP since it's been trained on massive amounts of text, langugage features and data, providing a very strong starting point for my model.

2) The layers to freeze/unfreeze
    - When deciding on the layers to freeze/unfreeze, I need to decide which layers are most relevant to my task, how efficient I want/need our model to be, the size of my dataset, the challenge-level of a specific task relative to other tasks, overfitting, and the tradeoff between flexibility for task-specific adaptation versus maintaining the pre-trained model's knowledge.
    - Freezing the lower levels of the BERT-based model helps capture more general features like common language structures without risking overfitting on small datasets (which is the case for this project). I could then fine-tune the upper layers / task-specific heads since they are more flexibile and specific, and could improve the performance for the classification and sentiment analysis.
    - Another option I would explore is implementing layer-wise learning rate decay. I could assign a lower learning rate to the lower or frozen layers and a higher learning rate to the upper layers and task-specific heads. This would preserve the general language knowledge while allowing the task-specific layers to learn more aggressively.

## Task 4 - Training Loop Implementation (BONUS)

In [10]:
# importing all relevant libraries (even though I already imported some above - just to reiterate what is needed)

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

In [11]:
# creating a hypothetical dataset class

class HypotheticalDataset(Dataset):
    def __init__ (self, sentences, labels_taskA, labels_taskB):
        self.sentences = sentences
        self.labels_taskA = labels_taskA
        self.labels_taskB = labels_taskB
        
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, idx):
        sentence = self.sentences[idx]
        labelA = self.labels_taskA[idx]
        labelB = self.labels_taskB[idx]
        
        # tokenizing the sentence using the same tokenizer as task 1
        inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
        
        # removing the extra batch dimension added by the tokenizer so each input has the shape [1, seq_length]
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}
        
        return inputs, torch.tensor(labelA), torch.tensor(labelB)


In [12]:
# copying the list of test sentences from task 1 below for easy reference as hypothetical data
test_sentences = [
    "How much wood could a woodchuck chuck if a woodchuck could chuck wood?",
    "Prometheus stole fire from the gods and gave it to man. For this, he was chained to a rock and tortured for eternity.",
    "Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.",
    "The mystery of life isn't a problem to solve, but a reality to experience.",
    "On Monday, California firefighters finally contained the spread of the recent wildfires."
]

# mapping my made up sentence categories and sentiment analysis labels
taskA_labels = {0: "Expression", 1: "Opinion", 2: "Quote", 3: "Riddle", 4: "News", 5: "Other"}

taskB_labels = {0: "Neutral", 1: "Positive", 2: "Negative"}

# defining dummy labels 
dummy_labels_taskA = [3, 2, 2, 2, 4]
dummy_labels_taskB = [0, 2, 0, 1, 1]


In [13]:
# creating the dataset and dataloader

dataset = HypotheticalDataset(test_sentences, dummy_labels_taskA, dummy_labels_taskB)

dataloader = DataLoader(dataset, batch_size=2, shuffle=True) # 2 samples are processed together in each training iteration. Would want to increase the batch_size for full-scale training to get more stable results

In [14]:
# defining the optimizer and loss function

optimizer = optim.Adam(multi_task_model.parameters(), lr=2e-5) # ensuring the Adam optimizer will update all parameters in the model - can adjust later if I want to freeze parts of the model
criterion = nn.CrossEntropyLoss()


In [15]:
# training loop example just for demonstration since I am not actually training the model

num_epochs = 5 

for epoch in range(num_epochs):
    multi_task_model.train() # setting the model to training mode
    epoch_loss = 0
    total_samples = 0
    correct_taskA = 0
    correct_taskB = 0
    
    for batch in dataloader:
        inputs, labels_taskA, labels_taskB = batch
 
        optimizer.zero_grad() # clear previous gradients
        
        output_A, output_B = multi_task_model(inputs) # forward pass thru the multi-task model so I get outputs for both task A and task B. Same forwarding method as in previous steps
        
        # computing the loss for each task using CrossEntropyLoss
        loss_A = criterion(output_A, labels_taskA)
        loss_B = criterion(output_B, labels_taskB)
        loss = loss_A + loss_B # combining the losses, but in a real-world model I might want to weigh them depending on how critical each task is 
        
        loss.backward() # backward pass on the total loss
        optimizer.step() # update model parameters
        
        # updating the metrics
        batch_size = labels_taskA.size(0)
        epoch_loss += loss.item() * batch_size
        total_samples += batch_size
        
        # calculating accuracy for Task A
        predictions_A = torch.argmax(output_A, dim=1)
        correct_taskA += (predictions_A == labels_taskA).sum().item()
        
        # calculating accuracy for Task B
        predictions_B = torch.argmax(output_B, dim=1)
        correct_taskB += (predictions_B == labels_taskB).sum().item()
    
    # computing average loss and accuracy for each epoch. Basic metrics for this example but I could also use F1-score or precision/recall
    avg_loss = epoch_loss / total_samples
    accuracy_taskA = correct_taskA / total_samples
    accuracy_taskB = correct_taskB / total_samples
    
    print(f"Epoch {epoch+1}/{num_epochs}: Loss = {avg_loss:.4f}, Task A Accuracy: = {accuracy_taskA:.4f}, Task B Accuracy = {accuracy_taskB:.4f}")


Epoch 1/5: Loss = 2.9544, Task A Accuracy: = 0.0000, Task B Accuracy = 0.2000
Epoch 2/5: Loss = 2.4422, Task A Accuracy: = 0.8000, Task B Accuracy = 0.6000
Epoch 3/5: Loss = 2.0524, Task A Accuracy: = 0.8000, Task B Accuracy = 0.8000
Epoch 4/5: Loss = 1.5894, Task A Accuracy: = 1.0000, Task B Accuracy = 0.8000
Epoch 5/5: Loss = 1.1579, Task A Accuracy: = 1.0000, Task B Accuracy = 0.8000


- These results will show how the model classifies the data for both tasks. The scores can change each time I run the model since shuffle=True. Most of the time, the accuracy for both tasks will be 0.8 or 1.0 in each epoch due to the extremely small sample size I am using as it's easy to fit, so in a real-world production model I would want to have a much larger set of data to improve the model for use in a more complex environment.
- The loss is gradually decreseasing which is great to see because it means the model is effectively optimizing its' parameters on my small dataset. A larger dataset would likely have more variability with the loss (and accuracy).
- If I had a larger dataset for training this model, I would experiment with different batch sizes, learning rates, epochs, freezing/unfreezing layers as discussed in Task 3, and implementing more advanced metrics like F1-score and precision/recall to better understand the effectiveness of the model. But, I would need to keep in mind computational efficiency/demands and overfitting as I experiement with these different variables.