#Objective:
The goal of this exercise is to assess your ability to implement, train, and optimize neural
network architectures, particularly focusing on transformers and multi-task learning extensions.
Please explain any and all choices made in the course of this assessment.
## Task 1: Sentence Transformer Implementation
Implement a sentence transformer model using any deep learning framework of your choice.
This model should be able to encode input sentences into fixed-length embeddings. Test your
implementation with a few sample sentences and showcase the obtained embeddings.
Describe any choices you had to make regarding the model architecture outside of the
transformer backbone

In [10]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

class SentenceTransformerModel(nn.Module):
    def __init__(self, model_name='distilbert-base-uncased'):
        super(SentenceTransformerModel, self).__init__()
        # Load the pre-trained transformer backbone
        self.transformer = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    def forward(self, input_ids, attention_mask):
        # Get the transformer outputs
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_states = outputs.last_hidden_state  # shape: (batch_size, seq_length, hidden_dim)

        # Masked Mean Pooling: Only average over non-padding tokens
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()
        sum_embeddings = torch.sum(last_hidden_states * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
        sentence_embeddings = sum_embeddings / sum_mask

        return sentence_embeddings

    def encode(self, sentences, max_length=128, device='cpu'):
        self.to(device)
        # Tokenize the input sentences
        encoded_input = self.tokenizer(sentences, padding=True, truncation=True,
                                       max_length=max_length, return_tensors='pt')
        encoded_input = {key: val.to(device) for key, val in encoded_input.items()}
        with torch.no_grad():
            embeddings = self.forward(encoded_input['input_ids'], encoded_input['attention_mask'])
        return embeddings.cpu()

# Testing the model with sample sentences
if __name__ == '__main__':
    model = SentenceTransformerModel('distilbert-base-uncased')
    sentences = [
        "This is a test sentence.",
        "Here is another one.",
        "Sentence transformers are useful for NLP tasks."
    ]
    embeddings = model.encode(sentences)
    print("Sentence Embeddings:")
    print(embeddings)


Sentence Embeddings:
tensor([[ 0.0342, -0.2481, -0.1054,  ..., -0.0869, -0.1297,  0.2620],
        [ 0.0206, -0.2149, -0.0401,  ...,  0.1628,  0.1432,  0.0116],
        [ 0.0234, -0.1262, -0.1055,  ..., -0.1875, -0.5122, -0.1344]])


## Task 2: Multi-Task Learning Expansion
Expand the sentence transformer to handle a multi-task learning setting.
1. Task A: Sentence Classification – Classify sentences into predefined classes (you can make these up).

2. Task B: [Choose another relevant NLP task such as Named Entity Recognition, Sentiment Analysis, etc.] (you can make the labels up)
Describe the changes made to the architecture to support multi-task learning

In [11]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

class MultiTaskSentenceTransformer(nn.Module):
    """
    Multi-Task Sentence Transformer Model

    This model uses a shared pre-trained transformer (e.g., DistilBERT) as a backbone to encode sentences into
    fixed-length embeddings. It includes two task-specific heads:
      - Task A: Sentence Classification head for predicting predefined classes.
      - Task B: Sentiment Analysis head for predicting sentiment classes.

    The shared sentence embeddings are computed using masked mean pooling, ensuring that only valid (non-padding)
    tokens contribute to the final representation.
    """
    def __init__(self, model_name='distilbert-base-uncased',
                 num_classification_classes=3, num_sentiment_classes=3):
        super(MultiTaskSentenceTransformer, self).__init__()
        # Load the pre-trained transformer backbone and its tokenizer.
        self.transformer = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        # Hidden size comes from the transformer model's configuration.
        self.hidden_size = self.transformer.config.hidden_size

        # Define the Sentence Classification Head (Task A).
        self.classification_head = nn.Linear(self.hidden_size, num_classification_classes)

        # Define the Sentiment Analysis Head (Task B).
        self.sentiment_head = nn.Linear(self.hidden_size, num_sentiment_classes)

    def forward(self, input_ids, attention_mask):
        """
        Forward pass of the model.

        Args:
            input_ids (torch.Tensor): Input token IDs of shape (batch_size, sequence_length).
            attention_mask (torch.Tensor): Attention mask to differentiate valid tokens from padding.

        Returns:
            dict: A dictionary containing:
                - 'sentence_embedding': The pooled sentence embeddings.
                - 'classification_logits': Logits from the classification head.
                - 'sentiment_logits': Logits from the sentiment analysis head.
        """
        # Obtain hidden states from the transformer.
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_states = outputs.last_hidden_state  # (batch_size, seq_length, hidden_dim)

        # Masked mean pooling: Average only over non-padding tokens.
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()
        sum_embeddings = torch.sum(last_hidden_states * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
        sentence_embeddings = sum_embeddings / sum_mask

        # Compute outputs for both tasks.
        classification_logits = self.classification_head(sentence_embeddings)
        sentiment_logits = self.sentiment_head(sentence_embeddings)

        return {
            'sentence_embedding': sentence_embeddings,
            'classification_logits': classification_logits,
            'sentiment_logits': sentiment_logits
        }

    def encode(self, sentences, max_length=128, device='cpu'):
        """
        Encodes a list of sentences to obtain the shared sentence embeddings along with
        task-specific outputs.

        Args:
            sentences (list[str]): List of input sentences.
            max_length (int): Maximum token length for each sentence.
            device (str): Device to perform computations on ('cpu' or 'cuda').

        Returns:
            dict: A dictionary with sentence embeddings and task-specific logits.
        """
        # Ensure model is on the desired device.
        self.to(device)
        # Tokenize the input sentences.
        encoded_input = self.tokenizer(
            sentences, padding=True, truncation=True, max_length=max_length, return_tensors='pt'
        )
        # Move tokenized inputs to the device.
        encoded_input = {key: val.to(device) for key, val in encoded_input.items()}
        # Perform the forward pass without gradient tracking.
        with torch.no_grad():
            outputs = self.forward(encoded_input['input_ids'], encoded_input['attention_mask'])
        return outputs

def test_multitask_model():
    """
    Tests the MultiTaskSentenceTransformer by encoding a list of sample sentences.
    It prints out the sentence embeddings and task-specific logits for easy inspection.
    """
    # Create an instance of the multi-task model.
    model = MultiTaskSentenceTransformer('distilbert-base-uncased')

    # Define sample sentences to be encoded.
    sentences = [
        "This is a test sentence.",
        "I love machine learning and AI!",
        "Can you tell me the weather forecast?"
    ]

    # Obtain outputs from the model.
    outputs = model.encode(sentences)

    # Display the outputs with descriptive labels.
    print("=== Sentence Embeddings ===")
    print(outputs['sentence_embedding'])
    print("\n=== Sentence Classification Logits ===")
    print(outputs['classification_logits'])
    print("\n=== Sentiment Analysis Logits ===")
    print(outputs['sentiment_logits'])

if __name__ == '__main__':
    # Execute the test function when the script is run directly.
    test_multitask_model()


=== Sentence Embeddings ===
tensor([[ 0.0342, -0.2481, -0.1054,  ..., -0.0869, -0.1297,  0.2620],
        [ 0.1627,  0.2393, -0.0936,  ..., -0.0640,  0.1666,  0.0617],
        [ 0.1711, -0.2164,  0.1976,  ..., -0.0619,  0.2769,  0.1126]])

=== Sentence Classification Logits ===
tensor([[-0.1796, -0.1621,  0.2072],
        [-0.1913, -0.1343, -0.0164],
        [ 0.0315, -0.1127, -0.0348]])

=== Sentiment Analysis Logits ===
tensor([[-0.0830,  0.0332,  0.3141],
        [-0.2252, -0.1993,  0.1684],
        [-0.0978,  0.1167,  0.1811]])


# Scenario 1: Freezing the Entire Network
What It Means:
When you freeze every part of the model—including the transformer backbone and the task-specific heads—you prevent any parameters from updating during training. The model essentially works as a fixed feature extractor.

## Pros:

Speed and Simplicity: Without any weight updates, the training process is extremely fast and computationally light.
Low Overfitting Risk: In scenarios where you have a very small dataset, this approach helps prevent overfitting since the model’s parameters remain unchanged.
## When to Use:
If you’re working in a domain very similar to what the model was originally trained on, or if you have very little new data, freezing the entire network can be a practical option. However, keep in mind that this approach doesn’t allow the model to learn any domain-specific nuances.

#Scenario 2: Freezing Only the Transformer Backbone
What It Means:
In this case, the backbone—the pre-trained transformer—is kept unchanged, while the task-specific layers (the classification and sentiment heads) are fine-tuned on your new data.

##Pros:

Balanced Adaptation: The rich language features captured by the transformer are preserved, while the heads learn to map these features to your specific labels.
Efficient Learning: Since only a subset of the parameters is updated, training is faster and less prone to overfitting, which is particularly useful if your new dataset is limited in size.

##When to Use:
This method is ideal when the pre-trained model already provides strong general features that work well for your tasks, but you need some adaptation to capture the specifics of your new domain or task. It’s a common choice in transfer learning scenarios.

#Scenario 3: Freezing Only One Task-Specific Head
What It Means:
Here, you choose to freeze one of the task heads (for instance, the sentiment analysis head) while allowing the transformer backbone and the other head (e.g., sentence classification) to be updated.

##Pros:

Targeted Stability: If one of the tasks already has a well-performing head (perhaps because you’ve already tuned it or it has plenty of data), freezing it can protect its performance.
Focused Improvement: Meanwhile, the other parts of the model can continue to learn and adapt, which is beneficial when the two tasks have different levels of maturity or data quality.
##When to Use:

This approach makes sense if you have one task that is already reliable and you want to maintain its performance, while still improving on the other task. It allows for selective fine-tuning without disrupting the strengths of the better-performing head.

In a scenario where you have a limited, domain-specific dataset, transfer learning can significantly boost performance by leveraging robust features from a model pre-trained on massive, diverse data. Here’s how I would approach the transfer learning process:

## 1. Choosing a Pre-Trained Model
I’d select a well-established transformer model such as DistilBERT, BERT, or RoBERTa. For instance, DistilBERT is a great candidate because it’s a lighter, faster version of BERT that still retains much of its power. These models have been trained on extensive, diverse datasets, enabling them to capture rich language representations that are broadly useful across many tasks.

## 2. Freezing and Unfreezing Layers
Initial Phase:

Freeze the Transformer Backbone:
Start by freezing most, if not all, of the transformer layers. In this phase, only the task-specific heads (e.g., classification or sentiment analysis layers) are updated. This approach leverages the pre-trained knowledge without risking overfitting on a small new dataset.
Gradual Unfreezing:

Unfreeze Later Layers Gradually:
Once the task-specific heads have adapted to your dataset, begin unfreezing the later layers of the transformer. These layers tend to capture more specialized, task-relevant features. The early layers, which extract very general features (such as syntax), often remain frozen longer because their representations are broadly transferable.
This method, sometimes called discriminative fine-tuning, allows the model to gradually adjust to the specifics of your new domain while preserving the valuable general language patterns learned during pre-training.

## 3. Rationale Behind These Choices
Preserving Robust Representations:
The pre-trained model has already learned strong language features. Freezing the backbone initially helps retain this knowledge, ensuring that the model doesn’t "unlearn" these valuable representations during early fine-tuning.

Preventing Overfitting:
With a small or domain-specific dataset, training the entire network might lead to overfitting. By only training the task-specific heads at first, you reduce the risk of overfitting, as fewer parameters are updated.

Efficient Use of Resources:
Updating only the task-specific layers means fewer parameters are being tuned initially, which speeds up training and reduces computational demands.

Controlled Adaptation:
Gradually unfreezing the later layers allows the model to slowly adapt to the nuances of your new data. This controlled approach helps balance the need to incorporate domain-specific features with the risk of losing the general language understanding provided by the pre-trained layers.



# Task 4: Training Loop Implementation (BONUS)
If not already done, code the training loop for the Multi-Task Learning Expansion in Task 2.
Explain any assumptions or decisions made paying special attention to how training within a
MTL framework operates. Please note you need not actually train the model.
Things to focus on:
● Handling of hypothetical data
● Forward pass
● Metrics

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel

# -----------------------------------------------------------------------------
# Multi-Task Model Definition
# -----------------------------------------------------------------------------
class MultiTaskSentenceTransformer(nn.Module):
    """
    Multi-task Sentence Transformer Model that uses a shared transformer backbone
    and two task-specific heads: one for sentence classification and one for sentiment analysis.
    """
    def __init__(self, model_name='distilbert-base-uncased',
                 num_classification_classes=3, num_sentiment_classes=3):
        super(MultiTaskSentenceTransformer, self).__init__()
        # Load the pre-trained transformer backbone and its tokenizer.
        self.transformer = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.hidden_size = self.transformer.config.hidden_size

        # Task A: Sentence Classification Head.
        self.classification_head = nn.Linear(self.hidden_size, num_classification_classes)
        # Task B: Sentiment Analysis Head.
        self.sentiment_head = nn.Linear(self.hidden_size, num_sentiment_classes)

    def forward(self, input_ids, attention_mask):
        """
        Forward pass that returns shared embeddings along with task-specific logits.
        """
        # Obtain token-level outputs from the transformer.
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_states = outputs.last_hidden_state  # (batch_size, seq_length, hidden_dim)

        # Masked Mean Pooling: Only average non-padding tokens.
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_states.size()).float()
        sum_embeddings = torch.sum(last_hidden_states * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
        sentence_embeddings = sum_embeddings / sum_mask

        # Task-specific logits.
        classification_logits = self.classification_head(sentence_embeddings)
        sentiment_logits = self.sentiment_head(sentence_embeddings)

        return {
            'sentence_embedding': sentence_embeddings,
            'classification_logits': classification_logits,
            'sentiment_logits': sentiment_logits
        }

    def encode(self, sentences, max_length=128, device='cpu'):
        """
        Encodes a list of sentences by tokenizing them and performing a forward pass.
        Note: This method is useful for inference where gradients are not needed.
        """
        self.to(device)
        encoded_input = self.tokenizer(
            sentences, padding=True, truncation=True, max_length=max_length, return_tensors='pt'
        )
        encoded_input = {key: val.to(device) for key, val in encoded_input.items()}
        with torch.no_grad():
            outputs = self.forward(encoded_input['input_ids'], encoded_input['attention_mask'])
        return outputs

# -----------------------------------------------------------------------------
# Dummy Dataset for Multi-Task Learning
# -----------------------------------------------------------------------------
class DummyMultiTaskDataset(Dataset):
    """
    A simple dummy dataset where each sample consists of a sentence,
    a classification label, and a sentiment label.
    """
    def __init__(self):
        # Hypothetical data with made-up labels.
        self.data = [
            ("This is a test sentence.", 0, 1),
            ("I love machine learning and AI!", 1, 0),
            ("Can you tell me the weather forecast?", 2, 2),
            ("What a wonderful day!", 1, 0),
            ("I don't like this movie.", 0, 2),
            ("How do I reset my password?", 2, 1)
        ]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sentence, class_label, sentiment_label = self.data[idx]
        return {
            "sentence": sentence,
            "classification_label": torch.tensor(class_label, dtype=torch.long),
            "sentiment_label": torch.tensor(sentiment_label, dtype=torch.long)
        }

# -----------------------------------------------------------------------------
# Training Loop
# -----------------------------------------------------------------------------
def train_model(model, dataloader, optimizer, criterion, device, num_epochs=50):
    """
    Train the multi-task model over a number of epochs.

    Args:
        model (nn.Module): The multi-task model.
        dataloader (DataLoader): DataLoader for the dataset.
        optimizer (torch.optim.Optimizer): Optimizer for model parameters.
        criterion (nn.Module): Loss function (e.g., CrossEntropyLoss).
        device (str): Device to run training on (e.g., 'cpu' or 'cuda').
        num_epochs (int): Number of epochs to train.
    """
    print("Starting Training Loop for Multi-Task Learning Model...\n")

    for epoch in range(num_epochs):
        model.train()  # Set model to training mode.
        epoch_loss = 0.0
        classification_correct = 0
        sentiment_correct = 0
        total_samples = 0

        for batch in dataloader:
            sentences = batch['sentence']
            class_labels = batch['classification_label'].to(device)
            sentiment_labels = batch['sentiment_label'].to(device)

            optimizer.zero_grad()  # Reset gradients for the batch.

            # Tokenize the sentences and move them to the correct device.
            encoded_input = model.tokenizer(
                sentences, padding=True, truncation=True, max_length=128, return_tensors='pt'
            )
            encoded_input = {key: val.to(device) for key, val in encoded_input.items()}

            # Forward pass: compute model outputs for the batch.
            outputs = model.forward(encoded_input['input_ids'], encoded_input['attention_mask'])
            classification_logits = outputs['classification_logits']
            sentiment_logits = outputs['sentiment_logits']

            # Compute loss for each task.
            loss_classification = criterion(classification_logits, class_labels)
            loss_sentiment = criterion(sentiment_logits, sentiment_labels)

            # Total loss: sum of individual losses (equal weighting assumed).
            total_loss = loss_classification + loss_sentiment

            # Backpropagation.
            total_loss.backward()
            optimizer.step()

            # Accumulate loss and correct predictions.
            batch_size = len(sentences)
            epoch_loss += total_loss.item() * batch_size
            total_samples += batch_size

            _, predicted_class = torch.max(classification_logits, dim=1)
            _, predicted_sentiment = torch.max(sentiment_logits, dim=1)
            classification_correct += (predicted_class == class_labels).sum().item()
            sentiment_correct += (predicted_sentiment == sentiment_labels).sum().item()

        # Calculate average loss and accuracy for the epoch.
        avg_loss = epoch_loss / total_samples
        classification_accuracy = classification_correct / total_samples
        sentiment_accuracy = sentiment_correct / total_samples

        print(f"Epoch {epoch+1}/{num_epochs}")
        print(f"Average Loss: {avg_loss:.4f}")
        print(f"Classification Accuracy: {classification_accuracy:.4f}")
        print(f"Sentiment Accuracy: {sentiment_accuracy:.4f}")
        print("--------------------------------------------------")

# -----------------------------------------------------------------------------
# Main Function to Run Training
# -----------------------------------------------------------------------------
def main():
    """
    Sets up the dataset, model, optimizer, and initiates training.
    """
    # Determine the device (GPU if available, else CPU).
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # Create the dummy dataset and DataLoader.
    dataset = DummyMultiTaskDataset()
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

    # Initialize the multi-task model.
    model = MultiTaskSentenceTransformer('distilbert-base-uncased', num_classification_classes=3, num_sentiment_classes=3)
    model.to(device)

    # Set up the optimizer and loss function.
    optimizer = optim.Adam(model.parameters(), lr=2e-5)
    criterion = nn.CrossEntropyLoss()

    # Run the training loop.
    train_model(model, dataloader, optimizer, criterion, device, num_epochs=50)

if __name__ == '__main__':
    main()


Starting Training Loop for Multi-Task Learning Model...

Epoch 1/50
Average Loss: 2.2785
Classification Accuracy: 0.1667
Sentiment Accuracy: 0.1667
--------------------------------------------------
Epoch 2/50
Average Loss: 2.0022
Classification Accuracy: 0.6667
Sentiment Accuracy: 0.5000
--------------------------------------------------
Epoch 3/50
Average Loss: 1.8470
Classification Accuracy: 1.0000
Sentiment Accuracy: 0.8333
--------------------------------------------------
Epoch 4/50
Average Loss: 1.6635
Classification Accuracy: 1.0000
Sentiment Accuracy: 0.8333
--------------------------------------------------
Epoch 5/50
Average Loss: 1.4878
Classification Accuracy: 1.0000
Sentiment Accuracy: 1.0000
--------------------------------------------------
Epoch 6/50
Average Loss: 1.2600
Classification Accuracy: 1.0000
Sentiment Accuracy: 1.0000
--------------------------------------------------
Epoch 7/50
Average Loss: 1.0454
Classification Accuracy: 1.0000
Sentiment Accuracy: 1.0000


#Brief Write-Up: Key Decisions and Insights

## Task 3
For Task 3, my training considerations revolved around striking a balance between leveraging pre-trained knowledge and adapting the model to new, task-specific data. I evaluated three freezing strategies:

### Freezing the Entire Network:
This approach treats the model as a fixed feature extractor, which is fast and minimizes overfitting with very little data but limits domain adaptation.

### Freezing Only the Transformer Backbone:
Here, the robust, pre-trained layers remain untouched while the task-specific heads are fine-tuned. This method preserves general language features and is efficient when data is limited, yet it still allows customization to your new tasks.

### Freezing Only One Task-Specific Head:
Selectively freezing one head protects the performance of a well-established task, allowing the rest of the network to adapt to improve the other task. This is useful when tasks vary in maturity or label quality.

For transfer learning, I would begin with a model like DistilBERT, which offers a good balance of speed and accuracy, freeze the backbone initially to preserve its learned representations, and then gradually unfreeze the later layers for fine-tuning. This strategy minimizes overfitting while ensuring that the model adapts smoothly to the new domain.

## Task 4
In Task 4, the training loop was designed to handle hypothetical data by creating a dummy dataset that simulates the multi-task scenario. Key decisions included:

### Handling Data:
Using a synthetic dataset to demonstrate how sentences and their associated labels can be batched and fed to the model.

### Forward Pass and Loss Computation:
The model computes sentence embeddings via masked mean pooling and produces logits for each task. Losses for classification and sentiment analysis are computed using cross-entropy loss and summed to provide a single training signal.

### Metrics:
Accuracy for each task is tracked by comparing predictions with ground truth, which helps monitor training progress.

This overall approach ensures that the model benefits from strong pre-trained features while being effectively fine-tuned for multi-task learning, balancing adaptation with the risk of overfitting.










