Evaluation Criteria:
- Quality, depth, and clarity of explanations. It’s essential for us to be able to understand
your thought process
- Clarity and structure of the code.
- Ability for us to easily run your code and replicate your results

## Step 1: Implement a Sentence Transformer Model
- Implement a sentence transformer model using any deep learning framework of your
choice. This model should be able to encode input sentences into fixed-length
embeddings.
- Test your implementation with a few sample sentences and showcase the obtained
embeddings.
- Discuss any choices you had to make regarding the model architecture outside of the
transformer backbone

### Implementation

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
# Pre-trained transformer model from Hugging Face
model_name = 'sentence-transformers/all-MiniLM-L6-v2'  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

### Test

In [3]:
# Sample sentences: actual fetch reviews hehe
sentences = [
    "It CAN take a while for the points to collect",
    "Fetch Rewards is a game-changer for shoppers!",
    "I guess over the years since more people started using Fetch, they make it hard to get gift cards."
]

# Tokenize the input sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

In [4]:
with torch.no_grad():
    outputs = model(**inputs)

# The last hidden state is a tensor of size [batch_size, sequence_length, hidden_size]
# mean-pool over the sequence length to create fixed-length sentence embeddings --> end shape (batch_size, hidden_size)
sentence_embeddings = outputs.last_hidden_state.mean(dim=1)

# sanity check that they are the same lenght
for i, emb in enumerate(sentence_embeddings):
    print(f"Embedding for sentence {i+1}: {len(emb)}\n")

Embedding for sentence 1: 384

Embedding for sentence 2: 384

Embedding for sentence 3: 384



In [5]:
# Showcase embeddings
for i, emb in enumerate(sentence_embeddings):
    print(f"Embedding for sentence {i+1}: {emb}\n")

Embedding for sentence 1: tensor([-1.4879e-01, -3.2041e-01, -5.4127e-02,  1.6908e-01,  1.8378e-01,
         2.3295e-01,  3.3648e-03, -2.6190e-01, -4.2113e-02,  7.5660e-03,
         2.1651e-02, -5.0472e-02, -3.9413e-02,  8.1617e-02,  1.3642e-01,
         1.3107e-01, -5.5834e-03, -4.0379e-01,  1.1190e-01, -8.1360e-02,
        -1.9157e-01, -1.7363e-01, -3.1898e-01,  6.7866e-02,  2.7781e-01,
         1.9570e-01, -2.9567e-02,  1.5387e-01,  1.3636e-01, -2.1507e-01,
        -2.5315e-01, -6.5974e-03,  8.2477e-02,  2.0189e-01, -9.6518e-02,
        -7.1657e-02,  2.9166e-03,  1.5178e-01,  3.9197e-02, -2.0562e-01,
         2.4474e-01, -3.0083e-02,  6.2201e-02,  4.2204e-01,  7.8106e-02,
         1.7176e-01,  2.3724e-01, -2.0900e-01,  2.4517e-01,  1.3527e-01,
         1.3883e-01,  3.7395e-01, -4.1784e-02,  2.2985e-02, -1.2272e-01,
         1.0782e-01,  4.2287e-02,  2.4941e-02, -2.1965e-01, -1.4539e-01,
         7.3403e-02, -5.3891e-02, -2.7122e+00, -1.7283e-01,  1.0310e-01,
        -3.1633e-01,  6.1

### Discussion

1. **Model Selection:**
I chose the sentence-transformers/all-MiniLM-L6-v2 model becuase its lightweight and reliable. Since this is the first step of a multistep take home project that I'm running on my computer a lightweight model is the ideal option for balancing speed and accuracy. Additionally, this model is already fine-tuned for sentence-level tasks like semantic similarity, which reduces the need for additional fine-tuning. Some other pre trained base encoder models I could of used are: BERT, Distil-BERT, ALBERT, RoBERTa, XLNET, etc...

2. **Pooling Strategy:**
I applied mean pooling across the token dimension of the last hidden state to create fixed-length sentence embeddings since the request was to return "fix length embeddings". Mean pooling provides a robust rebresentation of the whole embedding. My other option would have been to just grab the first x embs for each sentance but this would not have captured the whole meaning of the sentance as well.

3. **Padding and Truncation:**
I set both padding=True and truncation=True when tokenizing the input sentences. Padding ensures that all sentences in the batch have the same length, allowing for batch processing in parallel. Truncation cuts off sentance that are longer than the model’s maximum input length to avoid errors. 

## Step 2: Multi-Task Learning Expansion
Expand the sentence transformer model architecture to handle a multi-task learning setting.
- Task A: Sentence Classification
    - Implement a task-specific head for classifying sentences into predefined classes
    - Classify sentences into predefined classes (you can make these up).
- Task B: [Choose an Additional NLP Task]
    - Implement a second task-specific head for a different NLP task, such as Named Entity Recognition (NER) or Sentiment Analysis (you can make the labels up).
- Discuss the changes made to the architecture to support multi-task learning.
Note that it’s not required to actually train the multi-task learning model or implement a training
loop. The focus is on implementing a forward pass that can accept an input sentence and output
predictions for each task that you define.

### Sentence Classification and Sentiment Analysis 

In [6]:
# Define the base model, this is staying the same from step 1
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModel.from_pretrained(model_name)


In [7]:
# Define the Multi-Task Model
class MultiTaskModel(nn.Module):
    def __init__(self, base_model, num_classes_task_a, num_classes_task_b):
        super(MultiTaskModel, self).__init__()
        self.transformer = base_model
        hidden_size = self.transformer.config.hidden_size
        
        # Task A: Sentence Classification head
        self.classification_head = nn.Sequential(
            nn.Linear(hidden_size, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes_task_a)
        )
        
        # Task B: Sentiment Analysis head
        self.sentiment_head = nn.Sequential(
            nn.Linear(hidden_size, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes_task_b)
        )
        
    def forward(self, input_ids, attention_mask):
        # Pass through transformer model
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state

        cls_token_rep = last_hidden_state[:, 0, :] 
        
        # Task A: Sentence classification 
        task_a_logits = self.classification_head(cls_token_rep)
        
        # Task B: Sentiment Analysis
        task_b_logits = self.sentiment_head(cls_token_rep)
        
        return task_a_logits, task_b_logits

### Testing (Sanity check not required)

In [8]:
# Initialize the multi-task model
num_classes_task_a = 2  # like: happy, sad 
num_classes_task_b = 3  # like: positive, neutral, negative
model = MultiTaskModel(base_model, num_classes_task_a, num_classes_task_b)

# Test sentences
sentences = [
    "I love this product! It's amazing.",
    "The service was okay, not great.",
    "I'm disappointed with the quality of the item.",
]

# Tokenize sentences
encoding = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)

# Forward pass through the model
model.eval()
with torch.no_grad():
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']
    task_a_logits, task_b_logits = model(input_ids, attention_mask)
    
# Convert logits to probabilities and print
task_a_probs = torch.softmax(task_a_logits, dim=-1)
task_b_probs = torch.softmax(task_b_logits, dim=-1)

print("\nTask A (Sentence Classification) Probabilities:")
print(task_a_probs)
print("Task B (Sentiment Analysis) Probabilities:")
print(task_b_probs)


Task A (Sentence Classification) Probabilities:
tensor([[0.4653, 0.5347],
        [0.4676, 0.5324],
        [0.4469, 0.5531]])
Task B (Sentiment Analysis) Probabilities:
tensor([[0.3596, 0.3431, 0.2974],
        [0.3577, 0.3409, 0.3014],
        [0.3522, 0.3561, 0.2917]])


### Discussion

1. **Sentence Classification and Sentiment Analysis heads:**
For simplicity's sake, I implemented both heads the same way. I tried to balance model complexity, computational efficiency, and generalization. Here is my reasoning for the implementation of the head layers:
    - nn.sequential, so I don't need to code each step in the forward 
    - The initial linear layer reduces the high-dimensional output from the transformer to a more manageable size (128 units), which helps focus on the relevant features.
    - ReLU introduces non-linearity, allowing the model to learn more complex patterns.
    - Dropout helps prevent overfitting by making the model robust to variations and noise in the data.
    - The second linear layer maps the reduced representation to the final output space, providing class logits for the classification task.  
2. **Hidden layer:** I added a hidden layer that is shared by both tasks. These layers capture common features, improve generalizations, and reduce overfitting.  
3. **The forward method:**
I added the Forward pass, which runs both sentence classification and sentiment analysis for every input sentence. If these tasks are not needed each time, adding an ```if else``` statement and adding a task_id as an input would be a great way to save on computation.  





## Step 3: Discussion Questions
1. Consider the scenario of training the multi-task sentence transformer that you implemented in Task 2. Specifically, discuss how you would decide which portions of the network to train and which parts to keep frozen.
For example,
    - When would it make sense to freeze the transformer backbone and only train the task specific layers?
    - When would it make sense to freeze one head while training the other?
2. Discuss how you would decide when to implement a multi-task model like the one in this
assignment and when it would make more sense to use two completely separate models
for each task.
3. When training the multi-task model, assume that Task A has abundant data, while Task B has limited data. Explain how you would handle this imbalance.

### Answers 
1. If I were to train the multi-task sentence trasnformer that I implemented in task 2, I would freeze the transformer backbone and only train the specific task layers:
    - if the data the transformer backbone was trained on is similar to my task data (e.g. the backbone model I used in this example was trained on sentance level task so I would probably freeze it during training)
    - if I had limited task layer data (so that the model does not overfit the data)
    - to save on computational resources, any pretrained transformer backbone does not *need* to be fine tuned, so if I have limited resources, I would focus on the task layers 

    I would freeze one head while training the other when: 
    - I've achieved my goal performance for one head but want to continue training and fine-tuning the other task head 
    - I need to transfer knowledge from one head to another (I would freeze the head that is performing well and continue to train the other head)

2. In general, I would implement a multi-task model like this one when the tasks are related and/or I don't have enough data to implement individual models. The shared embedding architecture of both tasks and likeness in objective make them great candidates for multi-task modeling. I've observed that often, however, when you have enough data for each task, individual models perform better. Having two seperate models does increase complexity so one should weigh that. 
3. If Task A has abundant data and Task B has limited data, I would try out a few methods to handle this task imbalance and use whichever works best: 
    - undersample the Task A data
    - up-sample Task B data (could try SMOTE or other data augmentation techniques)
    - use task-specific loss weighting, giving more importance to Task B during training.