### Homework 2 (option 0): Finetuning BERT

Your task today will be to play with BERT embedding generation, finetune existing models on new data and behold transformer superiority over previous architectures (even though at the expense of heavier computational costs).

In [None]:
%pip install --upgrade transformers datasets accelerate deepspeed

import os
os.environ["WANDB_DISABLED"] = "true"

import torch
import transformers
import datasets
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, Trainer, TrainingArguments

### Load data and model

Our dataset for today is a **Quora Question Pairs (QQP)**.

The dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other i.e. semantically close. Read [here](https://paperswithcode.com/dataset/quora-question-pairs) if you want to know more.

In [None]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

In [None]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

### Tokenize the data

The [dataset](https://huggingface.co/docs/datasets/en/index) library allows you to use mapping as in the functional-style programming.

What Happens to the Texts in `qqp_preprocessed`?

- The original `text1` and `text2` are tokenized and encoded into numerical representations using the tokenizer.
- Both texts are concatenated and either truncated (if combined length > 128 tokens) or padded (if combined length < 128 tokens).
- The `batched=True` argument ensures that the tokenizer processes multiple examples at once.
- The `qqp_preprocessed` dataset contains:
    - _Input IDs_: Numerical representations of the tokens.
    - _Attention Masks_: Binary masks indicating which tokens are padding. (
    - _Token Type IDs_ (if used by the model): Distinguish between tokens from text1 and text2.

__!Note! attention masks here does not have anything to do with DL attention. It is just a weird naming from those who wrote the tokenizer.__

In [None]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [None]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Task 1: evaluation (4 points)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

Just glimpsing at our data

In [None]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [None]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'],
      attention_mask=batch['attention_mask'],
      token_type_ids=batch['token_type_ids']
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0]), 'idx': tensor([0]), 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,   

Note that the model uses 2 heads for binary classification (one for each class), not one. This is, in fact, a matter of preference.

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


Note that even though the model computation runs on the GPU, the process of loading data from disk (or memory) into the format required by the model (e.g., tensors) is handled by the CPU.

Insufficient CPU computation resources may result in bottlenecking the whole process.

In [None]:
from tqdm import tqdm
import multiprocessing

cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

2

In [None]:
# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Create a DataLoader for the validation set
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=16,  # Larger batch size for faster processing
    shuffle=False, collate_fn=transformers.default_data_collator,
    num_workers=cores  # Use multiple workers to load data faster
)

In [None]:
# Measure validation accuracy
model.eval()  # Set model to evaluation mode

<YOUR CODE HERE>

# (optional) Enable mixed precision for faster computation if supported
scaler = torch.cuda.amp.GradScaler() if device == torch.device("cuda") else None

with torch.no_grad():  # Disable gradient calculation
    for batch in tqdm(val_loader, desc="Evaluating"):
        # Move batch to GPU
        <YOUR CODE HERE>

        # Use mixed precision if available
        if scaler:
            with torch.cuda.amp.autocast():
                outputs = <YOUR CODE HERE>
        else:
            outputs = <YOUR CODE HERE>

        # Get predictions and update accuracy
        <YOUR CODE HERE>

# Compute accuracy
accuracy = <YOUR CODE HERE> # Validation accuracy, between 0 and 1
print(f"Validation Accuracy: {accuracy:.4f}")

  scaler = torch.cuda.amp.GradScaler() if device == torch.device("cuda") else None
  with torch.cuda.amp.autocast():
Evaluating: 100%|██████████| 2527/2527 [01:19<00:00, 31.63it/s]

Validation Accuracy: 0.9084





In [None]:
assert 0.9 < accuracy < 0.91

### Task 2: train the model (6 points)

For this task, you have to fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.

In [None]:
# Load your model e.g. DeBERTa-v3 tokenizer and model
model_name = <THE MODEL OF YOUR CHOICE HERE>
tokenizer = <YOUR CODE HERE>
model = <YOUR CODE HERE>  # Binary classification. num_labels=1 if you prefer.

# Note that if the tokenizer of your model
# is different from the one we used aboVe,
# you need ot preprocess your data again.

# Preprocess the data
<YOUR CODE HERE>

# <If so, your code goes here>
qqp_preprocessed = <YOUR CODE HERE>

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

In [None]:
# Prepare the training and validation sets
train_set = qqp_preprocessed['train']
val_set = qqp_preprocessed['validation']

# Define a metric for evaluation. You can write your own if you prefer
from sklearn.metrics import accuracy_score

# If you are using transformers.Trainer, you may want to use a utility function below
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for the model during training or evaluation.
    Args:
        eval_pred (tuple): A tuple containing:
            - logits (ndarray or torch.Tensor): The raw logits output by the model for each sample
              in the evaluation batch. Shape: (batch_size, num_classes).
            - labels (ndarray or torch.Tensor): The ground truth labels for each sample in the batch.
              Shape: (batch_size,).
    Returns:
        dict: A dictionary containing the computed metric(s):
            - "accuracy" (float): The proportion of correct predictions over the total number of samples.
    """
    <YOUR CODE HERE>
    return {"accuracy": accuracy}

# Feel free not to use transformers.Trainer and write the code manually if you want
# A good starting learning rate is 2e-5.
# A step of an order of magnitude is a good way to adjust it if necessary e.g. 2e-4, 2e-3 etc.
# 3 train epochs is likely enough for gently finetuning the model without the model 'forgetting previous data'
# Be sure to use weight_decay i.e. regularisation. A good starting point is 1e-2. Feel free to experiment.
# Consider setting accuracy as the metric for the best model.

# Define your training arguments without the 'device' argument since it is handled automatically.
training_args = TrainingArguments(
    <YOUR CODE HERE>
)

# Initialize the Trainer
trainer = Trainer(
    <YOUR CODE HERE>
)

# Fine-tune the model
<YOUR CODE HERE>

# Evaluate the model
<YOUR CODE HERE>
print(f"Validation Accuracy: {accuracy:.4f}")


In [None]:
# Evaluate the model
results = trainer.evaluate()
accuracy = results['eval_accuracy']
print(f"Validation Accuracy: {accuracy:.4f}")

Epoch,Training Loss,Validation Loss,Accuracy
0,No log,0.311161,0.861885


Validation Accuracy: 0.8619


In [None]:
assert 0.9 < accuracy

AssertionError: 

### Get a taste of how BERT embeddings work

Regardless of how you did before, it is time to showcase how a BERT embedder can be used to find similar questions. Please appreciate how the search has become more semantically profound as compared to fasttext.

__!Note!__ You can use your own finetuned model instead of the placeholder if you want.

In [None]:
# # Initialize the model and tokenizer
model_name = 'sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Preprocess only text1 from the QQP dataset
def preprocess_function(examples):
    # Tokenize the text1 column with padding and truncation to a fixed max length
    return tokenizer(examples['text1'], padding='max_length', truncation=True, max_length=128)

tokenizer_config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/589 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

In [None]:
# Collect all embeddings and texts
all_embeddings = []
all_texts = qqp_preprocessed['validation']['text1']  # Original texts from the QQP validation set

model.eval()  # Set the model to evaluation mode
with torch.no_grad():
    for batch in tqdm(val_loader, desc='Embedding'):
        # Move input tensors to appropriate device (e.g., 'cuda' if using GPU)
        input_ids = batch['input_ids'].to('cuda' if torch.cuda.is_available() else 'cpu')
        attention_mask = batch['attention_mask'].to('cuda' if torch.cuda.is_available() else 'cpu')
        model = model.to('cuda' if torch.cuda.is_available() else 'cpu')

        # Get model output
        model_output = model(input_ids=input_ids, attention_mask=attention_mask)

        # Perform mean pooling
        embeddings = mean_pooling(model_output, attention_mask).cpu().numpy()  # Move to CPU for storage

        # Collect embeddings
        all_embeddings.extend(embeddings)

# Convert embeddings to a numpy array for efficient querying
all_embeddings_np = np.array(all_embeddings)

# Save the embeddings and texts for later use
np.save("qqp_text1_embeddings.npy", all_embeddings_np)
np.save("qqp_text1_texts.npy", np.array(all_texts, dtype=object))


Embedding: 100%|██████████| 2527/2527 [02:31<00:00, 16.66it/s]


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Load the embeddings and texts
loaded_embeddings = np.load("qqp_text1_embeddings.npy")
loaded_texts = np.load("qqp_text1_texts.npy", allow_pickle=True)

# use cpu for similarity search
model = model.to('cpu')

def get_mean_pooled_embedding(text):
    encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    embedding = mean_pooling(model_output, encoded_input['attention_mask']).squeeze().numpy()
    return embedding

In [None]:
# Example query
query_text = "Are apples useful?"
query_embedding = get_mean_pooled_embedding(query_text)

# Compute cosine similarity
similarities = cosine_similarity(query_embedding.reshape(1, -1), loaded_embeddings)

# Find the index of the most similar sentence
most_similar_index = similarities.argmax()
print(f"Most similar text: {loaded_texts[most_similar_index]}")
print(f"Similarity score: {similarities[0, most_similar_index]}")

# Find the top-5 most similar queries
top_5_indices = similarities[0].argsort()[-5:][::-1]  # Sort similarities and get the top-5 indices

# Print the top-5 most similar queries and their similarity scores
print("Top 5 most similar queries:")
for idx in top_5_indices:
    print(f"Query: {loaded_texts[idx]} | Similarity: {similarities[0][idx]}")


Most similar text: Why are renewable resources important? How are they used?
Similarity score: 0.9217956066131592
Top 5 most similar queries:
Query: Why are renewable resources important? How are they used? | Similarity: 0.9217956066131592
Query: What are the possibility to make business in renewable energy ? | Similarity: 0.9086899757385254
Query: How do get notified if the user cancel an auto renewable subscription? I want to cancel this on my server as well. | Similarity: 0.9058151841163635
Query: What are the reasons behind nuclear energy being non-renewable? | Similarity: 0.9012100100517273
Query: What is the hardest natural stone? | Similarity: 0.8973220586776733
