### Finetuning BERT

Your task today will be to play with BERT embedding generation, finetune existing models on new data and behold transformer superiority over previous architectures (even though at the expense of heavier computational costs).

In [None]:
%pip install --upgrade transformers datasets accelerate deepspeed

In [1]:
import os
os.environ["WANDB_DISABLED"] = "true"

import torch
import transformers
import datasets
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, Trainer, TrainingArguments

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]




### Load data and model

Our dataset for today is a **Quora Question Pairs (QQP)**.

The dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other i.e. semantically close. Read [here](https://paperswithcode.com/dataset/quora-question-pairs) if you want to know more.

In [2]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

Repo card metadata block was not found. Setting CardData to empty.




Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [3]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)



### Tokenize the data

The [dataset](https://huggingface.co/docs/datasets/en/index) library allows you to use mapping as in the functional-style programming.

What Happens to the Texts in `qqp_preprocessed`?

- The original `text1` and `text2` are tokenized into numerical ids using a relevant tokenizer.
- Both texts are concatenated via the `SEP` token and are prepended using the `CLS` token in order to meet the required formet. The resulting sequence is either truncated (if combined length > 128 tokens) or padded (if combined length < 128 tokens).
- The `qqp_preprocessed` dataset contains:
    - _Input IDs_: sequence of token ids.
    - _Attention Masks_: binary masks indicating which tokens are padding.
    - _Token Type IDs_: distinguish between tokens from text1 and text2.

__!Note!__ Attention masks here allow skipping computation on `PAD` tokens.

In [8]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [9]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Evaluation

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

Just glimpsing at our data

In [12]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [14]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'],
      attention_mask=batch['attention_mask'],
      token_type_ids=batch['token_type_ids']
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0]), 'idx': tensor([0]), 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,   

Note that the model uses 2 heads for binary classification (one for each class), not one. This is, in fact, a matter of preference.

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


Note that even though the model computation runs on the GPU, the process of loading data from disk (or memory) into the format required by the model (e.g., tensors) is handled by the CPU.

Insufficient CPU computation resources may result in bottlenecking the whole process.

In [16]:
from tqdm import tqdm
import multiprocessing

cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

20

In [18]:
# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Create a DataLoader for the validation set
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=16,  # Larger batch size for faster processing
    shuffle=False, collate_fn=transformers.default_data_collator,
    num_workers=cores  # Use multiple workers to load data faster
)

In [28]:
# Measure validation accuracy
model.eval()  # Set model to evaluation mode

correct = 0
total = 0

# (optional) Enable mixed precision for faster computation if supported
scaler = torch.cuda.amp.GradScaler() if device == torch.device("cuda") else None

with torch.no_grad():  # Disable gradient calculation
    for batch in tqdm(val_loader, desc="Evaluating"):
        # Move batch to GPU
        batch = {k: v.to(device) for k, v in batch.items()}

        # Use mixed precision if available
        if scaler:
            with torch.cuda.amp.autocast():
                outputs = model(
                    input_ids=batch['input_ids'],
                    attention_mask=batch['attention_mask'],
                    token_type_ids=batch['token_type_ids']
                )
        else:
            outputs = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                token_type_ids=batch['token_type_ids']
            )

        # Get predictions and update accuracy
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        labels = batch['labels']
        correct += (preds == labels).sum().item()
        total += labels.size(0)

# Compute accuracy
accuracy = correct / total # Validation accuracy, between 0 and 1
print(f"Validation Accuracy: {accuracy:.4f}")

  scaler = torch.cuda.amp.GradScaler() if device == torch.device("cuda") else None
  with torch.cuda.amp.autocast():
Evaluating: 100%|██████████| 2527/2527 [01:41<00:00, 24.92it/s]

Validation Accuracy: 0.9084





In [29]:
assert 0.9 < accuracy < 0.91

### Train the model

For this task, you have to fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.

In [32]:
# Load your model e.g. DeBERTa-v3 tokenizer and model
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Binary classification. num_labels=1 if you prefer.

# Note that if the tokenizer of your model
# is different from the one we used aboVe,
# you need ot preprocess your data again.

# Preprocess the data
def preprocess(examples):
    enc = tokenizer(
        examples["text1"],
        examples["text2"],
        padding="max_length",
        max_length=128,
        truncation=True
    )
    enc["label"] = examples["label"]
    return enc

# <If so, your code goes here>
qqp_preprocessed = qqp.map(preprocess, batched=True)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [40]:
# Prepare the training and validation sets
train_set = qqp_preprocessed['train']
val_set = qqp_preprocessed['validation']  


# Define a metric for evaluation. You can write your own if you prefer
from sklearn.metrics import accuracy_score
import numpy as np
# If you are using transformers.Trainer, you may want to use a utility function below
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for the model during training or evaluation.
    Args:
        eval_pred (tuple): A tuple containing:
            - logits (ndarray or torch.Tensor): The raw logits output by the model for each sample
              in the evaluation batch. Shape: (batch_size, num_classes).
            - labels (ndarray or torch.Tensor): The ground truth labels for each sample in the batch.
              Shape: (batch_size,).
    Returns:
        dict: A dictionary containing the computed metric(s):
            - "accuracy" (float): The proportion of correct predictions over the total number of samples.
    """
    logits, labels = eval_pred
    if isinstance(logits, torch.Tensor):
        logits = logits.detach().cpu().numpy()
    preds = np.argmax(logits, axis=1)
    accuracy = accuracy_score(labels, preds)
    return {"accuracy": accuracy}

# Feel free not to use transformers.Trainer and write the code manually if you want
# A good starting learning rate is 2e-5.
# A step of an order of magnitude is a good way to adjust it if necessary e.g. 2e-4, 2e-3 etc.
# 3 train epochs is likely enough for gently finetuning the model without the model 'forgetting previous data'
# Be sure to use weight_decay i.e. regularisation. A good starting point is 1e-2. Feel free to experiment.
# Consider setting accuracy as the metric for the best model.

# Define your training arguments without the 'device' argument since it is handled automatically.
training_args = TrainingArguments(
    output_dir="./deberta-qqp-finetune",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    logging_steps=100,
    report_to="none",
    fp16=torch.cuda.is_available()
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=val_set,
    tokenizer=tokenizer,
    data_collator=transformers.default_data_collator,
    compute_metrics=compute_metrics
)

# Fine-tune the model
trainer.train()

# Evaluate the model
eval_metrics = trainer.evaluate()
accuracy = eval_metrics.get("eval_accuracy", 0.0)
print(f"Validation Accuracy: {accuracy:.4f}")


Epoch,Training Loss,Validation Loss,Accuracy
1,0.134,0.284959,0.911452
2,0.1298,0.244591,0.919664
3,0.0768,0.309788,0.923547


Validation Accuracy: 0.9235


In [41]:
assert 0.9 < accuracy

To be completely honest, we made a small crime here. Validation part of the dataset is intended for tuning the hyperparameters, but for the sake of simplicity we ommited that logic here. You are free to pick the best hyperparameters and test the results on the `test` subsample if you feel so.

### BONUS: Get a taste of how BERT embeddings work

It is time to shed light on how a BERT-based embedder can be leveraged in searching relevant information.

The problem with vanilla BERT and the likes is that it isn't directly trained using contrastive or triplet loss in order to genuinely force similar embeddings closer to each other. Hence, to obtain the best possible results in building a search engine it is preferrable to pick a dedicated [sentence similarity](https://huggingface.co/models?pipeline_tag=sentence-similarity) model. Feel free to pick the one that will likely meet your requirements the most.

Similar to what we showcased in the first homework, your task is to construct a search engine:
1) _Prepare an embeddings database_: Since Quora Question Pairs dataset contains, well, pairs of questions, we will only pick data in the `text1` field of the `validation` subsample. You should obtain embeddings using a model of your choice and store them for later use in a `numpy.ndarray`. Optionally, you can leverage a dedicated [Faiss](https://github.com/facebookresearch/faiss) index.
2) _Implement a way to search for similar questions to a given query_: It is expected that you will write a function or a class to streamline interactions with your database. __A completion of this part of the homework will be judged upon the ability to print coherently the TOP 5 most similart quora questions given a new arbitrary query.__

Hopefully, you can appreciate how the search has become more semantically profound as compared to our previous attempt.

In [42]:
# Initialize the model and its tokenizer
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

model.to(device)
model.eval()

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)


In [43]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state  # (bs, seq, hidden)
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    summed = torch.sum(token_embeddings * input_mask_expanded, dim=1)
    counts = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
    mean_pooled = summed / counts
    return F.normalize(mean_pooled, p=2, dim=1)

In [44]:
val_texts = qqp["validation"]["text1"]  # raw texts for printing
batch_size = 256
emb_list = []

In [45]:
with torch.no_grad():
    for i in tqdm(range(0, len(val_texts), batch_size), desc="Building embedding DB"):
        batch_texts = val_texts[i:i+batch_size]
        enc = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
        enc = {k: v.to(device) for k, v in enc.items()}
        out = model(**enc)
        embs = mean_pooling(out, enc["attention_mask"])  # (bs, hidden)
        emb_list.append(embs.detach().cpu())

embeddings_storage_np = torch.cat(emb_list, dim=0).numpy()  # (N, hidden)

Building embedding DB: 100%|██████████| 158/158 [00:09<00:00, 16.18it/s]


In [46]:
def find_similar_questions(query, database, model, top_k = 5):
    """
    Finds and prints the top_k most similar questions for a query.

    This function encodes a query, compares it against a pre-computed
    embedding database using cosine similarity, and prints the most
    semantically similar questions.

    Args:
        query (str): The user's search query.
        database (np.ndarray): A 2D NumPy array containing the pre-computed
                               embeddings for the database of questions.
        model (SentenceTransformer): The initialized Sentence-Transformer model
                                     used to encode the query.
        top_k (int): The number of top results to display.

    Returns:
        None. The function prints the results.
    """
    model.eval()
    with torch.no_grad():
        enc = tokenizer([query], padding=True, truncation=True, max_length=128, return_tensors="pt")
        enc = {k: v.to(device) for k, v in enc.items()}
        out = model(**enc)
        q_emb = mean_pooling(out, enc["attention_mask"]).detach().cpu().numpy()[0]  # (hidden,)

    # cosine similarity with L2-normalized rows -> just dot product
    sims = database @ q_emb  # (N,)
    top_idx = np.argsort(-sims)[:top_k]

    print(f"\nQuery: {query}\nTop {top_k} similar questions:")
    for rank, idx in enumerate(top_idx, 1):
        print(f"{rank}. (score={sims[idx]:.4f}) {val_texts[idx]}")      

In [47]:
find_similar_questions("How can I lose weight quickly?", embeddings_storage_np, model, top_k=5)


Query: How can I lose weight quickly?
Top 5 similar questions:
1. (score=0.9740) How can you lose weight quickly?
2. (score=0.9740) How can you lose weight quickly?
3. (score=0.9740) How can you lose weight quickly?
4. (score=0.9740) How can you lose weight quickly?
5. (score=0.9740) How can you lose weight quickly?
