# Week 4: Transfer Learning, BERT (Homework)

## Question Search Engine

Embeddings are a good source of information for solving various tasks. For example, we can classify texts or find similar documents using their representations. We already know about word2vec, GloVe and fasttext, but they don't use context information from given text (only from contexts of source data).

For today we will use full power of context-aware embeddings to find text duplicates!

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [12]:
%pip install --upgrade transformers datasets accelerate deepspeed
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets
import pandas as pd
import time
import os
from torch.nn.functional import cosine_similarity
from huggingface_hub import hf_hub_download



### Data Preparation

In [2]:
qqp = datasets.load_dataset("SetFit/qqp")
print("\n")
print("Sample[0]:", qqp["train"][0])
print("Sample[3]:", qqp["train"][3])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/313 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl:   0%|          | 0.00/70.8M [00:00<?, ?B/s]

validation.jsonl: 0.00B [00:00, ?B/s]

test.jsonl:   0%|          | 0.00/76.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/363846 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/40430 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/390965 [00:00<?, ? examples/s]



Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [3]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [4]:
MAX_LENGTH = 128

def preprocess_function(examples):
    result = tokenizer(
        examples["text1"],
        examples["text2"],
        padding="max_length",
        max_length=MAX_LENGTH,
        truncation=True,
    )

    result["label"] = examples["label"]

    return result

In [5]:
qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [6]:
print(repr(qqp_preprocessed["train"][0]["input_ids"])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Evaluation (1 point)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [7]:
val_set = qqp_preprocessed["validation"]
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [8]:
for batch in val_loader:
    break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
    predicted = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        token_type_ids=batch["token_type_ids"],
    )

print("\nPrediction (probs):", torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0]), 'idx': tensor([0]), 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,   

**Task 1 (1 point)**

- Measure the validation accuracy of your model. Doing so naively may take several hours. Please make sure you use the following optimizations:
  - Run the model on GPU with no_grad
  - Using batch size larger than 1
  - Use optimize data loader with num_workers > 1
  - (Optional) Use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

val_loader = torch.utils.data.DataLoader(
    qqp_preprocessed["validation"],
    batch_size=16,
    shuffle=False,
    collate_fn=transformers.default_data_collator,
    num_workers=2
)

correct = 0
total = 0

with torch.no_grad():
    for batch in val_loader:
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            token_type_ids=batch["token_type_ids"],
        )

        preds = outputs.logits.argmax(dim=1)
        correct += (preds == batch["labels"]).sum().item()
        total += batch["labels"].size(0)

accuracy = correct / total
print(accuracy)

0.9083848627256987


In [10]:
assert 0.9 < accuracy < 0.91

### Training (4 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

**Task 2 (4 points)**
- Choose Option A or Option B (only one will be graded)
- Follow all the instructions and restrictions

In [11]:
# Option B

models = {
    "BERT": "gchhablani/bert-base-cased-finetuned-qqp",
    "RoBERTa": "cross-encoder/quora-roberta-base",
    "DeBERTa": "microsoft/deberta-v3-base",
    "DistilBERT": "textattack/distilbert-base-uncased-QQP",
}


def evaluate_model(model_name):
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)
    model.to(device)
    model.eval()

    val_loader = torch.utils.data.DataLoader(
        qqp_preprocessed["validation"],
        batch_size=16,
        shuffle=False,
        collate_fn=transformers.default_data_collator,
        num_workers=2
    )

    correct = 0
    total = 0
    start = time.time()

    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(device) for k, v in batch.items()}

            outputs_kwargs = {
              "input_ids": batch["input_ids"],
              "attention_mask": batch["attention_mask"]
            }

            if model_name == "gchhablani/bert-base-cased-finetuned-qqp":
              outputs_kwargs["token_type_ids"] = batch["token_type_ids"]

            outputs = model(**outputs_kwargs)

            preds = outputs.logits.argmax(dim=1)
            correct += (preds == batch["labels"]).sum().item()
            total += batch["labels"].size(0)

    elapsed = time.time() - start
    accuracy = correct / total
    speed = total / elapsed

    local_path = hf_hub_download(repo_id=model_name, filename="pytorch_model.bin")
    size_mb = os.path.getsize(local_path) / 1024 / 1024

    return accuracy, speed, size_mb


In [15]:
results = []

for name, model_id in models.items():
    acc, spd, size = evaluate_model(model_id)
    results.append({
        "Model": name,
        "Accuracy": acc,
        "Speed (samples/sec)": spd,
        "Size (MB)": size
    })

df_results = pd.DataFrame(results)
df_results


Unnamed: 0,Model,Accuracy,Speed (samples/sec),Size (MB)
0,BERT,0.908385,132.208622,413.257001
1,RoBERTa,0.631833,135.406311,475.574655
2,DeBERTa,0.564061,111.240248,353.952611
3,DistilBERT,0.842023,265.156406,255.43704


Наибольшая accuracy получена на BERT, так как модель была предварительно дообучена именно на QQP.  
Для DistilBERT accuracy получилась немного меньше, потому что это облегчённая версия BERT, но тоже обученная на QQP.  
На RoBERTa и DeBERTa получены низкие accuracy, потому что в эксперименте использовались их базовые версии без дополнительного дообучения на QQP.

Модель DistilBERT самая быстрая, что объясняется уменьшенным числом слоёв и параметров и общей оптимизацией архитектуры.  
BERT почти в 2 раза медленнее из-за большего числа слоёв и параметров.  
Скорость RoBERTa близка к BERT, но чуть выше за счёт токенизатора.  
Модель DeBERTa медленнее всех из-за более сложной архитектуры и слоя attention.

Модель RoBERTa самая тяжёлая из-за большого размера словаря и весов.  
DistilBERT является самой компактной моделью, что делает её удобной для использования в условиях ограниченности ресурсов.  

Модели RoBERTa и DeBERTa без дополнительного дообучения на QQP дают низкую accuracy, несмотря на современную архитектуру.  
BERT и DistilBERT, специально дообученные на QQP, показывают высокую accuracy.  

В зависимости от задачи BERT лучше, если важна максимальная точность, а DistilBERT - если важнее скорость и память.  
Для моделей RoBERTa и DeBERTa дообучение на QQP является необходимым условием для достижения конкурентоспособной точности.

### Finding Duplicates (1 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

**Task 3 (1 point)**
- Implement function for finding duplicates
- Test it on several examples (at least 5)
- Check suggested duplicates and make a conclusion about model correctness

In [16]:
def get_embeddings(texts, model, tokenizer, device, batch_size=32):
    embeddings = []
    model.eval()
    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            enc = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
            out = model.bert(**enc) if hasattr(model, "bert") else model(**enc)
            cls_embeds = out.last_hidden_state[:,0,:]
            cls_embeds = F.normalize(cls_embeds, p=2, dim=1)
            embeddings.append(cls_embeds.cpu())
    return torch.cat(embeddings)

In [17]:
train_texts = list({item["text1"] + " " + item["text2"] for item in qqp_preprocessed["train"]})
train_embeddings = get_embeddings(train_texts, model, tokenizer, device)
train_embeddings = train_embeddings.to(device)

In [18]:
def find_duplicates(query, train_texts, train_embeddings, model, tokenizer, device, top_k=5):
    model.eval()
    with torch.no_grad():
        enc = tokenizer([query], padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
        out = model.bert(**enc) if hasattr(model, "bert") else model(**enc)
        query_embed = out.last_hidden_state[:,0,:]
        query_embed = F.normalize(query_embed, p=2, dim=1)

    sims = F.cosine_similarity(query_embed, train_embeddings, dim=1)

    k = min(top_k, sims.size(0))
    topk_idx = torch.topk(sims, k=k).indices
    return [train_texts[i] for i in topk_idx]

In [19]:
examples = [
    "How can I lose weight quickly?",
    "What is the best way to learn Python?",
    "How do I fix a broken phone screen?",
    "What are some good movies to watch?",
    "How to start investing in stocks?"
]

for q in examples:
    print(f"\nQuery: {q}")
    duplicates = find_duplicates(q, train_texts, train_embeddings, model, tokenizer, device)
    for i, dup in enumerate(duplicates):
        print(f"{i+1}. {dup}")


Query: How can I lose weight quickly?
1. How can I lose weight ? How can I lose my weight quickly ?
2. How can I lose weight safely? How can I lose weight quickly?
3. Why is it so hard to lose weight? How can I lose weight quickly?
4. How can you lose weight quickly? How can I lose weight ?
5. How can I lose weight ? How can someone lose weight quickly?

Query: What is the best way to learn Python?
1. What's the best way to learn Python? How can you learn Python algorithms?
2. What's the best way to learn Python? How can I learn advanced Python?
3. How do I learn Python at home? What's the best way to learn Python?
4. What is a good way to learn the violin? How can I learn how to play violin?
5. How can I learn advanced Python? What is the best source to learn Python?

Query: How do I fix a broken phone screen?
1. How do you fix a laptop that has no sound? How can I fix this laptop problem?
2. Can you fix a broken iPad screen? If so, how? How do I fix my cracked iPad screen?
3. How do

### Bonus: Finding Duplicates Faster (0.5 point)

Try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.

**Bonus Task 1 (0.5 point)**
- Speed up your implementation from "Finding Duplicates" part
- Capture both old and new implementation work time
- Describe your approach

In [None]:
<A whole lot of YOUR CODE HERE>

### Bonus: Finding Duplicates in Old-Fashioned way (1.5 points)

In this bonus task you are supposed to use pretrained embeddings (word2vec, GloVe or fasttext) for solving the duplicates problem.

**Bonus Task 2 (1.5 points)**
- Solve Finding Duplicates problem using mentioned embeddings
- Compare old-fashioned solution to previous ones (quality, speed, etc.)
- Make a small report (up to 5 steps, results and conclusions) on work done in this part

In [None]:
<A whole lot of YOUR CODE HERE>