# Week 4: Transfer Learning, BERT (Seminar)

### Using pretrained transformers (for fun, profit and 1 point)

There are many toolkits that let you access pretrained transformer models (like we used pretrained embeddings earlier), but the most powerful and convenient by far is ðŸ¤—[`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pretrained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pretrained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [None]:
import transformers

In [None]:
sentiment_clf = transformers.pipeline(
    task="sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
)

sentiment_clf(["transformers library can be really useful!", "YSDA midterm is soon"])

In [None]:
transformers.pipelines.SUPPORTED_TASKS.keys()

But how can we find out which model is suitable for chosen task in such a big models space?

Option 1: Using search and filters in [web](https://huggingface.co/models) (user-friendly)

Option 2: Using `huggingface_hub` library to access API from Python (if you want to automate some process)


In [None]:
import huggingface_hub

In [None]:
some_model = next(huggingface_hub.list_models())

some_model

In [None]:
filter = (
    "sentiment-analysis",
    "pytorch",
    "ru",
)

filtered_models = huggingface_hub.list_models(
    filter=filter,
    sort="downloads",
    limit=10,
)

print(f"Filtered by {filter}:")
for model in filtered_models:
    print(f"- https://huggingface.co/{model.id} ({model.downloads} downloads, {model.likes} likes)")

Imagine the situation when you have a long text to read and a lack of time. Luckily, you've got an option to use one of pipelines! But which one?...

**Task 1 (0.5 points)**
- Find a suitable pipeline and model for text below
- Apply model to long text to get a short one
- Pretty-print the result and give an opinion if short text is good or not



In [None]:
long_text = """
The widespread adoption of remote work, accelerated by global events in the early 2020s, has triggered a significant and likely permanent shift in how we think about the workplace. This transition away from the traditional central office is having profound and multifaceted effects on urban economies, reshaping everything from commercial real estate to local small businesses.

One of the most immediate and visible impacts has been on the commercial real estate sector. With companies downsizing their physical footprints or adopting fully remote models, demand for office space has plummeted. This has led to rising vacancy rates, downward pressure on commercial rent prices, and a re-evaluation of the financial viability of large office buildings. City governments, which often rely heavily on property taxes from these high-value commercial properties, are now facing substantial budget shortfalls.

Furthermore, the daily rhythm of city centers has changed dramatically. The decline in the number of commuters has had a ripple effect on local businesses that once thrived on their patronage. Lunchtime cafes, after-work bars, dry cleaners, and public transit systems have all experienced a significant drop in revenue. This "doughnut effect" describes a phenomenon where the economic activity hollows out in the city center and increases in suburban residential areas as people work from home and spend their money locally.

However, it's not all negative. This shift also presents new opportunities. Some urban planners see a chance to repurpose vacant office buildings into much-needed residential housing, which could help address housing shortages and revitalize neighborhoods by creating 24/7 communities. Additionally, the ability to work remotely has spurred a reversal of rural depopulation in some regions, as professionals seek a better quality of life outside of major metropolitan areas, potentially distributing economic growth more evenly.

In conclusion, the remote work revolution is fundamentally restructuring urban economies. While it presents serious challenges to established systems like commercial real estate and downtown commerce, it also opens the door to innovative urban renewal and a more geographically dispersed economic landscape. The long-term effects will depend on how effectively cities and businesses can adapt to this new, more flexible paradigm.
"""

short_text = ... # Pipeline magic goes here

In [None]:
assert len(long_text) / len(short_text) > 5, "Too long, didn't read"

One of possible semi-supervised tasks used while BERT training is Masked Language Modeling. So our model have some text prediction capabilities!



In [None]:
mlm_model = transformers.pipeline(
    task="fill-mask",
    model="bert-base-cased"
)

mlm_model("My name is [MASK] Shady!")

In order to make result more readable we can just take top-1 result:

In [None]:
mlm_model("My name is [MASK] Shady!")[0]["sequence"]

**Task 2 (0.5 points)**
- Using BERT's ability to solve MLM task, find out answers on the following questions
- Perform some fact-checking, don't trust LLMs!

**Questions:**
- When YSDA was founded?
- Who invented radio first?
- What is the fifth Fibonacci number?

---

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a PyTorch `nn.Module` with pretrained weights

You can use such models as part of your regular PyTorch code: insert it as a layer in your model, apply to a batch of data, backpropagate, optimize, etc.

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model = transformers.AutoModel.from_pretrained("bert-base-uncased")

In [None]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    "I have no idea what pneumonoultramicroscopicsilicovolcanoconiosis is."
]

tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")
print("Tokenized:")
print(tokens_info)

print("\nDetokenized:")
for i in range(3):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

You can see some special tokens appeared besides our original text. They are usually used to give model some additional information, so model treats them in individual way.

You can list all special tokens used by tokenizer (moreover, you can add your own special tokens, but make sure you will show them to your model while training):

In [None]:
tokenizer._special_tokens_map

In [None]:
tokenizer("First sentence", "Second sentence", return_token_type_ids=True)

It's ineffective to put all possible tokens in vocabulary, but one also want to handle all possible text sequences instead of putting UNK everywhere.

WordPiece tokenization is here to help!

In [None]:
reversed_vocab = {token_id: token for token, token_id in tokenizer.vocab.items()}

In [None]:
for token_id in tokens_info["input_ids"][2]:
    print(reversed_vocab[token_id.item()], end=' ')

Now you can apply tokenized data with model.

Depending on your task, you can use different part of output. For example, `[CLS]`-token output can be obtained by `pooler_output` key in model output.

In [None]:
import torch

In [None]:
with torch.no_grad():
    out = model(**tokens_info)

print(out['pooler_output'])

Transformers knowledge hub: https://huggingface.co/transformers/



---



### Visualizing BERT

Interpretability of models is one of key factors of understanding their behaviour.

Neural Networks are harder to interpret than Classic ML models, but still it's not impossible!

Remember Attention mechanism? It's human-understandable concept: look closely to tokens which are more valuable for context of the current one.

In [None]:
!pip install bertviz

In [None]:
from transformers import AutoTokenizer, AutoModel, utils
from bertviz import model_view, head_view

input_text = "Every time I try to interpret BERT model behaviour, I find new interesting patterns"
model = AutoModel.from_pretrained("bert-base-cased", output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

model_view(attention, tokens)

In [None]:
head_view(attention, tokens)

Another possible task for BERT training is Next Sentence Prediction.

How BERT's heads looks at tokens in that case?

In [None]:
inputs = tokenizer.encode("I'm waiting for important call", "I can't go out right now", return_tensors="pt")
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

model_view(attention, tokens)

In [None]:
head_view(attention, tokens)

It looks interesting, doesn't it?

If you want to find out more about attention patterns, you can refer to special "field" of science - [BERTology](https://huggingface.co/docs/transformers/main/en/bertology).



---



### Tuning pretrained transfomers (for your own task and 2 points)

Important benefit of using big models is their ability to adapt to various tasks without spending a lot of time and resources for full training.

You could've heard about backbone models in another ML tasks, when they're tuned using specific data.

It's possible to tune model's weights directly, but you also can freeze model, use its outputs as knowledge and then extract neccessary information using much smaller neural networks.

#### Introduction

Here's an example of tuned BERT base model for Named Entity Recognition (NER) task:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = transformers.AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

In [None]:
model

As you can see, there's an additional classifier besides original BERT content. That layer is used to predict NER-classes for each BERT's token output.

BERT is suitable for tuning for different tasks since it outputs token embeddings and the whole data embedding in `[CLS]`-token as well.

#### Data preparation

In [None]:
import datasets

In [None]:
dataset = datasets.load_dataset("lhoestq/conll2003")

In [None]:
dataset

In [None]:
dataset["train"][0]

Since BERT tokenization is different from the dataset's one, we need to fix that divergence.

**Task 3 (0.5 points)**
- Align dataset token labels to WordPiece tokens
- Handle special tokens as well

In [None]:
from transformers import AutoTokenizer, DataCollatorForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_and_align_labels(samples):
    tokenized_inputs = tokenizer(samples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, original_labels in enumerate(samples["ner_tags"]):
        ... # Label aligning goes here

    tokenized_inputs["labels"] = labels

    return tokenized_inputs

In [None]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

In [None]:
tokenized_dataset["train"][0]

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Now dataset is ready to be used by BERT.

#### Model preparation

For our task we can use `AutoModelForTokenClassification`, which already provides required architecture with token classifier (e.g. classifier itself, class outputs).

You can handle these things by yourself: create PyTorch model class, init BERT model and Linear layer for classification, then override forward method and so on...

`AutoModelForTokenClassification` is chosen for the sake of simplicity, but it's still required for MLE to be capable of doing it with bare hands.

In [None]:
from transformers import AutoModelForTokenClassification

id2label = {0: "O", 1: "B-PER", 2: "I-PER", 3: "B-ORG", 4: "I-ORG", 5: "B-LOC", 6: "I-LOC", 7: "B-MISC", 8: "I-MISC"}
label2id = {label: id for id, label in id2label.items()}

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=9,
    id2label=id2label,
    label2id=label2id,
)

#### Evaluation

Evaluation is crucial while writing papers or reporting your work results. Sometimes it can be tricky and own implementation can be buggy, so it usually preferred to calculate metrics using frameworks.

In [None]:
!pip install seqeval

Let's prepare `compute_metrics` function for the following training loop:

In [None]:
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score
from seqeval.scheme import IOB2

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    y_true = []
    y_pred = []
    for i in range(len(predictions)):
        y_true_sample = []
        y_pred_sample = []
        for j in range(len(predictions[i])):
            if labels[i][j] == -100:
                continue

            y_true_sample.append(id2label[int(labels[i][j])])
            y_pred_sample.append(id2label[int(predictions[i][j])])

        y_true.append(y_true_sample)
        y_pred.append(y_pred_sample)

    return {
        "precision": precision_score(y_true, y_pred, mode="strict", scheme=IOB2),
        "recall": recall_score(y_true, y_pred, mode="strict", scheme=IOB2),
        "f1": f1_score(y_true, y_pred, mode="strict", scheme=IOB2),
    }

#### Training

**Task 4 (0.5 points)**
- Choose proper hyperparameters for tuning the model
- Setup HF Trainer
- Check correctness using training results


In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./bert-ner",
    eval_strategy="steps",
    eval_steps=50,
    logging_steps=50,
    logging_dir="./logs",
    report_to="none",

    learning_rate=...,
    num_train_epochs=...,
    per_device_train_batch_size=...,
    per_device_eval_batch_size=...,
)

In [None]:
trainer = Trainer(
    ... # One step away from new level of fit-predict
)

Let's check metrics before training:

In [None]:
results = trainer.evaluate(tokenized_dataset["test"])
print(results)

In [None]:
trainer.train()

In [None]:
results = trainer.evaluate(tokenized_dataset["test"])
print(results)

Compare test metrics before and after training. Did we succeed?

**Task 5 (1 point)**
- Compare our model's result with `dslim/bert-base-NER`
- Try to improve our model's quality. Choose any option:
  - Play with training hyperparameters (batch_size, lr, epochs, etc.)
  - Apply some training techniques (warm-up, lr-scheduling, etc.)
  - Perform error analysis and find model's weak spots (this option doesn't require fixing them)
  - Your very own idea
- Write a small report (up to 5 steps, results and conclusions) on the work done in "Tuning pretrained transformers" part