Soft deadline: `30.03.2022 23:59`

In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [1]:
# ! pip install datasets
# ! pip install transformers

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


In [2]:
from datasets import load_dataset
import numpy as np

## import dataset

In [3]:
# the result is a dataset dictionary of train and test splits in this case
dataset = load_dataset('yahoo_answers_topics')
#dataset['train'] = dataset['train'].select(np.arange(140000))
#dataset['test'] = dataset['test'].select(np.arange(6000))


Reusing dataset yahoo_answers_topics (C:\Users\Jerzy\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902)
100%|██████████| 2/2 [00:00<00:00, 62.50it/s]


# Fine-tuning the model** (20 points)

In [4]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel)

import torch
from torch.utils.data import DataLoader
from datasets import load_metric


Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Task**: 
- load tokenizer and model
- look at the predictions of the model as-is before any fine-tuning


```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

- convert `best_answer` to the input tokens (supporting function for dataset is provided below) 

```
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
```

- define optimizer, sheduler (optional)
- fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
- get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)
- Tune the training hyperparameters (and write down your results).

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


## Predictions before fine-tuning

In [5]:
MODEL_NAME = "google/electra-small-generator"
TOKENIZER_NAME = "google/electra-small-generator"
# create mask filler
fill_mask = pipeline("fill-mask", model=MODEL_NAME, tokenizer=TOKENIZER_NAME)


In [6]:
print(fill_mask(f"- Why don't you ask {fill_mask.tokenizer.mask_token}?"))


[{'score': 0.5620249509811401, 'token': 2033, 'token_str': 'me', 'sequence': "- why don't you ask me?"}, {'score': 0.0762922465801239, 'token': 3980, 'token_str': 'questions', 'sequence': "- why don't you ask questions?"}, {'score': 0.03575656935572624, 'token': 2068, 'token_str': 'them', 'sequence': "- why don't you ask them?"}, {'score': 0.030271142721176147, 'token': 2339, 'token_str': 'why', 'sequence': "- why don't you ask why?"}, {'score': 0.027905305847525597, 'token': 2023, 'token_str': 'this', 'sequence': "- why don't you ask this?"}]


In [7]:
print(fill_mask(f"- What is {fill_mask.tokenizer.mask_token}"))


[{'score': 0.8885705471038818, 'token': 1029, 'token_str': '?', 'sequence': '- what is?'}, {'score': 0.06915915757417679, 'token': 1012, 'token_str': '.', 'sequence': '- what is.'}, {'score': 0.03511178493499756, 'token': 999, 'token_str': '!', 'sequence': '- what is!'}, {'score': 0.006375017575919628, 'token': 1011, 'token_str': '-', 'sequence': '- what is -'}, {'score': 0.0001770858361851424, 'token': 1000, 'token_str': '"', 'sequence': '- what is "'}]


In [8]:
print(
    fill_mask(f"- Let's talk about {fill_mask.tokenizer.mask_token} physics"))


[{'score': 0.24689149856567383, 'token': 8559, 'token_str': 'quantum', 'sequence': "- let's talk about quantum physics"}, {'score': 0.21904081106185913, 'token': 9373, 'token_str': 'theoretical', 'sequence': "- let's talk about theoretical physics"}, {'score': 0.061686038970947266, 'token': 10811, 'token_str': 'particle', 'sequence': "- let's talk about particle physics"}, {'score': 0.037109434604644775, 'token': 2613, 'token_str': 'real', 'sequence': "- let's talk about real physics"}, {'score': 0.029618320986628532, 'token': 4517, 'token_str': 'nuclear', 'sequence': "- let's talk about nuclear physics"}]


The results of untuned initial model are quite acceptable in terms of common sense and logic

## Tokenize Dataset

In [9]:
# define model and tokenizer
tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)
model = ElectraModel.from_pretrained(MODEL_NAME, num_labels=10)


Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraModel: ['generator_predictions.dense.weight', 'generator_predictions.dense.bias', 'generator_lm_head.weight', 'generator_lm_head.bias', 'generator_predictions.LayerNorm.weight', 'generator_predictions.LayerNorm.bias']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)


# tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets


Loading cached processed dataset at C:\Users\Jerzy\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902\cache-fe1db3e31fea470e.arrow
Loading cached processed dataset at C:\Users\Jerzy\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902\cache-fd461c566b612018.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'topic', 'question_title', 'question_content', 'best_answer', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1400000
    })
    test: Dataset({
        features: ['id', 'topic', 'question_title', 'question_content', 'best_answer', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 60000
    })
})

In [11]:
# rename topic into labels
tokenized_datasets = tokenized_datasets.rename_column('topic', 'labels')


In [None]:
# remove extra columns that are not needed anymore
tokenized_datasets = tokenized_datasets.remove_columns(
    ['id', 'question_title', 'question_content', 'best_answer'])


In [13]:
tokenized_datasets.set_format("torch")

In [14]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1400000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 60000
    })
})

In [15]:
# apply dataloader
dataloader_train = DataLoader(
    tokenized_datasets['train'], shuffle=True, batch_size=32)
dataloader_test = DataLoader(tokenized_datasets['test'], batch_size=32)
next(iter(dataloader_train))


{'labels': tensor([8, 0, 8, 2, 0, 4, 4, 8, 0, 7, 9, 5, 6, 4, 1, 9, 3, 2, 1, 4, 9, 5, 2, 4,
         3, 7, 5, 4, 4, 8, 9, 8]),
 'input_ids': tensor([[ 101, 2064, 1005,  ...,    0,    0,    0],
         [ 101, 2024, 2017,  ...,    0,    0,    0],
         [ 101, 1020, 2086,  ...,    0,    0,    0],
         ...,
         [ 101, 2672, 1996,  ...,    0,    0,    0],
         [ 101, 7479, 1012,  ...,    0,    0,    0],
         [ 101, 1045, 1005,  ...,    0,    0,    0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

## Fine-tuning

In [16]:
# model fine-tuning alg is borrowed from https://huggingface.co/docs/transformers/training#finetune-in-native-pytorch
model = ElectraForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=10)
model


Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraForSequenceClassification: ['generator_predictions.dense.weight', 'generator_predictions.dense.bias', 'generator_lm_head.weight', 'generator_lm_head.bias', 'generator_predictions.LayerNorm.weight', 'generator_predictions.LayerNorm.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-generator and are newly initializ

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

In [29]:
# freeze layers exept classification
# for param in model.electra.parameters():
      # param.requires_grad = False

In [18]:
from torch.optim import AdamW
# defining optimizer 
optimizer = AdamW(model.parameters(), lr=5e-5)

In [19]:
num_epochs = 3
num_training_steps = num_epochs * len(dataloader_train)
# define scheduler
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [20]:
import torch
# set torch device to cuda to compute everything faster 
device = torch.device("cuda")
model.to(device)

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

In [21]:
# set progress bar
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

# train more layers of model
model.train()
for epoch in range(num_epochs):
    for batch in dataloader_train:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)


100%|██████████| 131250/131250 [6:37:27<00:00,  5.51it/s]  

In [22]:
from datasets import load_metric

# evaluate f1 score
metric = load_metric("f1")
model.eval()
for batch in dataloader_test:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute(average='weighted')


{'f1': 0.5199885940065225}

## Masked word prediction after fine-tuning

In [23]:
# move the model back to CPU after training
device = torch.device("cpu")
model.to(device)


ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

In [24]:
# pass the pretrained weights to Mask model from classifier model
model.save_pretrained('pretrained_model')
model = ElectraForMaskedLM.from_pretrained('pretrained_model')


Some weights of the model checkpoint at pretrained_model were not used when initializing ElectraForMaskedLM: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
- This IS expected if you are initializing ElectraForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForMaskedLM were not initialized from the model checkpoint at pretrained_model and are newly initialized: ['generator_predictions.dense.weight', 'generator_predictions.dense.bias', 'generator_lm_head.weight', 'generator_lm_head.bias', 'generator_predictions.LayerNorm.weight', 'generato

In [25]:
# set new pipeine with a new model
tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)
fill_mask_new = pipeline("fill-mask", model=model, tokenizer=tokenizer)


In [26]:
print(fill_mask_new(
    f"- Why don't you ask {fill_mask_new.tokenizer.mask_token}?"))


[{'score': 0.00035911804297938943, 'token': 24577, 'token_str': 'c o m p l i a n t', 'sequence': "- why don't you ask compliant?"}, {'score': 0.00031316454987972975, 'token': 19616, 'token_str': 's o r t e d', 'sequence': "- why don't you ask sorted?"}, {'score': 0.00030794698977842927, 'token': 9954, 'token_str': '# # r o o m', 'sequence': "- why don't you askroom?"}, {'score': 0.0002968114276882261, 'token': 22380, 'token_str': 'i n t e g r a t i n g', 'sequence': "- why don't you ask integrating?"}, {'score': 0.00028943814686499536, 'token': 29020, 'token_str': '# # r o o m s', 'sequence': "- why don't you askrooms?"}]


In [27]:
print(fill_mask_new(f"- What is {fill_mask_new.tokenizer.mask_token}"))


[{'score': 0.0005720441113226116, 'token': 28551, 'token_str': '# # r u e d', 'sequence': '- what isrued'}, {'score': 0.0003827141772489995, 'token': 18796, 'token_str': '# # u c e', 'sequence': '- what isuce'}, {'score': 0.000380926561774686, 'token': 6254, 'token_str': 'd o c u m e n t', 'sequence': '- what is document'}, {'score': 0.0003573968424461782, 'token': 5491, 'token_str': 'd o c u m e n t s', 'sequence': '- what is documents'}, {'score': 0.0003235583717469126, 'token': 11137, 'token_str': '# # u r a l', 'sequence': '- what isural'}]


In [28]:
print(
    fill_mask_new(f"- Let's talk about {fill_mask_new.tokenizer.mask_token} physics"))


[{'score': 0.0004735870461445302, 'token': 24988, 'token_str': '# # r r a l', 'sequence': "- let's talk aboutrral physics"}, {'score': 0.0004635512304957956, 'token': 21579, 'token_str': '# # h e s i v e', 'sequence': "- let's talk abouthesive physics"}, {'score': 0.0004009050317108631, 'token': 27242, 'token_str': '# # e a r i n g', 'sequence': "- let's talk aboutearing physics"}, {'score': 0.0003980121691711247, 'token': 25153, 'token_str': '# # b a n g', 'sequence': "- let's talk aboutbang physics"}, {'score': 0.0003865938924718648, 'token': 18907, 'token_str': '# # p r o o f', 'sequence': "- let's talk aboutproof physics"}]


## Conclusion

- After fine-tuning the results have become much worse. 
- It happened due to training of model on classification data via classifier. 
- We pass weights of classifier to Mask model to tune it, that is why the results we get after tuning are worse.
But if we train not a whole model, but only the classifier layer (freezing the other ones) we will face the same results as the pretrained model.

So the conclusion at this point is that we should tune models very carefully and keep in mind what layers should be trained not no ruin weights of other layers.