# Tutorial: Using Pretrained Transformer Architectures

In this notebook, you learn how to use pretrained and finetuned language models for various tasks using the Hugging Face `transformers` library. The transformer architecture can be adapted to many NLP tasks such as classification, named entity recognition or translation with only minor modifications and the `transformers` library supports a wide range of these task-specific architectures. You can also finetune pretrained models for your own tasks but first let's look at how to use models that were already finetuned by other people.

## Setup

Before we start, you'll need to install a few libraries, e.g. torch, the transformers library as well as the sentencepiece library which is used in the preprocessing for some models.

In [41]:
!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.9.0+cu102.html # change to cu111 if running in colab
!pip install torch
!pip install transformers
!pip install sentencepiece
!pip install pandas

Looking in links: https://data.pyg.org/whl/torch-1.9.0+cu102.html


## Basic Usage of BERT

* We first use BERT to simply vectorize a given sentence
* BERT expects input data in a specific format
    * Token Ids
        * tokens need to be indexed according to a vocabulary
    * Special Tokens
        * [SEP] marks the end of a sentence or the separation between a pair of sentences
        * [CLS] is placed at the begining of a sentence
        * other tokens from the used vocabulary, e.g., [UNK], [MASK], [PAD]
    * Mask Ids
        * indicates which elements in the preprocessed sequence are tokens and which are padding elements
    * Segment IDs
        * distinguish different sentences

In [45]:
# import pre-trained BERT model from transformers
from transformers import BertTokenizer, BertModel

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

loading file vocab.txt from cache at /Users/wangruijie/.cache/huggingface/hub/models--bert-base-uncased/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/wangruijie/.cache/huggingface/hub/models--bert-base-uncased/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/tokenizer_config.json
loading configuration file config.json from cache at /Users/wangruijie/.cache/huggingface/hub/models--bert-base-uncased/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "la

In [46]:
# check the vocabulary size
bert_tokenizer.vocab_size

30522

This tells us that BERT has 30,522 tokens that it can use to represent text with, including some special tokens:

In [47]:
bert_tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [48]:
# tokenize the input sentence
sentence = 'Anna likes studying at UZH.'
tokenized_sentence = bert_tokenizer(sentence, return_tensors="pt")
tokenized_sentence

{'input_ids': tensor([[  101,  4698,  7777,  5702,  2012,  1057, 27922,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [51]:
# check the input ids and their corresponding tokens in the vocabulary
for token in tokenized_sentence['input_ids'].view(-1).tolist():
    print(token, bert_tokenizer.decode([token]))

101 [CLS]
4698 anna
7777 likes
5702 studying
2012 at
1057 u
27922 ##zh
1012 .
102 [SEP]


In [52]:
bert_output = bert_model(**tokenized_sentence)
print(bert_output)
# last hidden state: sequence of hidden-states at the output of the last layer of the model.
print('\n size of last hidden state:', bert_output['last_hidden_state'].size())
# pooler_output: last layer hidden-state of the first token of the sentence, i.e., [CLS], after further task-specific processing
print('\n size of the pooler output:', bert_output['pooler_output'].size())

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0147, -0.1107,  0.1144,  ..., -0.3769,  0.5611,  0.1757],
         [-0.2149, -0.3584, -0.3037,  ..., -1.0536,  0.2202, -0.5198],
         [ 0.1458, -0.5347,  0.6045,  ..., -0.5411, -0.2079, -0.4417],
         ...,
         [ 0.4102,  0.2812, -0.3010,  ..., -0.5037,  0.4218,  0.0877],
         [-0.4441, -1.4007, -0.5559,  ...,  0.3225,  0.6334, -0.2757],
         [ 0.8111,  0.0069, -0.3102,  ...,  0.4673, -0.6839, -0.4513]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.8824, -0.5386, -0.9135,  0.8483,  0.8522, -0.1145,  0.8981,  0.5088,
         -0.7277, -1.0000, -0.4315,  0.9170,  0.9839,  0.5112,  0.9258, -0.7806,
         -0.1529, -0.6575,  0.3576, -0.6724,  0.6914,  1.0000,  0.0461,  0.4436,
          0.5427,  0.9909, -0.7854,  0.9331,  0.9407,  0.7060, -0.6987,  0.1810,
         -0.9909, -0.2104, -0.9374, -0.9925,  0.5665, -0.7680, -0.0418, -0.0860,
         -0.8962,  0.4141,  1.00

## Pipeline

Hugging Face pipelines can be created for any trained/finetuned model. They abstract away the model, take care of all necessary preprocessing steps and return cleaned up predictions for your inputs. They are especially useful to quickly test models on your own input data or to use as they are in your applications (if they are already finetuned towards your task of choice). We can use any model that was already published on the [Hugging Face Hub](https://huggingface.co/models).

<img src="images/pipeline.png" alt="Alt text that describes the graphic" title="Title text" width=800>

## Text Classification

One of the most common types of tasks in NLP is **text classification**. Text classification means that we train a model to predict a label for an entire input (e.g. a sentence or document). A typical example for this type of task is sentiment analysis, i.e., our model should predict whether a sentence is positive or negative.

For text classification, the model gets all the inputs and makes a single prediction as shown in the following example:

<img src="images/clf_arch.png" alt="Alt text that describes the graphic" title="Title text" width=600>

We can achieve this with Hugging Face by setting up a `pipeline` object which wraps a transformer model that was trained on our desired task of sentiment analysis:

In [44]:
from transformers import pipeline, set_seed
set_seed(111)

In [54]:
sentiment_pipeline = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')

loading configuration file config.json from cache at /Users/wangruijie/.cache/huggingface/hub/models--distilbert-base-uncased-finetuned-sst-2-english/snapshots/324d3097568e82724d53d7ac1d312aa719d48037/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading configuration

Here we download the `distilbert-base-uncased-finetuned-sst-2-english` model. This is a smaller and more efficient BERT model finetuned on [SST-2](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary) which is a sentiment analysis dataset.

The first time you execute this code snippet, you will notice that the model is downloaded from the Hugging Face Hub. The model will then be cached, so anytime after that you do not need to download it anymore.

Now we are ready to run an example through our pipeline and look at the models' prediction:

In [55]:
sentiment_pipeline('Anna likes studying at UZH.')

[{'label': 'POSITIVE', 'score': 0.9953606724739075}]

The model predicts that this sentence is positive with a high confidence. And given our understanding of the sentence, this makes sense. You can see that the pipeline returns a list of dicts with the predictions. We can also pass several sentences at the same time (as a list) in which case we would get several dicts in the list, for each sentence one.

## Token Classification

Another type of classification task is token classification. Instead of just finding the overall sentiment, here we are interested in a prediction for each token in the sentence. For example, we can try to identify named entities such as organizations, locations, or persons in the text. This task is called named entity recognition (NER). 

The model gets the same input as before but now makes a prediction for each token:

<img src="images/ner_arch.png" alt="Alt text that describes the graphic" title="Title text" width=600>

Again, this is very easy to do with Hugging Face because there are already finetuned models available for this task. We just load a pipeline for the NER task without specifying a model. This will load a default BERT model that has been trained on the [CoNLL-2003](https://huggingface.co/datasets/conll2003) dataset.

In [56]:
ner_pipeline = pipeline('ner', model='dbmdz/bert-large-cased-finetuned-conll03-english')

entities = ner_pipeline('Anna likes studying at UZH.', aggregation_strategy="simple")
print(entities)

loading configuration file config.json from cache at /Users/wangruijie/.cache/huggingface/hub/models--dbmdz--bert-large-cased-finetuned-conll03-english/snapshots/f2482bf01f5da0f0eb8e183ffd8cc3885aa90b14/config.json
Model config BertConfig {
  "_name_or_path": "dbmdz/bert-large-cased-finetuned-conll03-english",
  "_num_labels": 9,
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "O",
    "1": "B-MISC",
    "2": "I-MISC",
    "3": "B-PER",
    "4": "I-PER",
    "5": "B-ORG",
    "6": "I-ORG",
    "7": "B-LOC",
    "8": "I-LOC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "B-LOC": 7,
    "B-MISC": 1,
    "B-ORG": 5,
    "B-PER": 3,
    "I-LOC": 8,
    "I-MISC": 2,
    "I-ORG": 6,
    "I-PER": 4,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_p

[{'entity_group': 'PER', 'score': 0.9908325, 'word': 'Anna', 'start': 0, 'end': 4}, {'entity_group': 'ORG', 'score': 0.972512, 'word': 'UZH', 'start': 23, 'end': 26}]


When we pass our text through the model, we get again a list of dicts: each dict corresponds to one detected named entity. Since multiple tokens can correspond to a single entity we can apply an aggregation strategy that merges entities if the same class appears in consequtive tokens, e.g. here because "UZH" is split into two subwords but this would also extend to multi-word entities like "University of Zurich".

Let's clean up the outputs a bit:

In [57]:
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Anna: PER (0.99)
UZH: ORG (0.97)


The model both correctly predicted that Anna is a person and UZH is an organization!

## Text Generation

Next, we leave behind these natural understanding tasks (NLU) that models like BERT are particularly good at. We will now focus on natural language generation (NLG). Remember that generation is more expensive since we have to generate the output one token after the other:

<img src="images/gen_steps.png" alt="Alt text that describes the graphic" title="Title text" width=300>

Having a model generate text based on an input does not require finetuning, since decoder-based pretrained language models like GPT are already trained towards this objective in the pretraining phase. Hugging Face again allows us to simply load a pipeline for the text generation task. This will load the default GPT-2 model.

In [58]:
generation_pipeline = pipeline("text-generation", model='gpt2')

loading configuration file config.json from cache at /Users/wangruijie/.cache/huggingface/hub/models--gpt2/snapshots/909a290700bd99135e67c64eefc166960b67cfd2/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "trans

Now, we can see what this model thinks would be a likely continuation of our sentence.

In [67]:
generated_text = generation_pipeline(text_inputs='Anna likes studying at UZH.', max_new_tokens=20)
generated_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Anna likes studying at UZH. She is also very curious about things and thinks about it a lot, especially while studying at UC Berkeley'}]

The model does generate some text related to studying so it is not far off. But the output may still sound a bit strange. You can also play around with other inputs that the model may have seen more often during pretraining like "Once upon a time", for example.

## Sequence-to-sequence Tasks

You also learned about sequence-to-sequence tasks (seq2seq). These are tasks where we get a sequence as an input and expect a sequence as an output (that does not necessarily have the same length as the input). A typical seq2seq task is translation, where receive an input in one language and generate a translation in another language.

This can also be done very easily with Hugging Face, as there are many translation models readily available, e.g. for English to German:

In [68]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

loading configuration file config.json from cache at /Users/wangruijie/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-en-de/snapshots/bb3e01d60e3cf1de7902fe2482d5909cf13abd5b/config.json
Model config MarianConfig {
  "_name_or_path": "Helsinki-NLP/opus-mt-en-de",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      58100
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 58100,
  "decoder_vocab_size": 58101,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "forced_eos_token_id": 0,
  "gradient_che

Let's translate the our sentence to German:

In [69]:
outputs = translator('Anna likes studying at UZH.', clean_up_tokenization_spaces=True)
outputs[0]['translation_text']

'Anna studiert gerne an der UZH.'

That looks like an accurate translation!

## More pipelines

There are many more pipelines that you can experiment with. Look at the following list for an overview:

In [70]:
from transformers import pipelines

for task in pipelines.SUPPORTED_TASKS:
    print(task)

audio-classification
automatic-speech-recognition
feature-extraction
text-classification
token-classification
question-answering
table-question-answering
visual-question-answering
document-question-answering
fill-mask
summarization
translation
text2text-generation
text-generation
zero-shot-classification
zero-shot-image-classification
conversational
image-classification
image-segmentation
image-to-text
object-detection
zero-shot-object-detection


And don't forget to checkout all the pretrained and finetuned models that are already available on the [Hugging Face Hub](https://huggingface.co/models)!

## Finetuning Your Own Model

Now, we'll take a look at an example of how you can finetune a BERT model for text classification. Similarly, you can also finetune a GPT model for a generation task or a BART model for a sequence-to-sequence task. In this toy example, we look at the task of identifying whether a text input is a question or a statement. This may be a useful classifier for your course project if you expect that the users also enter non-questions. In this case, your models do not need to provide an answer.

Note to run the finetuning in a reasonable amount of time, it is recommended that you have access to a GPU. If you have a Google account, you may use [Google Colab](https://colab.research.google.com/) for this. The maximum amoumt of time you can use a GPU there is 12 hours which is enough for many finetuning tasks. Simply upload this notebook and run the code in Google Colab.

### Data Preparation

We use a [Kaggle Dataset](https://www.kaggle.com/shahrukhkhan/questions-vs-statementsclassificationdataset) as our finetuning data. The dataset is already divided into a training, a development and a test set. First, we read all of the data from the respective CSV files. Again, we use `pandas` for this:

In [20]:
import pandas as pd

train = pd.read_csv('train.csv', index_col=0)
dev = pd.read_csv('val.csv', index_col=0)
test = pd.read_csv('test.csv', index_col=0)

Let's look at what this dataset actually contains:

In [21]:
from IPython.display import display, HTML

sample = train.sample(n=5, random_state=10)
display(HTML(sample.to_html()))

Unnamed: 0,doc,target
26153,What did the defence functions of the Ministry of Aviation Supply merge into in 1964,1
42066,"The country relies heavily on rain to provide household water, but in the past 30 years average yearly precipitation has decreased",0
71408,What disciplines is Anthropolgy forced to confront?,1
75960,What blood clotting disease did Victorias oldest son have?,1
32251,The use of animal fur in clothing dates to prehistoric times. Using what for clothing has always been controversial,1


You can see that the dataset is just a collection of segments that are either statements or questions. The text can be found in the "doc" column, whereas the label is in the "target" column. 1 stands for "question" and 0 for "statement".

Now, let's see how many examples we have per data split:

In [22]:
print(f'Train: {len(train)}')
print(f'Dev: {len(dev)}')
print(f'Test: {len(test)}')

Train: 126909
Dev: 42303
Test: 42303


For the purposes of this tutorial, we want to reduce the number of examples so that the finetuning runs faster:

In [23]:
train = train.sample(n=500, random_state=111)
dev = dev.sample(n=50, random_state=111)
test = test.sample(n=50, random_state=111)

Let's see if the labels are balanced, so we know what metric to use in the evaluation:

In [24]:
train['target'].value_counts()

1    298
0    202
Name: target, dtype: int64

During training, Hugging Face `transformers` expects the labels to be ordered, starting from 0 to N. This is already given in our dataset with labels 0 and 1 since we only have two classes. But to make the output of our model a bit more readable, we create mappings between the label IDs and class names:

In [25]:
label_names = ["statement", "question"]
id2label = {idx:label for idx, label in enumerate(label_names)}
label2id = {v:k for k,v in id2label.items()}
id2label

{0: 'statement', 1: 'question'}

### Preprocessing

Each pretrained model comes with its own tokenizer, so to get started let's download the tokenizer of BERT from the Hub. Here we use DistilBERT a smaller and more efficient model:

In [26]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [27]:
encoded_str = tokenizer("Anna likes studying at UZH.")
encoded_str

{'input_ids': [101, 4698, 7777, 5702, 2012, 1057, 27922, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

The next thing to do is to tokenize all the segments in our data splits:

In [28]:
def tokenize_column(examples):
    return tokenizer(examples, truncation=True, max_length=180)

In [29]:
train['doc'] = train['doc'].apply(tokenize_column)
dev['doc'] = dev['doc'].apply(tokenize_column)
test['doc'] = test['doc'].apply(tokenize_column)

And again, we can look at what the actual tokens look like:

In [30]:
for token in train['doc'].iloc[0]["input_ids"]:
    print(token, tokenizer.decode([token]))

101 [CLS]
5262 indeed
1010 ,
1996 the
13282 qing
2231 government
2106 did
2521 far
2062 more
2000 to
8627 encourage
12969 mobility
2084 than
2000 to
28085 discourage
2009 it
102 [SEP]


Now, we need to wrap our dataset in a `Dataset` class. The resulting objects need to support the `__getitem__` and `__len__` methods so that they can be used by the `Trainer` class. If you use a dataset from the [Hugging Face Hub](https://huggingface.co/datasets) and import it via the `datasets` library, this is already done for you.

In [31]:
import torch

class QSDataset(torch.utils.data.Dataset):
    def __init__(self, table):
        self.table = table

    def __getitem__(self, idx):
        item = self.table['doc'].iloc[idx]
        item['labels'] = torch.tensor(self.table['target'].iloc[idx])
        return item

    def __len__(self):
        return len(self.table)

train_dataset = QSDataset(train)
dev_dataset = QSDataset(dev)
test_dataset = QSDataset(test)

### Finetuning

Now, we are starting with the actual finetuning process. First, let's download the model parameters of our pretrained BERT model and initialize our model to finetune it. We use the `AutoModelForSequenceClassification` class because this automatically puts a randomly initialized dense layer on top of BERT which we can use for our question-vs-statement prediction. We also need to specify how many classes we have - in our case just two, questions and statements.

In [32]:
from transformers import AutoModelForSequenceClassification

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels, label2id=label2id, id2label=id2label)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

We set the training arguments such as number of epochs to train, learning rate, where to save the model etc. and we store them in a `TrainingArguments` object that we can pass to the `Trainer` class.

In [33]:
from transformers import TrainingArguments

model_name = model_checkpoint
batch_size = 16
num_train_epochs = 2
logging_steps = len(train_dataset) // (batch_size * num_train_epochs)

args = TrainingArguments(
    output_dir=f"{model_name}-question-vs-statement",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    logging_steps=logging_steps,
    push_to_hub=False,
)

Next, we need to define the metric for the evaluation. Since the classes are not perfectly balanced, let's use the F1-score as our evaluation metric. To integrate this in our finetuning process, we need to wrap the computation of the metric into a simple function. But first we need to install two more packages:

In [34]:
!pip install datasets
!pip install sklearn



In [35]:
import numpy as np
from datasets import load_metric

metric = load_metric("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

  metric = load_metric("f1")


Great! Now, we have the preprocessed data, the training arguments, the metric compuation function and the initialized model. All that's left to do is creating a `Trainer` and finetune the model on our data:

In [36]:
from transformers import Trainer 

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [37]:
trainer.train()

***** Running training *****
  Num examples = 500
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 64
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,0.4197,0.305242,0.967742
2,0.2036,0.165556,1.0


***** Running Evaluation *****
  Num examples = 50
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-question-vs-statement/checkpoint-32
Configuration saved in distilbert-base-uncased-question-vs-statement/checkpoint-32/config.json
Model weights saved in distilbert-base-uncased-question-vs-statement/checkpoint-32/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-question-vs-statement/checkpoint-32/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-question-vs-statement/checkpoint-32/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 50
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-question-vs-statement/checkpoint-64
Configuration saved in distilbert-base-uncased-question-vs-statement/checkpoint-64/config.json
Model weights saved in distilbert-base-uncased-question-vs-statement/checkpoint-64/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-question-vs-sta

TrainOutput(global_step=64, training_loss=0.38029260002076626, metrics={'train_runtime': 112.7424, 'train_samples_per_second': 8.87, 'train_steps_per_second': 0.568, 'total_flos': 26444839358256.0, 'train_loss': 0.38029260002076626, 'epoch': 2.0})

Now, the finetuning is finished and we can evaluate how well our model learned to predict whether segments are statements or questions. For this, we use our test set:

In [38]:
predictions = trainer.evaluate(test_dataset)
predictions

***** Running Evaluation *****
  Num examples = 50
  Batch size = 16


{'eval_loss': 0.17908574640750885,
 'eval_f1': 1.0,
 'eval_runtime': 2.2814,
 'eval_samples_per_second': 21.916,
 'eval_steps_per_second': 1.753,
 'epoch': 2.0}

It looks like this task is simple and the model has learned to differentiate between questions and statements based on limited training samples.

We can now also load our finetuned model checkpoint into a pipeline and again abstract away the preprocessing and postprocessing, to test the model with new inputs:

In [39]:
finetuned_checkpoint = "./distilbert-base-uncased-question-vs-statement/checkpoint-64"
classifier = pipeline("text-classification", model=finetuned_checkpoint)

loading configuration file ./distilbert-base-uncased-question-vs-statement/checkpoint-64/config.json
Model config DistilBertConfig {
  "_name_or_path": "./distilbert-base-uncased-question-vs-statement/checkpoint-64",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "statement",
    "1": "question"
  },
  "initializer_range": 0.02,
  "label2id": {
    "question": 1,
    "statement": 0
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.23.1",
  "vocab_size": 30522
}

loading configuration file ./distilbert-base-uncased-question-vs-statement/checkpoint-64/conf

In [40]:
output = classifier(["What is the University of Zurich?", 
                     "The University of Zurich is a public research university."])
output

Disabling tokenizer parallelism, we're using DataLoader multithreading already


[{'label': 'question', 'score': 0.9602562189102173},
 {'label': 'statement', 'score': 0.7275270819664001}]

## Cache

Whenever we load a new model from the Hugging Face Hub, it is cached on the machine you are running on. If you run these examples on Colab this is not an issue since the persistent storage will be cleaned after your session anyway. However, if you run this notebook on your laptop you might have just filled several GB of your hard drive. By default the cache is saved in the folder `~/.cache/huggingface/transformers`. Make sure to clear it from time to time if your hard drive starts to fill up.