<a href="https://colab.research.google.com/github/lzumta/ATAI/blob/main/tutorial_pretrained_lms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Using Pretrained Architectures

In this notebook, you learn how to use pretrained and finetuned language models for various tasks using the Hugging Face `transformers` library. The transformer architecture can be adapted to many NLP tasks such as classification, named entity recognition or translation with only minor modifications and the `transformers` library supports a wide range of these task-specific architectures. You can also finetune pretrained models for your own tasks but first let's look at how to use models that were already finetuned by other people.

## Pipeline

The first useful Hugging Face feature that we will work with is the high-level API `pipeline`. Pipelines can be created for any trained/finetuned model. They abstract away the model, take care of all necessary preprocessing steps and return cleaned up predictions for your inputs. They are especially useful to quickly test models on your own input data or to use as they are in your applications (if they are already finetuned towards your task of choice). We can use any model that was already published on the [Hugging Face Hub](https://huggingface.co/models).

<img src="images/pipeline.png" alt="Alt text that describes the graphic" title="Title text" width=800>

## Setup

Before we start, you'll need to install a few libraries, e.g. torch, the transformers library as well as the sentencepiece library which is used in the preprocessing for some models.

In [1]:
!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.9.0+cu102.html # change to cu111 if running in colab
!pip install torch
!pip install transformers
!pip install sentencepiece
!pip install pandas

Looking in links: https://data.pyg.org/whl/torch-1.9.0+cu102.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-1.9.0%2Bcu102/torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl (8.0 MB)
[K     |████████████████████████████████| 8.0 MB 5.4 MB/s 
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.0.9
Collecting transformers
  Downloading transformers-4.12.0-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.4 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.5 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 32.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |█████████

In [2]:
from transformers import pipeline, set_seed
set_seed(111)

## Text Classification

One of the most common types of tasks in NLP is **text classification**. Text classification means that we train a model to predict a label for an entire input (e.g. a sentence or document). A typical example for this type of task is sentiment analysis, i.e. our model should predict whether a sentence is positive or negative.

For text classification, the model gets all the inputs and makes a single prediction as shown in the following example:

<img src="images/clf_arch.png" alt="Alt text that describes the graphic" title="Title text" width=600>

We can achieve this with Hugging Face by setting up a `pipeline` object which wraps a transformer model that was trained on our desired task of sentiment analysis:

In [3]:
sentiment_pipeline = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

TAPAS models are not usable since `torch_scatter` can't be loaded. It seems you have `torch_scatter` installed with the wrong CUDA version. Please try to reinstall it following the instructions here: https://github.com/rusty1s/pytorch_scatter.


Here we download the `distilbert-base-uncased-finetuned-sst-2-english` model. This is a smaller and more efficient BERT model finetuned on [SST-2](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary) which is a sentiment analysis dataset.

The first time you execute this code snippet, you will notice that the model is downloaded from the Hugging Face Hub. The model will then be cached, so anytime after that you do not need to download it anymore.

Now we are ready to run an example through our pipeline and look at the models' prediction:

In [4]:
sentiment_pipeline('Anna likes studying at UZH.')

[{'label': 'POSITIVE', 'score': 0.9953606724739075}]

The model predicts that this sentence is positive with a high confidence. And given our understanding of the sentence this makes sense. You can see that the pipeline returns a list of dicts with the predictions. We can also pass several sentences at the same time (as a list) in which case we would get several dicts in the list, for each sentence one.

## Token Classification

Another type of classification task is token classification. Instead of just finding the overall sentiment, here we are interested in a prediction for each token in the sentence. For example, we can try to identify named entities such as organizations, locations, or persons in the text. This task is called named entity recognition (NER). 

The model gets the same input as before but now makes a prediction for each token:

<img src="images/ner_arch.png" alt="Alt text that describes the graphic" title="Title text" width=600>

Again, this is very easy to do with Hugging Face because there are already finetuned models available for this task. We just load a pipeline for the NER task without specifying a model. This will load a default BERT model that has been trained on the [CoNLL-2003](https://huggingface.co/datasets/conll2003) dataset.

In [5]:
ner_pipeline = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

When we pass our text through the model, we get again a list of dicts: each dict corresponds to one detected named entity. Since multiple tokens can correspond to a single entity we can apply an aggregation strategy that merges entities if the same class appears in consequtive tokens, e.g. here because "UZH" is split into two subwords but this would also extend to multi-word entities like "University of Zurich".

In [6]:
entities = ner_pipeline('Anna likes studying at UZH.', aggregation_strategy="simple")
print(entities)

[{'entity_group': 'PER', 'score': 0.9908325, 'word': 'Anna', 'start': 0, 'end': 4}, {'entity_group': 'ORG', 'score': 0.97251207, 'word': 'UZH', 'start': 23, 'end': 26}]


Let's clean up the outputs a bit:

In [7]:
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Anna: PER (0.99)
UZH: ORG (0.97)


The model both correctly predicted that Anna is a person and UZH is an organization!

## Text Generation

Next, we leave behind these natural understanding tasks (NLU) that models like BERT are particularly good at. We will now focus on natural language generation (NLG). Remember that generation is more expensive since we have to generate the output one token after the other:

<img src="images/gen_steps.png" alt="Alt text that describes the graphic" title="Title text" width=300>

Having a model generate text based on an input does not require finetuning, since decoder-based pretrained language models like GPT are already trained towards this objective in the pretraining phase. Hugging Face again allows us to simply load a pipeline for the text generation task. This will load the default GPT-2 model.

In [8]:
generation_pipeline = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Now, we can see what this model thinks would be a likely continuation of our sentence.

In [9]:
generated_text = generation_pipeline(text_inputs='Anna likes studying at UZH.')
generated_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Anna likes studying at UZH. She has also been studying music for her doctorate, and when there is a great song at UZH, she sings it. The first time I met her, she said that she and my kids could'}]

The model does generate some text related to studying so it is not far off. But the output may still sound a bit strange. You can also play around with other inputs that the model may have seen more often during pretraining like "Once upon a time", for example.

## Sequence-to-sequence Tasks

You also learned about sequence-to-sequence tasks (seq2seq). These are tasks where we get a sequence as an input and expect a sequence as an output (that does not necessarily have the same length as the input). A typical seq2seq task is translation, where receive an input in one language and generate a translation in another language.

This can also be done very easily with Hugging Face, as there are many translation models readily available, e.g. for English to German:

In [None]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/284M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Let's translate the our sentence to German:

In [None]:
outputs = translator('Anna likes studying at UZH.', clean_up_tokenization_spaces=True)
outputs[0]['translation_text']

'Anna studiert gerne an der UZH.'

That looks like an accurate translation!

## More pipelines

There are many more pipelines that you can experiment with. Look at the following list for an overview:

In [None]:
from transformers import pipelines

for task in pipelines.SUPPORTED_TASKS:
    print(task)

audio-classification
automatic-speech-recognition
feature-extraction
text-classification
token-classification
question-answering
table-question-answering
fill-mask
summarization
translation
text2text-generation
text-generation
zero-shot-classification
conversational
image-classification
object-detection


And don't forget to checkout all the pretrained and finetuned models that are already available on the [Hugging Face Hub](https://huggingface.co/models)!

### Table Question Answering

One pipeline that may be particularly interesting for you regarding your course projects is the `table-question-answering` pipeline.

For this, we already installed some libraries at the beginning. `torch-scatter` is used by the `transformers` library and we need the `pandas` library to read and manipulate the tabular data and to pass the table as a `DataFrame` object to the pipeline.

With [TAPAS](https://huggingface.co/google/tapas-large-finetuned-wtq) you can do tabular question-answering:

In [None]:
table_qa = pipeline("table-question-answering", model="google/tapas-base-finetuned-wtq")

Downloading:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/490 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/154 [00:00<?, ?B/s]

In order for the model to be able to answer questions about a table, we need to give the model the table as part of the input. Here, we first convert the table to a `pandas DataFrame` and then pass that to the model:

In [None]:
import re
from io import StringIO
import pandas as pd

table = """
Repository   | Stars | Contributors | Programming language 
Transformers | 36542 | 651          | Python      
Datasets     | 4512  | 77           | Python    
Tokenizers   | 3934  | 34           | Rust, Python and NodeJS    
"""
table = re.sub(r'\s+\|\s+', '\t', table).strip() # convert to TSV format
table = StringIO(table) # pandas takes filepath or buffer as input
table = pd.read_csv(table, sep='\t')
table = table.astype(str) # all column types need to be of type string

Now, let's query the table with some questions!

In [None]:
output = table_qa(table=table, query="How many stars does the transformers repository have?")
output

{'answer': 'AVERAGE > 36542',
 'coordinates': [(0, 1)],
 'cells': ['36542'],
 'aggregator': 'AVERAGE'}

In [None]:
output = table_qa(table=table, query="How many people work on the libraries in total?")
output

{'answer': 'SUM > 651, 77, 34',
 'coordinates': [(0, 2), (1, 2), (2, 2)],
 'cells': ['651', '77', '34'],
 'aggregator': 'SUM'}

You can see that this also works with very flexible questions that do not specifically use the column names. In the example above, "contributors" is expressed as "people who work on" and "repository" is paraphrased as "library". But the model still extracts the correct answer!

The model can also correctly predict that we need to sum the values in order to get the total number of contributors.

Here are some more examples:

In [None]:
output = table_qa(table=table, query="Which is the most common programming language?")
output

{'answer': 'Python      ',
 'coordinates': [(0, 3)],
 'cells': ['Python      '],
 'aggregator': 'NONE'}

In [None]:
output = table_qa(table=table, query="Which libraries are supported?")
output

{'answer': 'Transformers, Datasets, Tokenizers',
 'coordinates': [(0, 0), (1, 0), (2, 0)],
 'cells': ['Transformers', 'Datasets', 'Tokenizers'],
 'aggregator': 'NONE'}

## Finetuning Your Own Model

Now, we'll take a look at an example of how you can finetune a BERT model for text classification. Similarly, you can also e.g. finetune a GPT model for a generation task or an BART model for a sequence-to-sequence task. In this toy example, we look at the task of identifying whether a text input is a question or a statement. This may be a useful classifier for your course projects if you expect that the users also enter non-questions. In this case, your models do not need to provide an answer.

Note to run the finetuning in a reasonable amount of time, it is recommended that you have access to a GPU. If you have a Google account, you may use [Google Colab](https://colab.research.google.com/) for this. The maximum amoumt of time you can use a GPU there is 12 hours which is enough for many finetuning tasks. Simply upload this notebook and run the code in Google Colab.

### Data Preparation

We use a [Kaggle Dataset](https://www.kaggle.com/shahrukhkhan/questions-vs-statementsclassificationdataset) as our finetuning data. The dataset is already divided into a training, a development and a test set. First, we read all of the data from the respective CSV files. Again, we use `pandas` for this:

In [11]:
import pandas as pd

train = pd.read_csv('train.csv', index_col=0)
dev = pd.read_csv('val.csv', index_col=0)
test = pd.read_csv('test.csv', index_col=0)

Let's look at what this dataset actually contains:

In [12]:
from IPython.display import display, HTML

sample = train.sample(n=5, random_state=111)
display(HTML(sample.to_html()))

Unnamed: 0,doc,target
17637,"Indeed, the Qing government did far more to encourage mobility than to discourage it",0
86809,What is zinc chemically identical to,1
120097,"Anxious to expand the company's broadcast and cable presence, longtime MCA head Lew Wasserman sought a rich partner. Who was the head of MCA in 1990",1
22760,"93 in (100 mm) on July 29, 1958",0
39006,Who was the founder of the Gelug school?,1


You can see that the dataset is just a collection of segments that are either statements or contain questions. The text can be found in the "doc" column, whereas the label is in the "target" column. 1 stands for "question" and 0 for "statement".

Now, let's see how many examples we have per data split:

In [13]:
print(f'Train: {len(train)}')
print(f'Dev: {len(dev)}')
print(f'Test: {len(test)}')

Train: 126909
Dev: 42303
Test: 42303


For the purposes of this tutorial, we want to reduce the number of examples so that the finetuning runs faster:

In [14]:
train = train.sample(n=3000, random_state=111)
dev = dev.sample(n=300, random_state=111)
test = test.sample(n=10000, random_state=111)

Let's see if the labels are balanced, so we know what metric to use in the evaluation:

In [15]:
train['target'].value_counts()

1    1825
0    1175
Name: target, dtype: int64

During training, Hugging Face `transformers` expects the labels to be ordered, starting from 0 to N. This is already given in our dataset with labels 0 and 1 since we only have two classes. But to make the output of our model a bit more readable, we create mappings between the label IDs and class names:

In [16]:
label_names = ["statement", "question"]
id2label = {idx:label for idx, label in enumerate(label_names)}
label2id = {v:k for k,v in id2label.items()}
id2label

{0: 'statement', 1: 'question'}

### Preprocessing

Like other machine learning models, transformers expect their inputs in the form of numbers (not strings) and so some form of preprocessing is required. For NLP, this preprocessing step is called tokenization. Tokenization converts strings into atomic chunks called tokens, and these tokens are subsequently encoded as numerical vectors. In our previous experiments, this preprocessing was abstracted into the pipeline. But now we have to do it ourselves.

Each pretrained model comes with its own tokenizer, so to get started let's download the tokenizer of BERT from the Hub. Here we use DistilBERT a smaller and more efficient model:

In [17]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

The tokenizer has a few interesting attributes such as the vocabulary size:

In [18]:
tokenizer.vocab_size

30522

This tells us that BERT has 30,522 tokens that it can use to represent text with. Some of the tokens are called special tokens to indicate whether a new sentence starts ([SEP]), or corresponds to the mask that is associated with language modeling ([MASK]). Here's what the special tokens look like for BERT:

In [19]:
tokenizer.special_tokens_map

{'cls_token': '[CLS]',
 'mask_token': '[MASK]',
 'pad_token': '[PAD]',
 'sep_token': '[SEP]',
 'unk_token': '[UNK]'}

When you feed strings to the tokenizer, you'll get at least two fields (some models have more, depending on how they're trained):

* `input_ids`: These correspond to the numerical encodings that map each token to an integer
* `attention_mask`: This indicates to the model which tokens should be ignored when computing self-attention

Let's see how this works with a simple example. First we encode the string:

In [20]:
encoded_str = tokenizer("Anna likes studying at UZH.")
encoded_str

{'input_ids': [101, 4698, 7777, 5702, 2012, 1057, 27922, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

and then decode the input IDs to see the mapping explicitly:

In [21]:
for token in encoded_str["input_ids"]:
    print(token, tokenizer.decode([token]))

101 [CLS]
4698 anna
7777 likes
5702 studying
2012 at
1057 u
27922 ##zh
1012 .
102 [SEP]


The next thing to do is to tokenize all the segments in our data splits:

In [22]:
def tokenize_column(examples):
    return tokenizer(examples, truncation=True, max_length=180)

In [23]:
train['doc'] = train['doc'].apply(tokenize_column)
dev['doc'] = dev['doc'].apply(tokenize_column)
test['doc'] = test['doc'].apply(tokenize_column)

And again, we can look at what the actual tokens look like:

In [24]:
for token in train['doc'].iloc[1]["input_ids"]:
    print(token, tokenizer.decode([token]))

101 [CLS]
2054 what
2003 is
15813 zinc
5072 chemical
2135 ##ly
7235 identical
2000 to
102 [SEP]


Now, we need to wrap our dataset in a `Dataset` class. The resulting objects need to support the `__getitem__` and `__len__` methods so that they can be used by the `Trainer` class. If you use a dataset from the [Hugging Face Hub](https://huggingface.co/datasets) and import it via the `datasets` library this is already done for you.

In [25]:
import torch

class QSDataset(torch.utils.data.Dataset):
    def __init__(self, table):
        self.table = table

    def __getitem__(self, idx):
        item = self.table['doc'].iloc[idx]
        item['labels'] = torch.tensor(self.table['target'].iloc[idx])
        return item

    def __len__(self):
        return len(self.table)

train_dataset = QSDataset(train)
dev_dataset = QSDataset(dev)
test_dataset = QSDataset(test)

### Finetuning

Now, we are starting with the actual finetuning process. First, let's download the model parameters of our pretrained BERT model and initialize our model to finetune it. We use the `AutoModelForSequenceClassification` class because this automatically puts a randomly initialized dense layer on top of BERT which we can use for our question-vs-statement prediction. We also need to specify how many classes we have - in our case just two, questions and statements.

In [26]:
from transformers import AutoModelForSequenceClassification

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels, label2id=label2id, id2label=id2label)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

We set the training arguments such as number of epochs to train, learning rate, where to save the model etc. and we store them in a `TrainingArguments` object that we can pass to the `Trainer` class.

In [27]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]
batch_size = 16
num_train_epochs = 2
logging_steps = len(train_dataset) // (batch_size * num_train_epochs)

args = TrainingArguments(
    output_dir=f"{model_name}-question-vs-statement",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    logging_steps=logging_steps,
    push_to_hub=False,
)

Next, we need to define the metric for the evaluation. Since the classes are not perfectly balanced, let's use the F1-score as our evaluation metric. To integrate this in our finetuning process, we need to wrap the computation of the metric into a simple function. But first we need to install two more packages:

In [28]:
!pip install datasets
!pip install sklearn

Collecting datasets
  Downloading datasets-1.14.0-py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 6.1 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.10.1-py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 45.6 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 41.7 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 37.0 MB/s 
Collecting multidict<7.0,>=4.5
  Downloading multidict-5.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
[K     |████████████████████████████████| 160 kB 44.8 MB/s 
[?25hCollecting yarl<2.0,>=1.0
  Downloading yarl-1.7.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
[K     |█

In [29]:
import numpy as np
from datasets import load_metric

metric = load_metric("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Great! Now, we have the preprocessed data, the training arguments, the metric compuation function and the initialized model. All that's left to do is create a `Trainer` and finetune the model on our data:

In [30]:
from transformers import Trainer 

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [31]:
trainer.train()

***** Running training *****
  Num examples = 3000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 376


Epoch,Training Loss,Validation Loss,F1
1,0.0134,0.028966,0.994595
2,0.0057,0.015294,0.994595


***** Running Evaluation *****
  Num examples = 300
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-question-vs-statement/checkpoint-188
Configuration saved in distilbert-base-uncased-question-vs-statement/checkpoint-188/config.json
Model weights saved in distilbert-base-uncased-question-vs-statement/checkpoint-188/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-question-vs-statement/checkpoint-188/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-question-vs-statement/checkpoint-188/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 300
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-question-vs-statement/checkpoint-376
Configuration saved in distilbert-base-uncased-question-vs-statement/checkpoint-376/config.json
Model weights saved in distilbert-base-uncased-question-vs-statement/checkpoint-376/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-quest

TrainOutput(global_step=376, training_loss=0.06640127122753914, metrics={'train_runtime': 121.0782, 'train_samples_per_second': 49.555, 'train_steps_per_second': 3.105, 'total_flos': 154320379828032.0, 'train_loss': 0.06640127122753914, 'epoch': 2.0})

Now, the finetuning is finished and we can evaluate how well our model learned to predict whether segments are statements or questions. For this, we use our test set:

In [32]:
predictions = trainer.evaluate(test_dataset)
predictions

***** Running Evaluation *****
  Num examples = 10000
  Batch size = 16


{'epoch': 2.0,
 'eval_f1': 0.9960294951786727,
 'eval_loss': 0.024219846352934837,
 'eval_runtime': 64.4492,
 'eval_samples_per_second': 155.161,
 'eval_steps_per_second': 9.698}

It looks like this task was very simple and the model had no issues at all to learn to differentiate between questions and statements.

We can now also load our finetuned model checkpoint into a pipeline and again abstract away the preprocessing and postprocessing, to test the model with new inputs:

In [33]:
finetuned_checkpoint = "./distilbert-base-uncased-question-vs-statement/checkpoint-376"
classifier = pipeline("text-classification", model=finetuned_checkpoint)

loading configuration file ./distilbert-base-uncased-question-vs-statement/checkpoint-376/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "statement",
    "1": "question"
  },
  "initializer_range": 0.02,
  "label2id": {
    "question": 1,
    "statement": 0
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.12.0",
  "vocab_size": 30522
}

loading configuration file ./distilbert-base-uncased-question-vs-statement/checkpoint-376/config.json
Model config DistilBertConfi

In [None]:
output = classifier(["Is this a question?", "How about this", "this is a statement."])
output

  cpuset_checked))


[{'label': 'question', 'score': 0.9987800717353821},
 {'label': 'question', 'score': 0.9990077614784241},
 {'label': 'statement', 'score': 0.9981380701065063}]

## Cache

Whenever we load a new model from the Hugging Face Hub, it is cached on the machine you are running on. If you run these examples on Colab this is not an issue since the persistent storage will be cleaned after your session anyway. However, if you run this notebook on your laptop you might have just filled several GB of your hard drive. By default the cache is saved in the folder `~/.cache/huggingface/transformers`. Make sure to clear it from time to time if your hard drive starts to fill up.