# Transformers: Fine-Tuning

In this notebook, you will load a language model that was pre-trained on a general text corpus and fine tune it on a more specific corpus. This is a very common practice in NLP, where the costly workload of training a large language model from scratch is outsourced. The user only loads a pre-trained model and fine-tunes it for a few epochs on text data of interest.

## Task

Follow the notebook, understand what's going on.

This excercise is based on https://huggingface.co/course/chapter7/3?fw=tf.

## 1. Load a pretrained language model

Load the pretrained checkpoint "distilbert" (https://huggingface.co/distilbert-base-uncased) which is a "distilled" (smaller and faster) version of the BERT base model. It was trained on the datasets [wikipedia](https://huggingface.co/datasets/wikipedia) and [bookcorpus](https://huggingface.co/datasets/bookcorpus) and is "teached" (details omitted) by the original BERT model. 

In [1]:
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import numpy as np
import tensorflow as tf
from functools import partial 
import TransformerUtility as tut

In [2]:
#browse https://huggingface.co/models to find more models
model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


If the previous cell prints a warning, this is expected and can be ignored.

In [3]:
f"The language model {model_checkpoint} has {round(model.count_params()/1e6)}M parameters."

'The language model distilbert-base-uncased has 67M parameters.'

## 2. Let the LM autocomplete a masked sentence

To convert a query from a string into a representation that can be consumed by a language model, we have to tokenize it. The tokenizer can also be used to convert a token representation of a text back to strings.

Usually, each model checkpoint at huggingface.co comes with its own tokenizer. We just have to load it like so:

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Running the model works like this:
1. Tokenize a string (that can contain [MASK] to indicate masked words). This generates a list of tensors that will be fed to the model.
2. Pass the generated inputs to the model and receive logits. The last axes of the output tensor is usually very large, because there are many different tokens possible. The model generates predictions for all positions, not just the masked ones. 
3. Decide for tokens with high probability based on the logits and use the tokenizer to decode back to words.

In [5]:
query = ["This is a great [MASK].",
         "This [MASK] is a disappointment.", 
         "I like to [MASK] [MASK] with my friends.",
         "I would like to [MASK] this."]
lens = [len(s.split()) for s in query]

In [6]:
#1.
inputs = tokenizer(query, return_tensors="np", padding=True, truncation=True)
inputs

{'input_ids': array([[  101,  2023,  2003,  1037,  2307,   103,  1012,   102,     0,
            0,     0],
       [  101,  2023,   103,  2003,  1037, 10520,  1012,   102,     0,
            0,     0],
       [  101,  1045,  2066,  2000,   103,   103,  2007,  2026,  2814,
         1012,   102],
       [  101,  1045,  2052,  2066,  2000,   103,  2023,  1012,   102,
            0,     0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}

We can see that strings have been converted to token ids. The meaning of these ids is encapsuled in the tokenizer. We also have a padding mask that indicates with a 0, if a position is not part of an actual input.

In [7]:
#2.
token_logits = model(**inputs).logits
token_logits[0,0], token_logits.shape

(<tf.Tensor: shape=(30522,), dtype=float32, numpy=
 array([-5.5882792, -5.5876465, -5.5964637, ..., -4.9451265, -4.817601 ,
        -2.9901636], dtype=float32)>,
 TensorShape([4, 11, 30522]))

Remember that the model outputs logits, not probabilities. If you want to see probabilities, use `tf.nn.softmax`.

In [8]:
#3.
#find the [MASK] positions and just take the most likely predicted token
def decode(logits, best_k = 5):
    mask_tokens = inputs["input_ids"] == tokenizer.mask_token_id
    mask_logits = logits[mask_tokens]
    #get the indices of the top logits
    top_tokens = np.argsort(-mask_logits, -1)[:,:best_k]
    for k in range(best_k):
        autocompleted_query = np.array(inputs.input_ids) #make a copy
        autocompleted_query[mask_tokens] = top_tokens[:,k]
        #remove start- and end-tokens and decode
        for i, l in enumerate(lens):
            print(tokenizer.decode(autocompleted_query[i,1:1+l]))
        print("")
        
decode(token_logits)

this is a great deal
this book is a disappointment
i like to spend fun with my friends
i would like to discuss this

this is a great success
this article is a disappointment
i like to enjoy friends with my friends
i would like to do this

this is a great adventure
this movie is a disappointment
i like to be out with my friends
i would like to hear this

this is a great idea
this episode is a disappointment
i like to play chat with my friends
i would like to repeat this

this is a great feat
this project is a disappointment
i like to share along with my friends
i would like to know this



Note that if we have multiple [MASK] tokens in one query, its naive to take the most likely predictions independently. Often, the resulting sentences do not make much sense. A better way would be to replace one [MASK] at a time and rerun the model afterwards to select the remaining [MASK]s (left open for exercise at home).

## 3. Prepare the fine-tuning dataset

We will use [imdb](https://huggingface.co/datasets/imdb) for fine-tuning which contains pairs of movie reviews and ratings (neg./pos.). We will ignore the ratings and just learn on the text.

You will learn that fine-tuning (as opposed to training from scratch) is fast. The resulting model retains the general language understanding capabilites of the pre-trained model, but the outputs will (hopefully) be more specific, reflecting the domain we are interested in (in this case movie reviews).

In [9]:
#huggingface.co provides an API to access their datasets
#we simply have to install their package called "datasets"
#feel free to roam huggingface.co for other interesting datasets than those used here
!pip install datasets

Defaulting to user installation because normal site-packages is not writeable


In [10]:
from datasets import load_dataset

#browse https://huggingface.co/datasets to find more datasets
imdb_dataset = load_dataset("imdb")
imdb_dataset

Found cached dataset imdb (/home/jovyan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that the dataset already comes preconfigured for training and testing. 

In [11]:
type(imdb_dataset["train"])

datasets.arrow_dataset.Dataset

They use a custom class for their datasets (which is ok, because they also provide a pipeline for training). However, if we wish, we could also covert to a pandas dataset. This would be a more general format that is compatible with a lot of other APIs:

In [12]:
imdb_dataset["train"].to_pandas()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1


Lets look at a few examples:

In [13]:
sample = imdb_dataset["train"].shuffle(seed=42).select([0,1,2])

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")

Loading cached shuffled indices for dataset at /home/jovyan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-9c48ce5d173413c7.arrow



'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

The preprocessing works as follows:
1. Tokenize all reviews. We have already learned how to do this before.
2. Concatenate everything and split it into equal sized chunks. I.e. we glue all review to a single large chunk of text and 
3. Downsample the dataset to speed up training.

To keep this notebook simple, the details of these steps are defined in the TransformerUtility module. Feel free to take a look at the code there. We can use the datasets `map` function, to apply one of our own functions to all examples in the dataset.

In [14]:
# 1.
tokenized_datasets = imdb_dataset.map(
    partial(tut.tokenize_function, tokenizer), 
    batched=True, remove_columns=["text", "label"])
# 2.
lm_datasets = tokenized_datasets.map(tut.group_texts, batched=True)
# 3. 
downsampled_train = 10_000
downsampled_test = int(0.1 * downsampled_train)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=downsampled_train, test_size=downsampled_test, seed=42)
downsampled_dataset

Loading cached processed dataset at /home/jovyan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-90cc70ec80c747e4.arrow


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (532 > 512). Running this sequence through the model will result in indexing errors
Loading cached processed dataset at /home/jovyan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-f582002f9c3c1ebf.arrow


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Loading cached processed dataset at /home/jovyan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-0e2554afe38ff41c.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

## 4. Training

In [15]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [16]:
tf_train_dataset = model.prepare_tf_dataset(
    downsampled_dataset["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    downsampled_dataset["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [17]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
    init_lr=1e-4,
    num_warmup_steps=100,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0


In [18]:
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: '<' not supported between instances of 'Literal' and 'str'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: '<' not supported between instances of 'Literal' and 'str'
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Perplexity: 23.74


A common metric to evaluate NLP models is perplexity. Intuitively it measures how surprised the model is by the ground truth. Lower perplexity indicates a better model. It's out of scope here to discuss the mathematical details of perplexity.

Note that we should not use accuracy. The model might choose a synonym token all the time and achieve zero accuracy despite being actually a good fit. Perplexity is computed directly from the cross-entropy loss, has an intuitive interpretation and takes the predictions for all tokens, not just the single "correct" one, into account.

In [19]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f250947e6d0>

In [20]:
eval_loss = model.evaluate(tf_eval_dataset, )
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 11.83


## 5. Repeat the example

If done correctly, we should see that the outputs of the fine-tuned model reflect the imdb corpus.

In [21]:
finetuned_token_logits = model(**inputs).logits
decode(finetuned_token_logits)

this is a great film
this movie is a disappointment
i like to spend friends with my friends
i would like to see this

this is a great movie
this film is a disappointment
i like to be up with my friends
i would like to know this

this is a great idea
this one is a disappointment
i like to go fun with my friends
i would like to enjoy this

this is a great story
this book is a disappointment
i like to have time with my friends
i would like to hear this

this is a great adventure
this episode is a disappointment
i like to share out with my friends
i would like to do this



In [22]:
#output before fine-tuning:
decode(token_logits)

this is a great deal
this book is a disappointment
i like to spend fun with my friends
i would like to discuss this

this is a great success
this article is a disappointment
i like to enjoy friends with my friends
i would like to do this

this is a great adventure
this movie is a disappointment
i like to be out with my friends
i would like to hear this

this is a great idea
this episode is a disappointment
i like to play chat with my friends
i would like to repeat this

this is a great feat
this project is a disappointment
i like to share along with my friends
i would like to know this

