# Fine-tuning a masked language model

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also log into Hugging Face.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

For many NLP applications involving Transformer models, you can simply take a pretrained model from the Hugging Face Hub and fine-tune it directly on your data for the task at hand. Provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning, transfer learning will usually produce good results.

However, there are a few cases where you'll want to first fine-tune the language models on your data, before training a task-specific head. For example, if your dataset contains legal contracts or scientific articles, a vanilla Transformer model like BERT will typically treat the domain-specific words in your corpus as rare tokens, and the resulting performance may be less than satisfactory. By fine-tuning the language model on in-domain data you can boost the performance of many downstream tasks, which means you usually only have to do this step once!

This process of fine-tuning a pretrained language model on in-domain data is usually called _domain adaptation_. It was popularized in 2018 by [ULMFiT](https://arxiv.org/abs/1801.06146), which was one of the first neural architectures (based on LSTMs) to make transfer learning really work for NLP. An example of domain adaptation with ULMFiT is shown in the image below; in this section we'll do something similar, but with a Transformer instead of an LSTM!

![ULMFiT](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/ulmfit.svg)

By the end of this section you'll have a [masked language model](https://huggingface.co/huggingface-course/distilbert-base-uncased-finetuned-imdb?text=This+is+a+great+%5BMASK%5D.) on the Hub that can autocomplete sentences.

Let's dive in!

> **TIP:** If the terms "masked language modeling" and "pretrained model" sound unfamiliar to you, go check out Chapter 1, where we explain all these core concepts, complete with videos!

## Picking a pretrained model for masked language modeling

To get started, let's pick a suitable pretrained model for masked language modeling. As shown in the following screenshot, you can find a list of candidates by applying the "Fill-Mask" filter on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads):

![Hub models](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/mlm-models.png)

Although the BERT and RoBERTa family of models are the most downloaded, we'll use a model called [DistilBERT](https://huggingface.co/distilbert-base-uncased) that can be trained much faster with little to no loss in downstream performance. This model was trained using a special technique called [_knowledge distillation_](https://en.wikipedia.org/wiki/Knowledge_distillation), where a large "teacher model" like BERT is used to guide the training of a "student model" that has far fewer parameters. An explanation of the details of knowledge distillation would take us too far afield in this section, but if you're interested you can read all about it in [_Natural Language Processing with Transformers_](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/) (colloquially known as the Transformers textbook).

Let's go ahead and download DistilBERT using the `AutoModelForMaskedLM` class:

In [4]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

We can see how many parameters this model has by calling the `num_parameters()` method:

In [5]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


With around 67 million parameters, DistilBERT is approximately two times smaller than the BERT base model, which roughly translates into a two-fold speedup in training -- nice! Let's now see what kinds of tokens this model predicts are the most likely completions of a small sample of text:

In [6]:
text = "This is a great [MASK]."

As humans, we can imagine many possibilities for the `[MASK]` token, such as "day", "ride", or "painting". For pretrained models, the predictions depend on the corpus the model was trained on, since it learns to pick up the statistical patterns present in the data. Like BERT, DistilBERT was pretrained on the [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [BookCorpus](https://huggingface.co/datasets/bookcorpus) datasets, so we expect the predictions for `[MASK]` to reflect these domains. To predict the mask we need DistilBERT's tokenizer to produce the inputs for the model, so let's download that from the Hub as well:

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

With a tokenizer and a model, we can now pass our text example to the model, extract the logits, and print out the top 5 candidates:

In [8]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


We can see from the outputs that the model's predictions refer to everyday terms, which is perhaps not surprising given the foundation of English Wikipedia. Let's see how we can change this domain to something a bit more niche -- highly polarized movie reviews!

## The dataset

To demonstrate domain adaptation, we'll use the famous [Large Movie Review Dataset](https://huggingface.co/datasets/imdb) (or IMDb for short), which is a corpus of movie reviews that is often used to benchmark sentiment analysis models. By fine-tuning DistilBERT on this corpus, we expect the language model to adapt its vocabulary from the factual data of Wikipedia to the more subjective elements of movie reviews. We can grab the data from the Hugging Face Hub using the `load_dataset()` function from 🤗 Datasets:

In [10]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that the `train` and `test` splits contain 25,000 reviews each, while there is an unlabeled set called `unsupervised` that contains 50,000 reviews. Let's have a look at a few samples to get a sense of the kind of text we're dealing with. As we've done in other sections of the course, we'll chain the `Dataset.shuffle()` and `Dataset.select()` functions to create a random sample:

In [11]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

Yes, these definitely look like movie reviews, and if you're old enough you might even understand the language of the last review 😜! Although we won't be needing the labels for language modeling, we can already see that `0` corresponds to a negative review, while `1` corresponds to a positive one.

> **NOTE:** As the [dataset card](https://huggingface.co/datasets/imdb) notes, the supervised split of the IMDb dataset has a balanced distribution of positive and negative reviews, so we don't need to worry about skewed labels.

Now that we've had a cursory look at the raw data, let's dive a little deeper with tokenization. As we've seen in other sections, the texts need to be encoded as numbers before a model can make sense of them. In the NLP domain, tokenization is usually the approach of choice for this, and it's now time to introduce _masked language modeling_ along the way.

## Preprocessing the data

For both autoregressive and masked language models, one common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size. This is quite different from our usual approach, where we simply tokenize individual samples. Why concatenate everything together? The reason is that individual samples might get truncated if they're too long, and that would result in losing information that might be useful for the language modeling task! So, to get started, we will tokenize our corpus as usual, but _without_ setting the `truncation=True` option in our tokenizer. We'll also grab the word IDs if they're available (which they will be if we are using a fast tokenizer, as described in [Chapter 6](/course/chapter6/3)), as we'll need them later on to do whole word masking. We'll wrap this in a simple function, and while we're at it we'll remove the `text` and `label` columns since we don't need them anymore:

In [13]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

Since DistilBERT is a BERT-like model, we can see that the encoded texts include the `input_ids` and `attention_mask` that we've encountered in other chapters, as well as the `word_ids` that we've added.

Now that we've tokenized our movie reviews, the next step is to group them all together and split the result into chunks of a certain size. But how big should these chunks be? This will ultimately be determined by the amount of GPU memory you have available, but a good starting point is to see what the maximum model input size is. This can be inferred by looking at the `model_max_length` attribute of our tokenizer:

In [14]:
tokenizer.model_max_length

512

This value is derived from the tokenizer's associated configuration file and tells us the maximum input size for the model. In this case, we can see that the maximum context size is 512 tokens.

> **NOTE:** Since the default context size for GPT-2 is 1024, we've explicitly set the `block_size` to be 128 to run the training a little faster.

To group the texts together, we'll use a simple trick: concatenate all the texts with an `eos_token_id` (end of sequence token) in between. It's the same strategy used in [a great tutorial from the 🤗 Transformers documentation](https://huggingface.co/transformers/master/custom_datasets.html#language-modeling) about language modeling. Once we've concatenated the texts, we'll split them into chunks of size `chunk_size`, which is a bit smaller than our maximum context size since we need to leave space for the special tokens:

In [15]:
chunk_size = 128

The next thing we need to figure out is how to generate the masks for masked language modeling. We have two strategies to choose from:

* _Per-token masking_: Randomly mask 15% of the tokens in each sequence by replacing them with `[MASK]` and then predict these masked tokens. This is the standard approach used to pretrain BERT and related models.
* _Whole word masking_: Instead of randomly masking individual tokens, we first pick tokens that represent complete words (i.e., where the token ID is not preceded by the special symbol ##). Once we've identified these "start of word" tokens, we then mask the first token and any adjacent tokens that are part of the same word.

In this example, we'll use whole word masking, but feel free to experiment with per-token masking on your own. A nice thing about both approaches is that we can use the `DataCollatorForLanguageModeling` class from 🤗 Transformers to perform the masking on the fly during training. This is more efficient than preprocessing the whole dataset, since we'd either have to run preprocessing every time we change our hyperparameters or save several copies of the dataset with different masks. So let's create our data collator now:

In [16]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15
)

Here, we've set `mlm=True` to enable masked language modeling and `mlm_probability=0.15` to specify that we want to mask 15% of the tokens in each batch.

> **NOTE:** Passing the `tokenizer` to `DataCollatorForLanguageModeling` is needed so that the data collator knows what the model's special tokens like `[CLS]`, `[SEP]`, and `[MASK]` are. If we set `mlm=False`, then the data collator will simply prepare the inputs for _causal_ language modeling by shifting the inputs and labels by one position.

The thing that makes whole word masking a bit more involved than per-token masking is that we need a way to map tokens to the words they're part of. For this, we can use the `word_ids()` method that we grabbed from our tokenizer (this method is only available for fast tokenizers, so you'll need to use one if you want to use whole word masking). As we saw earlier, the `word_ids()` method returns a list where each element corresponds to the word ID of the corresponding token in the encoded sentence, so all we need to do is iterate through these lists and create a mask whenever we encounter a new word ID.

To test it out, let's grab a small batch from our training set and inspect what the collator does:

In [17]:
samples = [tokenized_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented i am curious - yellow from my [MASK] store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it [MASK] seized by u. s. customs if [MASK] ever tried to enter this country, [MASK] being a fan of [MASK] considered " controversial " i really had to see this for [MASK]. < br / > < br / > the plot is centered [MASK] a young swedish drama student named lena who wants to learn everything [MASK] can about life. in particular [MASK] [MASK] to focus her attentions to making some sort of documentary on what the [MASK] [MASK]ede [MASK] about certain [MASK] issues such [MASK] [MASK] vietnam war [MASK] race issues [MASK] the [MASK] states. in between [MASK] politicians and ordinary den [MASK]ns [MASK] stockholm about their opinions on politics, she has [unused944] with her drama teacher, classmates, and married men. < br / > < br / [MASK] what kills me about i am curious - yellow is that 40 years [MASK], this was cons

Nice, we can see that tokens from the same words like `high's` and `teachers'` are masked together! Now that we have our data collator, the last thing we need to do is chunk the texts together. While we could do this manually, it's much simpler to use one of 🤗 Datasets' utility functions to do all the hard work for us:

In [18]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

Note that in the last line of `group_texts()` we create a new `labels` column that is a copy of the `input_ids` one. As we'll see shortly, that's because in masked language modeling the labels for the loss function are the input IDs themselves (the data collator will take care of shifting them by one position later on).

Now let's apply `group_texts()` to our tokenized datasets:

In [19]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

You can see that grouping and then chunking the texts has produced many more examples than our original 25,000 in the training and test sets. That's because we now have examples involving _contiguous tokens_ that span across multiple examples from the original corpus. You can see this explicitly by looking for the special `[SEP]` and `[CLS]` tokens in one of the chunks:

In [20]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it ' s not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

```python
".... at the time, trying to catch a flight from lax to jfk that got canceled due to a [SEP] [CLS] the way i write. [SEP] [CLS] every time i watch this show, i think of spoofs of soap operas, particularly days of our lives spoofs. first of all, there are many attractive female characters. the episodes often have dramatic plot twists ending in cliff hangers or faux cliff hangers (... [SEP]"
```

In this example we can see three reviews concatenated together, with the special tokens denoting where one finishes and the next one starts. With our datasets in this shape, we're now ready for the fun part: training! But before we do that, we need to take care of a few loose ends. First, we don't actually want the `word_ids` column in the datasets we use for training, so let's get rid of it:

In [21]:
lm_datasets = lm_datasets.remove_columns(["word_ids"])

Second, as we've seen in other parts of the course, language models like DistilBERT are typically very large and can take a long time to train on a single GPU (a few days!). To speed things up, we can downsample the training set to 10,000 examples. Since the examples are shuffled during training, we can just grab the first 10,000 examples:

In [22]:
from datasets import DatasetDict

# Downsample training set to 10,000 examples
downsampled_dataset = DatasetDict(
    {
        "train": lm_datasets["train"].shuffle(seed=42).select(range(10_000)),
        "test": lm_datasets["test"],
    }
)

This will still allow us to fine-tune the model and see decent results, and at the same time will speed up our training run considerably. Now that we've dealt with all the data processing, let's take a look at the model training!

## Fine-tuning DistilBERT with the Trainer API

Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model, like we did in [Chapter 3](/course/chapter3). The only difference is that we need to use a special data collator that can batch the masked inputs together. We've already defined our data collator above, so we just need to define a few more things to kick off the training. First, we need to specify a suitable compute metric for our task. In language modeling, it is common to use the _perplexity_ metric, which is just the exponentiated cross-entropy loss:

$$\text{Perplexity} = \exp \left( \frac{1}{N} \sum_{i=1}^{N} \text{CrossEntropy}(y_i, \hat{y}_i) \right)$$

This might look a bit intimidating, but the good news is that the 🤗 Evaluate library provides a `perplexity` metric that we can use off the shelf. Here's a simple function that takes in predictions and labels and returns the perplexity:

In [23]:
import evaluate
import numpy as np

metric = evaluate.load("perplexity")


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # preds have the same shape as the labels, after the argmax(-1) has been calculated
    # by preprocess_logits_for_metrics
    labels = labels.reshape(-1)
    preds = preds.reshape(-1)
    mask = labels != -100
    labels = labels[mask]
    preds = preds[mask]
    return metric.compute(predictions=preds, references=labels)

Downloading builder script: 0.00B [00:00, ?B/s]

Then we instantiate a `Trainer` like we did in previous sections, making sure to pass our `data_collator` so that the masking is applied during training:

In [26]:
from transformers import Trainer, TrainingArguments

batch_size = 32
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=8, # Reduced batch size
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
    report_to="none",  # Don't rely on wandb
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Notice that in `TrainingArguments`, we've set `push_to_hub=True` so that the fine-tuned model is automatically uploaded to the Hugging Face Hub at the end of each epoch. You can specify the repository you want to push to with the `hub_model_id` argument (in particular, you will have to use this argument to push to an organization). For example, when we pushed the model to the [`huggingface-course` organization](https://huggingface.co/huggingface-course), we added `hub_model_id="huggingface-course/distilbert-finetuned-imdb"` to `TrainingArguments`. By default, the repository used will be in your namespace and named after the output directory you set, so in our case it will be `"sgugger/distilbert-finetuned-imdb"` (which is the model we linked to at the beginning of this section).

> **TIP:** If the output directory you are using already exists, it needs to be a local clone of the repository you want to push to. If it isn't, you'll get an error when calling `trainer.train()` and will need to set a new name.

Finally, we just call `train()` to fine-tune the model:

In [27]:
# This doesn't work
trainer.train()

Epoch,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 3.26 GiB. GPU 0 has a total capacity of 14.74 GiB of which 3.11 GiB is free. Process 17427 has 11.63 GiB memory in use. Of the allocated memory 10.94 GiB is allocated by PyTorch, and 565.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Note that while the training takes place, each time the model is saved (here, every epoch) it is uploaded to the Hub in the background. This way, you can resume your training on another machine if necessary. Once training is complete, we can compute the perplexity on the test set to make sure our model has adapted to the new domain:

In [28]:
import torch
import math
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

model.eval()
eval_dataloader = DataLoader(
    downsampled_dataset["test"],
    batch_size=1,
    collate_fn=data_collator
)

total_loss = 0
num_batches = 0

with torch.no_grad():
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        batch = {k: v.to('cuda') for k, v in batch.items()}
        outputs = model(**batch)
        total_loss += outputs.loss.item()
        num_batches += 1

        # Clear batch from memory
        del batch, outputs
        torch.cuda.empty_cache()

avg_loss = total_loss / num_batches
perplexity = math.exp(avg_loss)
print(f">>> Perplexity: {perplexity:.2f}")

Evaluating:   0%|          | 0/59904 [00:00<?, ?it/s]

>>> Perplexity: 11.11


Nice -- that's a lot lower than the perplexity of around 21 that we started out with, which indicates the model has learned something useful about the movie review domain! Once fine-tuning is complete, we can push the final model to the Hub (this is actually redundant since we've already done this with `push_to_hub=True` in `TrainingArguments`, but it doesn't hurt to be explicit):

In [None]:
trainer.push_to_hub()

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...ed-imdb/model.safetensors:   6%|5         | 15.9MB /  268MB            

  ...76723.502f9f0ab8a0.1373.0:  68%|######8   | 3.38kB / 4.95kB            

  ...ed-imdb/training_args.bin:  68%|######8   | 4.04kB / 5.91kB            

CommitInfo(commit_url='https://huggingface.co/Axion004/distilbert-base-uncased-finetuned-imdb/commit/a4f9cb3bab089cb670ebc1b1d5953c4ffae93ed6', commit_message='End of training', commit_description='', oid='a4f9cb3bab089cb670ebc1b1d5953c4ffae93ed6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Axion004/distilbert-base-uncased-finetuned-imdb', endpoint='https://huggingface.co', repo_type='model', repo_id='Axion004/distilbert-base-uncased-finetuned-imdb'), pr_revision=None, pr_num=None)

> **TIP:** If you have access to a machine with a GPU, try running the training there and compare the time it takes relative to Google Colab.

Nice work -- we've trained our first language model! In the next section we'll explore a slightly more advanced topic: training language models with 🤗 Accelerate from scratch. But before we do that, let's take a quick look at what it takes to fine-tune a masked language model using native PyTorch.

## Fine-tuning DistilBERT with 🤗 Accelerate

Now let's take a look at how we can achieve the same fine-tuning results with 🤗 Accelerate, which allows us to customize every aspect of the training loop. As usual, we first need to apply `data_collator` to our training set so that the masking is applied during batch construction:

In [41]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    downsampled_dataset["test"], batch_size=batch_size, collate_fn=default_data_collator
)

From here, we follow the standard steps with 🤗 Accelerate. The first order of business is to load a fresh version of the pretrained model:

In [42]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Then we need to specify the optimizer; we'll use the standard `AdamW`:

In [43]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

With these objects, we can now prepare everything for training with the `Accelerator` object:

In [44]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that our model, optimizer, and dataloaders are configured, we can specify the learning rate scheduler as follows:

In [45]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

There is just one last thing to do before training: create a model repository on the Hugging Face Hub! We can use the 🤗 Hub library to first generate the full name of our repo:

In [46]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'Axion004/distilbert-base-uncased-finetuned-imdb-accelerate'

then create and clone the repository using the `Repository` class from 🤗 Hub:

In [None]:
from huggingface_hub import create_repo

create_repo(repo_name, exist_ok=True)

RepoUrl('https://huggingface.co/Axion004/distilbert-base-uncased-finetuned-imdb-accelerate', endpoint='https://huggingface.co', repo_type='model', repo_id='Axion004/distilbert-base-uncased-finetuned-imdb-accelerate')

In [49]:
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/content/distilbert-base-uncased-finetuned-imdb-accelerate is already a clone of https://huggingface.co/Axion004/distilbert-base-uncased-finetuned-imdb-accelerate. Make sure you pull the latest changes with `repo.git_pull()`.


With that done, it's just a simple matter of writing out the full training and evaluation loop:

In [57]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(downsampled_dataset["test"])]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/471 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 11.24184978000042
>>> Epoch 1: Perplexity: 11.115141385669068
>>> Epoch 2: Perplexity: 11.131663503400288


Cool, we've been able to evaluate perplexity with each epoch and ensure that multiple training runs are reproducible!

## Using our fine-tuned model

You can interact with your fine-tuned model either by using its widget on the Hub or locally with the `pipeline` from 🤗 Transformers. Let's use the latter to download our model using the `fill-mask` pipeline:

In [53]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)

config.json:   0%|          | 0.00/557 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


We can then feed the pipeline our sample text of "This is a great [MASK]" and see what the top 5 predictions are:

In [54]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great film.
>>> this is a great movie.
>>> this is a great idea.
>>> this is a great deal.
>>> this is a great adventure.


In [55]:
mask_filler = pipeline(
    "fill-mask", model="Axion004/distilbert-base-uncased-finetuned-imdb"
)

config.json:   0%|          | 0.00/500 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


In [56]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great deal.
>>> this is a great success.
>>> this is a great adventure.
>>> this is a great idea.
>>> this is a great feat.


Neat -- our model has clearly adapted its weights to predict words that are more strongly associated with movies!

This wraps up our first experiment with training a language model. In section 6 you'll learn how to train an auto-regressive model like GPT-2 from scratch; head over there if you'd like to see how you can pretrain your very own Transformer model!

> **TIP:** ✏️ **Try it out!** To quantify the benefits of domain adaptation, fine-tune a classifier on the IMDb labels for both the pretrained and fine-tuned DistilBERT checkpoints. If you need a refresher on text classification, check out Chapter 3.

In [34]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import numpy as np
import evaluate

imdb = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized = imdb.map(tokenize, batched=True)
train_data = tokenized["train"].shuffle(seed=42).select(range(10000))
test_data = tokenized["test"].shuffle(seed=42).select(range(5000))

# Metrics
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return metric.compute(predictions=preds, references=labels)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [35]:
# Train with the pretrained model
model1 = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
args = TrainingArguments("pretrained-clf", num_train_epochs=3, per_device_train_batch_size=16,
                         eval_strategy="epoch", report_to="none", push_to_hub=False)
trainer1 = Trainer(model1, args, train_dataset=train_data, eval_dataset=test_data, compute_metrics=compute_metrics)
trainer1.train()
results1 = trainer1.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3575,0.233378,0.9172
2,0.1797,0.253993,0.9196
3,0.1056,0.360309,0.9202


In [36]:
# Train with the domain adpated model

model2 = AutoModelForSequenceClassification.from_pretrained("Axion004/distilbert-base-uncased-finetuned-imdb", num_labels=2)
args2 = TrainingArguments("finetuned-clf", num_train_epochs=3, per_device_train_batch_size=16,
                          eval_strategy="epoch", report_to="none", push_to_hub=False)
trainer2 = Trainer(model2, args2, train_dataset=train_data, eval_dataset=test_data, compute_metrics=compute_metrics)
trainer2.train()
results2 = trainer2.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at Axion004/distilbert-base-uncased-finetuned-imdb and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3577,0.243322,0.9144
2,0.1749,0.265119,0.9194
3,0.1058,0.372196,0.9176
