# Fine-tuning a masked language model

For many NLP applications involving Transformer models, you can simply take a pretrained model from the Hugging Face Hub and fine-tune it directly on your data for the task at hand. Provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning, transfer learning will usually produce good results.

However, there are a few cases where you'll want to first fine-tune the language models on your data, before training a task-specific head. For example, if your dataset contains legal contracts or scientific articles, a vanilla Transformer model like BERT will typically treat the domain-specific words in your corpus as rare tokens, and the resulting performance may be less than satisfactory. By fine-tuning the language model on in-domain data you can boost the performance of many downstream tasks, which means you usually only have to do this step once!

This process of fine-tuning a pretrained language model on in-domain data is usually called _domain adaptation_. It was popularized in 2018 by [ULMFiT](https://arxiv.org/abs/1801.06146), which was one of the first neural architectures (based on LSTMs) to make transfer learning really work for NLP. An example of domain adaptation with ULMFiT is shown in the image below; in this section we'll do something similar, but with a Transformer instead of an LSTM!

![ULMFiT](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/ulmfit.svg)

By the end of this section you'll have a [masked language model](https://huggingface.co/huggingface-course/distilbert-base-uncased-finetuned-imdb?text=This+is+a+great+%5BMASK%5D.) on the Hub that can autocomplete sentences.

Let's dive in!

> **TIP:** If the terms "masked language modeling" and "pretrained model" sound unfamiliar to you, go check out Chapter 1, where we explain all these core concepts, complete with videos!

## Picking a pretrained model for masked language modeling

To get started, let's pick a suitable pretrained model for masked language modeling. As shown in the following screenshot, you can find a list of candidates by applying the "Fill-Mask" filter on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=fill-mask&sort=downloads):

![Hub models](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/mlm-models.png)

Although the BERT and RoBERTa family of models are the most downloaded, we'll use a model called [DistilBERT](https://huggingface.co/distilbert-base-uncased) that can be trained much faster with little to no loss in downstream performance. This model was trained using a special technique called [_knowledge distillation_](https://en.wikipedia.org/wiki/Knowledge_distillation), where a large "teacher model" like BERT is used to guide the training of a "student model" that has far fewer parameters. An explanation of the details of knowledge distillation would take us too far afield in this section, but if you're interested you can read all about it in [_Natural Language Processing with Transformers_](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/) (colloquially known as the Transformers textbook).

Let's go ahead and download DistilBERT using the `AutoModelForMaskedLM` class:

In [None]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

We can see how many parameters this model has by calling the `num_parameters()` method:

In [None]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

```python
'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'
```

With around 67 million parameters, DistilBERT is approximately two times smaller than the BERT base model, which roughly translates into a two-fold speedup in training -- nice! Let's now see what kinds of tokens this model predicts are the most likely completions of a small sample of text:

In [None]:
text = "This is a great [MASK]."

As humans, we can imagine many possibilities for the `[MASK]` token, such as "day", "ride", or "painting". For pretrained models, the predictions depend on the corpus the model was trained on, since it learns to pick up the statistical patterns present in the data. Like BERT, DistilBERT was pretrained on the [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [BookCorpus](https://huggingface.co/datasets/bookcorpus) datasets, so we expect the predictions for `[MASK]` to reflect these domains. To predict the mask we need DistilBERT's tokenizer to produce the inputs for the model, so let's download that from the Hub as well:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

With a tokenizer and a model, we can now pass our text example to the model, extract the logits, and print out the top 5 candidates:

In [None]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

```python
'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'
```

We can see from the outputs that the model's predictions refer to everyday terms, which is perhaps not surprising given the foundation of English Wikipedia. Let's see how we can change this domain to something a bit more niche -- highly polarized movie reviews!

## The dataset

To demonstrate domain adaptation, we'll use the famous [Large Movie Review Dataset](https://huggingface.co/datasets/imdb) (or IMDb for short), which is a corpus of movie reviews that is often used to benchmark sentiment analysis models. By fine-tuning DistilBERT on this corpus, we expect the language model to adapt its vocabulary from the factual data of Wikipedia to the more subjective elements of movie reviews. We can grab the data from the Hugging Face Hub using the `load_dataset()` function from ü§ó Datasets:

In [None]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

```python
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
```

We can see that the `train` and `test` splits contain 25,000 reviews each, while there is an unlabeled set called `unsupervised` that contains 50,000 reviews. Let's have a look at a few samples to get a sense of the kind of text we're dealing with. As we've done in other sections of the course, we'll chain the `Dataset.shuffle()` and `Dataset.select()` functions to create a random sample:

In [None]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")

```python
'>>> Review: This is your typical Priyadarshan movie--a bunch of loony characters out on some silly mission. His signature climax has the entire cast of the film coming together and fighting each other in some crazy moshpit over hidden money. Whether it is a winning lottery ticket in Malamaal Weekly, black money in Hera Pheri, "kodokoo" in Phir Hera Pheri, etc., etc., the director is becoming ridiculously predictable. Don\'t get me wrong; as clich√©d and preposterous his movies may be, I usually end up enjoying the comedy. However, in most his previous movies there has actually been some good humor, (Hungama and Hera Pheri being noteworthy ones). Now, the hilarity of his films is fading as he is using the same formula over and over again.<br /><br />Songs are good. Tanushree Datta looks awesome. Rajpal Yadav is irritating, and Tusshar is not a whole lot better. Kunal Khemu is OK, and Sharman Joshi is the best.'
'>>> Label: 0'

'>>> Review: Okay, the story makes no sense, the characters lack any dimensionally, the best dialogue is ad-libs about the low quality of movie, the cinematography is dismal, and only editing saves a bit of the muddle, but Sam" Peckinpah directed the film. Somehow, his direction is not enough. For those who appreciate Peckinpah and his great work, this movie is a disappointment. Even a great cast cannot redeem the time the viewer wastes with this minimal effort.<br /><br />The proper response to the movie is the contempt that the director San Peckinpah, James Caan, Robert Duvall, Burt Young, Bo Hopkins, Arthur Hill, and even Gig Young bring to their work. Watch the great Peckinpah films. Skip this mess.'
'>>> Label: 0'

'>>> Review: I saw this movie at the theaters when I was about 6 or 7 years old. I loved it then, and have recently come to own a VHS version. <br /><br />My 4 and 6 year old children love this movie and have been asking again and again to watch it. <br /><br />I have enjoyed watching it again too. Though I have to admit it is not as good on a little TV.<br /><br />I do not have older children so I do not know what they would think of it. <br /><br />The songs are very cute. My daughter keeps singing them over and over.<br /><br />Hope this helps.'
'>>> Label: 1'
```

Yes, these definitely look like movie reviews, and if you're old enough you might even understand the language of the last review üòú! Although we won't be needing the labels for language modeling, we can already see that `0` corresponds to a negative review, while `1` corresponds to a positive one.

> **NOTE:** As the [dataset card](https://huggingface.co/datasets/imdb) notes, the supervised split of the IMDb dataset has a balanced distribution of positive and negative reviews, so we don't need to worry about skewed labels.

Now that we've had a cursory look at the raw data, let's dive a little deeper with tokenization. As we've seen in other sections, the texts need to be encoded as numbers before a model can make sense of them. In the NLP domain, tokenization is usually the approach of choice for this, and it's now time to introduce _masked language modeling_ along the way.

## Preprocessing the data

For both autoregressive and masked language models, one common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size. This is quite different from our usual approach, where we simply tokenize individual samples. Why concatenate everything together? The reason is that individual samples might get truncated if they're too long, and that would result in losing information that might be useful for the language modeling task! So, to get started, we will tokenize our corpus as usual, but _without_ setting the `truncation=True` option in our tokenizer. We'll also grab the word IDs if they're available (which they will be if we are using a fast tokenizer, as described in [Chapter 6](/course/chapter6/3)), as we'll need them later on to do whole word masking. We'll wrap this in a simple function, and while we're at it we'll remove the `text` and `label` columns since we don't need them anymore:

In [None]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

```python
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 50000
    })
})
```

Since DistilBERT is a BERT-like model, we can see that the encoded texts include the `input_ids` and `attention_mask` that we've encountered in other chapters, as well as the `word_ids` that we've added.

Now that we've tokenized our movie reviews, the next step is to group them all together and split the result into chunks of a certain size. But how big should these chunks be? This will ultimately be determined by the amount of GPU memory you have available, but a good starting point is to see what the maximum model input size is. This can be inferred by looking at the `model_max_length` attribute of our tokenizer:

In [None]:
tokenizer.model_max_length

```python
512
```

This value is derived from the tokenizer's associated configuration file and tells us the maximum input size for the model. In this case, we can see that the maximum context size is 512 tokens. 

> **NOTE:** Since the default context size for GPT-2 is 1024, we've explicitly set the `block_size` to be 128 to run the training a little faster.

To group the texts together, we'll use a simple trick: concatenate all the texts with an `eos_token_id` (end of sequence token) in between. It's the same strategy used in [a great tutorial from the ü§ó Transformers documentation](https://huggingface.co/transformers/master/custom_datasets.html#language-modeling) about language modeling. Once we've concatenated the texts, we'll split them into chunks of size `chunk_size`, which is a bit smaller than our maximum context size since we need to leave space for the special tokens:

In [None]:
chunk_size = 128

The next thing we need to figure out is how to generate the masks for masked language modeling. We have two strategies to choose from:

* _Per-token masking_: Randomly mask 15% of the tokens in each sequence by replacing them with `[MASK]` and then predict these masked tokens. This is the standard approach used to pretrain BERT and related models.
* _Whole word masking_: Instead of randomly masking individual tokens, we first pick tokens that represent complete words (i.e., where the token ID is not preceded by the special symbol ##). Once we've identified these "start of word" tokens, we then mask the first token and any adjacent tokens that are part of the same word. 

In this example, we'll use whole word masking, but feel free to experiment with per-token masking on your own. A nice thing about both approaches is that we can use the `DataCollatorForLanguageModeling` class from ü§ó Transformers to perform the masking on the fly during training. This is more efficient than preprocessing the whole dataset, since we'd either have to run preprocessing every time we change our hyperparameters or save several copies of the dataset with different masks. So let's create our data collator now:

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15
)

Here, we've set `mlm=True` to enable masked language modeling and `mlm_probability=0.15` to specify that we want to mask 15% of the tokens in each batch. 

> **NOTE:** Passing the `tokenizer` to `DataCollatorForLanguageModeling` is needed so that the data collator knows what the model's special tokens like `[CLS]`, `[SEP]`, and `[MASK]` are. If we set `mlm=False`, then the data collator will simply prepare the inputs for _causal_ language modeling by shifting the inputs and labels by one position.

The thing that makes whole word masking a bit more involved than per-token masking is that we need a way to map tokens to the words they're part of. For this, we can use the `word_ids()` method that we grabbed from our tokenizer (this method is only available for fast tokenizers, so you'll need to use one if you want to use whole word masking). As we saw earlier, the `word_ids()` method returns a list where each element corresponds to the word ID of the corresponding token in the encoded sentence, so all we need to do is iterate through these lists and create a mask whenever we encounter a new word ID.

To test it out, let's grab a small batch from our training set and inspect what the collator does:

In [None]:
samples = [tokenized_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

```python
'>>> [CLS] bromwell [MASK] is a cartoon comedy. it ran at the same [MASK] as some other [MASK] about school life, [MASK] as " teachers ". [MASK] [MASK] [MASK] in the teaching [MASK] lead [MASK] to believe that bromwell high\'s [MASK] are far closer to reality than is " teachers ". the scramble [MASK] [MASK] financially, the [MASK]ful students whogn [MASK] right through [MASK] pathetic teachers\'pomp, the pettiness of the whole situation, distinction remind me of the schools i knew and their students. when i saw [MASK] episode in [MASK] a student repeatedly tried to burn down the school, [MASK] immediately recalled. [MASK]...... [MASK]...... high. a classic line inspector : i\'m here to [MASK] your school. student : welcome to a week in hell. i expect that many adults of my age think that [MASK]mwell [MASK] is [MASK] fetched. what a pity that it isn\'t! [SEP]'

'>>> [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless as just a lost cause while worrying about things such as racism, the war on iraq, pressuring kids to succeed [MASK] school, [MASK] the hell suffered by the homeless goes [MASK]ticed. you can\' t turn on the tv and find a [MASK] special on homelessness. most of the[MASK] are [MASK]ized by [MASK] and the media. remember how the media exploited a [MASK]titute when she was [MASK] by a hollywood actor? what did she [MASK] [MASK]? about 75 percent of her [MASK] : homelessness. hmm, a [MASK]titute who [MASK] rent or [MASK] is obviously going to be on the streets. everyone knows that the but no one does anything to help [MASK] get off the streets [MASK]quickly as possible. do they? most [SEP]'
```

Nice, we can see that tokens from the same words like `high's` and `teachers'` are masked together! Now that we have our data collator, the last thing we need to do is chunk the texts together. While we could do this manually, it's much simpler to use one of ü§ó Datasets' utility functions to do all the hard work for us:

In [None]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

Note that in the last line of `group_texts()` we create a new `labels` column that is a copy of the `input_ids` one. As we'll see shortly, that's because in masked language modeling the labels for the loss function are the input IDs themselves (the data collator will take care of shifting them by one position later on).

Now let's apply `group_texts()` to our tokenized datasets:

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

```python
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 61289
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 59905
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 122963
    })
})
```

You can see that grouping and then chunking the texts has produced many more examples than our original 25,000 in the training and test sets. That's because we now have examples involving _contiguous tokens_ that span across multiple examples from the original corpus. You can see this explicitly by looking for the special `[SEP]` and `[CLS]` tokens in one of the chunks:

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

```python
".... at the time, trying to catch a flight from lax to jfk that got canceled due to a [SEP] [CLS] the way i write. [SEP] [CLS] every time i watch this show, i think of spoofs of soap operas, particularly days of our lives spoofs. first of all, there are many attractive female characters. the episodes often have dramatic plot twists ending in cliff hangers or faux cliff hangers (... [SEP]"
```

In this example we can see three reviews concatenated together, with the special tokens denoting where one finishes and the next one starts. With our datasets in this shape, we're now ready for the fun part: training! But before we do that, we need to take care of a few loose ends. First, we don't actually want the `word_ids` column in the datasets we use for training, so let's get rid of it:

In [None]:
lm_datasets = lm_datasets.remove_columns(["word_ids"])

Second, as we've seen in other parts of the course, language models like DistilBERT are typically very large and can take a long time to train on a single GPU (a few days!). To speed things up, we can downsample the training set to 10,000 examples. Since the examples are shuffled during training, we can just grab the first 10,000 examples:

In [None]:
from datasets import DatasetDict

# Downsample training set to 10,000 examples
downsampled_dataset = DatasetDict(
    {
        "train": lm_datasets["train"].shuffle(seed=42).select(range(10_000)),
        "test": lm_datasets["test"],
    }
)

This will still allow us to fine-tune the model and see decent results, and at the same time will speed up our training run considerably. Now that we've dealt with all the data processing, let's take a look at the model training!

## Fine-tuning DistilBERT with the Trainer API

Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model, like we did in [Chapter 3](/course/chapter3). The only difference is that we need to use a special data collator that can batch the masked inputs together. We've already defined our data collator above, so we just need to define a few more things to kick off the training. First, we need to specify a suitable compute metric for our task. In language modeling, it is common to use the _perplexity_ metric, which is just the exponentiated cross-entropy loss:

$$\text{Perplexity} = \exp \left( \frac{1}{N} \sum_{i=1}^{N} \text{CrossEntropy}(y_i, \hat{y}_i) \right)$$

This might look a bit intimidating, but the good news is that the ü§ó Evaluate library provides a `perplexity` metric that we can use off the shelf. Here's a simple function that takes in predictions and labels and returns the perplexity:

In [None]:
import evaluate
import numpy as np

metric = evaluate.load("perplexity")


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # preds have the same shape as the labels, after the argmax(-1) has been calculated
    # by preprocess_logits_for_metrics
    labels = labels.reshape(-1)
    preds = preds.reshape(-1)
    mask = labels != -100
    labels = labels[mask]
    preds = preds[mask]
    return metric.compute(predictions=preds, references=labels)

Then we instantiate a `Trainer` like we did in previous sections, making sure to pass our `data_collator` so that the masking is applied during training:

In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    eval_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Notice that in `TrainingArguments`, we've set `push_to_hub=True` so that the fine-tuned model is automatically uploaded to the Hugging Face Hub at the end of each epoch. You can specify the repository you want to push to with the `hub_model_id` argument (in particular, you will have to use this argument to push to an organization). For example, when we pushed the model to the [`huggingface-course` organization](https://huggingface.co/huggingface-course), we added `hub_model_id="huggingface-course/distilbert-finetuned-imdb"` to `TrainingArguments`. By default, the repository used will be in your namespace and named after the output directory you set, so in our case it will be `"sgugger/distilbert-finetuned-imdb"` (which is the model we linked to at the beginning of this section).

> **TIP:** If the output directory you are using already exists, it needs to be a local clone of the repository you want to push to. If it isn't, you'll get an error when calling `trainer.train()` and will need to set a new name. 

Finally, we just call `train()` to fine-tune the model:

In [None]:
trainer.train()

Note that while the training takes place, each time the model is saved (here, every epoch) it is uploaded to the Hub in the background. This way, you can resume your training on another machine if necessary. Once training is complete, we can compute the perplexity on the test set to make sure our model has adapted to the new domain:

In [None]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

```python
>>> Perplexity: 11.32
```

Nice -- that's a lot lower than the perplexity of around 21 that we started out with, which indicates the model has learned something useful about the movie review domain! Once fine-tuning is complete, we can push the final model to the Hub (this is actually redundant since we've already done this with `push_to_hub=True` in `TrainingArguments`, but it doesn't hurt to be explicit):

In [None]:
trainer.push_to_hub()

> **TIP:** If you have access to a machine with a GPU, try running the training there and compare the time it takes relative to Google Colab. 

Nice work -- we've trained our first language model! In the next section we'll explore a slightly more advanced topic: training language models with ü§ó Accelerate from scratch. But before we do that, let's take a quick look at what it takes to fine-tune a masked language model using native PyTorch.

## Fine-tuning DistilBERT with ü§ó Accelerate

Now let's take a look at how we can achieve the same fine-tuning results with ü§ó Accelerate, which allows us to customize every aspect of the training loop. As usual, we first need to apply `data_collator` to our training set so that the masking is applied during batch construction:

In [None]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    downsampled_dataset["test"], batch_size=batch_size, collate_fn=default_data_collator
)

From here, we follow the standard steps with ü§ó Accelerate. The first order of business is to load a fresh version of the pretrained model:

In [None]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Then we need to specify the optimizer; we'll use the standard `AdamW`:

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

With these objects, we can now prepare everything for training with the `Accelerator` object:

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that our model, optimizer, and dataloaders are configured, we can specify the learning rate scheduler as follows:

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

There is just one last thing to do before training: create a model repository on the Hugging Face Hub! We can use the ü§ó Hub library to first generate the full name of our repo:

In [None]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

```python
'lewtun/distilbert-base-uncased-finetuned-imdb-accelerate'
```

then create and clone the repository using the `Repository` class from ü§ó Hub:

In [None]:
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

With that done, it's just a simple matter of writing out the full training and evaluation loop:

In [None]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(downsampled_dataset["test"])]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

```python
>>> Epoch 0: Perplexity: 11.397545307900472
>>> Epoch 1: Perplexity: 10.904909330983092
>>> Epoch 2: Perplexity: 10.729503505340409
```

Cool, we've been able to evaluate perplexity with each epoch and ensure that multiple training runs are reproducible!

## Using our fine-tuned model

You can interact with your fine-tuned model either by using its widget on the Hub or locally with the `pipeline` from ü§ó Transformers. Let's use the latter to download our model using the `fill-mask` pipeline:

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)

We can then feed the pipeline our sample text of "This is a great [MASK]" and see what the top 5 predictions are:

In [None]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

```python
'>>> this is a great movie.'
'>>> this is a great film.'
'>>> this is a great story.'
'>>> this is a great movies.'
'>>> this is a great character.'
```

Neat -- our model has clearly adapted its weights to predict words that are more strongly associated with movies!

This wraps up our first experiment with training a language model. In section 6 you'll learn how to train an auto-regressive model like GPT-2 from scratch; head over there if you'd like to see how you can pretrain your very own Transformer model!

> **TIP:** ‚úèÔ∏è **Try it out!** To quantify the benefits of domain adaptation, fine-tune a classifier on the IMDb labels for both the pretrained and fine-tuned DistilBERT checkpoints. If you need a refresher on text classification, check out Chapter 3.