# Language Modelling on IPUs - Training

In this notebook, we'll see how to train a [ðŸ¤— Transformers](https://github.com/huggingface/transformers) model on a language modelling task. We will cover two types of language modelling tasks:

- Causal language modelling: The model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it from accessing the tokens after token `i` when trying to predict token `i+1` in the sentence.

![Widget inference representing the causal language modelling task](images/causal_language_modeling.png)

- Masked language modelling: The model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens that have been masked to predict their value.

![Widget inference representing the masked language modelling task](images/masked_language_modeling.png)

We will see how to easily load and preprocess the dataset for each of these tasks, and how to use the `IPUTrainer` API to train a model on it.

This notebooks assumes you have trained a tokenizer on the corpus you are using (see the [How to train a tokenizer](https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb) notebook for details).

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
| Natural language processing | Causal language modelling and Masked language modelling | gpt2 and bert-base-cased | Wikitext 2 | Training | 4 or 16 | 28 min on POD4, 15 min on POD16 |

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)


In order to improve usability and support for future users, Graphcore would like to collect information about the
applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

## Dependencies and configuration

Install the dependencies for this notebook.

In [1]:
%pip install "optimum-graphcore==0.7"
%pip install graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu:
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu:
Collecting graphcore-cloud-tools[logger]@ git+https://github.com/graphcore/graphcore-cloud-tools
  Cloning https://github.com/graphcore/graphcore-cloud-tools to /tmp/pip-install-u_sdu62u/graphcore-cloud-tools_2a9128a5b2cd40c5850d2b1ee56a3e18
  Running command git clone --filter=blob:none -q https://github.com/graphcore/graphcore-cloud-tools /tmp/pip-install-u_sdu62u/graphcore-cloud-tools_2a9128a5b2cd40c5850d2b1ee56a3e18
  Resolved https://github.com/graphcore/graphcore-cloud-tools to commit bb3d3d26b6e11aeccc03d1917c98eaf2cf1ede6e
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


In [None]:
import subprocess

# Run the pip list command and capture its output
result = subprocess.run(['pip', 'list'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

# Check if the command executed successfully
if result.returncode == 0:
    # Print the captured output
    print(result.stdout)
else:
    # Print error message if the command failed
    print("Error:", result.stderr)

In [None]:
raise Exception

The cache directories can be configured through environment variables or directly in the notebook:

In [2]:
import os

executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/language_modelling_from_scratch"

### Sharing your model with the community

You can share your model with the ðŸ¤— community. You do this by completing the following steps:

1. Store your authentication token from the ðŸ¤— website. [Sign up to ðŸ¤—](https://huggingface.co/join) if you haven't already.
2. Execute the following cell and input your username and password:

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

Then you need to install Git-LFS to manage large files:

In [4]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.


## Preparing the dataset

For each of the tasks, we will use the Wikitext 2 dataset as an example. You can load it easily with the ðŸ¤— Datasets library.

In [5]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

You can replace the dataset above with any dataset hosted on [ðŸ¤— Datasets](https://huggingface.co/datasets). 

You can also use your own data. Just uncomment the following cell and replace the paths shown with the paths to your files:

In [6]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

You can also load datasets from a CSV or a JSON file. See the Datasets documentation for [loading datasets from local files](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split ("train" in the example) and specify an index:

In [7]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

We want to get a sense of what the data looks like, so we define the `show_random_elements` function to display some examples picked randomly from the dataset.

In [8]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"True to the then @-@ ideal of Manifest Destiny , over 500 @,@ 000 people set out from the river town of Independence , Missouri to their various destinations in the American West from the 1830s to the 1860s . These people had many reasons to embark on this strenuous year @-@ long journey â€“ economic crisis , and later gold strikes including the California Gold Rush , for example . For most , the route took them up the Missouri to Omaha , Nebraska , where they would set out along the Platte River , which flows from the Rocky Mountains in Wyoming and Colorado eastwards through the Great Plains . An early expedition led by Robert Stuart from 1812 to 1813 proved the Platte impossible to navigate by the dugout canoes they used , let alone the large sidewheelers and sternwheelers that would later ply the Missouri in increasing numbers . One explorer remarked that the Platte was "" too thick to drink , too thin to plow "" . Nevertheless , the Platte provided an abundant and reliable source of water for the pioneers as they headed west . Covered wagons , popularly referred to as prairie schooners , provided the primary means of transport until the beginning of regular boat service on the river in the 1850s . \n"
1,
2,
3,= = = = Success and fame = = = = \n
4,
5,"On D @-@ Day at Gold , naval bombardment got underway at 05 : 30 , and amphibious landings commenced at 07 : 25 . High winds made conditions difficult for the landing craft , and the amphibious DD tanks were released close to shore or directly on the beach instead of further out as planned . Three of the four guns in a large emplacement at the Longues @-@ sur @-@ Mer battery were disabled by direct hits from the cruisers Ajax and Argonaut at 06 : 20 . The fourth gun resumed firing intermittently in the afternoon , and its garrison surrendered on 7 June . Aerial attacks had failed to hit the Le Hamel strongpoint , which had its embrasure facing east to provide enfilade fire along the beach and had a thick concrete wall on the seaward side . Its 75 mm gun continued to do damage until 16 : 00 , when a modified Armoured Vehicle Royal Engineers ( AVRE ) tank fired a large petard charge into its rear entrance . A second casemated emplacement at La RiviÃ¨re containing an 88 mm gun was neutralised by a tank at 07 : 30 . \n"
6,= = Colours and badge = = \n
7,= = Production = = \n
8,= = Biography = = \n
9,"On 15 January 1944 , an earthquake occurred in the town of San Juan , Argentina , killing some 10 @,@ 000 people . In response , PerÃ³n , who was then the Secretary of Labour , established a fund to raise money to aid the victims . He devised a plan to have an "" artistic festival "" as a fundraiser , and invited radio and film actors to participate . After a week of fundraising , all participants met at a gala held at Luna Park Stadium in Buenos Aires to benefit earthquake victims . It was at this gala , on 22 January 1944 , that Eva Duarte first met Colonel Juan PerÃ³n . Eva promptly became the colonel 's mistress . Eva referred to the day she met her future husband as her "" marvelous day "" . Fraser and Navarro write that Juan PerÃ³n and Eva left the gala together at around two in the morning . \n"


As we can see, some of the text samples are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal language modelling

For causal language modelling (CLM) we are going to take all the text in our dataset and concatenate them after they are tokenized. Then we will split them in samples of a certain sequence length. This means that the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or 
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether the samples span over several of the original text samples in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [`gpt2`](https://huggingface.co/gpt2) architecture for this example. You can pick any of the [ðŸ¤— models for causal language modelling](https://huggingface.co/models?filter=causal-lm) as long as that model is supported by Optimum Graphcore. The IPU config files of the supported models are available in Graphcore's [ðŸ¤— account](https://huggingface.co/Graphcore). You can also create your own IPU config file locally. For the tokenizer, you can replace the checkpoint with the one you trained yourself.

In this notebook, we are using both data parallelism and pipeline parallelism (see the [tutorial on efficient data loading](https://github.com/graphcore/examples/tree/master/tutorials/tutorials/pytorch/efficient_data_loading) for more information). Therefore the global batch size, which is the actual number of samples used for the weight update, is determined from three factors:
- global batch size = micro batch size * gradient accumulation steps * replication factor

The replication factor is determined by the type of IPU Pod used, which will be used as a key to select the replication factor from a dictionary defined in the IPU config file. For example, the dictionary in the IPU config file [Graphcore/gpt2-small-ipu](https://huggingface.co/Graphcore/gpt2-small-ipu/blob/main/ipu_config.json) looks like this:
- "replication_factor": {"pod4": 1, "pod8": 2, "pod16": 4, "pod32": 8, "pod64": 16, "default": 1}

Depending on your model and the IPU Pod you are using, you might need to adjust these three batch-size-related arguments.

In [10]:
model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"

ipu_config_name = "Graphcore/gpt2-small-ipu"
micro_batch_size = 1
gradient_accumulation_steps = 64
dataloader_workers = 64

To tokenize all our text samples with the same vocabulary that was used when training the model, we have to download a pre-trained tokenizer. This is all done by the `AutoTokenizer` class:

In [11]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

We can now call the tokenizer on all our text samples. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that calls the tokenizer on our texts:

In [12]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [13]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

If we now look at an element of our datasets, we will see the text has been replaced with `input_ids` that the model will need:

In [14]:
tokenized_datasets["train"][1]

{'input_ids': [238, 8576, 9441, 2987, 238, 252],
 'attention_mask': [1, 1, 1, 1, 1, 1]}

Now we need to concatenate all our text samples together then split the result into small chunks of a certain block size (`block_size`). To do this, we will use the `map` method again, with the option `batched=True`. This option lets us change the number of samples in the datasets by returning a different number of samples than we originally had. This means that we can create a new set of samples from an existing set of samples.

We can read the maximum length our model was pre-trained with (with tokenizer.model_max_length), but since the value might be too big to fit on your IPU RAM, we set it to 128.

In [15]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our text samples:

In [16]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Note that we duplicate the inputs for our labels. This is because the model of the ðŸ¤— Transformers library applies a shift to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized text samples a multiple of `block_size` every 1,000 examples. You can adjust this behaviour by passing a larger batch size (which will also take longer to be processed). You can also speed up the preprocessing by using multiprocessing:

In [17]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

We can check our datasets have changed. Now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original text samples.

In [18]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Ozawa. A large'

To instantiate `IPUTrainer`, we will need to define:
* `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device.
* A model.
* `IPUTrainingArguments`, which is a class that contains all the attributes to customize the training.

We initialize `IPUConfig` with one config name or a path, which we set earlier. We also get the model configuration from the model name set earlier and initialize our model using that config.

In [19]:
from transformers import AutoConfig, AutoModelForCausalLM
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

ipu_config = IPUConfig.from_pretrained(ipu_config_name, executable_cache_dir=executable_cache_dir)

config = AutoConfig.from_pretrained(model_checkpoint)
config.update({'activation_function':'gelu'})
model = AutoModelForCausalLM.from_config(config)

`replicated_tensor_sharding` is not used when `replication_factor=1`


`IPUTrainingArguments` requires one folder name, which will be used to save the checkpoints of the model. All other arguments are optional:

In [20]:
training_args = IPUTrainingArguments(
    f"{model_checkpoint}-wikitext2",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=10,
    loss_scaling=16384,
    n_ipu=4,
    warmup_ratio=0.1,
    dataloader_drop_last=True,
    dataloader_num_workers=dataloader_workers,
    logging_steps=10,
    push_to_hub=False,
    # hub_model_id=f"username-or-organization/{model_checkpoint}-wikitext2",
)

`push_to_hub` and `hub_model_id` in `IPUTrainingArguments` are necessary if we want to push the model to the [ðŸ¤— Models Hub](https://huggingface.co/models) regularly during training. You can remove them if you didn't follow the installation steps at the beginning of this notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/gpt-finetuned-wikitext2"` or `"huggingface/gpt-finetuned-wikitext2"`).

Finally, we pass along all of these to the `IPUTrainer` class:

In [21]:
from transformers import default_data_collator

trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

Overriding IPU config: gradient_accumulation_steps=64
-------------------- Device Allocation --------------------
Token Embedding     --> IPU 0
Position Embedding  --> IPU 0
Layer 0             --> IPU 1
Layer 1             --> IPU 1
Layer 2             --> IPU 1
Layer 3             --> IPU 1
Layer 4             --> IPU 2
Layer 5             --> IPU 2
Layer 6             --> IPU 2
Layer 7             --> IPU 2
Layer 8             --> IPU 3
Layer 9             --> IPU 3
Layer 10            --> IPU 3
Layer 11            --> IPU 3
Head                --> IPU 0
-----------------------------------------------------------


We are now ready to train our model:

In [22]:
# trainer.train()

Once the training is complete, we can evaluate our model and get its perplexity on the validation set like this:

In [23]:
import math
# eval_results = trainer.evaluate()
# print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

The perplexity is still quite high since we only trained on a small dataset for a small number of epochs. For a real language model training, you would need a larger dataset and more epochs.

You can now upload the result of the training to the ðŸ¤— Hub:

In [24]:
# trainer.push_to_hub()

You can now share this model and other users can load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("sgugger/my-awesome-model")
```

## Masked language modelling

For masked language modelling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them with `[MASK]` and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens). If you use a tokenizer you trained yourself, make sure the `[MASK]` token is among the special tokens you passed during training!

We will use the [`bert-base-cased`](https://huggingface.co/bert-based-cased) model for this example.  You can pick any of the [ðŸ¤— models for masked language modelling](https://huggingface.co/models?filter=masked-lm) as long as that model is supported by Optimum Graphcore. The IPU config files of the supported models are available in Graphcore's [ðŸ¤— account](https://huggingface.co/Graphcore). You can also create your own IPU config file locally. For the tokenizer, replace the checkpoint with the one you trained.

In [25]:
model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "sgugger/bert-like-tokenizer"

ipu_config_name = "Graphcore/bert-base-ipu"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the model checkpoint we just picked:

In [26]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

As with the causal language modelling example, we group the text samples together and create chunks of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [27]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we did for causal language modelling, with two exceptions:
* We need a model suitable for masked language modelling.
* We need a special data collator.

First, we use a model suitable for masked language modelling:

In [28]:
from transformers import AutoConfig, AutoModelForMaskedLM
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

ipu_config = IPUConfig.from_pretrained(ipu_config_name, executable_cache_dir=executable_cache_dir)

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_config(config)

loading configuration file ipu_config.json from cache at /tmp/huggingface_caches/checkpoints/models--Graphcore--bert-base-ipu/snapshots/0bbecd4f090aa0bd3b916b6914bed1b08be854f3/ipu_config.json
`replicated_tensor_sharding` is not used when `replication_factor=1`
IPUConfig {
  "auto_loss_scaling": false,
  "device_iterations": 1,
  "embedding_serialization_factor": 1,
  "enable_half_partials": true,
  "executable_cache_dir": "/tmp/exe_cache/3.3.0/language_modelling_from_scratch",
  "execute_encoder_on_cpu_for_generation": false,
  "gradient_accumulation_steps": 16,
  "inference_device_iterations": 5,
  "inference_embedding_serialization_factor": 1,
  "inference_ipus_per_replica": 4,
  "inference_layers_per_ipu": [
    0,
    4,
    4,
    4
  ],
  "inference_matmul_proportion": 0.25,
  "inference_projection_serialization_factor": 1,
  "inference_replication_factor": 1,
  "inference_serialized_embedding_splits_per_ipu": null,
  "inference_serialized_projection_splits_per_ipu": null,
  "ip

We redefine the `IPUTrainingArguments` class:

In [29]:
training_args = IPUTrainingArguments(
    f"{model_checkpoint}-wikitext2-test-mlm",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=10,
    dataloader_drop_last=True,
    dataloader_num_workers=dataloader_workers,
    warmup_ratio=0.1,
    logging_steps=10,
    n_ipu=4,
    push_to_hub=False,
    # hub_model_id=f"username-or-organization/{model_checkpoint}-wikitext2-test-mlm",
)

Like before, the last two arguments in `IPUTrainingArguments` are needed if we want to push the model to the [ðŸ¤— Models Hub](https://huggingface.co/models) at the end of training. Remove these two arguments if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/gpt-finetuned-wikitext2"` or `"huggingface/gpt-finetuned-wikitext2"`).

Finally, we use a special data collator. The data collator is a function that is responsible for taking samples and batching them into tensors. In the causal language modelling example, we didn't need anything special, so we just used the default data collator. Here we want to randomly mask the data. We could do it as a pre-processing step (like with the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the data collator, we ensure this random masking is done in a new way each time we go over the data.

To do this masking, we use `DataCollatorForLanguagemodelling` which lets us adjust the probability of the masking:

In [30]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `IPUTrainer` and begin training:

In [31]:
print("rrr executing free -h before train")
!free -h

rrr executing free -h before train
              total        used        free      shared  buff/cache   available
Mem:          503Gi        32Gi       216Gi       9.0Mi       255Gi       468Gi
Swap:            0B          0B          0B


In [32]:
trainer = IPUTrainer(
    model=model,
    args=training_args,
    ipu_config=ipu_config,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

Overriding IPU config: gradient_accumulation_steps=64
-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------


In [33]:
trainer.train()

-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a

  0%|          | 0/2930 [00:00<?, ?it/s]

{'loss': 10.1508, 'learning_rate': 6.825938566552902e-07, 'epoch': 0.03}
{'loss': 9.9695, 'learning_rate': 1.3651877133105804e-06, 'epoch': 0.07}
{'loss': 9.8, 'learning_rate': 2.0477815699658705e-06, 'epoch': 0.1}
{'loss': 9.7805, 'learning_rate': 2.7303754266211608e-06, 'epoch': 0.14}
{'loss': 9.7844, 'learning_rate': 3.412969283276451e-06, 'epoch': 0.17}
{'loss': 9.5734, 'learning_rate': 4.095563139931741e-06, 'epoch': 0.2}
{'loss': 9.5633, 'learning_rate': 4.778156996587031e-06, 'epoch': 0.24}
{'loss': 9.2664, 'learning_rate': 5.4607508532423215e-06, 'epoch': 0.27}
{'loss': 9.3281, 'learning_rate': 6.143344709897611e-06, 'epoch': 0.31}
{'loss': 9.1023, 'learning_rate': 6.825938566552902e-06, 'epoch': 0.34}
{'loss': 9.2648, 'learning_rate': 7.508532423208191e-06, 'epoch': 0.38}
{'loss': 9.0184, 'learning_rate': 8.191126279863482e-06, 'epoch': 0.41}
{'loss': 8.9555, 'learning_rate': 8.873720136518773e-06, 'epoch': 0.44}
{'loss': 9.2836, 'learning_rate': 9.556313993174062e-06, 'epoch'

Saving model checkpoint to bert-base-cased-wikitext2-test-mlm/checkpoint-500


{'loss': 7.3082, 'learning_rate': 1.8430034129692834e-05, 'epoch': 1.71}


-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------
Configuration saved in bert-base-cased-wikitext2-test-mlm/checkpoint-500/ipu_config.json


{'loss': 7.1047, 'learning_rate': 1.8354190367842245e-05, 'epoch': 1.74}
{'loss': 7.6059, 'learning_rate': 1.827834660599166e-05, 'epoch': 1.77}
{'loss': 7.3566, 'learning_rate': 1.820250284414107e-05, 'epoch': 1.81}
{'loss': 6.6348, 'learning_rate': 1.8126659082290482e-05, 'epoch': 1.84}
{'loss': 7.4977, 'learning_rate': 1.8050815320439897e-05, 'epoch': 1.88}
{'loss': 7.3086, 'learning_rate': 1.7974971558589308e-05, 'epoch': 1.91}
{'loss': 6.8828, 'learning_rate': 1.789912779673872e-05, 'epoch': 1.95}
{'loss': 7.5281, 'learning_rate': 1.782328403488813e-05, 'epoch': 1.98}
{'loss': 7.1922, 'learning_rate': 1.7747440273037545e-05, 'epoch': 2.01}
{'loss': 7.1633, 'learning_rate': 1.7671596511186956e-05, 'epoch': 2.05}
{'loss': 6.9426, 'learning_rate': 1.7595752749336368e-05, 'epoch': 2.08}
{'loss': 7.3242, 'learning_rate': 1.7519908987485782e-05, 'epoch': 2.12}
{'loss': 7.1488, 'learning_rate': 1.7444065225635193e-05, 'epoch': 2.15}
{'loss': 7.0549, 'learning_rate': 1.7368221463784605e-0

Saving model checkpoint to bert-base-cased-wikitext2-test-mlm/checkpoint-1000


{'loss': 7.3633, 'learning_rate': 1.4637846037163443e-05, 'epoch': 3.41}


-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------
Configuration saved in bert-base-cased-wikitext2-test-mlm/checkpoint-1000/ipu_config.json


{'loss': 6.5918, 'learning_rate': 1.4562002275312856e-05, 'epoch': 3.45}
{'loss': 6.9902, 'learning_rate': 1.4486158513462269e-05, 'epoch': 3.48}
{'loss': 7.0469, 'learning_rate': 1.441031475161168e-05, 'epoch': 3.52}
{'loss': 6.5711, 'learning_rate': 1.4334470989761093e-05, 'epoch': 3.55}
{'loss': 6.6801, 'learning_rate': 1.4258627227910506e-05, 'epoch': 3.58}
{'loss': 6.8258, 'learning_rate': 1.4182783466059917e-05, 'epoch': 3.62}
{'loss': 7.1676, 'learning_rate': 1.410693970420933e-05, 'epoch': 3.65}
{'loss': 6.7625, 'learning_rate': 1.4031095942358741e-05, 'epoch': 3.69}
{'loss': 6.9035, 'learning_rate': 1.3955252180508154e-05, 'epoch': 3.72}
{'loss': 6.9738, 'learning_rate': 1.3879408418657567e-05, 'epoch': 3.75}
{'loss': 6.8918, 'learning_rate': 1.3803564656806978e-05, 'epoch': 3.79}
{'loss': 6.7277, 'learning_rate': 1.3727720894956391e-05, 'epoch': 3.82}
{'loss': 6.5203, 'learning_rate': 1.3651877133105804e-05, 'epoch': 3.86}
{'loss': 7.123, 'learning_rate': 1.3576033371255215e-

Saving model checkpoint to bert-base-cased-wikitext2-test-mlm/checkpoint-1500


{'loss': 7.2738, 'learning_rate': 1.0845657944634054e-05, 'epoch': 5.12}


-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------
Configuration saved in bert-base-cased-wikitext2-test-mlm/checkpoint-1500/ipu_config.json


{'loss': 6.5547, 'learning_rate': 1.0769814182783467e-05, 'epoch': 5.15}
{'loss': 7.8496, 'learning_rate': 1.069397042093288e-05, 'epoch': 5.19}
{'loss': 6.6168, 'learning_rate': 1.0618126659082291e-05, 'epoch': 5.22}
{'loss': 7.2145, 'learning_rate': 1.0542282897231704e-05, 'epoch': 5.26}
{'loss': 6.7852, 'learning_rate': 1.0466439135381117e-05, 'epoch': 5.29}
{'loss': 6.4328, 'learning_rate': 1.0390595373530528e-05, 'epoch': 5.32}
{'loss': 6.9766, 'learning_rate': 1.0314751611679941e-05, 'epoch': 5.36}
{'loss': 6.7152, 'learning_rate': 1.0238907849829352e-05, 'epoch': 5.39}
{'loss': 6.4781, 'learning_rate': 1.0163064087978765e-05, 'epoch': 5.43}
{'loss': 7.1348, 'learning_rate': 1.0087220326128178e-05, 'epoch': 5.46}
{'loss': 7.0094, 'learning_rate': 1.0011376564277588e-05, 'epoch': 5.49}
{'loss': 6.948, 'learning_rate': 9.935532802427002e-06, 'epoch': 5.53}
{'loss': 7.1437, 'learning_rate': 9.859689040576413e-06, 'epoch': 5.56}
{'loss': 6.8359, 'learning_rate': 9.783845278725825e-06

Saving model checkpoint to bert-base-cased-wikitext2-test-mlm/checkpoint-2000


{'loss': 7.0633, 'learning_rate': 7.053469852104665e-06, 'epoch': 6.83}


-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------
Configuration saved in bert-base-cased-wikitext2-test-mlm/checkpoint-2000/ipu_config.json


{'loss': 7.2629, 'learning_rate': 6.977626090254077e-06, 'epoch': 6.86}
{'loss': 7.091, 'learning_rate': 6.901782328403489e-06, 'epoch': 6.89}
{'loss': 6.5539, 'learning_rate': 6.825938566552902e-06, 'epoch': 6.93}
{'loss': 7.1129, 'learning_rate': 6.750094804702314e-06, 'epoch': 6.96}
{'loss': 7.0457, 'learning_rate': 6.674251042851726e-06, 'epoch': 7.0}
{'loss': 6.7781, 'learning_rate': 6.598407281001138e-06, 'epoch': 7.03}
{'loss': 6.8355, 'learning_rate': 6.5225635191505495e-06, 'epoch': 7.06}
{'loss': 6.3609, 'learning_rate': 6.446719757299963e-06, 'epoch': 7.1}
{'loss': 7.0031, 'learning_rate': 6.370875995449375e-06, 'epoch': 7.13}
{'loss': 6.8844, 'learning_rate': 6.2950322335987864e-06, 'epoch': 7.17}
{'loss': 7.4488, 'learning_rate': 6.2191884717481985e-06, 'epoch': 7.2}
{'loss': 6.9766, 'learning_rate': 6.143344709897611e-06, 'epoch': 7.24}
{'loss': 6.5871, 'learning_rate': 6.0675009480470234e-06, 'epoch': 7.27}
{'loss': 6.8508, 'learning_rate': 5.9916571861964355e-06, 'epoch

Saving model checkpoint to bert-base-cased-wikitext2-test-mlm/checkpoint-2500


{'loss': 6.6656, 'learning_rate': 3.2612817595752747e-06, 'epoch': 8.53}


-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------
Configuration saved in bert-base-cased-wikitext2-test-mlm/checkpoint-2500/ipu_config.json


{'loss': 6.7668, 'learning_rate': 3.1854379977246876e-06, 'epoch': 8.57}
{'loss': 6.9492, 'learning_rate': 3.1095942358740992e-06, 'epoch': 8.6}
{'loss': 7.0773, 'learning_rate': 3.0337504740235117e-06, 'epoch': 8.63}
{'loss': 6.6926, 'learning_rate': 2.9579067121729238e-06, 'epoch': 8.67}
{'loss': 6.7227, 'learning_rate': 2.8820629503223362e-06, 'epoch': 8.7}
{'loss': 7.1309, 'learning_rate': 2.8062191884717483e-06, 'epoch': 8.74}
{'loss': 6.6289, 'learning_rate': 2.7303754266211608e-06, 'epoch': 8.77}
{'loss': 6.918, 'learning_rate': 2.654531664770573e-06, 'epoch': 8.81}
{'loss': 6.8152, 'learning_rate': 2.5786879029199853e-06, 'epoch': 8.84}
{'loss': 6.6137, 'learning_rate': 2.502844141069397e-06, 'epoch': 8.87}
{'loss': 6.993, 'learning_rate': 2.4270003792188094e-06, 'epoch': 8.91}
{'loss': 6.8094, 'learning_rate': 2.3511566173682214e-06, 'epoch': 8.94}
{'loss': 6.7105, 'learning_rate': 2.275312855517634e-06, 'epoch': 8.98}
{'loss': 6.6066, 'learning_rate': 2.199469093667046e-06, '



Training completed. Do not forget to share your model on huggingface.co/models =)




{'loss': 6.9695, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 422.8342, 'train_samples_per_second': 443.484, 'train_steps_per_second': 6.929, 'train_loss': 7.169850549274744, 'epoch': 10.0}


TrainOutput(global_step=2930, training_loss=7.169850549274744, metrics={'train_runtime': 422.8342, 'train_samples_per_second': 443.484, 'train_steps_per_second': 6.929, 'train_loss': 7.169850549274744, 'epoch': 10.0})

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [34]:
print("rrr executing free -h before evaluate")
!free -h

rrr executing free -h before evaluate
              total        used        free      shared  buff/cache   available
Mem:          503Gi        44Gi       201Gi        10Mi       257Gi       456Gi
Swap:            0B          0B          0B


In [35]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

-------------------- Device Allocation --------------------
Embedding  --> IPU 0
Encoder 0  --> IPU 1
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 2
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 3
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Classifier --> IPU 0
-----------------------------------------------------------
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a

  0%|          | 0/401 [00:00<?, ?it/s]

Perplexity: 975.37


The perplexity is still quite high since we only trained on a small dataset for a small number of epochs. For a real language model training, you  would need a larger dataset and more epochs.

You can now upload the result of the training to the ðŸ¤— Hub:

In [36]:
# trainer.push_to_hub()

You can also share this model and other users can load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("sgugger/my-awesome-model")
```

## Next steps

Check out the full list of [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to get more of a feel for how IPUs perform on other tasks.

In [39]:
!pip list

Package                  Version
------------------------ ------------
accelerate               0.21.0
agate                    1.7.1
agate-dbf                0.2.2
agate-excel              0.2.5
agate-sql                0.5.9
aiofiles                 22.1.0
aiohttp                  3.8.5
aiosignal                1.3.1
aiosqlite                0.19.0
anyio                    3.7.0
argon2-cffi              21.3.0
argon2-cffi-bindings     21.2.0
arrow                    1.2.3
asttokens                2.2.1
async-timeout            4.0.3
attrs                    23.1.0
awscli                   1.27.165
Babel                    2.12.1
backcall                 0.2.0
beautifulsoup4           4.12.2
bleach                   6.0.0
boto3                    1.26.165
botocore                 1.29.165
certifi                  2023.5.7
cffi                     1.15.1
charset-normalizer       3.1.0
cmake                    3.26.3
colorama                 0.4.4
coloredlogs              15.0.1
comm   