First of all, ensure your environment has the latest version of  [🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) installed:

In [None]:
%pip install git+https://github.com/huggingface/optimum-graphcore.git

Next, ensure all required packages for this notebook are installed.

In [None]:
%pip install datasets
%pip install evaluate
%pip install tokenizers
%pip install matplotlib
%pip install scipy
%pip install --force-reinstall huggingface_hub==0.11.1;

Let's start by importing the `transformers` and `optimum.graphcore` libraries, and printing the versions we are using.

In [None]:
import transformers
import optimum.graphcore

print(transformers.__version__)
print(optimum.graphcore.__version__)

At the end of this notebook, to be able to share your model with the community and easily access it through HuggingFace, there are some short set-up steps you must follow to enable uploading your checkpoint to the HuggingFace Hub.

First you have to store your authentication token from the Hugging Face website ([sign up here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Git-lfs must also be installed to enable large file storage when pushing to the hub:

In [None]:
! apt install git-lfs

# Faster question-answering with SQuAD using PackedBERT

This notebook describes how to fine-tune BERT from [🤗 Transformers](https://github.com/huggingface/transformers) for question-answering using the SQuAD(v1) dataset using [packing](https://towardsdatascience.com/introducing-packed-bert-for-2x-faster-training-in-natural-language-processing-eadb749962b1), an optimisation method originally used for 2x faster BERT pre-training, which can now also provide massive throughput increases for fine-tuning and batched inference! 

**So, what *is* packing?** The basic idea of 'packing' a dataset is to utilise the requirement for constant-shaped inputs into a model. Instead of padding it with empty, unused space, we can recycle this unused space and fill it with more inputs! The architecture of transformer models like BERT supports this, and lets us optimally use this space to process multiple sequences within one input.

**And here is why you might want to use it:** Having a single input contain multiple sequences leads to multiple sequences being processed in parallel in a single pass within a single iteration inside a batch, increasing the 'effective' batch size of the model by a considerable factor in many cases, and most importantly, increasing model throughput for training and batched inference significantly.

The process of training and validating the `BertForQuestionAnswering` model requires some adaptations to accommodate a packed dataset, and this notebook aims to introduce these on top of the [existing process](https://github.com/huggingface/optimum-graphcore/blob/main/notebooks/question_answering.ipynb) for fine-tuning the SQuAD dataset with BERT using an unmodified dataset.

Let's initialise our training configurations. 

Note here that we define a 'micro' batch size, which is the local batch size that would be passed into the model on the CPU. In this notebook, we are using both data parallelism and pipeline parallelism (see this [tutorial](https://github.com/graphcore/tutorials/tree/master/tutorials/pytorch/efficient_data_loading/walkthrough.ipynb) for more). Therefore the global batch size, which is the actual number of samples used for the weight update, is determined using four factors:), so the 'global' batch size, i.e. the number of data elements passed for one gradient calculation on the IPU, is calculated using the `device_iterations`, `gradient_accumulation_steps`, `replication_factor` and `max_seq_per_pack` (maximum sequences in a pack) for training, such that:

```
global_training_batch_size = micro_batch_size * device_iterations * gradient_accumulation_steps * replication_factor
```

Depending on you model and the pod machine you are using, you might need to adjust these three batch-size-related arguments.

`max_seq_per_pack` highlights the benefit of packing multiple sequences into one input sequence given there is enough space for them. It shows that multiple sequences are processed effectively in parallel within the model, using up space that would essentially be padding if one sequence were passed at a time. This is a much more efficient way to send inputs into the model, and improves the global batch size to a best-case-scenario of:

```
global_training_batch_size = micro_batch_size * device_iterations * gradient_accumulation_steps * replication_factor * max_seq_per_pack
```

Realistically, the global batch size will not always be multiplied by the *maximum* number of sequences in a packed sequence, but rather the *average* number of sequences in a packed sequence, and will depend on the sequence length distribution within any given dataset.

In [None]:
model_checkpoint="bert-base-uncased" # Default uncased pre-trained BERT checkpoint
ipu_config_name="Graphcore/bert-base-uncased" # Default Graphcore IPU config initialisation for pre-trained BERT
max_seq_length=384 # The maximum sequence length allowed for sequences in the model.
gradient_accumulation_steps=32 # Gradient accumulation steps for training the model on the IPU.
device_iterations = 32
micro_batch_size=2
model_task="squad" 

Gradients are not calculated during validation, so gradient accumulation is not applicable, and the global batch size for validation can be defined separately as:

```
global_validation_batch_size=micro_batch_size*device_iterations*replication_factor*max_seq_per_pack
```

In Optimum, we can define inference-specific `device iterations` and `replication factor`, which can be adjusted to create larger batches to complensate for the lack of a gradient accumulation factor.

Values for machine size and cache directories can be configured through environment variables or directly in the notebook:

In [None]:
import os

pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod4")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/") + "/packed_bert_squad/"

## Loading the dataset

The next step is to use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the dataset from the hub, and to use the  [🤗 Evaluate](https://github.com/huggingface/evaluate) library to load the evaluation metrics for the SQuAD model. This will allow easy performance metric analysis during validation.

In [None]:
from datasets import load_dataset, load_metric
import evaluate



dataset = load_dataset(model_task) # Load dataset
metric = evaluate.load(model_task) # Load metric for dataset

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [None]:
dataset

To access an actual element, you need to select a split first, then provide an index:

In [None]:
dataset["train"][0]

In the SQuAD dataset, we have a `question`, its `context` i.e., an excerpt of text which includes the answer as well as surrounding context, and the `answer` key, which holds the start position of the answer in the context, as well as the answer itself. For a different or custom question-answering dataset, these fields may have different names but serve the same purpose, so pre-defining them is useful.

We  have a configuration describing these necessary keys in the dataset containing the raw data that needs to be pre-processed or tokenised before being passed into the model. These generic keys may change for custom datasets, but the usage of them generally stays the same for a similar fine-tuning task.

In [None]:
question_key="question"
context_key="context"
answer_key="answers"
train = True
validate = True

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

The `Dataset` method is also imported, which will allow us to convert our modified and tokenized columns in dictionary form to a dataset.

In [None]:
from transformers import AutoTokenizer
from datasets import Dataset 

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

For SQuAD, we define a custom function to handle the overflows and offset mapping created by generating tokenised inputs from sequences, as well as the start and end positions of the answers which need to be translated from positions of characters to positions of tokens.

The first step is to tokenize the dataset using the tokenizer. Note here that for packing, it is important to **not** pad the dataset, so `padding` should be set to `False`. If we pad, we will have to un-pad when packing sequences into a packed sequence, which is inefficient.

The preprocessing function is outlined in [the original (unpacked) question-answering notebook](../natural-language-processing/other-use-cases/question_answering.ipynb) for more information on it. In this case, we can import the preprocessing directly from `utils.packing`, ready *without* padding for PackedBERT.

In [None]:
from utils.packing.qa_utils import preprocess_packed_qa

raw_train_dataset = dataset['train']

tokenized_training_dataset = preprocess_packed_qa(
    dataset=raw_train_dataset,
    tokenizer=tokenizer,
    question_key=question_key,
    context_key=context_key,
    answer_key=answer_key,
    sequence_length=max_seq_length,
    padding=False,
    train=True
)


raw_validation_dataset = dataset['validation']

tokenized_validation_dataset = preprocess_packed_qa(
    dataset=raw_validation_dataset,
    tokenizer=tokenizer,
    question_key=question_key,
    context_key=context_key,
    answer_key=answer_key,
    sequence_length=max_seq_length,
    padding=False,
    train=False
)

## Packing the dataset

To implement packing, we need to pack our dataset first. Each new element will be a "pack" containing at most `max_seq_per_pack` sequences.

In [None]:
max_seq_per_pack = 6

We also define the number of labels in our dataset. For SQuAD, this means the number of outputs, i.e. positions returned by the model - since it is not a classification task, so this is set to 2, to correspond to start and end positions.

In [None]:
num_labels = 2
problem_type = 'question_answering'

### Packing algorithm

In order to pack efficiently, we will use a histogram-based algorithm: shortest-pack-first histogram packing (SPFHP) presented in the [blog post](https://www.graphcore.ai/posts/introducing-packed-bert-for-2x-faster-training-in-natural-language-processing) adapted from the [blog code](https://github.com/graphcore/tutorials/tree/master/blogs_code/packedBERT). The full process of packing the dataset consists of four steps:

1. Create a histogram of the sequence lengths of the dataset.
2. Generate the 'strategy' for the dataset using one of the state-of-the-art packing algorithms, which maps out the order and indices of the sequences that need to be packed together.
3. Use this strategy to create the actual dataset, concatenating the tokenized features together for each column in the dataset, including the labels.
4. Finally, pass these new columns into a custom PyTorch dataset, ready to be passed to the PopTorch dataloader!

These steps have been simplified through the easy-to-use `utils.packing` available in Graphcore Optimum. You can simply generate the packed dataset after the usual tokenization and preprocessing by passing all necessary packing configuration to the `PackedDatasetCreator` class, and generate the ready-to-use PyTorch dataset with `.create()`.

Within the function, there are some column names used by default. The expected default columns for question-answering include:
* `input_ids`
* `attention_mask`
* `token_type_ids`
* `start_positions`
* `end_positions`

These should all be generated automatically when tokenizing the SQuAD dataset for BERT.

The `PackedDatasetCreator` requires different instantiations for different datasets, so it must be called separately for each of our dataset splits. We can set either `training`, `validation` or `inference` to `True` as needed.

In [None]:
from utils.packing.dataset_creator import PackedDatasetCreator

train_data_packer = PackedDatasetCreator(
    tokenized_dataset = tokenized_training_dataset,
    max_sequence_length = max_seq_length,
    max_sequences_per_pack = max_seq_per_pack,
    training = True,
    num_labels = num_labels,
    problem_type = problem_type,
    algorithm = 'SPFHP'
)

val_data_packer = PackedDatasetCreator(
    tokenized_dataset = tokenized_validation_dataset,
    max_sequence_length = max_seq_length,
    max_sequences_per_pack = max_seq_per_pack,
    validation = True,
    num_labels = num_labels,
    problem_type = problem_type,
    algorithm = 'SPFHP'
)

This will create the strategy and initialise the necessary parameters for packing the dataset. We can see that the ideal speed-up we have achieved is approximately 2.2x the original dataset, which corresponds directly to the average packing factor: the average number of sequences within one pack.

The `PackedDatasetCreator` class also has some other features we do not use here for training, such as `pad_to_global_batch_size`, a feature useful for performing batched inference on a large samples when we do not want to lose any of the samples when creating data iterators using the `poptorch.Dataloader`, it applies 'vertical' padding to the dataset, adding filler rows to bring the dataset up to a value divisible by the global batch size, and allows for the largest possible batch sizes to be used without any loss of data.

You can also view the histogram generated in the first step of the packing process, to observe whether the distribution of sequence lengths in the dataset will benefit from packing - as a general rule, as long as the average length of the sequences in the dataset is 50% or less of the maximum sequence length, packing will offer at least a 2x throughput benefit, in other words: `throughput_increase ≈ max_seq_len/mean_seq_len`

Many datasets have distributions with much smaller average lengths, and will benefit much more. We can easily observe this distribution by retrieving and plotting the histogram from the data class:

In [None]:
import matplotlib.pyplot as plt

train_histogram = train_data_packer.histogram

plt.hist(train_histogram, bins = [k for k in range(0,max_seq_length,10)]) 
plt.title("Sequence length histogram") 
plt.xlabel('Sequence lengths')
plt.ylabel('Frequency')
plt.show()

Now we need to create the actual packed dataset, this is the 3rd step of the packing process outlined above.

In this stage, we take the strategy for mapping the sequences by size into 'packs' that was generated by the packing algorithm, and use this to extract the sequences from the tokenized dataset, inserting them into packs for each column in the dataset. Any remaining space in a pack after the sequences have been concatenated is padded to bring all sequences up to the maximum sequence length.

**Some key features unique to packed datasets are worth mentioning here**:

- A specific `attention_mask` is generated: It contains a unique index for each sequence of the pack and `0` for the remaining padding tokens. This, essentially, tells the model where to "look" from the perspective of a single token, ignoring any encoded information (such as a different sequence) that is not relevant to that token.
    - Example of 3 sequences in a pack: `attention_mask = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,0,0,0]`
    - Compared to a single sequence in an unpacked input `attention_mask = [1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0]`
    

- The `position_ids` of a pack contain the concatenated `position_ids` of each sequences 
    - For instance given 3 sequences: `[0,1,2,3,4] + [0,1,2,3] + [0,1,2] -> [1,2,3,4,1,2,3,1,2,...,0,0,0]` (note: the CLS tokens position id '0' are also moved the end of the pack)
    
    
- For SQuAD, during training, answers are determined using a start position and end position within the sequence. During preprocessing, these were converted from character positions to token positions. Now, during packing, as tokenized sequences are effectively being concatenated along the same dimension, the positions of the answer will change for any sequence that is not starting at index 0 within a pack. For example, in a pack with 2 sequences:
    - Answer positions before packing:
    ```
    Length of sequence 1: 100 tokens (index 0 to 99)   , start position: 30, end position: 35
    Length of sequence 2: 120 tokens (index 0 to 119)  , start position: 15, end position: 25
    ```
    - Answer positions after packing:
    ```
    Length of sequence 1 in pack 1: 100 tokens (index 0 to 99)   , start position: 30, end position: 35
    Length of sequence 2 in pack 1: 120 tokens (index 100 to 219), start position: 115, end position: 125 
    ```

    - The positions have been shifted by the total length of preceding sequences in the pack,  We call this the `positions_offset`.


To create a dataloader-ready packed dataset, all you need to do is call the `create()` method:

In [None]:
packed_train_dataset = train_data_packer.create()
packed_val_dataset = val_data_packer.create()

Let's visualize one sample of the new `packed_train_dataset`:

In [None]:
packed_train_dataset[133]

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it.

### Implement Packed BERT

Some model modifications are required to make packing work with BERT. For SQuAD, we create a custom output class to separate the logits according to each of the sequences within the pack and calculate the loss. The existing class `BertForQuestionAnswering` is extended to `PipelinedPackedBertForQuestionAnswering` which incorporates the required modifications to the model. The crux of these changes is to introduce the new attention mask, and modify the hidden layer output of the model to mask any padded inputs from the logits.

First let's load a default BERT configuration using `AutoConfig`. The config includes a new parameter we must set, `max_sequences_per_pack`, this informs the model of the maximum number of sequences it will need to 'unpack' in the model output. It also allows us to clearly define the `num_labels` and `problem_type` for this model.

In [None]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_checkpoint)
config.max_sequences_per_pack = max_seq_per_pack
config.num_labels = num_labels
config.problem_type = problem_type

Now we can instantiate the model class with the config, loading the weights from the model checkpoint. For SQuAD, we can determine the number of "labels" as the two output types that will determine whether answers are correct or not, i.e., the start and end position.

In [None]:
import torch
import numpy as np
torch.manual_seed(43)
np.random.seed(43)
 
from models.modeling_bert_packed import PipelinedPackedBertForQuestionAnswering

model = PipelinedPackedBertForQuestionAnswering.from_pretrained(model_checkpoint, config=config)

The warning is telling us we are throwing away some weights and randomly initializing others. This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for question answering, for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

We can first test the model on CPU.

In [None]:
# test the model on CPU
from transformers.data.data_collator import default_data_collator

loader = torch.utils.data.DataLoader(packed_train_dataset,
                             batch_size=2,
                             shuffle=True,
                             drop_last=True,
                             collate_fn=default_data_collator)
data = next(iter(loader))
o = model(**data)
print("Logits shape:", o)

Now, let's prepare the model for IPU.

First, we set the model in half precision:

In [None]:
model.half()

### Define validation metrics for SQuAD

Before training and evaluating, a custom postprocessing function needs to be defined for SQuAD. This is because we need to map the predictions of the model back to parts of the context in terms of the character positions in the original untokenized samples. The model predicts logits for the start and end token position of the answer.

The purpose of the function is to identify each of the tokenized features according to their `example_ids` and map the start and end token positions for the output, taking the top-*n* logit indices and discarding all invalid solutions. It then uses the `offset_mapping` to map the start and end token-level positions back to character-level positions within the context, and generates a text answer using the original context. This text prediction can then be used to calculate accuracy metrics and compared to the target answer present in the dataset.

The `postprocess_qa_predictions()` function is adapted for packing, taken directly from the existing [tutorial for SQuAD finetuning for the IPU](https://github.com/huggingface/optimum-graphcore/blob/main/notebooks/question_answering.ipynb) for an unpacked dataset. The full description for the use of this function is described in that tutorial. 

The main changes to the function for packing include: 
* Instead of iterating through all the features in the tokenized dataset, and obtaining the `example_id` field created during tokenization of the validation dataset, this function iterates through each feature within each pack, obtaining the corresponding `example_id` for each feature within the pack. 

* It saves the index of the pack in the dataset, **as well as the index of the feature within the pack**, to allow the function to easily and linearly obtain the features to perform validation on.

This postprocessing is available ready-to-use from the packing utils: `utils.packing`, and can simply be initialised.

In [None]:
from utils.packing.qa_utils import postprocess_packed_qa_predictions

Finally, a `compute_validation_metrics` function is created to take in the postprocessed predictions. This obtains the answers from the dataset, maps them according to the `example_id` to the corresponding prediction, and uses `metric` from the 🤗 Evaluate library to compute the relevant metrics for SQuAD, including an "exact match" accuracy, as well as F1 score, for each answer. 

In [None]:
def compute_validation_metrics(predictions, raw_validation_dataset, packed_validation_dataset_unformatted, metric):
    
    target_answers = [
        {"id": ex["id"], "answers": ex["answers"]} for ex in raw_validation_dataset
    ]
    
    final_predictions = postprocess_packed_qa_predictions(
        raw_validation_dataset, packed_validation_dataset_unformatted, predictions
    )

    formatted_predictions = [
        {"id": k, "prediction_text": v} for k, v in final_predictions.items()
    ]

    metrics = metric.compute(predictions=formatted_predictions, references=target_answers)
    
    return metrics


### Train and validate the model using the 🤗 Optimum Graphcore `Trainer`

Now let's prepare the model for IPU, instantiate the options and machine configurations and create an IPU Trainer to efficiently and easily perform training on the IPU in just a few lines.

We need to define the `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device. We initialize it with one config name or path, which we set earlier. Then we use it to set the mode attribute `model.ipu_config` 

As we are using a pre-trained checkpoint, we can use the existing IPU configuration for `"Graphcore/bert-base-uncased"`for the custom model. This should require no changes as even though the model has been modified to be compatible with a packed dataset, the pipelining stages and IPU options will remain the same. 

Some of the options have been specified when defining the `ipu_config` to highlight the global batch size. This uses the configurations defined at the beginning of this script. Note that we can also define inference specific device iterations and replication factors for performing validation on the model, to modify the validation global batch size.

In [None]:
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

ipu_config = IPUConfig.from_pretrained(
    ipu_config_name,
    executable_cache_dir = executable_cache_dir,
    gradient_accumulation_steps=gradient_accumulation_steps,
    device_iterations=device_iterations,
    replication_factor=1,
    embedding_serialization_factor=1,
    inference_device_iterations= 64,
    inference_replication_factor=1,
)

To instantiate an `IPUTrainer`, we will need to define `IPUTrainingArguments`, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
training_args = IPUTrainingArguments(
    output_dir=f"./{model_checkpoint}-{model_task}",
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    learning_rate=9e-05,
    loss_scaling=64.0,
    weight_decay=0.01,
    warmup_ratio=0.25,
    lr_scheduler_type='cosine',
    pod_type=pod_type,
    gradient_accumulation_steps=gradient_accumulation_steps,
    dataloader_drop_last=True,
    dataloader_num_workers=64,
    logging_steps=5
)

**Note that we do not set evaluation to be performed during the training process for SQuAD**. This is due to the custom postprocessing steps required to extract text-level answers for SQuAD, for which the logits cannot be easily modified without multiple function inputs, such as the tokenized and raw datasets, while the `preprocess_logits_for_metrics` argument provided in `IPUTrainingArguments` can only utilise logits alone. Therefore, validation is done after training.

We will need a data collator that will batch our processed examples together, here we will use the default data collator imported from the Transformers library. This is passed to the `IPUTrainer` class. 

Then we just need to pass all of this along with our datasets to the IPUTrainer:

In [None]:
from transformers import default_data_collator

trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=packed_train_dataset,
    data_collator=default_data_collator
)


We can now finetune our model by just calling the train method:

In [None]:
train_run_metrics = trainer.train()

You can now upload the result of the training to the Hub if you successfully logged in at the beginning of this notebook, just execute this instruction:


In [None]:
trainer.push_to_hub()

Then save the model with the model checkpoint name.

In [None]:
trainer.save_model(f"./{model_checkpoint}-{model_task}")

We can then perform the evaluation by using the `IPUTrainer`'s `predict` functionality. This provides all of the raw predictions for the packed inputs for validation. This will, be default, use the global batch size defined specifically for inference in the `IPUTrainingArguments`.

In [None]:
raw_predictions = trainer.predict(packed_val_dataset)

Once the predictions have been obtained, the validation metrics can be computed by passing them into the `compute_validation_metrics` function. This, as described previously, performs the necessary postprocessing on the logits and obtains text answers, then computes the accuracy metrics (exact match and F1 score) for SQuAD finetuning.

In [None]:
val_metrics = compute_validation_metrics(
    raw_predictions.predictions, raw_validation_dataset, packed_val_dataset, metric)

print(val_metrics)

## Faster Inference

When training, the packing factor affects the convergence and hyperparameters in a similar way to a large increase in batch size. However, for inference-only runs, we are free to use a bigger packing factor to speed it up. Let's try it on SQuAD with max_seq_per_pack = 12, and sequence length set to 512.

In [None]:
max_seq_per_pack = 12

In [None]:
dataset = load_dataset("squad")
raw_train_dataset = dataset['train']
max_seq_length = 512

# Lets use the train dataset to have more features to infer over
tokenized_inference_dataset = preprocess_packed_qa(
    dataset=raw_train_dataset,
    tokenizer=tokenizer,
    question_key=question_key,
    context_key=context_key,
    answer_key=answer_key,
    sequence_length=max_seq_length,
    padding=False,
    train=False
)

packed_inference_dataset = PackedDatasetCreator(
    tokenized_dataset = tokenized_inference_dataset,
    max_sequence_length = max_seq_length,
    max_sequences_per_pack = max_seq_per_pack,
    inference=True,
    problem_type = problem_type,
).create()

We can see that the average packing factor has improved from 2.2 to 2.95, allowing an approximate 3x throughput speed-up from the base unpacked model. This is not nearly as much as the maximum sequences per pack limit, due to the larger sequence lengths in the SQuAD dataset, but still allows a 3x speedup for inference!

Let's also modify the configuration of the model for inference. For speed up, we can replicate a one-IPU run (`ipus_per_replica`) over four IPUs by changing the `replication_factor`. After this, we can re-initialise the model and the `IPUTrainer` with the existing arguments.

In [None]:
ipu_config.layers_per_ipu = [12]
ipu_config.inference_device_iterations = 32
ipu_config.inference_replication_factor = 4
ipu_config.ipus_per_replica = 1

To test inference throughput, we can just use a default checkpoint to run the model and evaluate the speed of packing. If you saved the model locally or on the hub earlier, you can replace `model_checkpoint` with the path to your model to perform inference on the fine-tuned weights!

In [None]:
model_checkpoint = "bert-base-uncased"

# Load checkpoint locally:
# model_checkpoint = f"./{model_name}-{task}"

# Load from Huggingface Hub instead:
# model_checkpoint = '<your_username>/{model_checkpoint}-{model_task}'

model = PipelinedPackedBertForQuestionAnswering.from_pretrained(
    model_checkpoint, config=config)

In [None]:
args = IPUTrainingArguments(
    "/tmp/"+f"{model_checkpoint}-{model_task}-fast-inf",
    per_device_eval_batch_size=8,
    dataloader_mode="async_rebatched",
    dataloader_drop_last=True,
    logging_steps=10,
    pod_type=pod_type
)

trainer = IPUTrainer(
    model,
    ipu_config,
    args,
    eval_dataset=packed_inference_dataset
)

In [None]:
trainer.evaluate()

Using these simple optimisations and the increase in maximum sequences per pack, we can see a throughput increase to approximately **8000 sequences per second** - remember that to obtain the actual throughput we multiply the packed samples/s by the average packing factor - highlighting the benefits of using packing! 