# Faster Multi-label Text Classification using PackedBERT on IPUs

This notebook builds on the process in the notebook on [fine-tuning BERT on a text classification task](text_classification.ipynb). Here we show how to implement packing for BERT for multi-label classification. [Packing](https://www.graphcore.ai/posts/introducing-packed-bert-for-2x-faster-training-in-natural-language-processing) is an optimisation method originally used for 2x faster BERT pre-training, which can now also provide massive throughput increases for **fine-tuning** and **batched inference**! 

**So, what *is* packing?** The basic idea of "packing" a dataset is to utilise the requirement for constant-shaped inputs into a model. Instead of padding it with empty, unused space, we can recycle this unused space and fill it with more inputs! The architecture of transformer models like BERT supports this, and lets us optimally use this space to process multiple sequences within one input.

**And here is why you might want to use it:** Having a single input that contains multiple sequences leads to multiple sequences being processed in parallel in a single pass within a single iteration inside a batch, increasing the "effective" batch size of the model by a considerable factor in many cases, and most importantly, increasing model throughput for training and batched inference significantly.

The [GoEmotions](https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html) dataset will be fine-tuned using packing. This notebook outlines how to easily enable packing for BERT when performing fine-tuning and inference on a text-classification task in 🤗 Graphcore Optimum, resulting in an impressive 5-9x faster training and inference run-time for the dataset. 

You can read more about packing in the original paper on [efficient sequence packing without cross-contamination](https://arxiv.org/abs/2107.02027)

![GoEmotions dataset (Source: GoogleBlog)](../images/go_emotions.png)

The dataset consists of 58k comments labelled for 27 different emotion categories (and a 28th "neutral" category). This dataset is used for multi-label, multi-class classification. The dataset format and categories can be viewed on the [Hugging Face Hub](https://huggingface.co/datasets/go_emotions).

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|  | multi-label text classification | PackedBERT | GoEmotions | fine-tuning, inference | recommended: 16XX (min: 4X) | 20Xmn (X1h20mn)   |

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies and configuration


Also make sure all the packages required for this notebook are installed.

In [None]:
%pip install scikit-learn;
%pip install datasets
%pip install evaluate
%pip install tokenizers
%pip install matplotlib
%pip install scipy
%pip install --force-reinstall huggingface_hub==0.11.1;

We use both data parallelism and pipeline parallelism (see this [tutorial on efficient data loading](https://github.com/graphcore/tutorials/blob/master/tutorials/pytorch/efficient_data_loading/walkthrough.ipynb) for more information). Therefore the global batch size, which is the actual number of samples used for the weight update, is determined using four factors:

    global batch size = micro_batch_size * gradient accumulation steps * device iterations * replication factor

Replication factor is determined by `pod_type`, the type of IPU Pod. `pod_type` is used as a key to select the replication factor from a dictionary defined in the IPU config file. For example, the dictionary in the IPU config file `Graphcore/roberta-base-ipu` looks like this:

    "replication_factor": {"pod4": 1, "pod8": 2, "pod16": 4, "pod32": 8, "pod64": 16, "default": 1}

Depending on your model and the IPU Pod machine you are using, you might need to adjust these batch-size-related arguments.

By default this notebook is configured to run on 4 IPUs.

We also define `max_seq_length` which is the maximum length a sequence can be, and all sequences will be padded to this length. Therefore, `max_seq_length` should not be larger than the maximum length of the model. Given the small size of the sequences in `go-emotions`, we can reduce the model maximum input size to `max_seq_length = 256`. 

Set these parameters and the rest of the notebook should run smoothly:

In [None]:
model_checkpoint = "bert-base-uncased" # Default uncased pre-trained BERT checkpoint
ipu_config_name = "Graphcore/bert-base-uncased" # Default Graphcore IPU config initialisation for pre-trained BERT
max_seq_length = 256 # The maximum sequence length allowed for sequences in the model.
micro_batch_size = 2 
gradient_accumulation_steps = 39
device_iterations = 32
model_task = 'go_emotions'
num_labels = 28

Gradients are not calculated during validation, so gradient accumulation is not applicable, and the global batch size for validation can be defined separately as:
```
global_validation_batch_size=device_iterations*replication_factor*max_seq_per_pack
```


Values for machine size and cache directories can be configured through environment variables or directly in the notebook:

In [None]:
import os

pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod4")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/") + "/packed_bert_mlseqcls/"

### Sharing your model with the community

You can share your model with the 🤗 community. You do this by completing the following steps:

1. Store your authentication token from the 🤗 website. [Sign up to 🤗](https://huggingface.co/join) if you haven't already.
2. Execute the following cell and input your username and authentication token.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Git-LFS must also be installed to manage large files:

In [None]:
!apt install git-lfs

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we will use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset
import evaluate

In [None]:
dataset = load_dataset(model_task)
metric = evaluate.load("roc_auc", "multilabel")

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test sets.

In [None]:
dataset

To access an actual element, you need to select a split ("train" in the example) and then specify an index:

In [None]:
dataset["train"][0]

To get a sense of what the data looks like, the following function will show some samples picked randomly from the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

## Preprocessing the data

Before we can feed the text samples to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary), putting them into a format the model expects, as well as generating the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- We get a tokenizer that corresponds to the model architecture we want to use.
- We download the vocabulary used when pre-training this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We pass `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Fast tokenizers are available for almost all models, but if you get an error with the previous call then simply set `use_fast` to False.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter for what we're doing here, but they are required by the model we will instantiate later. You can learn more about keys in this [tutorial on preprocessing](https://huggingface.co/transformers/preprocessing.html).

To preprocess our dataset, we will need the names of the columns containing the sentence(s). In this case, the column is called `'text'` and it is indexed as such in the tokenization function.

We can then write the function that will preprocess our samples. We feed the samples to the `tokenizer` with three arguments.`truncation=True` will ensure that an input longer than the maximum length will be truncated to the maximum length. `max_length=max_seq_length` sets the maximum length of a sequence.

**Note that since we use packing later, we don't set any padding in the tokenizer.**

In [None]:
# no padding for packing
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=max_seq_length)

For multi-label classification, we also need to convert our labels from integer values indicating a category to an N-hot binary format (where N is the maximum number of labels). This makes sure we have constant-sized labels, and all of our labels (one input can have multiple target labels) are present for training. The conversion looks something like this:

```python
unprocessed_labels = [3,21] # Where 3 and 21 are label categories
preprocessed_labels = id_to_N_hot([3,21])
preprocessed_labels = [0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
```


The following function processes one example and converts it to N-hot. The `.map()` functionality available in the `datasets` library allows the function to be applied easily to the entire dataset.

In [None]:
import numpy as np

def id_to_N_hot(example):
    indexes = example['labels']
    label = np.zeros((num_labels,), dtype=int)
    for idx in indexes:
        label[idx] = 1
    example['labels'] = label
    return example

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function to all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in a single command.

In [None]:
encoded_dataset = dataset.map(id_to_N_hot)
encoded_dataset = encoded_dataset.map(preprocess_function, batched=True)

len(encoded_dataset['validation'])

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is able to detect when the function you pass to `map` has changed (and thus to not use the cached data). For instance, it will detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files. You can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the text samples together into batches. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the text samples in a batch concurrently.

##  Packing the dataset

To implement packing, we need to pack our dataset first. Each new element will be a pack containing at most `max_seq_per_pack` sequences.

In [None]:
max_seq_per_pack = 6

The problem type for this task is multi_label_classification and this also needs to be defined for the packed model to work.

In [None]:
problem_type = 'multi_label_classification'

### Packing algorithm

In order to pack efficiently, we will use a histogram-based algorithm. The shortest-pack-first histogram packing (SPFHP) was presented in the Graphcore blog post [introducing Packed BERT for a training speedup in natural language processing](https://www.graphcore.ai/posts/introducing-packed-bert-for-2x-faster-training-in-natural-language-processing). We have adapted the [code](https://github.com/graphcore/tutorials/tree/master/blogs_code/packedBERT) from the blog post for this notebook. The full process of packing the dataset consists of four steps:

1. Create a histogram of the sequence lengths of the dataset.
2. Generate the "strategy" for the dataset using one of the state-of-the-art packing algorithms. The strategy maps out the order and indices of the sequences that need to be packed together.
3. Use this strategy to create the actual dataset, concatenating the tokenized features together for each column in the dataset, including the labels.
4. Finally, pass these new columns into a custom PyTorch dataset, ready to be passed to the PopTorch dataloader!

These steps have been simplified through the easy-to-use `utils.packing` package available in Graphcore Optimum. You can simply generate the packed dataset after the usual tokenization and preprocessing by passing all necessary packing configuration to the `PackedDatasetCreator` class, and generate the ready-to-use PyTorch dataset with `.create()`.

Within the function, there are some column names used by default. The expected default columns for text classification include:
* `input_ids`
* `attention_mask`
* `token_type_ids`
* `labels`

These should all be generated automatically when tokenizing any classification dataset for BERT. However, the labels key, as it is not encoded, may have a different name. For this dataset, the column key for the labels for this dataset is `label`, since the dataset creator expects `labels`, we can pass this to the argument `custom_label_key`, so the class can find our labels. 

The `PackedDatasetCreator` requires different instantiations for different datasets, so it must be called separately for each of our dataset splits. We can set either `training`, `validation` or `inference` to `True` as needed.

In [None]:
from utils.packing.dataset_creator import PackedDatasetCreator

train_data_packer = PackedDatasetCreator(
    tokenized_dataset = encoded_dataset['train'],
    max_sequence_length = max_seq_length,
    max_sequences_per_pack = max_seq_per_pack,
    training = True,
    num_labels = num_labels,
    problem_type = problem_type,
    algorithm = 'SPFHP',
    custom_label_key = 'labels'
)

val_data_packer = PackedDatasetCreator(
    tokenized_dataset = encoded_dataset['validation'],
    max_sequence_length = max_seq_length,
    max_sequences_per_pack = max_seq_per_pack,
    validation = True,
    num_labels = num_labels,
    problem_type = problem_type,
    algorithm = 'SPFHP',
    custom_label_key = 'labels'
)

This will create the strategy and initialise the necessary parameters for packing the dataset. We can see that the ideal speed-up we have achieved is approximately 5.7x the original dataset, which corresponds directly to the average packing factor: the average number of sequences within one pack.

The `PackedDatasetCreator` class also has some other features we do not use here for training, such as `pad_to_global_batch_size`, a feature useful for performing batched inference on a large samples when we do not want to lose any of the samples. When creating data iterators using the `poptorch.Dataloader`, it applies 'vertical' padding to the dataset, adding filler rows to bring the dataset up to a value divisible by the global batch size, and allows for the largest possible batch sizes to be used without any loss of data.

You can also view the histogram generated in the first step of the packing process, to observe whether the distribution of sequence lengths in the dataset will benefit from packing - as a general rule, as long as the average length of the sequences in the dataset is 50% or less of the maximum sequence length, packing will offer at least a 2x throughput benefit, in other words: `throughput_increase ≈ max_seq_len/mean_seq_len`

Many datasets have distributions with much smaller average lengths, and will benefit much more. We can easily observe this distribution by retrieving and plotting the histogram from the data class:

In [None]:
import matplotlib.pyplot as plt

train_histogram = train_data_packer.histogram

plt.hist(train_histogram, bins = [k for k in range(0,max_seq_length,10)]) 
plt.title("Sequence length histogram") 
plt.xlabel('Sequence lengths')
plt.ylabel('Frequency')
plt.show()

Now we need to create the actual packed dataset (step 3 in the packing process outlined above).

In this stage, we take the strategy for mapping the sequences by size into packs that were generated by the packing algorithm, and use this to extract the sequences from the tokenized dataset, inserting them into packs for each column in the dataset. Any remaining space in a pack after the sequences have been concatenated is padded to bring all sequences up to the maximum sequence length.

Some key features unique to packed datasets are worth mentioning here:

- The specific attention mask (`attention_mask`) that is generated contains a unique index for each sequence of the pack and `0` for the remaining padding tokens. This, essentially, tells the model where to look from the perspective of a single token, ignoring any encoded information (such as a different sequence) that is not relevant to that token.
    - Example of 3 sequences: `attention_mask = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,0,...,0,1,2,3]`


- The [CLS] tokens of each sequence must be moved to the end of the pack.
    - For instance: `[CLS,a,b,c] + [CLS, d,e,f] + [CLS, g,h,i] -> [a,b,c,d,e,f,g,h,i,...,CLS,CLS,CLS]`
    

- `position_ids` for a pack contains the concatenated `position_ids` of each sequences 
    - For instance given 3 sequences: `[0,1,2,3,4] + [0,1,2,3] + [0,1,2] -> [1,2,3,4,1,2,3,1,2,...,0,0,0]` (note: the CLS tokens' position ID '0' are also moved to the end of the pack)
    
- `labels` and `token_type_ids` are also packed to correspond to the `input_ids` pack.


To create a dataloader-ready packed dataset, all you need to do is call the `create()` method:

In [None]:
packed_train_dataset = train_data_packer.create()
packed_val_dataset = val_data_packer.create()

Let's visualize one sample of the new `packed_train_dataset`:

In [None]:
packed_train_dataset[133]

## Fine-tuning the model

Now that our data is ready, we can download the pre-trained model and fine-tune it.

### Implementing Packed BERT

A few model modifications are required to make packing work with BERT.
We extend the existing `BertForSequenceClassification` class to `PipelinedPackedBertForSequenceClassification` which incorporates the required changes to the pooler and the model output. The crux of these changes is to modify the generic sequence classification model to handle unpacking multiple sequences in the output stage, treating them as a larger batch size for classification, as well as masking any padding created by packing.

First let's load a default BERT configuration using `AutoConfig`. The config includes a new parameter we must set, `max_sequences_per_pack`, which informs the model of the maximum number of sequences it will need to unpack in the model output. It also allows us to clearly define the `num_labels` and `problem_type` for this model.

It is essential we define the problem type here, as switching between the methods used by different types of classification requires that it be defined within the custom model.

In [None]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_checkpoint)
config.max_sequences_per_pack = max_seq_per_pack
config.num_labels = num_labels
config.problem_type = problem_type

Now we can instantiate the model class with the config, loading the weights from the model checkpoint.

In [None]:
import torch
import numpy as np
torch.manual_seed(43)
np.random.seed(43)

from models.modeling_bert_packed import PipelinedPackedBertForSequenceClassification


model = PipelinedPackedBertForSequenceClassification(config).from_pretrained(
   model_checkpoint, config=config)

The warning tells us we are throwing away some weights and randomly initializing others. This is normal in this case, because we are removing the head used to pre-train the model on a masked language modelling objective and replacing it with a new head for sequence classification, which we don't have pre-trained weights for, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

We can first test the model on a CPU and observe that the output logits have the size `[batch_size, max_seq_per_pack, 2] = [1, 6, 28]` with this notebook's default values, and the 28 labels for the dataset. The logits are reshaped into this form in the model output, to be the same shape as the labels, for ease of post-processing.

In [None]:
# test the model on CPU
from transformers.data.data_collator import default_data_collator

loader = torch.utils.data.DataLoader(packed_train_dataset,
                             batch_size=1,
                             shuffle=True,
                             drop_last=True,
                             collate_fn=default_data_collator)
data = next(iter(loader))
labels = data['labels']

print('labels: ', labels.shape)
o = model(**data)
print('outputs (loss, logits): ', o[0], o[1].shape)

Now, let's prepare the model for an IPU.

First, we set the model to half-precision:

In [None]:
model.half()

For validation, we need to define a function to compute the metrics from the predictions, which will use `metric` which we loaded earlier. Preprocessing here involves a step to mask the labels and predictions we are not using, set to a `-100` value when creating the dataset, with a boolean mask. Then, the predictions are passed into a `softmax` function to determine the probabilities of each class, as this is a multi-label task. 

These predictions and labels are passed into the metric function to compute the accuracy during evaluation.

In [None]:
model_name = model_checkpoint.split("/")[-1]
from scipy.special import softmax
from tqdm import tqdm
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    
    labels = labels.reshape(-1, labels.shape[-1])
    predictions = predictions.reshape(-1, predictions.shape[-1])
    
    # Remove the padding labels
    mask = (labels != -100)[:,0]
    
    labels = labels[mask,:]
    predictions = predictions[mask,:]
    pred_scores = softmax(predictions.astype("float32"), axis=1)    

    auc = metric.compute(
        prediction_scores=pred_scores, references=labels, multi_class="ovr"
    )["roc_auc"]

    return {"roc_auc": auc}

Next, we need to define `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device. We initialize it with a config name or path, which we set earlier. Then we use it to set the mode attribute `model.ipu_config` 

In [None]:
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

ipu_config = IPUConfig.from_pretrained(
    ipu_config_name,
    executable_cache_dir = executable_cache_dir,
    gradient_accumulation_steps=gradient_accumulation_steps,
    device_iterations = device_iterations,
    replication_factor=1,
    inference_device_iterations = 64,
    inference_replication_factor = 1
)

`IPUTrainingArguments` define any custom parameter modification we want to do, such as the initial learning rate for the model. It also allows other options, such as dataloader parameters, micro batch sizes and an automatic push to the Hugging Face Hub (if credentials were set up earlier) to happen at given intervals.

These arguments are passed to `IPUTrainer` which wraps the model training and evaluation process into a simple single-line process, doing all of the heavy lifting for us, for example training and evaluation loops, device assignment, optimiser definition and data-loading.

Note that only some arbitrary hyperparameter tuning was performed for this task. Other tasks and datasets may require further tuning to get the most optimal results.

In [None]:
from transformers import default_data_collator
metric_name = "roc_auc"

args = IPUTrainingArguments(
    "./"+f"{model_name}-{model_task}",
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    learning_rate=2e-4,
    adam_epsilon=1e-6,
    loss_scaling=16.0,
    warmup_ratio=0.1,
    weight_decay=0,
    lr_scheduler_type = "cosine",
    metric_for_best_model=metric_name,
    dataloader_drop_last=True,
    dataloader_mode="async_rebatched",
    logging_steps=1,
    pod_type=pod_type,
    gradient_accumulation_steps=gradient_accumulation_steps,
    push_to_hub=True    
)

trainer = IPUTrainer(
    model,
    ipu_config,
    args,
    train_dataset=packed_train_dataset,
    eval_dataset=packed_val_dataset,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics
)

Then, to train the model we can simply call the `train()` method:

In [None]:
trainer.train()

***About the performance:*** `IPUTrainer` doesn't take into account that we have packed data samples when computing the speed metrics. It treats a "sample" as a single input to the model, which means one **pack**.

So the actual throughput estimation can be obtained by multiplying `samples_per_second` by the average packing factor (the average number of samples per pack) of the dataset. These were obtained in the `packing_algorithm` section: `5.68` for the `go-emotions` training set and `5.83` for validation set.

Next, we can evaluate the model by simply calling the `evaluate()` method:

In [None]:
trainer.evaluate()

You can now upload the result of the training to the Hugging Face Hub if you successfully authenticated at the beginning of this notebook:

In [None]:
trainer.push_to_hub()

You can also save the model locally:

In [None]:
trainer.save_model("./"+f"{model_name}-{model_task}")

You have now successfully fine-tuned and evaluated your speed-optimised model for text classification using packing!

## Fast batched inference

Packing can also be used for inference, particularly for performing inference for workloads. This section demonstrates how to perform faster, batched inference with a large number of samples using a super-easy custom pipeline which batches and packs your input data, performs inference and returns post-processed predictions. 

For the pipeline, we need to import it, and initialise a few essential parameters.

The `model` is the model checkpoint and we are going to use the locally saved checkpoint generated from training `go_emotions`. `executable_cache_dir`, `problem_type`, `max_seq_length` must be specified. To return predictions organised by class names, the class names for your output must be passed to `label_categories`. 

The pipeline will automatically determine your model's IPU config, given that the checkpoint was trained using Optimum Graphcore, which will be the case for the model fine-tuned in this notebook.

In this example, we pre-load `IPUConfig` and modify some of the default parameters to get the best performance out of inference and leverage the benefits of IPU parallelism. The micro-batch size can also be specified, for which the default is 1.

When training, the packing factor affects the convergence in the same way as a large increase in batch size would do. However, for inference, we are free to use a bigger packing factor to speed it up. Let's try it with `max_seq_per_pack = 12`.

**Note:** Packing brings huge benefits for performing inference on large amounts of data. For small scale inference tasks, such as those which more suit sequential inference on a single un-batched input, the generic Optimum Graphcore `TextClassificationPipeline` class may be preferred. This won't affect fine-tuning and the weights generated from fine-tuning using packing will work just the same!

Let's list the class names for the GoEmotions dataset.

In [None]:
class_names = [
    "admiration",
    "amusement",
    "anger",
    "annoyance",
    "approval",
    "caring",
    "confusion",
    "curiosity",
    "desire",
    "disappointment",
    "disapproval",
    "disgust",
    "embarrassment",
    "excitement",
    "fear",
    "gratitude",
    "grief",
    "joy",
    "love",
    "nervousness",
    "optimism",
    "pride",
    "realization",
    "relief",
    "remorse",
    "sadness",
    "surprise",
    "neutral",
]

Let's initialise the `PackedBertTextClassificationPipeline`.

In [None]:
from pipeline.packed_bert import PackedBertTextClassificationPipeline

from optimum.graphcore import IPUConfig

model = "./"+f"{model_name}-{model_task}"
# model = 'your_username/{model_name}-{task}' # to load from Hugging Face Hub

inference_boosted_ipu_config = IPUConfig.from_pretrained(model, 
        inference_device_iterations=32,
        inference_replication_factor=4,
        ipus_per_replica=1,
        layers_per_ipu=[12]
    )

pipeline = PackedBertTextClassificationPipeline(
    model = model,
    executable_cache_dir = executable_cache_dir,
    problem_type='multi_label_classification',
    max_seq_per_pack=12,
    max_seq_length=max_seq_length,
    ipu_config=inference_boosted_ipu_config,
    micro_batch_size=8,
    label_categories=class_names
)

The pipeline expects a **list of strings** directly passed to it. There is no need to tokenize, preprocess, pack or post-process the data to use the inference pipeline.

As a test, we can load the entire `sst2` dataset and perform packed inference using `.predict()` on the text column to generate predictions. 

Datasets with multiple sentences can simply be passed as `predict(<sentences_1>,<sentences_2>)`

In [None]:
import datasets
dataset = datasets.load_dataset('go_emotions','simplified')
preds = pipeline.predict(dataset['train']['text'])

print(preds.keys())
print(f"Number of predictions: {len(preds['predictions'])}")
print(f"Preprocessing time: {preds['preprocessing_time']}s")
print(f"Postprocessing time: {preds['postprocessing_time']}s")
print(f"Throughput: {preds['throughput']} samples/s")

There is minimal overhead from tokenizing and packing the dataset, but the speed benefits are evident. After increasing the maximum sequences to 12, we can observe a much higher packing factor of 9.14.

Running the above pipeline, we achieve a throughput approximately 45000 samples per second, demonstrating the huge time benefit you can achieve by using packing!

## Next steps

Check out the full list of [Optimum Graphcore notebooks](https://github.com/huggingface/optimum-graphcore/tree/main/notebooks) to get a feel for how IPUs perform on other tasks.

* [Single-label text classification](https://github.com/huggingface/optimum-graphcore/blob/main/notebooks/packed_bert/packedBERT_single_label_text_classification.ipynb)
* [Question answering](https://github.com/huggingface/optimum-graphcore/blob/main/notebooks/packed_bert/packedBERT_question_answering.ipynb)