First of all, make sure your environment has installed the latest version of [🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore).

In [None]:
pip install git+https://github.com/huggingface/optimum-graphcore.git;

Also make sure all the packages required for text classification are installed.

In [None]:
! pip install scikit-learn;
! pip install matplotlib;
! pip install tokenizers==0.11.1

Let's print out the versions of Transformers and Optimum Graphcore:

In [None]:
import transformers
import optimum.graphcore

print(transformers.__version__)
print(optimum.graphcore.__version__)

# Fine-tuning BERT on a text classification task using packing

This notebook is an alternative for [Fine-tuning BERT on a text classification task](text_classification.ipynb) showing how to implement packing for BERT step by step and use if for fine-tuning on `GLUE/sst2` text classification. This includes packing the dataset and adapting an existing BERT model. Packing consists in concatenating several input sequences into one to increase the computational efficiency. More details about packing can be found in the [blog post](https://www.graphcore.ai/posts/introducing-packed-bert-for-2x-faster-training-in-natural-language-processing) and the original [paper](https://arxiv.org/abs/2107.02027).

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/).

![Widget inference on a text classification task](images/text_classification.png)

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a  dataset containing sentences labeled grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

We will see how to easily load the dataset for each one of those tasks and use packed BERT to fine-tune a model on it. Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [None]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

**For this Packed bert demo, we will cover (single-label) sequence classification on `sst2` dataset. But `task` can be changed to run the other `GLUE` tasks . However, training hyperparameters may need some tuning for these other tasks.**

In this notebook, we are using both data parallelism and pipeline parallelism (see this [tutorial](https://github.com/graphcore/tutorials/tree/master/tutorials/pytorch/tut2_efficient_data_loading) for more). Therefore the global batch size, which is the actual number of samples used for the weight update, is determined with three factors:
- global batch size = micro_batch_size * gradient accumulation steps * replication factor

and replication factor is determined by `pod_type`, which will be used as a key to select the replication factor from a dictionary defined in the IPU config file. For example, the dictionary in the IPU config file [Graphcore/roberta-base-ipu](https://huggingface.co/Graphcore/roberta-base-ipu/blob/main/ipu_config.json) looks like this:
- "replication_factor": {"pod4": 1, "pod8": 2, "pod16": 4, "pod32": 8, "pod64": 16, "default": 1}

Depending on you model and the pod machine you are using, you might need to adjust these three batch-size-related arguments.

By default this notebook is configured to run on 4 IPUs.

Finally, `max_seq_length` is the length we are going to pad the sentences to, so it should not be larger than the maximum length of the model. Set those seven parameters, then the rest of the notebook should run smoothly:

Given the small size of the sequences in `sst2`, we can reduce the model input size to `max_seq_length = 256`.

In [None]:
task = "sst2"
model_checkpoint = "bert-base-uncased"
ipu_config_name = "Graphcore/bert-base-uncased"
micro_batch_size = 2
gradient_accumulation_steps = 32
pod_type = "pod4"
max_seq_length = 256

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric, Dataset

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [None]:
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [None]:
dataset

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][0]

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [None]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

Note that `load_metric` has loaded the proper metric associated to your task, which is:

- for CoLA: [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient)
- for MNLI (matched or mismatched): Accuracy
- for MRPC: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for QNLI: Accuracy
- for QQP: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for RTE: Accuracy
- for SST-2: Accuracy
- for STS-B: [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman's_Rank_Correlation_Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
- for WNLI: Accuracy

so the metric object only computes the one(s) needed for your task.

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [None]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

We can double check it does work on our current dataset:

In [None]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the three arguments.`truncation=True` will ensure that an input longer than maximum length will be truncated to the maximum length. `max_length=max_seq_length` sets the maximum length of a sequence.

**Note: since we will use packing later, we don't want to perform any padding in the tokenizer.**

In [None]:
# no padding for packing
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True, max_length=max_seq_length)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True, max_length=max_seq_length)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(dataset['train'][:5])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)
len(encoded_dataset['train'])

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

###  Packing the dataset

To implement packing, we need to pack our dataset first. Each new element will be a "pack" containing at most `max_seq_per_pack` sequences.

In [None]:
max_seq_per_pack = 6

### Packing algorithm

In order to pack efficiently, we will use an histogram-based algorithm (SPFHP) presented in the [blog post](https://www.graphcore.ai/posts/introducing-packed-bert-for-2x-faster-training-in-natural-language-processing) https://github.com/graphcore/tutorials/tree/master/blogs_code/packedBERT. First we need to generate the histogram of the sequences lengths in our dataset:

In [None]:
def generate_histogram(unpadded_input_ids, max_seq_len):
    dataset_seq_lens:list = np.array([len(seq) for seq in unpadded_input_ids])
    histogram = np.zeros(max_seq_len, dtype=np.int64)
    seq_lens, counts = np.unique(dataset_seq_lens, return_counts=True)
    histogram[seq_lens - 1] = counts

    return histogram

In [None]:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"

train_dataset = encoded_dataset['train']
val_dataset = encoded_dataset[validation_key]

train_hist = generate_histogram(train_dataset['input_ids'], max_seq_length )
val_hist = generate_histogram(val_dataset['input_ids'], max_seq_length )

import matplotlib.pyplot as plt
plt.hist(train_hist, bins = [k for k in range(0,max_seq_length,10)]) 
plt.title("sequences length histogram") 
plt.show()

Now we apply the `Shortest pack first histogram packing` algorithm to generate a packing strategy from the histogram.

In [None]:
import time
from scipy import optimize, stats
import numpy as np
from collections import defaultdict

def add_pack(pack, count, tmp, final, limit, offset, max_sequence_length=512):
    """Filter out packs that reached maximum length or number of components."""
    if len(pack) == limit or offset == 0:
        final[offset].append((count, pack))
    else:
        tmp[offset].append((count, pack))


#^SPFHP - Shortest pack first histogram packing
def SPFHP(histogram, max_sequence_length, max_sequences_per_pack):
    """Shortest-pack-first histogram-packing."""
    start = time.time()
    reversed_histogram = np.flip(histogram)
    # Initialize main strategy data dictionary.
    # The key indicates how many tokens are left for full length.
    # The value is a list of tuples, consisting of counts and respective packs.
    # A pack is a (sorted) list of sequence length values that get concatenated.
    tmp_strategies_per_length = defaultdict(list)
    strategies_per_length = defaultdict(list)
    # Index i indicates here, how much space is left, due to reversed histogram
    for i in range(max_sequence_length):
        n_sequences_to_bin = reversed_histogram[i]
        length_to_bin = max_sequence_length - i
        offset = i + 1  # largest possible offset
        while n_sequences_to_bin > 0:
            if (length_to_bin + offset) in tmp_strategies_per_length:
                # extract shortest pack that will get modified
                n_sequences_to_pack, pack = tmp_strategies_per_length[
                    length_to_bin + offset].pop()
                new_pack = pack + [length_to_bin]
                count = min(n_sequences_to_pack, n_sequences_to_bin)
                if n_sequences_to_pack > n_sequences_to_bin:
                    # old pack gets reduced
                    n_sequences_to_pack -= n_sequences_to_bin
                    tmp_strategies_per_length[length_to_bin + offset].append(
                        (n_sequences_to_pack, pack))
                    n_sequences_to_bin = 0
                else:
                    n_sequences_to_bin -= n_sequences_to_pack
                add_pack(new_pack, count,
                         tmp_strategies_per_length, strategies_per_length,
                         max_sequences_per_pack, offset)
                # clean up to speed up main key search
                if not tmp_strategies_per_length[length_to_bin + offset]:
                    tmp_strategies_per_length.pop(length_to_bin + offset)
            else:
                offset -= 1
            # Does not fit anywhere. Create new pack.
            if offset < 0:
                add_pack([length_to_bin], n_sequences_to_bin,
                         tmp_strategies_per_length, strategies_per_length,
                         max_sequences_per_pack, i)
                n_sequences_to_bin = 0
    # merge all strategies
    for key in tmp_strategies_per_length:
        strategies_per_length[key].extend(tmp_strategies_per_length[key])
    # flatten strategies dictionary
    strategy_set = []
    strategy_repeat_count = []
    for key in strategies_per_length:
        for count, pack in strategies_per_length[key]:
            pack.reverse()
            strategy_set.append(pack)
            strategy_repeat_count.append(count)

    # Summarize efficiency of solution
    duration = time.time() - start
    sequence_lengths = np.arange(1, max_sequence_length + 1)
    strategy_repeat_count = np.array(strategy_repeat_count)
    n_strategies = len(strategy_set)
    old_number_of_samples = histogram.sum()
    new_number_of_samples = strategy_repeat_count.sum()
    sequences = sum([count*len(pack) for count, pack in
                     zip(strategy_repeat_count, strategy_set)])
    total_tokens = max_sequence_length * new_number_of_samples
    empty_tokens = sum([count*(max_sequence_length-sum(pack)) for count, pack
                        in zip(strategy_repeat_count, strategy_set)])
    efficiency = 100 - empty_tokens / total_tokens * 100
    speedup_upper_bound = 1.0 / (1 - (histogram*(1 - sequence_lengths / max_sequence_length)).sum() / old_number_of_samples)
    packing_factor = sequences/sum(strategy_repeat_count)
    
    print(f"Packing efficiency (fraction of real tokens): {efficiency:3.4f}\n",
          f"Speed-up theoretical limit: {speedup_upper_bound:3.4f}\n",
          f"Achieved speed-up over un-packed dataset: {old_number_of_samples/new_number_of_samples:3.5f}\n",
          f"Runtime: Packed {old_number_of_samples} sequences in {duration:3.3f} seconds\n",
          f"Average packing factor: {packing_factor}")
    

    return strategy_set, np.array(strategy_repeat_count)

`strategy_set` is a list of lists containing the sequences lenghts we can pack together.

`strategy_repeat_count` gives the corresponding number of time we can create each pack of `strategy_set`.

In [None]:
train_strategy = SPFHP(train_hist, max_seq_length, max_seq_per_pack)
val_strategy = SPFHP(val_hist, max_seq_length, max_seq_per_pack)

Now we need to create the actual packed dataset object. 
We pick the sequences and pack them based on their length and following the strategy we just generated. Once they are packed, we also need to pad the sequences to the `max_seq_lentgh` to maintain a constant input size.

Notes:
- A specific `attention_mask` is generated: It contains a unique index for each sequence of the pack and `0` for the remaining padding tokens.
    - Example of 3 sequences: `attention_mask = [1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,0,...,0,1,2,3]`


- The [CLS] tokens of each sequence are moved at the end of the pack.
    - For instance: `[CLS,a,b,c] + [CLS, d,e,f] + [CLS, g,h,i] -> [a,b,c,d,e,f,g,h,i,...,CLS,CLS,CLS]`
    

- The `position_ids` of a pack contains the concatenated `position_ids` of each sequences 
    - For instance given 3 sequences: `[0,1,2,3,4] + [0,1,2,3] + [0,1,2] -> [1,2,3,4,1,2,3,1,2,...,0,0,0]` (note: the CLS tokens position id '0' are also moved the end of the pack)
    
- `labels` and `token_type_ids` are also packed to correspond the `input_ids` pack.

In [None]:
import itertools
def create_dataset_from_strategy(data, strategy_set, strategy_repeat_count, max_seq_len, max_seq_per_pack):
    total_num_packs:int = np.sum(strategy_repeat_count)
        

    # Sort the sequences by length
    dataset_seq_lens = np.array([len(seq) for seq in data['input_ids']])
    len_sorted_seq_idxs = np.argsort(dataset_seq_lens)
    len_sorted_seq_lens = dataset_seq_lens[len_sorted_seq_idxs]
    sorted_seqs = np.stack((len_sorted_seq_lens, len_sorted_seq_idxs))


    # Get the data from the tokenised dataset
    input_ids = data['input_ids']
    attention_mask = data['attention_mask']
    token_type_ids = data['token_type_ids']
    labels = data['label']
    
    # Prepare the manually padded constant sized data
    packed_input_ids = np.zeros((total_num_packs, max_seq_len), dtype=int)
    packed_attention_mask = np.zeros((total_num_packs, max_seq_len), dtype=int)
    packed_token_type_ids = np.zeros((total_num_packs, max_seq_len), dtype=int)
    packed_position_ids = np.zeros((total_num_packs, max_seq_len), dtype=int)
    packed_labels = -100 * np.ones((total_num_packs, max_seq_per_pack), dtype=int)
    
    # Pack the data using the developed strategies
    pack_index = 0
    for i in range(len(strategy_repeat_count)):
        strategy = strategy_set[i]
        # This is the offset we apply to the start positions to account for the positional change of the logits when unmasking the pack to extract a set of logits for each sequence in the pack
        for _ in range(strategy_repeat_count[i]):

            '''Key terms in loop:

            * sorted_seqs: (shape [2, dataset])
                - index 0: sorted lengths of each sequence in dataset
                    -- e.g. sorted_seqs[0,12] gives the length of the sequence at dataset position at index: sorted_seqs[1,12]
                - index 1: index of corresponding lengths in the dataset
                    -- e.g. dataset[sorted_seqs[1,12]] returns dataset sequence at index: sorted_seqs[1,12]

            * ref_inds: (shape [strategy_set])
                - the indices of the [length, dataset index] pair in sorted_seqs (this is used to remove/clear sorted_seqs as data is packed).
                    -- e.g sorted_seqs[0, ref_inds] = -1 will nullify the sequence length at positions in [array] ref_inds such that they cannot be called to pull data from those indices again.

            * inds: (shape [strategy_set])
                - the indices in the actual dataset, called using the indices of sorted_seqs retrieved from ref_inds.
                    --e.g. > inds = sorted_seqs[1, ref_inds]
                           > packed data = concatenate(dataset[inds])
            '''

            ref_inds = []
            for x in strategy:
                ref_ind = np.argwhere(sorted_seqs[0] == x)[-1]
                sorted_seqs[0, ref_ind] = -1
                ref_inds.append(ref_ind)

            inds = sorted_seqs[1, ref_inds].ravel()

            # Exclude the CLS tokens to put them at the end later
            input_id_pack = list(itertools.chain(*[input_ids[x][1:] for x in inds]))
            attention_mask_pack = list(itertools.chain(*[itertools.repeat(n+1, len(attention_mask[v])-1) for n,v in enumerate(inds)]))
            token_type_ids_pack = list(itertools.chain(*[token_type_ids[x][1:] for x in inds]))
            position_ids_pack = list(itertools.chain(*[range(1, len(attention_mask[v])) for n,v in enumerate(inds)]))

            # Create the equivalent tokenised packed dataset
            packed_input_ids[pack_index, :len(input_id_pack)] = input_id_pack
            packed_attention_mask[pack_index, :len(attention_mask_pack)] = attention_mask_pack
            packed_token_type_ids[pack_index, :len(token_type_ids_pack)] = token_type_ids_pack
            packed_position_ids[pack_index, :len(position_ids_pack)] = position_ids_pack
            labels_pack = [labels[x] for x in inds]
            packed_labels[pack_index, :len(labels_pack)] = labels_pack

            # Now add the CLS tokens and their masks at the end of the pack
            packed_input_ids[pack_index, -max_seq_per_pack:] = [input_ids[0][0] for _ in range(max_seq_per_pack)]
            packed_attention_mask[pack_index, -max_seq_per_pack:] = list(range(1, max_seq_per_pack+1))

            pack_index += 1
            
    new_dataset = Dataset.from_dict({ "input_ids": packed_input_ids,
                                      "attention_mask": packed_attention_mask,
                                      "token_type_ids": packed_token_type_ids,
                                      "position_ids": packed_position_ids,
                                      "labels": packed_labels
                                })
    new_dataset.set_format(type='torch', columns=new_dataset.features)
    return new_dataset

In [None]:
packed_train_dataset = create_dataset_from_strategy(train_dataset, train_strategy[0], train_strategy[1], max_seq_length, max_seq_per_pack)
packed_val_dataset = create_dataset_from_strategy(val_dataset, val_strategy[0], val_strategy[1], max_seq_length, max_seq_per_pack)

print(packed_train_dataset)

Let's visualize one sample of the new `packed_train_dataset`:

In [None]:
packed_train_dataset[3020]

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. The number of labels will be required.

In [None]:
from transformers import AutoModelForSequenceClassification, default_data_collator
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2

### Implement Packed BERT

A few model modifications are required to make packing work with BERT.
We will extend the existing class `BertForSequenceClassification`.

First let's load a default BERT configuration using `AutoConfig`.

In [None]:
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_checkpoint)
config.max_position_embeddings = max_seq_length
print(config)

Packing sequences increases the number of elements per batch.
In order to reuse the classifications heads from `transformers` library, we need a special pooler. Instead of pooling the hidden states of a single sequence, it's pooling multiple ones (given the maxium number of sequences in the pack) and ordering them along the batch dimension. So the output size of the pooler is: `[batch-size x max_sequences_per_pack, hidden_size]`

From the Loss point-of-view , everything will appear as if the batch-size was larger (`batch-size x max_sequences_per_pack`).
When the number of sequences in the pack is lower than `max_sequences_per_pack`, padding is ignored by using the default `ignore_index` (-100) of the loss as a special labels (this was already done in the dataset preprocessing, cf: *Packing the dataset*).

![pooling](images/pooling.png)

In [None]:
import torch
import torch.nn as nn

class PackedBertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.max_sequences_per_pack = config.max_sequences_per_pack
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden states corresponding
        # to the last max_sequences_per_pack tokens. Note that the [CLS] tokens
        # are always located at the end of the pack. When the actual number of
        # sequences is lower than max_sequences_per_pack, we still slice out
        # the last max_sequences_per_pack tokens, but we will not use all of
        # them during loss calculation.
        sh = hidden_states.shape
        last_tokens_tensors = hidden_states[:, -self.max_sequences_per_pack:]
        last_reshape = last_tokens_tensors.reshape(sh[0]*self.max_sequences_per_pack, sh[2])
        # output size: [bs x max_sequences_per_pack, hidden_size]
        output = self.dense(last_reshape)
        output = self.activation(output)
        return output

##### Attention mask
The attention mask should be used in a specific way in packed-BERT.
We will create a 2D attention mask like in the following example.
By doing so, the cross-attention will treat separately each sequence of the pack (and it will also ignore the padding).
![attn-mask](images/attention-mask.png)

To get a better intuition here is an example showing how to transform the 1D attention mask:

In [None]:
# 1 : Flat attention mask genreated by the dataset. Each sequence has a different index. 0 is padding.
attention_mask = torch.tensor([[1,1,2,2,3,3,3,4,4,4,4,0,0,0,0,1,2,3,4]])
# 2: Generate the boolean 2D attention mask
attention_mask = attention_mask[:, None, :].repeat(1, attention_mask.shape[1], 1)
attention_mask = (attention_mask == attention_mask.transpose(1, 2)) * (attention_mask != 0)
# Notice that the mask is always False for the padding tokens.
print(attention_mask.to(int))

Now let's integrate this idea to the input of packed BERT.

By inheriting from `BertPipelineMixin` , the `paralellize()` method is already implemented for the BERT body. We overloaded it to also place the classifier on the last IPU.

In [None]:
import poptorch
from optimum.graphcore.models.bert.modeling_bert import BertPipelineMixin
from transformers import BertForSequenceClassification


class PackedBertForSequenceClassification(BertForSequenceClassification, BertPipelineMixin):
    
    def __init__(self, config):
        super().__init__(config)
        self.config.max_sequences_per_pack = max_seq_per_pack
        self.bert.pooler = PackedBertPooler(config)
        
    def parallelize(self):
            super().parallelize()
            last_ipu = self.ipu_config.ipus_per_replica - 1
            self.classifier = poptorch.BeginBlock(self.classifier, "Classifier Output", ipu_id=last_ipu)
            return self
    
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, labels=None):
        
        seq_len = input_ids.shape[1]
        attention_mask = attention_mask[:, None, :].repeat(1, seq_len, 1)
        attention_mask = (attention_mask == attention_mask.transpose(1, 2)) * (attention_mask != 0)
        
        output = super().forward(input_ids = input_ids,
                                 attention_mask=attention_mask,
                                 token_type_ids=token_type_ids,
                                 position_ids=position_ids,
                                 labels=labels)

        # For validation: output should keep the same batch dimension as the original input
        if not self.training:
            output.logits = output.logits.reshape([-1,max_seq_per_pack, num_labels])

        return output

In [None]:
model = PackedBertForSequenceClassification(config).from_pretrained("bert-base-uncased", num_labels=num_labels).train()

The warning is telling us we are throwing away some weights and randomly initializing some other. This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

We can first test the model on CPU and observe that the output logits have now the size [batch_size x max_seq_per_pack, 2] = [12, 2] with this notebook default values.

In [None]:
from transformers.data.data_collator import default_data_collator


loader = torch.utils.data.DataLoader(packed_train_dataset,
                             batch_size=micro_batch_size,
                             shuffle=True,
                             drop_last=True,
                             collate_fn=default_data_collator)
data = iter(loader).next()
outputs = model(**data)
print("logits: ", outputs)

Now let's prepare the model for IPU

First, we set the model in half precision:

In [None]:
model.half()

We need to define the `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device. We initialize it with one config name or path, which we set earlier. Then we use it to set the mode attribute `model.ipu_config` 

In [None]:
ipu_config = IPUConfig.from_pretrained(
    ipu_config_name,
    executable_cache_dir = "/tmp/exe_cache/",
    replication_factor=1,
    gradient_accumulation_steps=gradient_accumulation_steps,
    device_iterations = 32,
    inference_replication_factor=1
)

For validation, we need to define a function to compute the metrics from the predictions, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B). To ignore the `-100` labels from uncomplete packs, we use a boolean mask.

In [None]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    
#     Remove the padding labels
    mask = (labels != -100)
    labels = labels[mask]
    predictions = predictions[mask]
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [None]:
args = IPUTrainingArguments(
    "/tmp/"+f"{model_name}-finetuned-{task}",
    learning_rate=0.00009,
    lr_scheduler_type = "cosine",
    warmup_ratio=0.1,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0,
    metric_for_best_model=metric_name,
    dataloader_drop_last=True,
    dataloader_mode="async_rebatched",
    logging_steps=1,
    pod_type=pod_type,
    gradient_accumulation_steps=gradient_accumulation_steps,
    push_to_hub=False,
    
)


trainer = IPUTrainer(
    model,
    ipu_config,
    args,
    train_dataset=packed_train_dataset,
    eval_dataset=packed_val_dataset,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

***About the performances:*** `IPUTrainer` doesn't take into account that we have packed data samples when computing the speed metrics. So the actual throughput estimation can be obtained by multiplying the `samples_per_second` by the average packing factor of the dataset. (These were obtained in the `packing_algorithm` section: `5.15` for `sst2` training set and `5.77` for validation set).

In [None]:
trainer.evaluate()

To see how your model fared you can compare it to the [GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
# trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")
```

## Faster inference:

When training, the packing factor does affect the convergence the same way as bigger batch size would do. However, for inference, we are free to use a bigger packing factor to speed it up.
Let's try it on `sst2` with `max_seq_per_pack = 12`.

In [None]:
max_seq_per_pack = 12

To have enough examples, we will reuse the training set.

In [None]:
dataset = load_dataset("glue", "sst2")
encoded_dataset = dataset.map(preprocess_function, batched=True)
inference_dataset = encoded_dataset['train'] # train set again, to have enough examples
infer_strategy = SPFHP(train_hist, max_seq_length, max_seq_per_pack)
packed_dataset = create_dataset_from_strategy(train_dataset, infer_strategy[0], infer_strategy[1], max_seq_length, max_seq_per_pack)

We can see that the average packing factor `6.7` is not close to the maximum now (12), this is still an imporvement compared to the previous `5.7`.

Let's also modify the configuration of the model for inference. For speed up, we can us a single IPU and 4 replicas by changing `layers_per_ipu` , `inference_replication_factor` and `ipus_per_replica` and also use a larger `batch-size`.

In [None]:
ipu_config.layers_per_ipu = [12]
ipu_config.inference_device_iterations = 32
ipu_config.inference_replication_factor = 4
ipu_config.ipus_per_replica = 1

In [None]:
model = PackedBertForSequenceClassification(config).from_pretrained("bert-base-uncased", num_labels=num_labels)

In [None]:
args = IPUTrainingArguments(
    "/tmp/"+f"{model_name}-finetuned-{task}-fast-inference",
    per_device_eval_batch_size=8,
    dataloader_mode="async_rebatched",
    dataloader_drop_last=True,
    logging_steps=10,
    pod_type=pod_type
)

trainer = IPUTrainer(
    model,
    ipu_config,
    args,
    eval_dataset=packed_dataset,
    compute_metrics=compute_metrics
)

In [None]:
trainer.evaluate()

As before, to get a correct throughput estimation we need to multiply `eval_samples_per_second` by the average packing factor (6.72).