In [None]:
%load_ext autoreload
%autoreload 2

# Training

> Efficient training tricks like Sequence Packing


In [None]:
# | hide
from nbdev.showdoc import *

In [None]:
from dart_math.train import *

## Accelerating Several Times with Sequence Packing in 4 Lines of Code


Our interfaces can be integrated with the [HuggingFace `datasets`](https://huggingface.co/docs/datasets/en/index) in 4 lines of code:

```python
from dart_math.train import monkey_patch4pack, make_supervised_dset
# ...
monkey_patch4pack(model)
pack_dset = make_supervised_dset(tokenizer=tokenizer, data_path=data_args.data_path, pack_len=training_args.model_max_length, query_field=data_args.query_field,, resp_field=data_args.resp_field,, prompt_template=data_args.prompt_template)
trainer = Trainer(model=model, tokenizer=tokenizer, train_dataset=pack_dset)
```

`monkey_patch4pack` would monkey-patch the model's `_get_unpad_data` method.

`make_supervised_dset` would

1. load, tokenize and cache the dataset;
2. pack the data points into computation sequences.

For a more detailed usage example, please refer to our [training script for DART-Math](https://github.com/hkust-nlp/dart-math/blob/main/pipeline/train.py).

Besides, for general datasets objects that with the form `[{"input_ids": [...], "labels": [...], "attention_mask"}: [...]}, ...]`, you can use `PackedDataset` to wrap it to apply sequence packing:

```python
from dart_math.train import PackedDataset
# ...
dset = PackedDataset(dataset=dset, tokenizer=tokenizer, pack_len=4096)
```


In [None]:
show_doc(monkey_patch4pack, title_level=3)

In [None]:
show_doc(make_supervised_dset, title_level=3)

## Sequence Packing


### Sequence Packing Accelerates 6-8x than Simple Batching


**Simple batching** that pad every data sequence to the maximum training length wastes a lot computation and memory on padding tokens, especially for short data sequences and long maximum training length.

For example, if the model maximum training length is 4096 (as in most base models like Mistral-7B and the longest data sequences in some datasets like MATH), and data sequences are ~512 tokens long on average (as in most math SFT datasets), we **waste almost 1-1/8=7/8 computation and memory on padding tokens**.

**Sequence packing can eliminate the waste almost completely, without affecting the training dynamics** (for most models nowadays), except for the number of data sequences in one batch .

In the example above, we can **accelerate about 6-8x** with sequence packing.


### Basic Idea of Sequence Packing


The basic idea of sequence packing is

- to **merge/pack short data sequences into a single conputation sequence as long as the maximum training length** to **eliminate most watse on padding tokens**,
- while trying best to **not affecting the training dynamics** by
  - manipulating **attention masks** to avoid cross-contamination between different data sequences,
  - working with **relative positional encoding** to avoid the positional information mismatch for the non-first data sequences in the packed computation sequence.


#### Manipulating Attention Masks to Avoid Cross-Contamination


<style>
    .container {
        display: flex;
        align-items: center;
    }
    .container img {
        height: 200px; /* Set the desired height */
        object-fit: cover; /* Maintains aspect ratio */
    }
    .caption {
        text-align: center;
        font-size: small;
        margin-top: 10px;
    }
</style>
<div class="container">
<img src="https://github.com/MeetKai/functionary/blob/main/functionary/train/packing/assets/cross_contamination.png?raw=true">
<img src="https://github.com/MeetKai/functionary/blob/main/functionary/train/packing/assets/correct_packing_attention.png?raw=true">
</div>

> Concretely, when we pack inputs, the attention should be only within individual sequences. For example, assume that we are packing 2 inputs: packed input = [input 1] [input 2]. Tokens from **input 1** only attend to tokens from **input 1** and tokens from **input 2** only attend to tokens from **input 2**
>
> Examples of packing 2 input sequences: "good morning my name is John" and "This is a dog". The first one is the attention matrix of packing with cross-contamination, the second one is the correct attention matrix of packing.
>
> c.f. https://github.com/MeetKai/functionary/tree/main/functionary/train/packing


#### Relative Positinal Encoding Perferctly Works with Sequence Packing


At first glance, sequence packing introduces another problem: **the positional encodings of the non-first data sequences in one computation sequence are not the same as the vanilla non-packing setting**.

This is indeed a problem for absolute positional encoding, but practically **does not matter for relative positional encoding** like [RoPE](https://arxiv.org/abs/2104.09864), which is almost the de facto practice nowadays.


## API Reference


In [None]:
show_doc(PackedDataset, title_level=3)

In [None]:
show_doc(PackedDataset.stat, title_level=4)

In [None]:
show_doc(TokenizedSupervisedDataset, title_level=3)

In [None]:
show_doc(TokenizedSupervisedDataset.load_from_raw_dset, title_level=4)

In [None]:
show_doc(TokenizedSupervisedDataset.__getitem__, title_level=4)

In [None]:
show_doc(TokenizedSupervisedDataset.concat, title_level=4)

In [None]:
show_doc(TokenizedSupervisedDataset.shuffle, title_level=4)

In [None]:
show_doc(TokenizedSupervisedDataset.pad, title_level=4)

## Acknowlegements


Thanks to https://github.com/MeetKai/functionary/tree/main/functionary/train/packing. The code for sequence packing is largely based on it.
