# Using HF (Hugging Face) based datasets with infctx trainer

The infctx trainer makes a huge shift towards a HF focus dataset parser, with several pros and cons. This note book aims to cover all the common use cases for dataset handling and processing, that is supported by this trainer code.

Because there are multiple possible strategy for parsing of the dataset, they are evaluated in the following order by default

- multi_column_keys (used if any collumn matches)
- prompt & completion (used if both collumn exists)
- text (default baseline)

We would be going through how the above dataset processing strategies work, starting with text (the default baseline).

> Important note: These example focuses only on how to configure your dataset, and does not properly perform checkmarking - for trainer configurations refer to the training notebooks

## Intial setup

Before we go into the dataset setup, lets perform an initial setup for all the folders we need, and a small toy model which we would use throughout the various examples within this notebook.

In [None]:
# Setup the folders we will need
!mkdir -p ../../model/
!mkdir -p ../../datapath/
!mkdir -p ../../checkpoint/

# Initialized a simple L6-D512 model, for both the v4 neox (50277) tokenizer
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size neox --skip-if-exists ../model/L6-D512-neox-init.pth

# and rwkv world (65529) tokenizers
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size world --skip-if-exists ../model/L6-D512-world-init.pth

# If you have a custom vocab size, you can indicate accordingly as well with an int
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size 20259 --skip-if-exists ../model/L6-D512-V20259-init.pth

## Training using a text dataset

The following is the `example-hf-enwiki.yaml` settings, for using a textual dataset via huggingface, with most of the comments removed

---
```yaml
trainer:
  max_steps: 10
  target_batch_size: 32
model:
  load_model: ../model/L6-D512-neox-init.pth
  ctx_len: 1024
  lr_init: 3e-4
  bptt_learning: true
  bptt_learning_range: -1
data:
  data_path: ../datapath/enwiki_10k_neox_1024/

  source: "teven/enwiki_10k"
  # # Additional source dataset params, used to grab subsets
  # source_dataset_params:
  #   language: en

  tokenizer: neox
  min_token_size: 64
  max_token_size: -1
  text_rechunk_size: 2048
  text_rechunk_force: true
  custom_text_key: 'text'
  test_split: 0.01
  test_split_shuffle: false
```
---

### Understanding the `data` config, for textual datasets

Lets go through each of data parameter settings and what they mean...

**data.data_path** 

This is where the HF datapath is saved, when used against existing HF data sources. This is a requried parameter

**data.source** 

This can be configured either as a hugging face dataset (eg. `teven/enwiki_10k`) or if intended to be used with local files `text / json / csv / pandas` or their respective file paths (you can point to a single `.txt/.jsonl` file and it should 'work', see the local file examples for more details)

**data.source_dataset_params** 

Additional params to configure the huggingface `load_dataset` command. This is only useful for larger dataset which supports such parameters, to filter out a subsets specifically (defaults to empty object)

**data.tokenizer**

The tokenizer to use for the dataset, use either `neox` or `world` for the respective RWKV models. For custom HF tokenizer refer to the custom tokenizer example below. (defaults to neox)

**data.min_token_size/max_token_size**

Scans the given dataset, and skips datasamples that fail to meet the given criteria. This is mostly useful for filtering small low quality datasamples in large datasets, or large datasample beyond what you intend to support. (this is done before rechunking if enabled, defaults to -1 which support all)

**data.text_rechunk_force**

Enable text rechunking, this means, all the filtered datasamples will be merged together, with a new line between them. Before being split again by the rechunk size. This is mostly useful for large corpus of raw text data, and is consistent with how existing foundation model are trained from raw text files. This also allows more efficient training process (tokens/second), as each datasample will have the exact same token count. (Disabled by default unless your source is literally a ".txt" file)

**data.text_rechunk_size** 

Number of tokens each datasample should have after rechunking. Recommended sizes is the context size you intend the model to support (ie. 2048, 4096, etc)

**data.custom_text_key** 

For huggingface datasets (or json/csv) by default we would use the `text` collumn if its avaliable. However some dataset store their text in a different collumn (eg. `code`). This allow you to choose which collumn would you like to use the text from. Note for more complicated instruct/input/output examples, you will want to see the 'multi_column' guide/examples instead.

**data.test_split**

Important Note: this is ignored, if the HF dataset has an inbuilt test split.

If configured as a floating number between 0.0 and 1.0, it will be the percentage (0.1 is 10%) of the test data that is used for test validation.

If configured as an int number, it will be the number of samples.

Due to some limitations in the current trainer code, even if its set as 0, we will use a single data sample for the test split.

This defaults to 0.01 or 1%

**data.test_shuffle**

Perform a dataset shuffle before test split, this defaults to False.

Note, this is not a truely random shuffle, but a detriministic shuffle. To ensure a consistent result.

### Optimizing the `model.bptt_*` mode according to your dataset config ....

**model.load_model** 

This is the `model.pth` file you start the training process from. This can be an exisitng model you are finetuning from, or a new model that you initalized with `init_model.py` script.

**model.ctx_len** 

This is the training context length used in the training process. For the infctx trainer, your data samples can be larger then the training context length. If so, the data sample is split into chunks accordingly, and trained in parts (with bptt_learning enabled, which it is by default)

This is ultimately a tradeoff between VRAM usage vs GPU compute usage, while you can save VRAM usage, this comes an increased compute cost, as first few chunks will need to recalculated multiple times for each subsequent chunk. There are also been recorded minor loss learning penalty especially for small context sizes.

As such it is always recommended to configure this to be as large as what can be supported by your GPU in the power of 2 (1024,2048,4096,...) with some healthy vram buffer for checkpoints and gradients, and up to your dataset sample size (as its pointless to go beyond that)

Typically this is 2048, 4096, or 8192 for ML training GPUS (24GB vram and above). For consumer GPUS, anything less then 512 is not recommended, due to compounded loss learning penalty involved when used with large data samples.

**model.bptt_learning**

Enabled by default, this is the core feature of infctx trainer. If your training ctx length is equal to your dataset context length, you can disable bptt_learning for an insignificant speed boost (barely measurable).

In most cases its better to just set bptt_learning_range to 1 instead of switching it off

**model.bptt_learning_range**

`bptt_learning_range: -1` will work by default for all use cases. On a single GPU.

However, when training across multiple GPUs `bptt_learning_range: -1` has a small performance penalty in which it needs to syncronize the number of chunks across multiple GPUs. 

This is an issue especially, when training with mixed dataset size, if a single GPU is stuck with a significantly larger document with many chunks, all the other GPUs maybe stuck waiting for that one GPU to complete.

In most cases this would be an acceptable compromise with mixed sized dataset. However if your dataset is of fixed size. Especially with 'rechunking' enabled. You can optimize multiple GPU training by configuring the learning range to be exactly equals to the number of chunk (eg: learning_range = 4, for data size of 4096, training ctx len of 1024)

You can also configure the range to be less then the data sample size, in which the learning process will only happen for the last X configured chunks. This is not as bad as it sounds, and has it uses cases (which will be documented seperately)

---

### Download and preload the datapath from huggingface

In [None]:
# Lets preload the requried dataset
!cd ../../RWKV-v5 && python3 preload_datapath.py ../notebook/dataset-config/example-hf-enwiki.yaml

In [None]:
# Validate the dataset is working, by doing a quick training run
!cd ../../RWKV-v5 && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-hf-enwiki.yaml

## Training using prompt completion pair dataset

However, beyond foundation model training, for finetuning. One common format is the `prompt` and `completion` pair. This is supported out of the box.

An example of the prompt/completion pair as followed

```json
{
  "prompt": "What is the dominant emotion of the user? I am happy. Output:",
  "completion": " Happy<|endoftext|>"
}
```

Setting this up, as simple as the following

---
```yaml
trainer:
  max_steps: 10
  target_batch_size: 32
model:
  load_model: ../model/L6-D512-neox-init.pth
  ctx_len: 1024
  lr_init: 1e-4
  bptt_learning: true
  bptt_learning_range: -1
data:
  data_path: ../datapath/self-instruct-base/
  source: "eastwind/self-instruct-base"
  tokenizer: neox
  disable_prompt_completion_mask: false
```
---

**data.disable_prompt_completion_mask**

If the dataset uses prompt/completion data layout. By default it would be used in place of the text collumn. Typically, no additional configuration required.

However, the default prompt completion behaviour, is that the text on the prompt half is "learning masked" disabled, while the text on the completion half has the "learning mask" enabled.

In practise, the model will not learn how to generate the prompt as an output. And the learnings are focused on the completion half.

---

### Preload the dataset and train it

In [None]:
# Lets preload the requried dataset
!cd ../../RWKV-v5 && python3 preload_datapath.py ../notebook/dataset-config/example-hf-prompt-completion.yaml

In [None]:
# Validate the dataset is working, by doing a quick training run
!cd ../../RWKV-v5 && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-hf-prompt-completion.yaml

## Training using multi column keys

What if you want alternative format, with more complicated layouts. Like instruction/input/output (or other formats)

You can use the `multi_column_keys` which gives you precise control over how each data sample is processed.

For example the following is an example data record for dolly instruction set (simplified)

```
{
    "category": "closed_qa"	,
    "instruction": "When did Virgin Australia start operating?",
    "input": "Virgin Australia, the trading ....",
    "output": "Virgin Australia commenced services on ..."
}
```

If using the default settings as shown below, this will get converted into the following training text (ignoring masking)

```
Instruction:
When did Virgin Australia start operating?

Input:
Virgin Australia, the trading ....

Output:
Virgin Australia commenced services on ...
```

We can support the following dataset with the following settings

---
```yaml
trainer:
  max_steps: 10
  target_batch_size: 32
model:
  load_model: ../model/L6-D512-neox-init.pth
  ctx_len: 1024
  lr_init: 1e-4
  bptt_learning: true
  bptt_learning_range: -1
data:
  data_path: ../datapath/self-instruct-base/
  source: "c-s-ale/dolly-15k-instruction-alpaca-format"
  tokenizer: neox
  multi_column_keys: ['instruction', 'input', 'output']
  multi_column_prefix: ['Instruction:\n', 'Input:\n', 'Output:\n']
  multi_column_train_mask: [true, false, true]
  multi_column_separator: '\n\n'
```
---

**data.multi_column_keys**

Defaults to: `['instruction', 'input', 'output']`

List of keys to detect, and use for your text data training. Requires atleast one column to exist, all other collumns will be ignored. Columns are matched in the given order.

**data.multi_column_prefix**

Defaults to `['Instruction:\n', 'Input:\n', 'Output:\n']`

For each matching column found, append the following string as a prefix in the matching array position to `multi_column_keys`

**data.multi_column_train_mask**

Defaults to `[true, false, true]`

For each matching column found, either apply the training mask where the model will learn from (true), or to ignore in the learning process (false).

**data.multi_column_separator**

Defaults to: `\n\n`

String to append inbetween each matching multi column

> Important note: As it is very common to use \n or escape character \ in multi column settings, ensure such strings are within single quotes (ie. '\n'), otherwise the \slash value will get escaped into double slash.

---

### Preload the dataset and train it

In [None]:
# Lets preload the requried dataset
!cd ../../RWKV-v5 && python3 preload_datapath.py ../notebook/dataset-config/example-hf-multi-column-keys.yaml

In [None]:
# Validate the dataset is working, by doing a quick training run
!cd ../../RWKV-v5 && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-hf-multi-column-keys.yaml

# Training with a custom HF tokenizer

How about taking it to another use case all together, with a custom tokenizer? Like training RWKV for music generation?
You can load an existing HF tokenizer, by simply changing the tokenizer value respectively

---
```yaml
...
data:
  # dataset_path for the prebuilt dataset, using HF `load_from_disk()`
  data_path: ../datapath/musnet/

  # @Breadlicker45 music dataset for musnet, and the tokenizer
  source: "breadlicker45/musenet-encoders-40k"
  
  # For huggingface tokenizer, just indicate the tokenizer project path respectively
  tokenizer: "breadlicker45/muse-tokenizer2"

  # Test split settings
  test_split: 0.005
  test_split_shuffle: true

  # Minimum / Maximum token size of the dataset to use
  min_token_size: -1
  max_token_size: -1

  # Custom text key, specific to the dataset
  custom_text_key: 'bing'

```
---

The key thing to note is the `tokenizer` value which will be passed to HF tokenizer implementation. 

### Preload the dataset and train it

In [None]:
# Lets preload the requried dataset
!cd ../../RWKV-v5 && python3 preload_datapath.py ../notebook/dataset-config/example-hf-music-tokenizer.yaml

In [None]:
# Validate the dataset is working, by doing a quick training run
!cd ../../RWKV-v5 && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-hf-music-tokenizer.yaml