# Training a Tokenizer

## Configuration

[forgather_config.yaml](forgather_config.yaml)  
[forgather_demo/paths.yaml](forgather_demo/paths.yaml)  
[../templates/common/tokenizers/causal_bpe.yaml](../templates/common/tokenizers/causal_bpe.yaml)  
[../templates/common/tokenizers/tiny_2k_bpe.yaml](../templates/common/tokenizers/tiny_2k_bpe.yaml)  
[../templates/common/tokenizers/whitelist.yaml](../templates/common/tokenizers/whitelist.yaml)  

### See Also
[Configuration Loader](forgather.ipynb)  
[TokenizerTrainer implementation](../aiws/tokenizer_trainer.py)  
[forgather.config](../forgather/config.py)  

### Dataset

dataset-id: 'roneneldan/TinyStories'

Huggingface dataset link:  
[https://huggingface.co/datasets/roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)  

Dataset Paper:  
[https://arxiv.org/abs/2305.07759](https://arxiv.org/abs/2305.07759)

### Tokenizers  
Rather than working with the raw ASCII/Unicode from the dataset, we will be "tokenizing" the data. A tokenizer is a statisttical model which aggregates individual characters into sub-word, where the most frequent strings of characters are replaced by unique symbols.

https://en.wikipedia.org/wiki/Large_language_model#Probabilistic_tokenization

For this tutorial, we will be created a Byte Pair Encoding (BPE) tokenizer, which starts with all of the symbols from the ASCII character set, then creates tokens for the most common pairs of ASCII characters. These pairs are further aggregated into larger symbols and the process repeats until a set of symbols matching the target vocabulary size has been created.

By starting with the ASCII character set, it is possible to represent any combination of letters, including those which were not observed when the tokenizer was created.

In [1]:
# Import some things we will need and get the meta-config for the tutorials.
import sys
if '..' not in sys.path: sys.path.insert(0, '..')
import os

import pprint
import tokenizers

from aiws.dotdict import DotDict
from aiws.tokenizer_trainer import TokenizerTrainer
from forgather.config import(
    pconfig,
    preprocess_config,
    load_config,
    materialize_config,
    load_whitelist_as_set,
)

# Load meta-configuration
metacfg = DotDict(load_config('forgather_config.yaml').config)
pconfig(metacfg)

project_templates: 'forgather_demo'
templates: '../templates'
tokenizer_dir: '../tokenizers'
datasets_dir: '../datasets'
assets_dir: '..'
search_paths:
  - 'forgather_demo'
  - '../templates'
  - '../model_zoo'
whitelist_path: 'forgather_demo/whitelist.yaml'
model_src_dir: '../model_zoo'
script_dir: '../scripts'
train_script_path: '../scripts/train_script.py'
models_dir: 'forgather_demo/output_models'
dataset_id: 'roneneldan/TinyStories'
tokenizer_def: '../templates/common/tokenizers/tiny_2k_bpe.yaml'
tokenizer_path: '../tokenizers/tiny_stories_2k'
tokenizers_whitelist: '../templates/common/tokenizers/whitelist.yaml'


## Tokenizer Definition Template

We will use a tokenizer definition template to make things easier.  
[../templates/common/tokenizers/causal_bpe.yaml](../templates/common/tokenizers/causal_bpe.yaml)  

This templates takes four arguments:
- dataset_id: The dataset to train the tokenizer on.
- dataset_split: Which split to train on -- datasets are often divided into multiple splits, like 'train' and 'validate'
- model_max_length: The maximum sequence length of the model the tokenizer will be used with. This is meta-data, which is stored in the tokenizer.
- vocal_size: The number of unique tokens in the tokenizer, excluding 'special tokens'

We will first preprocess the template with Jinja2 and see what it looks like.

In [4]:
bpe_tokenizer_template_path = os.path.join(metacfg.templates, 'common', 'tokenizers', 'causal_bpe.yaml')
pp_config = preprocess_config(
    bpe_tokenizer_template_path,
    dataset_id = 'roneneldan/TinyStories',
    dataset_split = 'train',
    model_max_length = 2048,
    vocab_size = 2000,
)
pconfig(pp_config)

     1: # BPE Tokenizer Definition for Causal Model
     2: # 2024-07-08 08:45:11
     3: # dataset_id: 'roneneldan/TinyStories'
     4: # dataset_split: 'train
     5: # model_max_length: '2048'
     6: # vocab_size: '2000'
     7: 
     8: special_tokens_map: &special_tokens_map
     9:     bos: "<|BOS|>" # Beginning of Sequence; the first token in a sequence
    10:     pad: "<|PAD|>" # Padding, used to pad out samples in a batch.
    11:     eos: "<|EOS|>" # End of Sequence; typically is used to stop generation.
    12:     unk: "<|UNK|>" # Unknown; used when a symbol can't be represented.
    13: 
    14: # TokenizerTrainer args
    15: # aiws.tokenizer_trainer.TokenizerTrainer
    16: trainer_args: &trainer_args
    17:     # https://huggingface.co/docs/tokenizers/api/trainers#tokenizers.trainers.BpeTrainer
    18:     model: !callable:tokenizers:models.BPE
    19:         cache_capacity: 16
    20:         unk_token: "<|UNK|>"
    21:         byte_fallback: True
    22: 
    23:

We are manually injecting the template arguments above, but there is a pre-defined template for this configuration, as this tokenizer is used for the other tutotials.

As for the base template, you can experiment with modifying it -- better yet, sub-class it using [template inheretance](https://jinja.palletsprojects.com/en/3.1.x/templates/#template-inheritance).

In [82]:
with open(metacfg.tokenizer_def) as f:
    print(f.read())

-- set dataset_id = 'roneneldan/TinyStories'
-- set model_max_length = 2048
-- set vocab_size = 2000
-- set dataset_split = 'train'

-- include 'common/tokenizers/causal_bpe.yaml'
tokenizer_name: "tiny_stories_2k"


### Materialize the Configuration

This will instantiate concrete objects, corresponding to the config definition.

As we have already preprocessed the config, we will use that, but we could have skipped that step.

In [5]:
latent_config = load_config(pp_config, preprocess=False, load_method="from_string")
print(latent_config.config)

{'special_tokens_map': {'bos': '<|BOS|>', 'pad': '<|PAD|>', 'eos': '<|EOS|>', 'unk': '<|UNK|>'}, 'trainer_args': {'model': Latent('tokenizers:models.BPE', *[], **{'cache_capacity': 16, 'unk_token': '<|UNK|>', 'byte_fallback': True}), 'normalizer': Latent('tokenizers:normalizers.NFC', *[], **{}), 'pre_tokenizer': Latent('tokenizers:pre_tokenizers.ByteLevel', *[], **{}), 'decoder': Latent('tokenizers:decoders.ByteLevel', *[], **{}), 'post_processor': Latent('tokenizers:processors.TemplateProcessing', *[], **{'single': '<bos> $A', 'special_tokens': [('<bos>', 0)]}), 'trainer': Latent('tokenizers.trainers:BpeTrainer', *[], **{'vocab_size': 2000, 'initial_alphabet': Latent('tokenizers:pre_tokenizers.ByteLevel.alphabet', *[], **{}), 'special_tokens': Latent('forgather.construct:values', *[{'bos': '<|BOS|>', 'pad': '<|PAD|>', 'eos': '<|EOS|>', 'unk': '<|UNK|>'}], **{}), 'show_progress': False}), 'dataset': Latent('forgather.construct:get_item', *[Latent('datasets:load_dataset', *['roneneldan/

In [6]:
config_out = materialize_config(
    pp_config,
    # We already did this
    preprocess=False,
    whitelist=load_whitelist_as_set(metacfg.tokenizers_whitelist).config,
    search_path=metacfg.search_paths,
    # The input is text, not a file path.
    load_method="from_string"
)

# Assing to dot-dict for easy access
config = DotDict(config_out.config)
pconfig(config)

Repo card metadata block was not found. Setting CardData to empty.


special_tokens_map:
  bos: '<|BOS|>'
  pad: '<|PAD|>'
  eos: '<|EOS|>'
  unk: '<|UNK|>'
trainer_args:
  model: <tokenizers.models.BPE object at 0x7fc6dd4777d0>
  normalizer: <tokenizers.normalizers.NFC object at 0x7fc6952eb7b0>
  pre_tokenizer: <tokenizers.pre_tokenizers.ByteLevel object at 0x7fc6952eb430>
  decoder: <tokenizers.decoders.ByteLevel object at 0x7fc6dd3dfdb0>
  post_processor: <tokenizers.processors.TemplateProcessing object at 0x7fc6dd3ddec0>
  trainer: <tokenizers.trainers.BpeTrainer object at 0x7fc6953dcc50>
  dataset:
    Dataset({
        features: ['text'],
        num_rows: 2119719
    })
pretrained_tokenizer_fast_args:
  bos_token: '<|BOS|>'
  eos_token: '<|EOS|>'
  unk_token: '<|UNK|>'
  pad_token: '<|PAD|>'
  return_special_tokens_mask: False
  model_max_length: 2048
  padding_side: 'right'
  truncation_side: 'right'


## Dataset
Implementation: [datasets.py](../tutorial_code/datasets.py)  
See Also: [dataset.ipynb](./dataset.ipynb)

For experimentation, we will want a bit of sample text to work with. 
Grab the first example in the dataset and show it.

In [7]:
sample_text = config.trainer_args['dataset'][0]['text']
print(sample_text)

One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.


### Special Tokens Map

The special characters map assigns consistent names to what are otherwise configurable representations for the special tokens.

In [8]:
pconfig(config.special_tokens_map)

bos: '<|BOS|>'
pad: '<|PAD|>'
eos: '<|EOS|>'
unk: '<|UNK|>'


### Pre-tokenizer

The pre-toknizer breaks the input text into sub-strings via a regular expression. For example, a simple pre-tokenizer could split the input on spaces and punctuation.

We will be using the "ByteLevel" pre-tokenizer, which uses a GPT-2 specfic regex for splitting the words and replaces spaces with the 'Ġ' character.

In [9]:
pre_tokenizer = config.trainer_args['pre_tokenizer']

def test_pretokenizer(pre_tokenizer, sample_text):
    tokens = pre_tokenizer.pre_tokenize_str(sample_text)
    for token in tokens:
        print(f"'{token[0]}'", end=" ")
    print("\n")

test_pretokenizer(pre_tokenizer, sample_text)

'ĠOne' 'Ġday' ',' 'Ġa' 'Ġlittle' 'Ġgirl' 'Ġnamed' 'ĠLily' 'Ġfound' 'Ġa' 'Ġneedle' 'Ġin' 'Ġher' 'Ġroom' '.' 'ĠShe' 'Ġknew' 'Ġit' 'Ġwas' 'Ġdifficult' 'Ġto' 'Ġplay' 'Ġwith' 'Ġit' 'Ġbecause' 'Ġit' 'Ġwas' 'Ġsharp' '.' 'ĠLily' 'Ġwanted' 'Ġto' 'Ġshare' 'Ġthe' 'Ġneedle' 'Ġwith' 'Ġher' 'Ġmom' ',' 'Ġso' 'Ġshe' 'Ġcould' 'Ġsew' 'Ġa' 'Ġbutton' 'Ġon' 'Ġher' 'Ġshirt' '.' 'Ċ' 'Ċ' 'Lily' 'Ġwent' 'Ġto' 'Ġher' 'Ġmom' 'Ġand' 'Ġsaid' ',' 'Ġ"' 'Mom' ',' 'ĠI' 'Ġfound' 'Ġthis' 'Ġneedle' '.' 'ĠCan' 'Ġyou' 'Ġshare' 'Ġit' 'Ġwith' 'Ġme' 'Ġand' 'Ġsew' 'Ġmy' 'Ġshirt' '?"' 'ĠHer' 'Ġmom' 'Ġsmiled' 'Ġand' 'Ġsaid' ',' 'Ġ"' 'Yes' ',' 'ĠLily' ',' 'Ġwe' 'Ġcan' 'Ġshare' 'Ġthe' 'Ġneedle' 'Ġand' 'Ġfix' 'Ġyour' 'Ġshirt' '."' 'Ċ' 'Ċ' 'Together' ',' 'Ġthey' 'Ġshared' 'Ġthe' 'Ġneedle' 'Ġand' 'Ġsewed' 'Ġthe' 'Ġbutton' 'Ġon' 'ĠLily' ''s' 'Ġshirt' '.' 'ĠIt' 'Ġwas' 'Ġnot' 'Ġdifficult' 'Ġfor' 'Ġthem' 'Ġbecause' 'Ġthey' 'Ġwere' 'Ġsharing' 'Ġand' 'Ġhelping' 'Ġeach' 'Ġother' '.' 'ĠAfter' 'Ġthey' 'Ġfinished' ',' 'ĠLily' 'Ġthanked' 'Ġher'

TODO: Explain the other tokenizer modules.

## Train the Tokenizer

When we 'train' the tokenizer, it builds a statistical module of the vocabulary from the dataset. In the case of the BPE tokenizer, it starts with a vocabulary conisting of first 256 ASCII values (plus special tokens) and starts by finding the most common pairs of symbols; these become the next tokens in the set.

The original base alphabet and the pairs can then be combined into pairs to create larger strings of characters. The process is repeated until the target vocabulary size is reached.

With the same configuration, it will always produce the same set of tokens, but a different dataset, with a different distribution of symbols would produce a different optimal set of tokens.

The process is fairly CPU intensive and can take a bit of time.

We will be using my [TokenizerTrainer implementation](../aiws/tokenizer_trainer.py), as this makes things easier, but feel free to look at the implementation.

In [10]:
trainer = TokenizerTrainer(**config.trainer_args)
trainer.train()

**** Training Tokenizer ****
total_samples: 2119719
batch_size: 1000
steps: 2119


  0%|                                                                                                         …

**** Training Completed ****
runtime: 26.4
samples_per_second: 80292.39


The base "[Tokenizers](https://huggingface.co/docs/tokenizers/index)" API is relatively low level. We would like to wrap the tokenizer object in the higher-level [Transformers Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) class.

The TokenizerTrainer can do this for uss, but we will need some additional arguments. The configuration has already generated these, so we can go ahead and wrap the tokenizer.

If you need to modify the wrapped tokenizer, it can be found as the 'backend_tokenizer' attribute.

In [11]:
pconfig(config.pretrained_tokenizer_fast_args)
print('*' * 40)
tokenizer = trainer.as_pretrained_tokenizer_fast(**config.pretrained_tokenizer_fast_args)
print(tokenizer)

bos_token: '<|BOS|>'
eos_token: '<|EOS|>'
unk_token: '<|UNK|>'
pad_token: '<|PAD|>'
return_special_tokens_mask: False
model_max_length: 2048
padding_side: 'right'
truncation_side: 'right'
****************************************
PreTrainedTokenizerFast(name_or_path='', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|PAD|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


### Test the tokenizer

We can use the new tokenizer to tokenizer text via the object's __call__ method, like this:

In [12]:
input_ids = tokenizer(sample_text)['input_ids']
print(input_ids)

# We can convert these to their symbolic representations like this.
# Note the 'Ġ' symbols. The tokenizer has folded spaces into the tokens, where this symbol represents the space.
# A consequence of this encoding is that tokens may exist for the same word, both with and without a space.
# For example, "she" and " she" would be represented as seperate tokens.
for ids in [input_ids]:
    print(tokenizer.convert_ids_to_tokens(ids))

[0, 490, 359, 15, 262, 402, 449, 504, 361, 597, 262, 791, 310, 319, 312, 762, 17, 316, 708, 307, 285, 1034, 73, 474, 1388, 87, 269, 364, 345, 307, 790, 307, 285, 384, 290, 83, 17, 361, 447, 269, 951, 266, 791, 310, 345, 312, 369, 15, 353, 341, 463, 441, 90, 262, 1841, 306, 348, 312, 384, 315, 87, 17, 202, 202, 600, 472, 269, 312, 369, 268, 330, 15, 331, 780, 15, 338, 597, 746, 791, 310, 17, 1282, 349, 951, 307, 345, 521, 268, 441, 90, 655, 384, 315, 87, 480, 868, 369, 502, 268, 330, 15, 331, 835, 15, 361, 15, 368, 476, 951, 266, 791, 310, 268, 1306, 632, 384, 315, 87, 419, 202, 202, 55, 82, 557, 15, 367, 1658, 266, 791, 310, 268, 441, 90, 267, 266, 1841, 306, 348, 361, 375, 384, 315, 87, 17, 412, 285, 389, 1034, 73, 474, 1388, 87, 371, 451, 790, 367, 432, 384, 1397, 268, 1766, 760, 575, 17, 1456, 367, 1446, 15, 361, 861, 312, 369, 371, 384, 1397, 266, 791, 310, 268, 1306, 292, 312, 384, 315, 87, 17, 323, 900, 516, 408, 790, 367, 365, 1658, 268, 1371, 569, 17]
['<|BOS|>', 'ĠOne', 'Ġday'

In [13]:
# We can decode token ids with decode() or batch_decode()
decoded_tokens = tokenizer.batch_decode([input_ids], skip_special_tokens=False, clean_up_tokenization_spaces=True)
for s in decoded_tokens:
    print(f"\"{s}\"")

"<|BOS|> One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together."


---
We can dump the vocabulary of the tokenizer. The first part will contain our special tokens and the ASCII character-set. After this, the number of characters in each tokens grows, with the largest tokens at the end.

In [14]:
# Dump a range of the tokenizer's vocabulary
def show_vocabulary(tokenizer, token_range):
    for i, token in zip(token_range, tokenizer.batch_decode([i for i in token_range], skip_special_tokens=False)):
        print(f"'{i}: {token}'", end=" ")
    print("\n")

# Show the first and last 64 tokens.
show_vocabulary(tokenizer, range(64))
show_vocabulary(tokenizer, range(tokenizer.vocab_size - 64, tokenizer.vocab_size))

# Show full vocab.
#show_vocabulary(tokenizer, range(tokenizer.vocab_size))

'0: <|BOS|>' '1: <|PAD|>' '2: <|EOS|>' '3: <|UNK|>' '4: !' '5: "' '6: #' '7: $' '8: %' '9: &' '10: '' '11: (' '12: )' '13: *' '14: +' '15: ,' '16: -' '17: .' '18: /' '19: 0' '20: 1' '21: 2' '22: 3' '23: 4' '24: 5' '25: 6' '26: 7' '27: 8' '28: 9' '29: :' '30: ;' '31: <' '32: =' '33: >' '34: ?' '35: @' '36: A' '37: B' '38: C' '39: D' '40: E' '41: F' '42: G' '43: H' '44: I' '45: J' '46: K' '47: L' '48: M' '49: N' '50: O' '51: P' '52: Q' '53: R' '54: S' '55: T' '56: U' '57: V' '58: W' '59: X' '60: Y' '61: Z' '62: [' '63: \' 

'1936:  shap' '1937:  shook' '1938:  exploring' '1939:  moved' '1940:  purp' '1941:  year' '1942: aughty' '1943:  nearby' '1944:  naughty' '1945:  star' '1946:  soup' '1947:  shop' '1948:  wise' '1949:  stars' '1950:  owl' '1951:  bring' '1952: fused' '1953:  jar' '1954: bow' '1955: Do' '1956: ocked' '1957:  inv' '1958:  exp' '1959:  whe' '1960: yard' '1961:  caught' '1962:  su' '1963: ward' '1964:  Emma' '1965:  backyard' '1966:  seemed' '1967: ail' '1968:  es' '1969

### Save tokenizer
We can save the tokenizer, which will allow us to skip recreating it from scratch next time.

In [15]:
tokenizer.save_pretrained(metacfg.tokenizer_path)

('../tokenizers/tiny_stories_2k/tokenizer_config.json',
 '../tokenizers/tiny_stories_2k/special_tokens_map.json',
 '../tokenizers/tiny_stories_2k/tokenizer.json')

### Load tokenizer
We can load our saved tokenizer -- or the tokenizer from any Huggingface model -- with this interface.

In [16]:
from transformers import AutoTokenizer

# Load a tokenizer from a local path -- or from a Huggingface model name.
# Rather than starting from scratch, you could replace 'model_path' with the path of an existing model and use its tokenizer.
tokenizer = AutoTokenizer.from_pretrained(metacfg.tokenizer_path)
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='../tokenizers/tiny_stories_2k', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|PAD|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


## Quick Build
This function is roughlty equivalent to the tutorial.

It checks if the tokenizer has been built and skips if it already exists. Override can be set with 'force=True'

[source](../tutorial_code/tokenizer.py)

In [2]:
import sys
if '..' not in sys.path: sys.path.insert(0, '..')
from tutorial_code.tokenizer import make_project_tokenizer
make_project_tokenizer(force=False)

Repo card metadata block was not found. Setting CardData to empty.


**** Training Tokenizer ****
total_samples: 2119719
batch_size: 1000
steps: 2119


  0%|                                                                                                         …

**** Training Completed ****
runtime: 26.54
samples_per_second: 79868.84
