## Configuration

[forgather_config.yaml](forgather_config.yaml)  
[forgather_demo/paths.yaml](forgather_demo/paths.yaml)  
[../templates/common/tokenizers/causal_bpe.yaml](../templates/common/tokenizers/causal_bpe.yaml)  
[../templates/common/tokenizers/tiny_2k_bpe.yaml](../templates/common/tokenizers/tiny_2k_bpe.yaml)  
[../templates/common/tokenizers/whitelist.yaml](../templates/common/tokenizers/whitelist.yaml)  

### See Also
[forgather.ipynb](forgather.ipynb)  
[aiws.tokenizer_trainer.TokenizerTrainer](../aiws/tokenizer_trainer.py)  


In [25]:
import sys
if '..' not in sys.path: sys.path.insert(0, '..')
import os

import pprint
import tokenizers
from tokenizers.processors import TemplateProcessing
from datasets import load_dataset

from aiws.dotdict import DotDict
from forgather.config import load_config
from pprint import pp, pformat

# Load meta-configuration
dirs = DotDict(load_config('forgather_config.yaml').config)
print(pformat(dirs))

{'assets_dir': '..',
 'dataset_id': 'roneneldan/TinyStories',
 'datasets_dir': '../datasets',
 'model_src_dir': '../model_zoo',
 'models_dir': 'forgather_demo/output_models',
 'project_templates': 'forgather_demo',
 'script_dir': '../scripts',
 'search_paths': ['forgather_demo', '../templates', '../model_zoo'],
 'templates': '../templates',
 'tokenizer_def': '../templates/common/tokenizers/tiny_2k_bpe.yaml',
 'tokenizer_dir': '../tokenizers',
 'tokenizer_path': '../tokenizers/tiny_stories_2k',
 'tokenizers_whitelist': '../templates/common/tokenizers/whitelist.yaml',
 'train_script_path': '../scripts/train_script.py',
 'whitelist_path': 'forgather_demo/whitelist.yaml'}


In [27]:
from forgather.config import materialize_config, fconfig, preprocess_config

config_out = materialize_config(dirs.tokenizer_def, whitelist=dirs.tokenizers_whitelist, search_path=dirs.search_paths)
print(fconfig(config_out.pp_config))
config = DotDict(config_out.config)

Repo card metadata block was not found. Setting CardData to empty.


     1: # BPE Tokenizer Definition for Causal Model
     2: # 2024-07-08 05:15:58
     3: # dataset_id: 'roneneldan/TinyStories'
     4: # dataset_split: 'train
     5: # model_max_length: '2048'
     6: # vocab_size: '2000'
     7: 
     8: trainer_args: &trainer_args
     9:     model: !callable:tokenizers:models.BPE
    10:         kwargs:
    11:             cache_capacity: 16
    12:             unk_token: "<|UNK|>"
    13:             byte_fallback: True
    14:     normalizer: !callable:tokenizers:normalizers.NFC []
    15:     pre_tokenizer: !callable:tokenizers:pre_tokenizers.ByteLevel []
    16:     decoder: !callable:tokenizers:decoders.ByteLevel []
    17: 
    18:     # Automatically add bos token to sequence start
    19:     post_processor: !callable:tokenizers:processors.TemplateProcessing
    20:         single: "<|BOS|> $A"
    21:         special_tokens: [[ "<|BOS|>", 0 ]]
    22:     trainer: !callable:tokenizers.trainers:BpeTrainer
    23:         kwargs:
    24:  

In [29]:
print(fconfig(config))

trainer_args:
  model: <tokenizers.models.BPE object at 0x7f5dcfcbd9d0>
  normalizer: <tokenizers.normalizers.NFC object at 0x7f5e7fea06f0>
  pre_tokenizer: <tokenizers.pre_tokenizers.ByteLevel object at 0x7f5dcd34e8b0>
  decoder: <tokenizers.decoders.ByteLevel object at 0x7f5e80f00ed0>
  post_processor: <tokenizers.processors.TemplateProcessing object at 0x7f5dcfcad320>
  trainer: <tokenizers.trainers.BpeTrainer object at 0x7f5dcfcbc770>
  dataset:
    Dataset({
        features: ['text'],
        num_rows: 2119719
    })
trainer: <aiws.tokenizer_trainer.TokenizerTrainer object at 0x7f5e80e848e0>
pretrained_tokenizer_fast_args:
  bos_token: '<|BOS|>'
  eos_token: '<|EOS|>'
  unk_token: '<|UNK|>'
  pad_token: '<|PAD|>'
  return_special_tokens_mask: False
  model_max_length: 2048
  padding_side: 'right'
  truncation_side: 'right'


In [30]:
config.trainer.train()

**** Training Tokenizer ****
total_samples: 2119719
batch_size: 1000
steps: 2119


  0%|                                                                                                         …

**** Training Completed ****
runtime: 26.812073469161987
samples_per_second: 79058.377


In [45]:
tokenizer = trainer.as_pretrained_tokenizer_fast(**config.pretrained_tokenizer_fast_args)
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|PAD|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


## Dataset
Implementation: [datasets.py](../tutorial_code/datasets.py)  
See Also: [dataset.ipynb](./dataset.ipynb)

In [2]:
dataset_dict = load_dataset(dirs.dataset_id)

print(dataset_dict)
train_dataset = dataset_dict['train']

# For experimentation, we will want a bit of sample text to work with. 
# This will grab the first 500 characters from the first record of the training dataset.
sample_text = train_dataset['text'][0][:500]
print(sample_text)

Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})
One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b


## Tokenizer
Rather than working with the raw ASCII/Unicode from the dataset, we will be "tokenizing" the data. A tokenizer is a statisttical model which aggregates individual characters into sub-word, where the most frequent strings of characters are replaced by unique symbols.

https://en.wikipedia.org/wiki/Large_language_model#Probabilistic_tokenization

For this tutorial, we will be created a Byte Pair Encoding (BPE) tokenizer, which starts with all of the symbols from the ASCII character set, then creates tokens for the most common pairs of ASCII characters. These pairs are further aggregated into larger symbols and the process repeats until a set of symbols matching the target vocabulary size has been created.

By starting with the ASCII character set, it is possible to represent any combination of letters, including those which were not observed when the tokenizer was created.

### Special Character

Define the special characters map.

The map merely maps the name of the character to how it is represented in text.

In [None]:
special_tokens={
    "pad": "<|PAD|>",   # Used to pad unused positions in a sequence.
    "mask": "<|MASK|>", # Used with masked-language-modeling to mark a position as having been masked.
    "bos": "<|BOS|>",   # Beginning of Sequence
    "eos": "<|EOS|>",   # End of Sequence
    "unk": "<|UNK|>",   # Unknown
}

### Pre-tokenizer

Creat a [pre-tokenizer](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)," which breaks the input text into sub-strings via a regular expression. For example, a simple pre-tokenizer could split the input on spaces and punctuation.

We will be using the "ByteLevel" pre-tokenizer, which uses a GPT-2 specfic regex for splitting the words and replaces spaces with the 'Ġ' character.

In [8]:
pre_tokenizer = tokenizers.pre_tokenizers.ByteLevel()

def test_pretokenizer(pre_tokenizer, sample_text):
    tokens = pre_tokenizer.pre_tokenize_str(sample_text)
    for token in tokens:
        print(f"'{token[0]}'", end=" ")
    print("\n")

test_pretokenizer(pre_tokenizer, sample_text)

'ĠOne' 'Ġday' ',' 'Ġa' 'Ġlittle' 'Ġgirl' 'Ġnamed' 'ĠLily' 'Ġfound' 'Ġa' 'Ġneedle' 'Ġin' 'Ġher' 'Ġroom' '.' 'ĠShe' 'Ġknew' 'Ġit' 'Ġwas' 'Ġdifficult' 'Ġto' 'Ġplay' 'Ġwith' 'Ġit' 'Ġbecause' 'Ġit' 'Ġwas' 'Ġsharp' '.' 'ĠLily' 'Ġwanted' 'Ġto' 'Ġshare' 'Ġthe' 'Ġneedle' 'Ġwith' 'Ġher' 'Ġmom' ',' 'Ġso' 'Ġshe' 'Ġcould' 'Ġsew' 'Ġa' 'Ġbutton' 'Ġon' 'Ġher' 'Ġshirt' '.' 'Ċ' 'Ċ' 'Lily' 'Ġwent' 'Ġto' 'Ġher' 'Ġmom' 'Ġand' 'Ġsaid' ',' 'Ġ"' 'Mom' ',' 'ĠI' 'Ġfound' 'Ġthis' 'Ġneedle' '.' 'ĠCan' 'Ġyou' 'Ġshare' 'Ġit' 'Ġwith' 'Ġme' 'Ġand' 'Ġsew' 'Ġmy' 'Ġshirt' '?"' 'ĠHer' 'Ġmom' 'Ġsmiled' 'Ġand' 'Ġsaid' ',' 'Ġ"' 'Yes' ',' 'ĠLily' ',' 'Ġwe' 'Ġcan' 'Ġshare' 'Ġthe' 'Ġneedle' 'Ġand' 'Ġfix' 'Ġyour' 'Ġshirt' '."' 'Ċ' 'Ċ' 'Together' ',' 'Ġthey' 'Ġshared' 'Ġthe' 'Ġneedle' 'Ġand' 'Ġsewed' 'Ġthe' 'Ġbutton' 'Ġon' 'ĠLily' ''s' 'Ġshirt' '.' 'ĠIt' 'Ġwas' 'Ġnot' 'Ġdifficult' 'Ġfor' 'Ġthem' 'Ġb' 



### Create a BPE Tokenizer Model

[Tokenizer Models](https://huggingface.co/docs/tokenizers/en/api/models)

For extra credit, try the other models at the link above.

In [9]:
tokenizer_model = tokenizers.models.BPE(
    cache_capacity=16,
    unk_token=special_tokens['unk'],
    byte_fallback=True
)

In [10]:
normalizer = tokenizers.normalizers.NFC()

In [11]:
# The decoder is applied when coverting tokens back into text and the ByteLevel decoder
# is responsible for replacing 'Ġ' character with spaces. 
decoder = tokenizers.decoders.ByteLevel()

In [12]:
# Automatically add Begin Of Sequence (BOS) token to output when 'add_special_tokens' is True
# This has relevance to causal models, which predict the next token in a sequence. As the first real token lacks
# a preceeding token, this allows the model to identify where the sequence actually begins.
#
# Note: A causal model can still function without a BOS token and the need to include it is debatable.
post_processor = TemplateProcessing(
    single="<BOS> $A",
    special_tokens=[
        ("<BOS>", 2),
    ],
)

### Define a Constructor
The constructor is rather awkward to use, as arguments are set as attributes, rather being passed to the constructor. This just makes construction act more like one would expect.

### Construct the Tokenizer

In [7]:
pretrained_tokenizer = construct_tokenizer(
    tokenizer_model,
    normalizer,
    pre_tokenizer,
    decoder
)

### Train the tokenizer

In [6]:
# Create a BPE trainer, which is used to build an optimal set of tokens from
# a a given dataset.
tok_trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=config.tokenizer.vocab_size,
    initial_alphabet=tokenizers.pre_tokenizers.ByteLevel.alphabet(),
    special_tokens=list(special_tokens.values()),
    show_progress=True,
)

# This abstraction is needed for the trainer to iterate over our dataset
def batch_iterator(dataset, batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]['text']

# Train the tokenizer of the dataset
# Be patient! This will take a bit of time to complete...
pretrained_tokenizer.train_from_iterator(batch_iterator(train_dataset), trainer=tok_trainer, length=len(train_dataset))






### Wrap the tokenizer

The BPE tokenizer class can be wrapped in a Huggingface [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) class, which makes working with the tokenizer easier.

In [31]:
from transformers import PreTrainedTokenizerFast

# This wraps the tokenizer in a Huggingface transformer tokenizer, which
# is a higher level abstraction
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=pretrained_tokenizer,
    # This should match the model's input length limit, which depends upon the archetecture.
    # If not limit is specified, the default will be a VERY LARGE value.
    model_max_length=config.model.max_sequence_len,
    pad_token=special_tokens['eos'],
    mask_token=special_tokens['mask'],
    bos_token=special_tokens['bos'],
    eos_token=special_tokens['eos'],
    unk_token=special_tokens['unk'],
    return_special_tokens_mask=False,
)

print(tokenizer)

NameError: name 'config' is not defined

### Test the tokenizer

In [8]:
# We can use the new tokenizer to tokenizer text via the object's __call__ method, like this:

input_ids = tokenizer(sample_text)['input_ids']
print(input_ids)

# We can convert these to their symbolic representations like this.
# Note the 'Ġ' symbols. The tokenizer has folded spaces into the tokens, where this symbol represents the space.
# A consequence of this encoding is that tokens may exist for the same word, both with and without a space.
# For example, "she" and " she" would be represented as seperate tokens.
for ids in [input_ids]:
    print(tokenizer.convert_ids_to_tokens(ids))

[2, 491, 360, 16, 263, 403, 450, 505, 362, 598, 263, 792, 311, 320, 313, 763, 18, 317, 709, 308, 286, 1035, 74, 475, 1389, 88, 270, 365, 346, 308, 791, 308, 286, 385, 291, 84, 18, 362, 448, 270, 952, 267, 792, 311, 346, 313, 370, 16, 354, 342, 464, 442, 91, 263, 1842, 307, 349, 313, 385, 316, 88, 18, 203, 203, 601, 473, 270, 313, 370, 269, 331, 16, 332, 781, 16, 339, 598, 747, 792, 311, 18, 1283, 350, 952, 308, 346, 522, 269, 442, 91, 656, 385, 316, 88, 481, 869, 370, 503, 269, 331, 16, 332, 836, 16, 362, 16, 369, 477, 952, 267, 792, 311, 269, 1307, 633, 385, 316, 88, 420, 203, 203, 56, 83, 558, 16, 368, 1659, 267, 792, 311, 269, 442, 91, 268, 267, 1842, 307, 349, 362, 376, 385, 316, 88, 18, 413, 286, 390, 1035, 74, 475, 1389, 88, 372, 452, 271]
['<|BOS|>', 'ĠOne', 'Ġday', ',', 'Ġa', 'Ġlittle', 'Ġgirl', 'Ġnamed', 'ĠLily', 'Ġfound', 'Ġa', 'Ġneed', 'le', 'Ġin', 'Ġher', 'Ġroom', '.', 'ĠShe', 'Ġknew', 'Ġit', 'Ġwas', 'Ġdif', 'f', 'ic', 'ul', 't', 'Ġto', 'Ġplay', 'Ġwith', 'Ġit', 'Ġbecause', 

In [9]:
# We can decode token ids with decode() or batch_decode()
decoded_tokens = tokenizer.batch_decode([input_ids], skip_special_tokens=False, clean_up_tokenization_spaces=True)
for s in decoded_tokens:
    print(f"\"{s}\"")

"<|BOS|> One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b"


---
We can dump the vocabulary of the tokenizer. The first part will contain our special tokens and the ASCII character-set. After this, the number of characters in each tokens grows, with the largest tokens at the end.

In [12]:
# Dump a range of the tokenizer's vocabulary
def show_vocabulary(tokenizer, token_range):
    for i, token in zip(token_range, tokenizer.batch_decode([i for i in token_range], skip_special_tokens=False)):
        print(f"'{i}: {token}'", end=" ")
    print("\n")

# Show the first and last 64 tokens.
show_vocabulary(tokenizer, range(64))
show_vocabulary(tokenizer, range(tokenizer.vocab_size - 64, tokenizer.vocab_size))

# Show full vocab.
#show_vocabulary(tokenizer, range(tokenizer.vocab_size))

'0: <|PAD|>' '1: <|MASK|>' '2: <|BOS|>' '3: <|EOS|>' '4: <|UNK|>' '5: !' '6: "' '7: #' '8: $' '9: %' '10: &' '11: '' '12: (' '13: )' '14: *' '15: +' '16: ,' '17: -' '18: .' '19: /' '20: 0' '21: 1' '22: 2' '23: 3' '24: 4' '25: 5' '26: 6' '27: 7' '28: 8' '29: 9' '30: :' '31: ;' '32: <' '33: =' '34: >' '35: ?' '36: @' '37: A' '38: B' '39: C' '40: D' '41: E' '42: F' '43: G' '44: H' '45: I' '46: J' '47: K' '48: L' '49: M' '50: N' '51: O' '52: P' '53: Q' '54: R' '55: S' '56: T' '57: U' '58: V' '59: W' '60: X' '61: Y' '62: Z' '63: [' 

'1936: ek' '1937:  shap' '1938:  shook' '1939:  exploring' '1940:  moved' '1941:  purp' '1942:  year' '1943: aughty' '1944:  nearby' '1945:  naughty' '1946:  star' '1947:  soup' '1948:  shop' '1949:  wise' '1950:  stars' '1951:  owl' '1952:  bring' '1953: fused' '1954:  jar' '1955: bow' '1956: Do' '1957: ocked' '1958:  inv' '1959:  exp' '1960:  whe' '1961: yard' '1962:  caught' '1963:  su' '1964: ward' '1965:  Emma' '1966:  backyard' '1967:  seemed' '1968: ail'

### Save tokenizer
We can save the tokenizer, which will allow us to skip recreating it from scratch next time.

In [10]:
tokenizer.save_pretrained(config.model_path)

('/home/dinalt/ai_assets/models/tiny/tokenizer_config.json',
 '/home/dinalt/ai_assets/models/tiny/special_tokens_map.json',
 '/home/dinalt/ai_assets/models/tiny/tokenizer.json')

### Load tokenizer
We can load our saved tokenizer -- or the tokenizer from any Huggingface model -- with this interface.

In [11]:
from transformers import AutoTokenizer

# Load a tokenizer from a local path -- or from a Huggingface model name.
# Rather than starting from scratch, you could replace 'model_path' with the path of an existing model and use its tokenizer.
tokenizer = AutoTokenizer.from_pretrained(config.model_path)
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='/home/dinalt/ai_assets/models/tiny', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


## Quick Build
This function is equivalent to the tutorial.  
[source](../tutorial_code/tokenizer.py)

In [None]:
tokenizer = train_bpe_tokenizer(config, dataset['train'])