# Project Index

[Custom Model Notebook](../../../notebooks/custom_model.ipynb)  
[Training Notebook](../../../notebooks/train.ipynb)  
[Project Config Notebook](../../../notebooks/project_config.ipynb)  
[Forgather Notebook](../../../notebooks/forgather.ipynb)  

In [1]:
import forgather.ml.notebooks as nb

nb.display_project_index(config_template="", materialize=True, pp_first=False)

## Tiny Stories BPE Tokenizer

An example BPE tokenizer trained on the Tiny Stories dataset.

### 2k:
- vocabulary_size: 2000
- model_max_length: 2048

### 8k
- vocabulary_size: 8000
- model_max_length: 2048

## Meta Config
Project Directory: /home/dinalt/ai_assets/forgather/examples/tokenizers/tiny_stories_bpe

Meta Config: [/home/dinalt/ai_assets/forgather/examples/tokenizers/tiny_stories_bpe/meta.yaml](meta.yaml)

- [meta.yaml](meta.yaml)

Template Search Paths:
- [/home/dinalt/ai_assets/forgather/examples/tokenizers/tiny_stories_bpe/templates](templates)
- [/home/dinalt/ai_assets/forgather/templates](../../../templates)

## Available Configurations
- [256.yaml](templates/configs/256.yaml)
- [2k.yaml](templates/configs/2k.yaml)
- [8k.yaml](templates/configs/8k.yaml)
Default Configuration: 2k.yaml

Active Configuration: 2k.yaml

## Available Templates
- [project.yaml](templates/project.yaml)
- [configs/256.yaml](templates/configs/256.yaml)
- [configs/2k.yaml](templates/configs/2k.yaml)
- [configs/8k.yaml](templates/configs/8k.yaml)
- [trainers/accel_trainer.yaml](../../../templates/trainers/accel_trainer.yaml)
- [trainers/trainer.yaml](../../../templates/trainers/trainer.yaml)
- [trainers/hf_trainer.yaml](../../../templates/trainers/hf_trainer.yaml)
- [trainers/base_trainer.yaml](../../../templates/trainers/base_trainer.yaml)
- [model_ctor/args.yaml](../../../templates/model_ctor/args.yaml)
- [projects/tiny.yaml](../../../templates/projects/tiny.yaml)
- [datasets/abstract/pretokenized_dataset.yaml](../../../templates/datasets/abstract/pretokenized_dataset.yaml)
- [datasets/abstract/base_datasets.yaml](../../../templates/datasets/abstract/base_datasets.yaml)
- [datasets/tiny/tiny_stories.yaml](../../../templates/datasets/tiny/tiny_stories.yaml)
- [datasets/tiny/tiny_stories_abridged.yaml](../../../templates/datasets/tiny/tiny_stories_abridged.yaml)
- [models/dynamic_lm.yaml](../../../templates/models/dynamic_lm.yaml)
- [models/causal_transformer.yaml](../../../templates/models/causal_transformer.yaml)
- [models/gpt2.yaml](../../../templates/models/gpt2.yaml)
- [models/llama.yaml](../../../templates/models/llama.yaml)
- [models/abstract/causal_lm_from_config.yaml](../../../templates/models/abstract/causal_lm_from_config.yaml)
- [models/abstract/base_language_model.yaml](../../../templates/models/abstract/base_language_model.yaml)
- [models/abstract/custom_causal_lm.yaml](../../../templates/models/abstract/custom_causal_lm.yaml)
- [models/abstract/causal_lm_from_pretrained.yaml](../../../templates/models/abstract/causal_lm_from_pretrained.yaml)
- [models/abstract/load_model.yaml](../../../templates/models/abstract/load_model.yaml)
- [models/tiny/tiny_causal.yaml](../../../templates/models/tiny/tiny_causal.yaml)
- [models/tiny/tiny_gpt2.yaml](../../../templates/models/tiny/tiny_gpt2.yaml)
- [models/tiny/tiny_llama.yaml](../../../templates/models/tiny/tiny_llama.yaml)
- [models/tiny/tiny_d128_l2.yaml](../../../templates/models/tiny/tiny_d128_l2.yaml)
- [prompts/tiny_stories.yaml](../../../templates/prompts/tiny_stories.yaml)
- [callbacks/base_callbacks.yaml](../../../templates/callbacks/base_callbacks.yaml)
- [callbacks/loggers.yaml](../../../templates/callbacks/loggers.yaml)
- [types/meta_template.yaml](../../../templates/types/meta_template.yaml)
- [types/type.yaml](../../../templates/types/type.yaml)
- [types/tokenizer/tokenizer.yaml](../../../templates/types/tokenizer/tokenizer.yaml)
- [types/tokenizer/bpe/bpe.yaml](../../../templates/types/tokenizer/bpe/bpe.yaml)
- [types/model/model_type.yaml](../../../templates/types/model/model_type.yaml)
- [types/training_script/training_script.yaml](../../../templates/types/training_script/training_script.yaml)
- [types/training_script/causal_lm/causal_lm.yaml](../../../templates/types/training_script/causal_lm/causal_lm.yaml)
- [paths/example_paths.yaml](../../../templates/paths/example_paths.yaml)
- [tokenizers/tiny_2k.yaml](../../../templates/tokenizers/tiny_2k.yaml)
- [tokenizers/tiny_8k.yaml](../../../templates/tokenizers/tiny_8k.yaml)
## Included Templates
- [configs/2k.yaml](templates/configs/2k.yaml)
    - [project.yaml](templates/project.yaml)
        - [paths/example_paths.yaml](../../../templates/paths/example_paths.yaml)
        - [types/tokenizer/bpe/bpe.yaml](../../../templates/types/tokenizer/bpe/bpe.yaml)
            - [types/tokenizer/tokenizer.yaml](../../../templates/types/tokenizer/tokenizer.yaml)
                - [types/type.yaml](../../../templates/types/type.yaml)
                - [inc/formatting.jinja](../../../templates/inc/formatting.jinja)
### Config Metadata:

```python
{'config_description': 'BPE tokenizer trained on Tiny Stories dataset w/ 2K '
                       'tokens',
 'config_name': 'Tiny Stories 2K',
 'datasets_dir': '../../../datasets',
 'model_max_length': '2048',
 'models_dir': 'output_models',
 'output_dir': '../../../tokenizers/tiny_stories_2k',
 'project_dir': '.',
 'tokenizer_name': 'tiny_stories_2k',
 'tokenizers_dir': '../../../tokenizers',
 'vocab_size': '2000'}

```

## Modules
## Preprocessed Config

```yaml
#---------------------------------------
#             Tiny Stories 2K            
#---------------------------------------
# 2024-08-08T01:00:01
# Description: BPE tokenizer trained on Tiny Stories dataset w/ 2K tokens
# Project Dir: .
# Current Working Dir: "/home/dinalt/ai_assets/forgather/examples/tokenizers/tiny_stories_bpe"
# Forgather Config Dir: "/home/dinalt/.config/forgather"

############# Config Vars ##############

# ns.models_dir: "output_models"
# ns.tokenizers_dir: "../../../tokenizers"
# ns.datasets_dir: "../../../datasets"
# tokenizer_name: 'tiny_stories_2k'
# output_dir: '../../../tokenizers/tiny_stories_2k'
# model_max_length: '2048'
# vocab_size: '2000'
# dataset_id: 'roneneldan/TinyStories'
# dataset_split: 'train'

########## Special Tokens Map ##########

.define: &special_tokens_map !dict:@special_tokens_map
    bos: "<|BOS|>" # Beginning of Sequence; the first token in a sequence
    pad: "<|PAD|>" # Padding, used to pad out samples in a batch.
    eos: "<|EOS|>" # End of Sequence; typically is used to stop generation.
    unk: "<|UNK|>" # Unknown; used when a symbol can't be represented.

#### Pretrained Tokenizer Fast Args ####

.define: &tokenizer_args !dict:@tokenizer_args
    bos_token: "<|BOS|>"
    eos_token: "<|EOS|>"
    unk_token: "<|UNK|>"
    pad_token: "<|PAD|>"
    return_special_tokens_mask: True
    model_max_length: 2048
    padding_side: "right"
    truncation_side: "right"

###### Tokenizer Training Dataset ######

.define: &tokenizer_dataset !singleton:operator:getitem@tokenizer_dataset
    - !singleton:datasets:load_dataset [ "roneneldan/TinyStories" ]
    - "train"

########## Tokenizer Trainer ###########

.define: &tokenizer_trainer !lambda:forgather.ml.tokenizer:train_tokenizer@tokenizer_trainer
    output_dir: "../../../tokenizers/tiny_stories_2k"
    dataset: *tokenizer_dataset
    args: *tokenizer_args

    model: !singleton:tokenizers:models.BPE
        cache_capacity: 16
        unk_token: "<|UNK|>"
        byte_fallback: True
    normalizer: !singleton:tokenizers:normalizers.NFC []
    pre_tokenizer: !singleton:tokenizers:pre_tokenizers.ByteLevel []
    decoder: !singleton:tokenizers:decoders.ByteLevel []
    # Automatically add bos token to sequence start
    post_processor: !singleton:tokenizers:processors.TemplateProcessing
        single: "<bos> $A"
        special_tokens: [ !tuple [ "<bos>", 0 ] ]
    trainer: !singleton:tokenizers.trainers:BpeTrainer
        vocab_size: 2000
        initial_alphabet: !singleton:tokenizers:pre_tokenizers.ByteLevel.alphabet []
        special_tokens: !singleton:list [!singleton:values [*special_tokens_map]]
        show_progress: False

#---------------------------------------
#          Configuration Output          
#---------------------------------------
meta: &meta_output !dict:@meta
    config_name: "Tiny Stories 2K"
    config_description: "BPE tokenizer trained on Tiny Stories dataset w/ 2K tokens"
    project_dir: "."
    models_dir: "output_models"
    tokenizers_dir: "../../../tokenizers"
    datasets_dir: "../../../datasets"
    tokenizer_name: "tiny_stories_2k"
    output_dir: "../../../tokenizers/tiny_stories_2k"
    vocab_size: "2000"
    model_max_length: "2048"

main: !singleton:forgather.ml.construct:build_rule
    target: "../../../tokenizers/tiny_stories_2k/tokenizer.json"
    recipe: *tokenizer_trainer
    loader: !lambda:transformers:AutoTokenizer.from_pretrained
        - "../../../tokenizers/tiny_stories_2k"

```

## Loaded Configuration to YAML

```yaml
.define: &meta !singleton:named_dict@meta
    config_name: 'Tiny Stories 2K'
    config_description: 'BPE tokenizer trained on Tiny Stories dataset w/ 2K tokens'
    project_dir: '.'
    models_dir: 'output_models'
    tokenizers_dir: '../../../tokenizers'
    datasets_dir: '../../../datasets'
    tokenizer_name: 'tiny_stories_2k'
    output_dir: '../../../tokenizers/tiny_stories_2k'
    vocab_size: '2000'
    model_max_length: '2048'

.define: &tokenizer_dataset !singleton:operator:getitem@tokenizer_dataset
    - !singleton:datasets:load_dataset
        - 'roneneldan/TinyStories'
    - 'train'

.define: &tokenizer_args !singleton:named_dict@tokenizer_args
    bos_token: '<|BOS|>'
    eos_token: '<|EOS|>'
    unk_token: '<|UNK|>'
    pad_token: '<|PAD|>'
    return_special_tokens_mask: True
    model_max_length: 2048
    padding_side: 'right'
    truncation_side: 'right'

.define: &special_tokens_map !singleton:named_dict@special_tokens_map
    bos: '<|BOS|>'
    pad: '<|PAD|>'
    eos: '<|EOS|>'
    unk: '<|UNK|>'

.define: &tokenizer_trainer !lambda:forgather.ml.tokenizer:train_tokenizer@tokenizer_trainer
    output_dir: '../../../tokenizers/tiny_stories_2k'
    dataset: *tokenizer_dataset
    args: *tokenizer_args
    model: !singleton:tokenizers:models.BPE
        cache_capacity: 16
        unk_token: '<|UNK|>'
        byte_fallback: True
    normalizer: !singleton:tokenizers:normalizers.NFC []
    pre_tokenizer: !singleton:tokenizers:pre_tokenizers.ByteLevel []
    decoder: !singleton:tokenizers:decoders.ByteLevel []
    post_processor: !singleton:tokenizers:processors.TemplateProcessing
        single: '<bos> $A'
        special_tokens: 
            - !singleton:named_tuple!tuple
                - '<bos>'
                - 0
    trainer: !singleton:tokenizers.trainers:BpeTrainer
        vocab_size: 2000
        initial_alphabet: !singleton:tokenizers:pre_tokenizers.ByteLevel.alphabet []
        special_tokens: !singleton:list
            - !singleton:values
                - *special_tokens_map
        show_progress: False


meta: *meta
main: !singleton:forgather.ml.construct:build_rule
    target: '../../../tokenizers/tiny_stories_2k/tokenizer.json'
    recipe: *tokenizer_trainer
    loader: !lambda:transformers:AutoTokenizer.from_pretrained
        - '../../../tokenizers/tiny_stories_2k'

```

### Generated Source Code

```python
from forgather.ml.tokenizer import train_tokenizer
from transformers import AutoTokenizer.from_pretrained
from tokenizers import processors.TemplateProcessing
from tokenizers import pre_tokenizers.ByteLevel
from tokenizers import models.BPE
from datasets import load_dataset
from tokenizers.trainers import BpeTrainer
from tokenizers import decoders.ByteLevel
from tokenizers import normalizers.NFC
from tokenizers import pre_tokenizers.ByteLevel.alphabet
from forgather.ml.construct import build_rule

def construct(
):
    meta = {
        'config_name': 'Tiny Stories 2K',
        'config_description': 'BPE tokenizer trained on Tiny Stories dataset w/ 2K tokens',
        'project_dir': '.',
        'models_dir': 'output_models',
        'tokenizers_dir': '../../../tokenizers',
        'datasets_dir': '../../../datasets',
        'tokenizer_name': 'tiny_stories_2k',
        'output_dir': '../../../tokenizers/tiny_stories_2k',
        'vocab_size': '2000',
        'model_max_length': '2048',
    }

    tokenizer_dataset = load_dataset(
            'roneneldan/TinyStories',
        )['train']

    tokenizer_args = {
        'bos_token': '<|BOS|>',
        'eos_token': '<|EOS|>',
        'unk_token': '<|UNK|>',
        'pad_token': '<|PAD|>',
        'return_special_tokens_mask': True,
        'model_max_length': 2048,
        'padding_side': 'right',
        'truncation_side': 'right',
    }

    special_tokens_map = {
        'bos': '<|BOS|>',
        'pad': '<|PAD|>',
        'eos': '<|EOS|>',
        'unk': '<|UNK|>',
    }

    tokenizer_trainer = lambda: train_tokenizer(
        output_dir='../../../tokenizers/tiny_stories_2k',
        dataset=tokenizer_dataset,
        args=tokenizer_args,
        model=models.BPE(
            cache_capacity=16,
            unk_token='<|UNK|>',
            byte_fallback=True,
        ),
        normalizer=normalizers.NFC(),
        pre_tokenizer=pre_tokenizers.ByteLevel(),
        decoder=decoders.ByteLevel(),
        post_processor=processors.TemplateProcessing(
            single='<bos> $A',
            special_tokens=[
                (
                    '<bos>',
                    0,
                ),
            ],
        ),
        trainer=BpeTrainer(
            vocab_size=2000,
            initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
            special_tokens=list(
                special_tokens_map.values(),
            ),
            show_progress=False,
        ),
    )
    
    return {
        'meta': meta,
        'main': build_rule(
            target='../../../tokenizers/tiny_stories_2k/tokenizer.json',
            recipe=tokenizer_trainer,
            loader=lambda: AutoTokenizer.from_pretrained(
                '../../../tokenizers/tiny_stories_2k',
            ),
        ),
    }

```

## Constructed Project

```python
{'main': PreTrainedTokenizerFast(name_or_path='../../../tokenizers/tiny_stories_2k', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|PAD|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
},
 'meta': {'config_description': 'BPE tokenizer trained on Tiny Stories dataset '
                                'w/ 2K tokens',
          'config_name': 'Tiny Stories 2K',
          'datasets_dir': '../../../datasets',
          'model_max_length': '2048',
          'models_dir': 'output_models',
          'output_dir': '../../../tokenizers/tiny_stories_2k',
          'project_dir': '.',
          'tokenizer_name': 'tiny_stories_2k',
          'tokenizers_dir': '../../../tokenizers',
          'vocab_size': '2000'}}

```

