## Configuration
### Source
[aiws.yamldict](../aiws/yamldict.py)  
[tutorial_code.datasets](../tutorial_code/datasets.py)  
[tutorial_code.tokenizer](../tutorial_code/tokenizer.py)
[tutorial_code.model_utils](../tutorial_code/model_utils.py)  
[tutorial_code.train](../tutorial_code/train.py)  
[tutorial_code.bigram_model](../tutorial_code/bigram_model.py)  
[tutorial_code.inference](../tutorial_code/inference.py)  

### See Also
[dataset.ipynb](dataset.ipynb)  
[tokenizer.ipynb](tokenizer.ipynb)  
[my_first_transformer.ipynb](my_first_transformer.ipynb)  

### Config
[config.yaml](config/config.yaml)  
[paths.yaml](config/paths.yaml)  
[dataset.yaml](config/dataset.yaml)  
[tokenizer.yaml](config/tokenizer.yaml) 
[dataset.yaml](config/dataset.yaml)  
[training.yaml](config/training.yaml)   
[model.yaml](config/model.yaml)  

In [1]:
import os
import sys; sys.path.insert(0, '..')
import pprint
import torch
import transformers

from aiws.yamldict import load_yaml_dict
from tutorial_code.datasets import load_dataset_from_config, tokenize_datasetdict
from tutorial_code.tokenizer import train_bpe_tokenizer
from tutorial_code.model_utils import print_model_size, test_model_forward
from tutorial_code.bigram_model import BigramLM

config = load_yaml_dict("config/config.yaml")
pprint.pp(config)
config.model_path = os.path.join(config.paths.models_dir, config.model.model_id)

{'paths': {'models_dir': '/home/dinalt/ai_assets/models',
           'datasets_dir': '/home/dinalt/ai_assets/datasets'},
 'tokenizer': {'vocab_size': 2000},
 'train': {'per_device_train_batch_size': 64,
           'per_device_eval_batch_size': 128,
           'learning_rate': 0.001,
           'num_train_epochs': 1.0,
           'eval_steps': 1000,
           'num_warmup_steps': 0,
           'lr_scheduler_name': 'constant'},
 'dataset': {'dataset_id': 'roneneldan-TinyStories',
             'tokenized_dataset_path': './tiny_stories_tokenized',
             'train_select': 0.1,
             'validate_select': 0.1},
 'model': {'model_id': 'tiny',
           'max_sequence_len': 2048,
           'd_model': 128,
           'd_feedforward': 512,
           'num_attention_heads': 1,
           'num_hidden_layers': 2},
 'device': 'cuda'}


### Reload Module
Useful, if you make changes to a module and don't want to restart the notebook.  
Otherwise, skip this cell.

In [24]:
import importlib

# If the module has not been imported, we first import it.
import tutorial_code.textgen

# Trigger a reload of the module.
importlib.reload(tutorial_code.textgen)

# If the symbol was imported with 'from,' reimport the symbol by running the cell with 'from' again.

# If something new was added and can't be found...
#importlib.invalidate_caches()

<module 'tutorial_code.textgen' from '/home/dinalt/ai_assets/aiworkspace/tutorial/../tutorial_code/textgen.py'>

## Quick Load
If you have built a tokenizer and tokenized dataset already, you can just load them here and skip to training.

In [2]:
from transformers import AutoTokenizer
import datasets
dataset = load_dataset_from_config(config)
tokenizer = AutoTokenizer.from_pretrained(config.model_path)
tokenized_dataset = datasets.load_from_disk(config.dataset.tokenized_dataset_path)
sample_text = dataset['train']['text'][0][:500]
model = BigramLM(config.model.d_model, tokenizer.vocab_size)

## Dataset
We will need some data to train our model on. For this tutorial, we will use a dataset named "TinyStories," which is a synthetic dataset generated by ChatGPT designed for training very small language models to produce coherent output. This is made possible by limiting the examples to things which a 4-year-old child would be able to understand, with a total vocabulary of about 1500 words.

Huggingface dataset link:  
https://huggingface.co/datasets/roneneldan/TinyStories  

The paper describing the dataset:  
https://arxiv.org/abs/2305.07759

The first time this is run, it will download the dataset to your cache, which make take a few minutes. After that, the dataset will be loaded from your cache.

source: [tutorial_code.datasets.load_dataset_from_config()](../tutorial_code/datasets.py)

In [2]:
dataset = load_dataset_from_config(config)

print(dataset)
train_dataset = dataset['train']

# For experimentation, we will want a bit of sample text to work with. 
# This will grab the first 500 characters from the first record of the training dataset.
sample_text = train_dataset['text'][0][:500]
print(sample_text)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})
One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b


## Tokenizer
Rather than working with the raw ASCII/Unicode from the dataset, we will be "tokenizing" the data. A tokenizer is a statisttical model which aggregates individual characters into sub-word, where the most frequent strings of characters are replaced by unique symbols.

https://en.wikipedia.org/wiki/Large_language_model#Probabilistic_tokenization

For this tutorial, we will be created a Byte Pair Encoding (BPE) tokenizer, which starts with all of the symbols from the ASCII character set, then creates tokens for the most common pairs of ASCII characters. These pairs are further aggregated into larger symbols and the process repeats until a set of symbols matching the target vocabulary size has been created.

By starting with the ASCII character set, it is possible to represent any combination of letters, including those which were not observed when the tokenizer was created.

If you have not created a tokenizer, you can follow the [tokenizer tutorial](./tokenizer.ipynb) or just run the "Build Tokenizer" cell.

### Load tokenizer
We can load our saved tokenizer -- or the tokenizer from any Huggingface model -- with this interface.

In [3]:
from transformers import AutoTokenizer

# Load a tokenizer from a local path -- or from a Huggingface model name.
# Rather than starting from scratch, you could replace 'model_path' with the path of an existing model and use its tokenizer.
tokenizer = AutoTokenizer.from_pretrained(config.model_path)
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='/home/dinalt/ai_assets/models/tiny', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


### Build Tokenizer
If you have not built the tokenizer first, follow the linked tutorial...

[Tokenizer Notebook](tokenizer.ipynb)

...or just run this cell to build and save it.  
Building it can take a moment or three. Be patient!

In [3]:
tokenizer = train_bpe_tokenizer(config, dataset['train'])
print(tokenizer)
tokenizer.save_pretrained(config.model_path)




Completed training
PreTrainedTokenizerFast(name_or_path='', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


('/home/dinalt/ai_assets/models/tiny/tokenizer_config.json',
 '/home/dinalt/ai_assets/models/tiny/special_tokens_map.json',
 '/home/dinalt/ai_assets/models/tiny/tokenizer.json')

## Tokenize dataset
Before training the model, we need to convert the text in the dataset to the token-ids used by the model.

This function is a fairly simple imlementation of this functionality. It will:
- Split the dataset into a subset of the total, if 'select' is less than 1.0.
- Take each example from the dataset, in batches, and convert the text to the corresponding tokens.
- Truncate sequences longer than the model can process.
- Add padding tokens, where the length of sequences in the batch are not identical.
- Remove unused columns from the data.

See Also: [dataset.ipynb](./dataset.ipynb)

### Load Tokenized Dataset

In [6]:
import datasets
tokenized_dataset = datasets.load_from_disk(config.dataset.tokenized_dataset_path)
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 211971
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 2199
    })
})


### Build and Save Tokenized Dataset
If you have not built the tokenized dataset, you can to so now.

[tokenize_datasetdict()](../tutorial_code/tokenizer.py)

In [4]:
tokenized_dataset = tokenize_datasetdict(dataset, tokenizer, config)

#### Save Tokenized Dataset

In [5]:
tokenized_dataset.save_to_disk(config.dataset.tokenized_dataset_path)

Saving the dataset (0/1 shards):   0%|          | 0/211971 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2199 [00:00<?, ? examples/s]

## Create a simple causal language model
A "causal" language model is one which makes predictions about future tokens based upon past tokens. We will start with a simple model which predicts the next token, given only the immediadly preceeding token.

Source: [BigramLM](../tutorial_code/bigram_model.py)

### Instantiate model
This will create an instance of our model with a hidden-dimension (d_model) of 128 and a vocabulary size matching that of the tokenizer.

In [12]:
model = BigramLM(config.model.d_model, tokenizer.vocab_size)
print_model_size(model)
print(model)

Model size: 0.5M parameters
BigramLM(
  (embedding): Embedding(2000, 128)
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)


### Enable Torch Compile [optional]

https://pytorch.org/docs/stable/torch.compiler.html

Optionally compile the model. This may not work with all version of Python and Pytorch.
Using "torch.compile()" is especially effective at speeding up small models.

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
torch.set_float32_matmul_precision('high')
model.compile()
os.environ["TOKENIZERS_PARALLELISM"] = "true"

### Test model forward
This will tokenizer our sample text and feed it through the forward method of the model to ensure that the code does not "fall-over."

If the model has not been trained, the loss is expected to be around 7 - 8; lower, if it has been trained.

[test_model_forward()](../tutorial_code/model_utils.py)

In [14]:
test_model_forward(model, tokenizer, sample_text)

input_ids:
 tensor([[   2,  491,  360,   16,  263,  403,  450,  505,  362,  598,  263,  792,
          311,  320,  313,  763,   18,  317,  709,  308,  286, 1035,   74,  475,
         1389,   88,  270,  365,  346,  308,  791,  308,  286,  385,  291,   84,
           18,  362,  448,  270,  952,  267,  792,  311,  346,  313,  370,   16,
          354,  342,  464,  442,   91,  263, 1842,  307,  349,  313,  385,  316,
           88,   18,  203,  203,  601,  473,  270,  313,  370,  269,  331,   16,
          332,  781,   16,  339,  598,  747,  792,  311,   18, 1283,  350,  952,
          308,  346,  522,  269,  442,   91,  656,  385,  316,   88,  481,  869,
          370,  503,  269,  331,   16,  332,  836,   16,  362,   16,  369,  477,
          952,  267,  792,  311,  269, 1307,  633,  385,  316,   88,  420,  203,
          203,   56,   83,  558,   16,  368, 1659,  267,  792,  311,  269,  442,
           91,  268,  267, 1842,  307,  349,  362,  376,  385,  316,   88,   18,
          413,  

## Train Model
This is an example training-loop implementation.

Example code is based upon examples here:
https://huggingface.co/learn/nlp-course/en/chapter3/4

In [17]:
from tutorial_code.train import CausalTrainer

# This provides a place to configure the training parameters.
def do_train():
    CausalTrainer(
        model,
        tokenizer,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['validation'],
        per_device_train_batch_size=config.train.per_device_train_batch_size,
        per_device_eval_batch_size=config.train.per_device_eval_batch_size,
        learning_rate=config.train.learning_rate,
        num_train_epochs=config.train.num_train_epochs,
        eval_steps=config.train.eval_steps,
        optimizer_factory=lambda params, lr: torch.optim.AdamW(params, lr=lr),
        lr_scheduler_factory=lambda opt, steps: transformers.get_scheduler(
            config.train.lr_scheduler_name,
            opt,
            config.train.num_warmup_steps,
            steps,
        ),
        device = config.device,
    ).train()

In [18]:
do_train()

Training for 3313 steps


  0%|          | 0/3313 [00:00<?, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.7317506869633994
Global step: 1000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.642031921280755
Global step: 2000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.6106786727905273
Global step: 3000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.604231927129957


### Accelerate Training Loop

This is the same code, but modified to run on multiple GPU's within a notebook using the [Accelerate](https://huggingface.co/docs/accelerate/v0.11.0/en/index) library.

Note: For small models, this may actually be slower than the basic training loop.

In [3]:
from accelerate import notebook_launcher
from tutorial_code.train import CausalAccelerateTrainer

def train_function():
    trainer = CausalAccelerateTrainer(
        model,
        tokenizer,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['validation'],
        per_device_train_batch_size=config.train.per_device_train_batch_size,
        per_device_eval_batch_size=config.train.per_device_eval_batch_size,
        learning_rate=config.train.learning_rate,
        num_train_epochs=config.train.num_train_epochs,
        eval_steps=config.train.eval_steps,
        optimizer_factory=lambda params, lr: torch.optim.AdamW(params, lr=lr),
        lr_scheduler_factory=lambda opt, steps: transformers.get_scheduler(
            config.train.lr_scheduler_name,
            opt,
            config.train.num_warmup_steps,
            steps,
        ),
    )
    trainer.train()

def do_train():
    notebook_launcher(train_function, num_processes=torch.cuda.device_count())

In [4]:
do_train()

Launching training on 6 GPUs.
Training for 553 steps


  0%|          | 0/553 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

loss=3.7510127226511636


### Huggingface Trainer Example
This illustrates how to use the HF trainer class, with the functionality being similar
to the above code.

Within the context of a notebook and multiple GPU's, the trainer will train the model using torch [Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). This is not ideal for performance.

The same code, launched using Accelerate in a script will perform much better.

UNTESTED

In [None]:
from accelerate import notebook_launcher
from tutorial_code.train import hf_trainer

def do_train():
    train_causal_model(
        model,tokenizer,
        tok_train_dataset,
        tok_val_dataset,
        training_args = TrainingArguments(
            per_device_train_batch_size=config.train.per_device_train_batch_size,
            per_device_eval_batch_size=config.train.per_device_eval_batch_size,
            output_dir="test_trainer",
            evaluation_strategy="steps",
            eval_steps=config.train.eval_steps,
            num_train_epochs=config.train.num_train_epochs,
    
            # If set too high, your GPU may run out of memory.
            #per_device_train_batch_size=8,
            #per_device_eval_batch_size=16,
            
            # The learning rate will need to be reduced as model size grows. If the rate is set too high, the
            # loss will become unstable, possibly increasing.
            learning_rate=config.train.learning_rate,
            # Set for better diagnostics
            #use_cpu=True,
        ),
    )

#notebook_launcher(do_train, num_processes=torch.cuda.device_count())
do_train()

## Evaluate predictions

### Predict Tokens
This will take the input text and have the model make predictions for the next token for each token in the sequence.

The color coding indicates the loss for each individual token, with darker colors being more accurate and brighter colors being less so.

If you hover over a token, you can see the top-10 predictions for the next token in the sequence.

[tutorial_code.inference](../tutorial_code/inference.py)  

In [21]:
from tutorial_code.inference import show_predictions

show_predictions(model, tokenizer, device="cuda", text=[sample_text])

line: One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b
Metric 'Causal Loss': n=154, min=6.398907661437988, max=9.475622177124023, mean=7.742974758148193, range=(0, None)






### Simple Text Gen
This is a very simple text generator implementation.

[tutorial_code.textgen](../tutorial_code/textgen.py)  

In [25]:
from tutorial_code.textgen import TextGenerator

# Test text generation.
# Don't expect too much from this model, as the only input to each prediction is the previous word. 
text_gen = TextGenerator(model, tokenizer, 'cuda', do_sample=True, seed=42)
text = text_gen.prompt("One day, a little girl", max_new_tokens=50)
print(repr(text))

'<|BOS|> One day, a little girlark towards M ride rem r makes When outside willn clothesimeavy hcoam played poinend smart thoughtmaher\x1f get peopleeyock lessee strongudden fillress wavedaringside voice Rorrow roll wind grateoredMia its are N thanked'
