## Configuration
### Source
[aiws.config](../aiws/config.py)  
[tutorial_code.datasets](../tutorial_code/datasets.py)  
[tutorial_code.tokenizer](../tutorial_code/tokenizer.py)  
[tutorial_code.model_utils](../tutorial_code/model_utils.py)  
[tutorial_code.train](../tutorial_code/train.py)  
[tutorial_code.vanilla_transformer](../tutorial_code/vanilla_transformer.py)  
[tutorial_code.inference](../tutorial_code/inference.py)  

### See Also
[dataset.ipynb](dataset.ipynb)  
[tokenizer.ipynb](tokenizer.ipynb)  
[simple_lm.ipynb](simple_lm.ipynb)  
[train_script.py](train_script.py)  

### Config
[config.yaml](config/config.yaml)  
[paths.yaml](config/paths.yaml)  
[dataset.yaml](config/dataset.yaml)  
[tokenizer.yaml](config/tokenizer.yaml)  
[training.yaml](config/training.yaml)  
[model.yaml](config/model.yaml)  

There is also a train-script adaptation of this notebook: [train_script.py](train_script.py)  
```
accelerate launch ./train_script.py
```

In [6]:
import os
import sys; sys.path.insert(0, '..')
import time
import datetime

from pprint import pp
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import datasets
import transformers
from transformers import set_seed

from aiws.config import load_config
from aiws.dotdict import DotDict
from tutorial_code.datasets import tokenize_datasetdict
from tutorial_code.tokenizer import train_bpe_tokenizer
from tutorial_code.model_utils import print_model_size, test_model_forward
from tutorial_code.models.vanilla_transformer import VanillaTransformerConfig, VanillaTransformer

config = DotDict(load_config("my_first_transformer.yaml", search_path=["../config", "config"]))['config']
pp(config)

{'experiment_name': 'My First Transformer',
 'experiment_description': 'A basic introduction to training a transformer '
                           'from scratch.',
 'output_dir': './models/my_first_transformer/',
 'tokenizer': {'tokenizer_path': '../assets/tokenizers/tiny_stories_2k',
               'vocab_size': 2000,
               'max_sequence_length': 2048},
 'dataset': {'tokenized_dataset_path': '',
             'dataset_id': 'roneneldan/TinyStories',
             'tokenized_dataset': '../assets/datasets/tiny_stories_tokenized',
             'train_select': 0.1,
             'validate_select': 0.1},
 'model': {'args': {'vocab_size': 2000,
                    'max_sequence_length': 2048,
                    'hidden_size': 128,
                    'dim_feedforward': 512,
                    'num_attention_heads': 1,
                    'num_hidden_layers': 2}},
 'training_args': {'output_dir': './models/my_first_transformer/',
                   'overwrite_output_dir': True,
     

### Reload Module
Useful, if you make changes to a module and don't want to restart.

In [7]:
import importlib

# If the module has not been imported, we first import it.
import tutorial_code.train

# Trigger a reload of the module.
importlib.reload(tutorial_code.train)

<module 'tutorial_code.train' from '/home/dinalt/ai_assets/aiworkshop/tutorial/../tutorial_code/train.py'>

## Build Assets
If you have not built a tokenizer or tokenized the dataset, this will quickly get you up and running.  
This may take a moment or three. Be patient!

If you would like to dive deeper into the details, see:

[dataset.ipynb](./dataset.ipynb)  
[tokenizer.ipynb](./tokenizer.ipynb)  

These should also be rebuilt if you have made changes to the tokenizer or dataset parameters.

### Load Dataset

In [8]:
print("Downloading Dataset...")
dataset = datasets.load_dataset(config.dataset.dataset_id)
print(dataset)

Downloading Dataset...


Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})


In [13]:
print(dataset['train'][2]['text'])

One day, a little fish named Fin was swimming near the shore. He saw a big crab and wanted to be friends. "Hi, I am Fin. Do you want to play?" asked the little fish. The crab looked at Fin and said, "No, I don't want to play. I am cold and I don't feel fine."

Fin felt sad but wanted to help the crab feel better. He swam away and thought of a plan. He remembered that the sun could make things warm. So, Fin swam to the top of the water and called to the sun, "Please, sun, help my new friend feel fine and not freeze!"

The sun heard Fin's call and shone its warm light on the shore. The crab started to feel better and not so cold. He saw Fin and said, "Thank you, little fish, for making me feel fine. I don't feel like I will freeze now. Let's play together!" And so, Fin and the crab played and became good friends.


### Train Tokenizer

In [14]:
print("Training Tokenizer...")
tokenizer = train_bpe_tokenizer(config.tokenizer, dataset['train'])
print(tokenizer)
tokenizer.save_pretrained(config.tokenizer.tokenizer_path)

Training Tokenizer...



Completed training
PreTrainedTokenizerFast(name_or_path='', vocab_size=2000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


('../assets/tokenizers/tiny_stories_2k/tokenizer_config.json',
 '../assets/tokenizers/tiny_stories_2k/special_tokens_map.json',
 '../assets/tokenizers/tiny_stories_2k/tokenizer.json')

### Tokenize Dataset

In [3]:
print("Tokenizing Dataset...")
tokenized_dataset = tokenize_datasetdict(dataset, tokenizer, config)
print(tokenized_dataset)
tokenized_dataset.save_to_disk(config.dataset.tokenized_dataset_path)

Downloading Dataset...




Training Tokenizer...



Completed training
PreTrainedTokenizerFast(name_or_path='', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
Tokenizing Dataset...


Map:   0%|          | 0/423943 [00:00<?, ? examples/s]

Map:   0%|          | 0/21990 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 423943
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 21990
    })
})


Saving the dataset (0/1 shards):   0%|          | 0/423943 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/21990 [00:00<?, ? examples/s]

## Load Tokenizer and Dataset
If you have already built the tokenizer and dataset, you can load them from this cell.

In [2]:
dataset = load_dataset_from_config(config)
tokenizer = AutoTokenizer.from_pretrained(config.model_path)
tokenized_dataset = datasets.load_from_disk(config.dataset.tokenized_dataset_path)
sample_text = dataset['train']['text'][0][:500]

Repo card metadata block was not found. Setting CardData to empty.


## Vanilla transformer model

[tutorial_code.vanilla_transformer](../tutorial_code/vanilla_transformer.py)  

### Instantiate model

In [3]:
# Create model configuration
model_config = VanillaTransformerConfig(
    vocab_size = tokenizer.vocab_size,
    hidden_size = config.model.d_model,
    dim_feedforward = config.model.d_feedforward,
    max_sequence_length = tokenizer.model_max_length,
    num_attention_heads=config.model.num_attention_heads,
    num_hidden_layers=config.model.num_hidden_layers,
)

# A config can also be instantiated from a json file like this:
#config = AutoConfig.from_pretrained("path-to-config")

# Make model weights deterministic
set_seed(42)
# Instantiate the model
model = AutoModelForCausalLM.from_config(model_config)
print_model_size(model)
print(model)

Model size: 0.9M parameters
VanillaTransformer(
  (embedding): Embedding(2000, 128)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=128, out_features=128, bias=True)
        (key_linear): Linear(in_features=128, out_features=128, bias=True)
        (value_linear): Linear(in_features=128, out_features=128, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=128, bias=True)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
  )
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)


## Train Model
This is an example training-loop implementation.

Example code is based upon examples here:
https://huggingface.co/learn/nlp-course/en/chapter3/4

In [6]:
from tutorial_code.train import Trainer, AccelTrainer, TrainingArguments
import importlib
from transformers.utils.notebook import NotebookProgressCallback

# If the module has not been imported, we first import it.
import tutorial_code.train

# Trigger a reload of the module.
importlib.reload(tutorial_code.train)

# This provides a place to configure the training parameters.
def do_train():
    start_time = time.perf_counter()
    AccelTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['validation'],
        callbacks=[ NotebookProgressCallback() ],
        training_arguments=TrainingArguments(**config.train),
    ).train()
    end = time.perf_counter()
    print(f"Elapsed {datetime.timedelta(seconds=time.perf_counter() - start_time)}")

In [None]:
# Batch size 32
do_train()

In [None]:
# Create basic accelerate configuration
import os
from accelerate.utils import write_basic_config

write_basic_config()  # Write a config file
os._exit(00)  # Restart the notebook

In [4]:
from accelerate import notebook_launcher, DataLoaderConfiguration
from tutorial_code.trainer_callback import ProgressCallback, InfoCallback
from tutorial_code.train import Trainer, AccelTrainer, TrainingArguments
from transformers.utils.notebook import NotebookProgressCallback

#transformers.utils.logging.set_verbosity_debug()

def accel_train_function():
    # See: https://github.com/huggingface/accelerate/blob/v0.13.2/src/accelerate/accelerator.py
    accelerator_args=dict(
        #mixed_precision='bf16',
        # DataLoaderConfiguration
        dataloader_config=DataLoaderConfiguration(
            dispatch_batches=False,
            # Note: This requires that your batch size be a multiple of the number of GPUs.
            split_batches=True
        )
    )
    
    trainer = AccelTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['validation'],
        callbacks=[InfoCallback(), NotebookProgressCallback()],
        accelerator_args=accelerator_args,
        training_arguments=TrainingArguments(**config.train),
    )
    trainer.train()

    # The model is not moved back to the notebook process automatically.
    # Save the model and then reload it to move it.
    # Note: Be careful that this does not clobber a saved model.
    # If you don't want this to persist, create a temporary path to save the model to.
    trainer.accelerator.wait_for_everyone()
    if trainer.accelerator.is_main_process:
        model.save_pretrained(
            save_directory=config.model_path,
            safe_serialization=True,
        )


def do_train():
    
    start_time = time.perf_counter()
    notebook_launcher(accel_train_function, num_processes=torch.cuda.device_count())
    end = time.perf_counter()
    print(f"Elapsed {datetime.timedelta(seconds=time.perf_counter() - start_time)}")

    model = AutoModelForCausalLM.from_pretrained(
       config.model_path,
        local_files_only=True,
    )
    return model

model = do_train()

## Evaluate predictions

### Predict Tokens
This will take the input text and have the model make predictions for the next token for each token in the sequence.

The color coding indicates the loss for each individual token, with darker colors being more accurate and brighter colors being less so.

If you hover over a token, you can see the top-10 predictions for the next token in the sequence.

[tutorial_code.inference](../tutorial_code/inference.py)  

In [6]:
from tutorial_code.inference import show_predictions

show_predictions(model, tokenizer, device="cuda", text=[sample_text])

line: One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b
Metric 'Causal Loss': n=154, min=0.016235284507274628, max=9.593510627746582, mean=2.398885488510132, range=(0, None)






### Simple Text Gen
This is a very simple text generator implementation.

[tutorial_code.textgen](../tutorial_code/textgen.py)  

In [7]:
from tutorial_code.textgen import TextGenerator

# Test text generation.
# Don't expect too much from this model, as the only input to each prediction is the previous word. 
text_gen = TextGenerator(model, tokenizer, config.device, do_sample=True, seed=42)
text = text_gen.prompt("One day, a little girl", max_new_tokens=50)
print(repr(text))

'<|BOS|> One day, a little girl named Sue was feeling sad. She Yesternessi, Sue! She played together in the sunglass She learned to take her friends. She always remembered how she got to go home and play together.\n\nOne day,'


## Huggingface Text Gen
This is a simple wrapper for the Huggingface tokenizer

In [5]:
# https://huggingface.co/docs/transformers/v4.34.1/en/generation_strategies

class TextGen():
    def __init__(self, model, tokenizer, device):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

    def generate(self, prompt, do_sample=True, top_k=50, top_p=0.9, max_new_tokens=500):
        self.model.to(self.device)
        input_ids = self.tokenizer(prompt, return_tensors='pt')['input_ids'].to(self.device)
        outputs = model.generate(input_ids, do_sample=do_sample, top_k=top_k, top_p=top_p, max_new_tokens=max_new_tokens)
        return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

In [6]:
gen = TextGen(model, tokenizer, device=config.device)
print(gen.generate(sample_text))

 One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them burn. Lily was so proud of Lily and thanked her for their help. From that day on, she was careful not to try new things she had to play together. Lily learned ach and that it's important to share and that they needed to make it better."

From that day on, Lily played with the next day. She loved to play with the on her friends on her bed. It was very pretty and pretty. And always made sure to share her new shirt, she loved to enjoy. The end. Lily's face felt warm and happy to have fun in the wor

## Save and Load model
Should you want to save and restore a model...

### Save

In [10]:
model.save_pretrained(
    save_directory=config.model_path,
    safe_serialization=True,
)

### Load

In [5]:
model, load_info = AutoModelForCausalLM.from_pretrained(
    config.model_path,
    output_loading_info=True,
    local_files_only=True,
)
print(load_info)
print(model)

{'missing_keys': [], 'unexpected_keys': [], 'mismatched_keys': [], 'error_msgs': []}
VanillaTransformer(
  (embedding): Embedding(2000, 128)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=128, out_features=128, bias=True)
        (key_linear): Linear(in_features=128, out_features=128, bias=True)
        (value_linear): Linear(in_features=128, out_features=128, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=512, out_features=128, bias=True)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
  )
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)
