## Configuration
### Source
[aiws.yamldict](../aiws/yamldict.py)  
[tutorial_code.datasets](../tutorial_code/datasets.py)  
[tutorial_code.tokenizer](../tutorial_code/tokenizer.py)

### See Also
[tokenizer.ipynb](tokenizer.ipynb)

### Config
[config.yaml](config/config.yaml)  
[paths.yaml](config/paths.yaml)  
[tokenizer.yaml](config/tokenizer.yaml)  
[dataset.yaml](config/dataset.yaml)  

In [12]:
import sys
if '..' not in sys.path: sys.path.insert(0, '..')
from aiws.dotdict import DotDict
from forgather.config import load_config
from pprint import pp, pformat

# Load meta-configuration
dirs = DotDict(load_config('forgather_config.yaml').config)
print(pformat(dirs))

{'assets_dir': '..',
 'dataset_id': 'roneneldan/TinyStories',
 'datasets_dir': '../datasets',
 'model_src_dir': '../model_zoo',
 'models_dir': 'forgather_demo/output_models',
 'script_dir': '../scripts',
 'search_paths': ['forgather_demo', '../templates', '../model_zoo'],
 'templates_dir': 'forgather_demo',
 'tokenizer_dir': '../tokenizers',
 'train_script_path': '../scripts/train_script.py',
 'whitelist_path': 'forgather_demo/whitelist.yaml'}


## Dataset
We will need some data to train our model on. For this tutorial, we will use a dataset named "TinyStories," which is a synthetic dataset generated by ChatGPT designed for training very small language models to produce coherent output. This is made possible by limiting the examples to things which a 4-year-old child would be able to understand, with a total vocabulary of about 1500 words.

Huggingface dataset link:  
https://huggingface.co/datasets/roneneldan/TinyStories  

The paper describing the dataset:  
https://arxiv.org/abs/2305.07759

The first time this is run, it will download the dataset to your cache, which make take a few minutes. After that, the dataset will be loaded from your cache.

source: [tutorial_code.datasets.load_dataset_from_config()](../tutorial_code/datasets.py)

The dataset is split into two sections, "train" and "validation." The validation set is not present in the training dataset, which allows one to test the model on data is has never seen, thus allowing one to confirm that the model is learning to generalize and not just memorize the data. As such, the model should never be trained on the validation dataset.

In [14]:
from datasets import load_dataset
dataset_dict = load_dataset("roneneldan/TinyStories")
print(dataset_dict)
train_dataset = dataset_dict['train']
print('*' * 40)
pp(train_dataset.info)
print('*' * 40)
pp(train_dataset.features)

Repo card metadata block was not found. Setting CardData to empty.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})
****************************************
DatasetInfo(description='',
            citation='',
            homepage='',
            license='',
            features={'text': Value(dtype='string', id=None)},
            post_processed=None,
            supervised_keys=None,
            task_templates=None,
            builder_name='parquet',
            dataset_name='tiny_stories',
            config_name='default',
            version=0.0.0,
            splits={'train': SplitInfo(name='train',
                                       num_bytes=1911420483,
                                       num_examples=2119719,
                                       shard_lengths=[559930,
                                                      559930,
                                                      559930,
          

We can take a look at a random sampling of examples from the training dataset like this:

In [15]:
def print_sample_records(dataset, section="text", n_records=3, max_length=500):
    print(f"Showing {n_records} random records from dataset...")
    for record in dataset.shuffle()[:n_records][section]:
        print("============================================================================================\n")
        print(record[:max_length])

print_sample_records(train_dataset)

Showing 3 random records from dataset...

Once upon a time, there was a little bird named Tweety. Tweety loved to fly around and explore nature. One day, Tweety saw a big tree with lots of colorful leaves. Tweety wanted to see the leaves up close, so he flew down to the ground.

As Tweety landed on the ground, he saw a broken branch. The branch was very sad because it couldn't reach the sky anymore. Tweety felt bad for the broken branch, so he decided to cover it with some leaves.

Tweety picked up some leaves and carefully placed them 

One day, a boy named Tom found a map. The map showed a big park. Tom wanted to go to the park. He asked his friend, Sam, to go with him. Sam was very happy to go.

At the park, Tom and Sam played with a ball. They had a lot of fun. But then, the ball went into a tree. Tom and Sam were sad. They tried to get the ball, but it remained in the tree.

Tom said, "I am sorry, Sam. I should not have kicked the ball so hard." Sam said, "It is okay, Tom. We can 

## Tokenize dataset
Before training the model, we need to convert the text in the dataset to the token-ids used by the model.

This function is a fairly simple imlementation of this functionality. It will:
- Split the dataset into a subset of the total, if 'select' is less than 1.0.
- Take each example from the dataset, in batches, and convert the text to the corresponding tokens.
- Truncate sequences longer than the model can process.
- Add padding tokens, where the length of sequences in the batch are not identical.
- Remove unused columns from the data.

### Load tokenizer
We will need a tokenizer to tokenize the dataset.
We can load our saved tokenizer -- or the tokenizer from any Huggingface model -- with this interface.

In [5]:
from transformers import AutoTokenizer

# Load a tokenizer from a local path -- or from a Huggingface model name.
# Rather than starting from scratch, you could replace 'model_path' with the path of an existing model and use its tokenizer.
tokenizer = AutoTokenizer.from_pretrained(config.model_path)
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='/home/dinalt/ai_assets/models/tiny', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


### Build Tokenizer
If you have not built the tokenizer first, follow the linked tutorial...

[Tokenizer Notebook](tokenizer.ipynb)

...or just run this cell to build and save it.  
Building it can take a moment or three. Be patient!

In [3]:
tokenizer = train_bpe_tokenizer(config, dataset['train'])
print(tokenizer)
tokenizer.save_pretrained(config.model_path)




Completed training
PreTrainedTokenizerFast(name_or_path='', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


('/home/dinalt/ai_assets/models/tiny/tokenizer_config.json',
 '/home/dinalt/ai_assets/models/tiny/special_tokens_map.json',
 '/home/dinalt/ai_assets/models/tiny/tokenizer.json')

### Tokenize dataset
Before training the model, we need to convert the text in the dataset to the token-ids used by the model.

This function is a fairly simple imlementation of this functionality. It will:
- Split the dataset into a subset of the total, if 'select' is less than 1.0.
- Take each example from the dataset, in batches, and convert the text to the corresponding tokens.
- Truncate sequences longer than the model can process.
- Add padding tokens, where the length of sequences in the batch are not identical.
- Remove unused columns from the data.

[tokenize_datasetdict()](../tutorial_code/tokenizer.py)

In [4]:
tokenized_dataset = tokenize_datasetdict(dataset, tokenizer, config)

Map:   0%|          | 0/2199 [00:00<?, ? examples/s]

Map:   0%|          | 0/211971 [00:00<?, ? examples/s]

#### Save Tokenized Dataset
Optional: You can save the datasets in pre-tokenized form.
Note: The Datasets library is fairly good about caching, so this may be redundant.

In [5]:
tokenized_dataset.save_to_disk(config.dataset.tokenized_dataset_path)

Saving the dataset (0/1 shards):   0%|          | 0/2199 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/211971 [00:00<?, ? examples/s]

#### Load Tokenized Dataset

In [6]:
tokenized_dataset = datasets.load_from_disk(config.dataset.tokenized_dataset_path)
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 2199
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 211971
    })
})
