## Configuration

In [1]:
import os

# This is where we will save and load our model and tokenizer.
model_path = "/home/dinalt/ai_assets/models/causallm"

# The Huggingface dataset-id to use
dataset_id = "roneneldan/TinyStories"

# Alternative to above: A path to an on-disk dataset to load.
dataset_path = "/home/dinalt/ai_assets/datasets/roneneldan-TinyStories"

# Where to save/load the pretokenized dataset
tokenized_dataset_path = "/home/dinalt/ai_assets/datasets/causallm_tinystories_tokenized"

## Training parameters

# 'cuda', 'cpu', 'cuda:1', etc.
device = 'cuda'
per_device_train_batch_size = 64
per_device_eval_batch_size = 128
learning_rate = 1e-3
num_train_epochs = 1.0
eval_steps = 1000
num_warmup_steps = 0

# See: https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules#schedules
# linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup, inverse_sqrt, reduce_lr_on_plateau
lr_scheduler_name = "constant"

# If multiple GPUs, this can be used to select a sub-set of GPUs to use.
# Using fewer GPUs can actually be faster for small models.
#os.environ["CUDA_VISIBLE_DEVICES"] = "0"

## Quick Load
If you have already created a saved the tokenizer and tokenized datasets, executing this code
provides a shortcut to restoring these assets. You can then skip to model creation and training.

Otherwise, skip to the next section.

In [2]:
import datasets
from transformers import AutoTokenizer

if dataset_path is not None:
    dataset = datasets.load_from_disk(dataset_path)
else:
    dataset = datasets.load_dataset(dataset_path)
train_dataset = dataset['train']
sample_text = train_dataset['text'][0][:500]

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenized_dataset = datasets.load_from_disk(tokenized_dataset_path)
tok_train_dataset = tokenized_dataset["train"]
tok_val_dataset = tokenized_dataset["validation"]

## Dataset
We will need some data to train our model on. For this tutorial, we will use a dataset named "TinyStories," which is a synthetic dataset generated by ChatGPT designed for training very small language models to produce coherent output. This is made possible by limiting the examples to things which a 4-year-old child would be able to understand, with a total vocabulary of about 1500 words.

Huggingface dataset link:  
https://huggingface.co/datasets/roneneldan/TinyStories  

The paper describing the dataset:  
https://arxiv.org/abs/2305.07759

The first time this is run, it will download the dataset to your cache, which make take a few minutes. After that, the dataset will be loaded from your cache.

In [2]:
import datasets

# Load dataset from either HF Hub or from disk.
# Unfortunately, there is not a unified API which works for both.
if dataset_path is not None:
    dataset = datasets.load_from_disk(dataset_path)
else:
    dataset = datasets.load_dataset(dataset_path)

print(dataset)
train_dataset = dataset['train']

# For experimentation, we will want a bit of sample text to work with. 
# This will grab the first 500 characters from the first record of the training dataset.
sample_text = train_dataset['text'][0][:500]
print(sample_text)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})
One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b


The dataset is split into two sections, "train" and "validation." The validation set is not present in the training dataset, which allows one to test the model on data is has never seen, thus allowing one to confirm that the model is learning to generalize and not just memorize the data. As such, the model should never be trained on the validation dataset.

In [4]:
def print_dataset_info(dataset):
    print("dataset:\n\n", dataset)
    print("dataset.info:\n\n", dataset.info)
    print("dataset.features:\n\n", dataset.features)

print_dataset_info(train_dataset)

dataset:

 Dataset({
    features: ['text'],
    num_rows: 2119719
})
dataset.info:

 DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tiny_stories', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1911420483, num_examples=2119719, shard_lengths=[559930, 559930, 559930, 439929], dataset_name='tiny_stories'), 'validation': SplitInfo(name='validation', num_bytes=19306310, num_examples=21990, shard_lengths=None, dataset_name='tiny_stories')}, download_checksums={'hf://datasets/roneneldan/TinyStories@691b0d9bd48ade766778c940011ca1c549f6359b/data/train-00000-of-00004-2d5a1467fff1081b.parquet': {'num_bytes': 248731111, 'checksum': None}, 'hf://datasets/roneneldan/TinyStories@691b0d9bd48ade766778c940011ca1c549f6359b/data/train-00001-of-00004-5852b56a2bd28fd9.parquet': {'num_bytes': 248

We can take a look at a random sampling of examples from the training dataset like this:

In [5]:
def print_sample_records(dataset, section="text", n_records=3, max_length=500):
    print(f"Showing {n_records} random records from dataset...")
    for record in dataset.shuffle()[:n_records][section]:
        print("============================================================================================\n")
        print(record[:max_length])

print_sample_records(train_dataset)

Showing 3 random records from dataset...

Once upon a time, there was a green tiger. He was walking in the jungle, feeling excited. He said to himself, "I want to find an adventure!"

Suddenly he heard some noise. He looked around and saw a little boy. The boy was crying.

The tiger asked, "Why are you crying?"

The little boy replied, "I'm lost. I can't find my way home."

The tiger said, "Don't worry. I will help you find your way home. Come, follow me."

The tiger held the little boy's hand and walked through the jungle. After a whil

Once upon a time, there was a very long hoop lying in the grass. One day a little boy named Jack saw it and decided to pick it up.

Jack called to his dad: "Look Dad! I found a hoop!" His dad smiled when he saw the hoop and said: "That is great! Let's see what we can do with it."

Jack was very excited and asked: "What should we do?" His dad replied: "Let's try and set the hoop up."

They both started to set the hoop up by placing it on the ground and 

## Tokenizer
Rather than working with the raw ASCII/Unicode from the dataset, we will be "tokenizing" the data. A tokenizer is a statisttical model which aggregates individual characters into sub-word, where the most frequent strings of characters are replaced by unique symbols.

https://en.wikipedia.org/wiki/Large_language_model#Probabilistic_tokenization

For this tutorial, we will be created a Byte Pair Encoding (BPE) tokenizer, which starts with all of the symbols from the ASCII character set, then creates tokens for the most common pairs of ASCII characters. These pairs are further aggregated into larger symbols and the process repeats until a set of symbols matching the target vocabulary size has been created.

By starting with the ASCII character set, it is possible to represent any combination of letters, including those which were not observed when the tokenizer was created.



### Create a pre-tokenizer

The first step will be to specify a "[pre-tokenizer](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)," which breaks the input text into sub-strings via a regular expression. For example, a simple pre-tokenizer could split the input on spaces and punctuation.

We will be using the "ByteLevel" pre-tokenizer, which uses a GPT-2 specfic regex for splitting the words and replaces spaces with the 'Ġ' character.

In [3]:
import tokenizers

# A pre-tokenizer is responsible for splitting the input text into words.
# The ByteLevel pretokenizer uses a regular expression for splitting on word boundaries and
# substitutes the space character with 'Ġ'
pre_tokenizer = tokenizers.pre_tokenizers.ByteLevel()

def test_pretokenizer(pre_tokenizer, sample_text):
    tokens = pre_tokenizer.pre_tokenize_str(sample_text)
    for token in tokens:
        print(f"'{token[0]}'", end=" ")
    print("\n")

test_pretokenizer(pre_tokenizer, sample_text)

'ĠOne' 'Ġday' ',' 'Ġa' 'Ġlittle' 'Ġgirl' 'Ġnamed' 'ĠLily' 'Ġfound' 'Ġa' 'Ġneedle' 'Ġin' 'Ġher' 'Ġroom' '.' 'ĠShe' 'Ġknew' 'Ġit' 'Ġwas' 'Ġdifficult' 'Ġto' 'Ġplay' 'Ġwith' 'Ġit' 'Ġbecause' 'Ġit' 'Ġwas' 'Ġsharp' '.' 'ĠLily' 'Ġwanted' 'Ġto' 'Ġshare' 'Ġthe' 'Ġneedle' 'Ġwith' 'Ġher' 'Ġmom' ',' 'Ġso' 'Ġshe' 'Ġcould' 'Ġsew' 'Ġa' 'Ġbutton' 'Ġon' 'Ġher' 'Ġshirt' '.' 'Ċ' 'Ċ' 'Lily' 'Ġwent' 'Ġto' 'Ġher' 'Ġmom' 'Ġand' 'Ġsaid' ',' 'Ġ"' 'Mom' ',' 'ĠI' 'Ġfound' 'Ġthis' 'Ġneedle' '.' 'ĠCan' 'Ġyou' 'Ġshare' 'Ġit' 'Ġwith' 'Ġme' 'Ġand' 'Ġsew' 'Ġmy' 'Ġshirt' '?"' 'ĠHer' 'Ġmom' 'Ġsmiled' 'Ġand' 'Ġsaid' ',' 'Ġ"' 'Yes' ',' 'ĠLily' ',' 'Ġwe' 'Ġcan' 'Ġshare' 'Ġthe' 'Ġneedle' 'Ġand' 'Ġfix' 'Ġyour' 'Ġshirt' '."' 'Ċ' 'Ċ' 'Together' ',' 'Ġthey' 'Ġshared' 'Ġthe' 'Ġneedle' 'Ġand' 'Ġsewed' 'Ġthe' 'Ġbutton' 'Ġon' 'ĠLily' ''s' 'Ġshirt' '.' 'ĠIt' 'Ġwas' 'Ġnot' 'Ġdifficult' 'Ġfor' 'Ġthem' 'Ġb' 



### Create a new BPE tokenizer

[Tokenizer Models](https://huggingface.co/docs/tokenizers/en/api/models)

For extra credit, try the other models at the link above.

In [4]:
from tokenizers.processors import TemplateProcessing
# We can add tokens with special meanings to the tokenizer
special_tokens={
    "pad": "<|PAD|>",   # Used to pad unused positions in a sequence.
    "mask": "<|MASK|>", # Used with masked-language-modeling to mark a position as having been masked.
    "bos": "<|BOS|>",   # Beginning of Sequence
    "eos": "<|EOS|>",   # End of Sequence
    "unk": "<|UNK|>",   # Unknown
}

# Define the size of the vocabulary.
# A small vocabulary will result in many fragmented words, but is useful for
# minimizing training time.
#
# Consider using values in the range of 8K - 50K when not
# working with the smallest of toy models.
vocab_size = 2000

# Create a new BPE tokenizer.
pretrained_tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE(
    cache_capacity=16,
    unk_token=special_tokens['unk'],
    byte_fallback=True,
))

# Attach the pre-tokenizer, created above
pretrained_tokenizer.pre_tokenizer = pre_tokenizer

# The 'normalizer' can be used to transform the characters. For example, they can convert everything to
# lowercase, remove accent marks, and translate Unicode. We will use the NFC Unicode normalizer, with the 
# detaisl explained here: https://unicode.org/reports/tr15/
pretrained_tokenizer.normalizer = tokenizers.normalizers.NFC()

# The decoder is applied when coverting tokens back into text and the ByteLevel decoder
# is responsible for replacing 'Ġ' character with spaces. 
pretrained_tokenizer.decoder = tokenizers.decoders.ByteLevel()

# Automatically add Begin Of Sequence (BOS) token to output when 'add_special_tokens' is True
# This has relevance to causal models, which predict the next token in a sequence. As the first real token lacks
# a preceeding token, this allows the model to identify where the sequence actually begins.
#
# Note: A causal model can still function without a BOS token and the need to include it is debatable.
pretrained_tokenizer.post_processor = TemplateProcessing(
    single="<BOS> $A",
    special_tokens=[
        ("<BOS>", 2),
    ],
)

### Train the tokenizer

In [5]:
# Create a BPE trainer, which is used to build an optimal set of tokens from
# a a given dataset.
tok_trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=vocab_size,
    initial_alphabet=tokenizers.pre_tokenizers.ByteLevel.alphabet(),
    special_tokens=list(special_tokens.values()),
    show_progress=True,
)

# This abstraction is needed for the trainer to iterate over our dataset
def batch_iterator(dataset, batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]['text']

# Train the tokenizer of the dataset
# Be patient! This will take a bit of time to complete...
pretrained_tokenizer.train_from_iterator(batch_iterator(train_dataset), trainer=tok_trainer, length=len(dataset))






### Wrap the tokenizer

The BPE tokenizer class can be wrapped in a Huggingface [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) class, which makes working with the tokenizer easier.

In [6]:
from transformers import PreTrainedTokenizerFast

# This wraps the tokenizer in a Huggingface transformer tokenizer, which
# is a higher level abstraction
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=pretrained_tokenizer,
    # This should match the model's input length limit, which depends upon the archetecture.
    # If not limit is specified, the default will be a VERY LARGE value.
    model_max_length=2048,
    pad_token=special_tokens['eos'],
    mask_token=special_tokens['mask'],
    bos_token=special_tokens['bos'],
    eos_token=special_tokens['eos'],
    unk_token=special_tokens['unk'],
    return_special_tokens_mask=False,
)

print(tokenizer)

PreTrainedTokenizerFast(name_or_path='', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


### Test the tokenizer

In [7]:
# We can use the new tokenizer to tokenizer text via the object's __call__ method, like this:

input_ids = tokenizer(sample_text)['input_ids']
print(input_ids)

# We can convert these to their symbolic representations like this.
# Note the 'Ġ' symbols. The tokenizer has folded spaces into the tokens, where this symbol represents the space.
# A consequence of this encoding is that tokens may exist for the same word, both with and without a space.
# For example, "she" and " she" would be represented as seperate tokens.
for ids in [input_ids]:
    print(tokenizer.convert_ids_to_tokens(ids))

[2, 491, 360, 16, 263, 403, 450, 505, 362, 598, 263, 792, 311, 320, 313, 763, 18, 317, 709, 308, 286, 1035, 74, 475, 1389, 88, 270, 365, 346, 308, 791, 308, 286, 385, 291, 84, 18, 362, 448, 270, 952, 267, 792, 311, 346, 313, 370, 16, 354, 342, 464, 442, 91, 263, 1842, 307, 349, 313, 385, 316, 88, 18, 203, 203, 601, 473, 270, 313, 370, 269, 331, 16, 332, 781, 16, 339, 598, 747, 792, 311, 18, 1283, 350, 952, 308, 346, 522, 269, 442, 91, 656, 385, 316, 88, 481, 869, 370, 503, 269, 331, 16, 332, 836, 16, 362, 16, 369, 477, 952, 267, 792, 311, 269, 1307, 633, 385, 316, 88, 420, 203, 203, 56, 83, 558, 16, 368, 1659, 267, 792, 311, 269, 442, 91, 268, 267, 1842, 307, 349, 362, 376, 385, 316, 88, 18, 413, 286, 390, 1035, 74, 475, 1389, 88, 372, 452, 271]
['<|BOS|>', 'ĠOne', 'Ġday', ',', 'Ġa', 'Ġlittle', 'Ġgirl', 'Ġnamed', 'ĠLily', 'Ġfound', 'Ġa', 'Ġneed', 'le', 'Ġin', 'Ġher', 'Ġroom', '.', 'ĠShe', 'Ġknew', 'Ġit', 'Ġwas', 'Ġdif', 'f', 'ic', 'ul', 't', 'Ġto', 'Ġplay', 'Ġwith', 'Ġit', 'Ġbecause', 

In [9]:
# We can decode token ids with decode() or batch_decode()
decoded_tokens = tokenizer.batch_decode([input_ids], skip_special_tokens=False, clean_up_tokenization_spaces=True)
for s in decoded_tokens:
    print(f"\"{s}\"")

"<|BOS|> One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b"


---
We can dump the vocabulary of the tokenizer. The first part will contain our special tokens and the ASCII character-set. After this, the number of characters in each tokens grows, with the largest tokens at the end.

In [10]:
# Dump a range of the tokenizer's vocabulary
def show_vocabulary(tokenizer, token_range):
    for i, token in zip(token_range, tokenizer.batch_decode([i for i in token_range], skip_special_tokens=False)):
        print(f"'{i}: {token}'", end=" ")
    print("\n")

# Show the first and last 64 tokens.
show_vocabulary(tokenizer, range(64))
show_vocabulary(tokenizer, range(tokenizer.vocab_size - 64, tokenizer.vocab_size))

# Show full vocab.
#show_vocabulary(tokenizer, range(tokenizer.vocab_size))

'0: <|PAD|>' '1: <|MASK|>' '2: <|BOS|>' '3: <|EOS|>' '4: <|UNK|>' '5: !' '6: "' '7: #' '8: $' '9: %' '10: &' '11: '' '12: (' '13: )' '14: *' '15: +' '16: ,' '17: -' '18: .' '19: /' '20: 0' '21: 1' '22: 2' '23: 3' '24: 4' '25: 5' '26: 6' '27: 7' '28: 8' '29: 9' '30: :' '31: ;' '32: <' '33: =' '34: >' '35: ?' '36: @' '37: A' '38: B' '39: C' '40: D' '41: E' '42: F' '43: G' '44: H' '45: I' '46: J' '47: K' '48: L' '49: M' '50: N' '51: O' '52: P' '53: Q' '54: R' '55: S' '56: T' '57: U' '58: V' '59: W' '60: X' '61: Y' '62: Z' '63: [' 

'1936: ek' '1937:  shap' '1938:  shook' '1939:  exploring' '1940:  moved' '1941:  purp' '1942:  year' '1943: aughty' '1944:  nearby' '1945:  naughty' '1946:  star' '1947:  soup' '1948:  shop' '1949:  wise' '1950:  stars' '1951:  owl' '1952:  bring' '1953: fused' '1954:  jar' '1955: bow' '1956: Do' '1957: ocked' '1958:  inv' '1959:  exp' '1960:  whe' '1961: yard' '1962:  caught' '1963:  su' '1964: ward' '1965:  Emma' '1966:  backyard' '1967:  seemed' '1968: ail'

### Save tokenizer
We can save the tokenizer, which will allow us to skip recreating it from scratch next time.

In [8]:
tokenizer.save_pretrained(model_path)

('/home/dinalt/ai_assets/models/causallm/tokenizer_config.json',
 '/home/dinalt/ai_assets/models/causallm/special_tokens_map.json',
 '/home/dinalt/ai_assets/models/causallm/tokenizer.json')

### Load tokenizer
We can load our saved tokenizer -- or the tokenizer from any Huggingface model -- with this interface.

In [3]:
from transformers import AutoTokenizer

# Load a tokenizer from a local path -- or from a Huggingface model name.
# Rather than starting from scratch, you could replace 'model_path' with the path of an existing model and use its tokenizer.
tokenizer = AutoTokenizer.from_pretrained(model_path)
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='/home/dinalt/ai_assets/models/causallm', vocab_size=2000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|BOS|>', 'eos_token': '<|EOS|>', 'unk_token': '<|UNK|>', 'pad_token': '<|EOS|>', 'mask_token': '<|MASK|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<|PAD|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|MASK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|BOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|EOS|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<|UNK|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


## Tokenize dataset
Before training the model, we need to convert the text in the dataset to the token-ids used by the model.

This function is a fairly simple imlementation of this functionality. It will:
- Split the dataset into a subset of the total, if 'select' is less than 1.0.
- Take each example from the dataset, in batches, and convert the text to the corresponding tokens.
- Truncate sequences longer than the model can process.
- Add padding tokens, where the length of sequences in the batch are not identical.
- Remove unused columns from the data.

In [9]:
import os
# select: the ratio of the total to tokenize. e.g. 0.1 = 10%
def tokenize_dataset(dataset, tokenizer, select=1.0, shuffle=False):
    def map_fn(element, tokenizer):
        outputs = tokenizer(
            element["text"],
            truncation=True,

            # Other common arguments, for reference.
            #padding=True,
            #return_tensors='pt',

            # This can be used to limit the block-size to less than what the model can handle.
            # max_length=block_size,

            # If set to True, if the sequence is truncated, the 'overflowing' tokens will be 
            # returned on the next call.
            # return_overflowing_tokens=True,

            # This can we used to get the length of the returned sequences, allowing one to
            # discard short sequences, if desired.
            # return_length=True,
        )
        return {"input_ids": outputs["input_ids"]}

    if shuffle:
        dataset = dataset.shuffle()
    
    if select < 1.0:
        dataset = dataset.select(range(0, int(len(dataset) * select)))
    
    # https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning
    #os.environ["TOKENIZERS_PARALLELISM"] = "true"
    tokenized_data = dataset.map(
        map_fn,
        batched=True,
        remove_columns=dataset.column_names,
        fn_kwargs=dict(tokenizer=tokenizer)
    )
    #os.environ["TOKENIZERS_PARALLELISM"] = "false"
    return tokenized_data

#### Tokenize Datasets

In [10]:
# Note the small "select" sizes. We are only using 10% of the validation and train
# examples. This is to keep the training time reasonable for the examples.
tok_val_dataset = tokenize_dataset(dataset["validation"], tokenizer, select=0.1)
tok_train_dataset = tokenize_dataset(dataset["train"], tokenizer, select=0.1)

Map:   0%|          | 0/2199 [00:00<?, ? examples/s]

#### Save Tokenized Dataset
Optional: You can save the datasets in pre-tokenized form.
Note: The Datasets library is fairly good about caching, so this may be redundant.

In [11]:
def save_tokenized_dataset(path, train, validate):
    dataset_dict = datasets.DatasetDict()
    dataset_dict["train"] = train
    dataset_dict["validation"] = validate
    print(dataset_dict)
    dataset_dict.save_to_disk(path)

save_tokenized_dataset(tokenized_dataset_path, tok_train_dataset, tok_val_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 211971
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 2199
    })
})


Saving the dataset (0/1 shards):   0%|          | 0/211971 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2199 [00:00<?, ? examples/s]

#### Load Tokenized Dataset

In [7]:
def load_tokenized_dataset(dataset_path):
    tokenized_dataset = datasets.load_from_disk(dataset_path)
    print(tokenized_dataset)
    return tokenized_dataset["train"], tokenized_dataset["validation"]

tok_train_dataset, tok_val_dataset = load_tokenized_dataset(tokenized_dataset_path)

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 211971
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 2199
    })
})


## Create a simple causal language model
A "causal" language model is one which makes predictions about future tokens based upon past tokens. We will start with a simple model which predicts the next token, given only the immediadly preceeding token.

### Model definition

In [3]:
import torch
from torch import nn, Tensor
import torch.nn.init as init
from torch.nn import functional as F

class CausalLM(nn.Module):
    def __init__(
        self,
        # d_model is the number of features in the model's embeddings, where a feature is a single floating-point scalar and
        # the embeddings are vectors, each of size d_model. This parameter is sometimes referred to a the model's
        # "hidden" dimension.
        d_model,

        # This is the vocabulary size of the model, which should match the size of the model's tokenizer.
        vocab_size
    ):
        super().__init__()
        self.vocab_size = vocab_size
        self.d_model = d_model

        # The class contains an array of embeddings, with each token-id in the vocabulary corresponding to the element at that index.
        # For example, self.embedding.weight[token_id] would refer to the features at the index 'token_id.'
        self.embedding = nn.Embedding(self.vocab_size, self.d_model)

        # We will use a linear layer to convert embeddins into a probability distribution accross all of the token-ids, thus
        # it has an input size of d_model and an output size of vocab_size.
        self.output_projection = nn.Linear(self.d_model, self.vocab_size)

    def forward(self, input_ids: Tensor, labels: Tensor=None, attention_mask: Tensor=None):
        # input_ids (batch_size, seq_len):
        #    This contains batches of sequences of token-ids, representing the input text.
        # labels (batch_size, seq_len): If given these are the ground-truth targets the model is striving to predict.
        #    For a causal model, these are identical to the input-ids, with a special value of -100
        #    reserved for padding, which are not scored.
        # attention_mask: The Huggingface APIs pass this in, although we don't use it.
        
        # Convert input_ids to embeddings.
        x = self.embedding(input_ids)

        # Convert embeddings to log-probabilities of next token-id
        # We could convert the logits to probabilities (0.0 to 1.0) with
        # torch.softmax(logits, dim=-1)
        logits = self.output_projection(x)

        # If we are passed labels, we will compute loss and return both loss and logits.
        if labels is not None:
            loss = self.loss(logits, labels)
            return (loss, logits)
        # Otherwise, we only return the logits.
        else:
            return logits

    def loss(self, logits, labels):
        # Shift so that tokens < n predict n
        # To achieve this, we slice off the last prediction, as we don't have a label
        # corresponding to it and slice off the first label, as nothing preceeds it for which
        # a prediction could be made.
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()

        # The loss meaures the error between the ground-truth (lables) and what the model predicted (logits).
        # If the model makes a perfect prediction, the loss will be zero, otherwise, it will be a positive
        # log-scaled measure of the error.
        #
        # For each label, the model makes a prediction for every token in the vocabulary, with the logits being
        # a log-probability distribution of the prediction. Cross-entroy-loss compares the model's predicted
        # distribution with a "one-hot" distribution -- that is, a probability distribution with 1.0 where the label
        # is and 0.0 for all other elements in the vocabulary.
        loss = torch.nn.functional.cross_entropy(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1),
            # labels with this value are ignored when computing loss
            ignore_index=-100,
            reduction='mean',
        )

        # Allowing the model to return NaN can cause problems, so we convert these values to a number.
        return loss.nan_to_num()

### Instantiate model
This will create an instance of our model with a hidden-dimension (d_model) of 128 and a vocabulary size matching that of the tokenizer.

In [4]:
def print_model_size(model):
    model_size = sum(t.numel() for t in model.parameters())
    print(f"Model size: {model_size/1000**2:.1f}M parameters")

# Model hidden dimension
d_model = 128

model = CausalLM(d_model, tokenizer.vocab_size)
print_model_size(model)
print(model)

Model size: 0.5M parameters
CausalLM(
  (embedding): Embedding(2000, 128)
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)


### Enable Torch Compile [optional]

https://pytorch.org/docs/stable/torch.compiler.html

Optionally compile the model. This may not work with all version of Python and Pytorch.
Using "torch.compile()" is especially effective at speeding up small models.

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
torch.set_float32_matmul_precision('high')
model.compile()
os.environ["TOKENIZERS_PARALLELISM"] = "true"

### Test model forward
This will tokenizer our sample text and feed it through the forward method of the model to ensure that the code does not "fall-over."

If the model has not been trained, the loss is expected to be around 7 - 8; lower, if it has been trained.

In [11]:
def test_model_forward(model, tokenizer, text, device='cpu'):
    model.train()
    model = model.to(device=device)
    
    input_ids = tokenizer(
        text,
        truncation=True,
        return_tensors='pt',
    )['input_ids'].to(device=device)

    print("input_ids:\n", input_ids)
    labels = input_ids

    loss, logits = model(input_ids=input_ids, labels=labels)
    print(loss)
    
    # Compute gradient
    loss.backward()

    # Reset model gradients
    model.zero_grad()

test_model_forward(model, tokenizer, sample_text)

input_ids:
 tensor([[   2,  491,  360,   16,  263,  403,  450,  505,  362,  598,  263,  792,
          311,  320,  313,  763,   18,  317,  709,  308,  286, 1035,   74,  475,
         1389,   88,  270,  365,  346,  308,  791,  308,  286,  385,  291,   84,
           18,  362,  448,  270,  952,  267,  792,  311,  346,  313,  370,   16,
          354,  342,  464,  442,   91,  263, 1842,  307,  349,  313,  385,  316,
           88,   18,  203,  203,  601,  473,  270,  313,  370,  269,  331,   16,
          332,  781,   16,  339,  598,  747,  792,  311,   18, 1283,  350,  952,
          308,  346,  522,  269,  442,   91,  656,  385,  316,   88,  481,  869,
          370,  503,  269,  331,   16,  332,  836,   16,  362,   16,  369,  477,
          952,  267,  792,  311,  269, 1307,  633,  385,  316,   88,  420,  203,
          203,   56,   83,  558,   16,  368, 1659,  267,  792,  311,  269,  442,
           91,  268,  267, 1842,  307,  349,  362,  376,  385,  316,   88,   18,
          413,  

## Train Model
This is an example training-loop implementation.

Example code is based upon examples here:
https://huggingface.co/learn/nlp-course/en/chapter3/4

In [6]:
import torch
from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling
from tqdm.auto import tqdm
import evaluate
import numpy as np
import transformers

class CausalTrainer:
    def __init__(
        self,
        model,
        tokenizer,
        train_dataset,
        eval_dataset,
        per_device_train_batch_size,
        per_device_eval_batch_size,
        eval_steps,
        lr,
        num_train_epochs,
        optimizer_factory,
        lr_scheduler_factory,
        device,
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.eval_steps = eval_steps
        self.device = device
        self.num_train_epochs = num_train_epochs
        self.model = self.model.to(self.device) 
        
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm=False,
        )
        
        self.train_dataloader = DataLoader(
            train_dataset,
            shuffle=True,
            batch_size=per_device_train_batch_size,
            collate_fn=data_collator,
        )
        
        self.eval_dataloader = DataLoader(
            eval_dataset,
            batch_size=per_device_eval_batch_size,
            collate_fn=data_collator
        )

        self.num_train_steps = int(self.num_train_epochs * len(self.train_dataloader))
        self.optimizer = optimizer_factory(model.parameters(), learning_rate)
        self.lr_scheduler = lr_scheduler_factory(self.optimizer, self.num_train_steps)

    def train(self):
        self._train_loop()
        self.eval()

    def _train_loop(self):
        self.model.train()
        epoch = 0
        global_step = 0
        eval_step = 0
        
        
        print(f"Training for {self.num_train_steps} steps")
        progress_bar = tqdm(range(self.num_train_steps))
        while True:
            for batch in self.train_dataloader:
                batch = {k: v.to(self.device) for k, v in batch.items()}
                outputs = self.model(**batch)
                loss = outputs[0]
                loss.backward()
                self.optimizer.step()
                self.lr_scheduler.step()
                self.optimizer.zero_grad()

                progress_bar.update(1)
                global_step += 1
                if global_step >= self.num_train_steps:
                    return
                eval_step += 1
                # Evaluate every 'eval_steps'
                if eval_step == self.eval_steps:
                    self.eval()
                    print(f"Global step: {global_step}")
                    self.model.train()
                    eval_step = 0
            
            epoch += 1
            print(f"Epoch: {epoch}")

    @torch.no_grad()
    def eval(self):
        eval_steps = len(self.eval_dataloader)
        #metric = evaluate.load("accuracy")
        self.model.eval()
        progress_bar = tqdm(range(eval_steps))
        total_loss = 0
        for batch in self.eval_dataloader:
            batch = {k: v.to(self.device) for k, v in batch.items()}
            outputs = self.model(**batch)
            logits = outputs[1]
            loss = outputs[0]
            total_loss += loss.item()
            progress_bar.update(1)
        print(f"loss={total_loss / eval_steps}")

# This provides a place to configure the training parameters.
def do_train():
    CausalTrainer(
        model,
        tokenizer,
        train_dataset=tok_train_dataset,
        eval_dataset=tok_val_dataset,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        lr=learning_rate,
        num_train_epochs=num_train_epochs,
        eval_steps=eval_steps,
        optimizer_factory=lambda params, lr: torch.optim.AdamW(params, lr=lr),
        lr_scheduler_factory=lambda opt, steps: transformers.get_scheduler(lr_scheduler_name, opt, num_warmup_steps, steps),
        device = 'cuda',
    ).train()

In [None]:
do_train()

### Accelerate Training Loop

This is the same code, but modified to run on multiple GPU's within a notebook using the [Accelerate](https://huggingface.co/docs/accelerate/v0.11.0/en/index) library.

Note: For small models, this may actually be slower than the basic training loop.

In [14]:
from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling
from tqdm.auto import tqdm
import evaluate
import numpy as np
from accelerate import Accelerator
from accelerate import notebook_launcher
import transformers

class CausalAccelerateTrainer:
    def __init__(
        self,
        model,
        tokenizer,
        train_dataset,
        eval_dataset,
        per_device_train_batch_size,
        per_device_eval_batch_size,
        eval_steps,
        lr,
        num_train_epochs,
        optimizer_factory,
        lr_scheduler_factory,
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.eval_steps = eval_steps
        self.num_train_epochs = num_train_epochs
        self.accelerator = Accelerator()
        
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm=False,
        )
        
        self.train_dataloader = DataLoader(
            train_dataset,
            shuffle=True,
            batch_size=per_device_train_batch_size,
            collate_fn=data_collator,
        )
        
        self.eval_dataloader = DataLoader(
            eval_dataset,
            batch_size=per_device_eval_batch_size,
            collate_fn=data_collator
        )

        self.optimizer = optimizer_factory(model.parameters(), learning_rate)

        self.train_dataloader, self.eval_dataloader, self.model, self.optimizer = self.accelerator.prepare(
            self.train_dataloader,
            self.eval_dataloader,
            self.model,
            self.optimizer,
        )

        self.num_train_steps = int(self.num_train_epochs * len(self.train_dataloader))
        self.lr_scheduler = lr_scheduler_factory(self.optimizer, self.num_train_steps)
        self.num_train_epochs = num_train_epochs

    def train(self):
        self._train_loop()
        self.eval()

    def _train_loop(self):
        self.model.train()
        epoch = 0
        global_step = 0
        eval_step = 0

        # Note: Getting this attribute appears to be expensive, so cache it!
        is_main_process = self.accelerator.is_main_process
        if is_main_process:
            print(f"Training for {self.num_train_steps} steps")
            progress_bar = tqdm(range(self.num_train_steps))
        while True:
            for batch in self.train_dataloader:
                outputs = self.model(**batch)
                loss = outputs[0]
                self.accelerator.backward(loss)
                self.optimizer.step()
                self.lr_scheduler.step()
                self.optimizer.zero_grad()

                if is_main_process:
                    progress_bar.update(1)
                global_step += 1
                if global_step >= self.num_train_steps:
                    return
                eval_step += 1
                # Evaluate every 'eval_steps'
                if eval_step == self.eval_steps:
                    self.eval()
                    if is_main_process:
                        print(f"Global step: {global_step}")
                    self.model.train()
                    eval_step = 0
            
            epoch += 1
            if is_main_process:
                print(f"Epoch: {epoch}")

    @torch.no_grad()
    def eval(self):
        eval_steps = len(self.eval_dataloader)
        is_main_process = self.accelerator.is_main_process
        #metric = evaluate.load("accuracy")
        self.model.eval()
        if is_main_process:
            progress_bar = tqdm(range(eval_steps))
        total_loss = 0
        for batch in self.eval_dataloader:
            outputs = self.model(**batch)
            logits = outputs[1]
            loss = outputs[0]
            total_loss += self.accelerator.reduce(loss.detach(), "mean").item()
            if is_main_process:
                progress_bar.update(1)
        if is_main_process:
            print(f"loss={total_loss / eval_steps}")

def train_function():
    trainer = CausalAccelerateTrainer(
        model,
        tokenizer,
        train_dataset=tok_train_dataset,
        eval_dataset=tok_val_dataset,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        lr=learning_rate,
        num_train_epochs=num_train_epochs,
        eval_steps=eval_steps,
        optimizer_factory=lambda params, lr: torch.optim.AdamW(params, lr=lr),
        lr_scheduler_factory=lambda opt, steps: transformers.get_scheduler(lr_scheduler_name, opt, num_warmup_steps, steps),
    )
    trainer.train()

def do_train():
    notebook_launcher(train_function, num_processes=torch.cuda.device_count())

In [15]:
do_train()

Launching training on 6 GPUs.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Training for 553 steps


  0%|          | 0/553 [00:00<?, ?it/s]

KeyboardInterrupt: 

### Huggingface Trainer Example
This illustrates how to use the HF trainer class, with the functionality being similar
to the above code.

Within the context of a notebook and multiple GPU's, the trainer will train the model using torch [Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). This is not ideal for performance.

The same code, launched using Accelerate in a script will perform much better.

In [5]:
from transformers import Trainer, TrainingArguments
import evaluate
import numpy as np
from transformers import DataCollatorForLanguageModeling

def train_causal_model(
    model,tokenizer,
    train_dataset,
    eval_dataset,
    training_args,
):
    def preprocess_logits_for_metrics(logits, labels):
        if isinstance(logits, tuple):
            # Depending on the model and config, logits may contain extra tensors,
            # like past_key_values, but logits always come first
            logits = logits[0]
        return logits.argmax(dim=-1)
    
    metric = evaluate.load("accuracy")
    
    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        # preds have the same shape as the labels, after the argmax(-1) has been calculated
        # by preprocess_logits_for_metrics but we need to shift the labels
        labels = labels[:, 1:].reshape(-1)
        preds = preds[:, :-1].reshape(-1)
        return metric.compute(predictions=preds, references=labels)
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        #compute_metrics=compute_metrics,
        #preprocess_logits_for_metrics=preprocess_logits_for_metrics,
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm=False,
        )
    )

    trainer.train()

In [None]:
from accelerate import notebook_launcher

def do_train():
    train_causal_model(
        model,tokenizer,
        tok_train_dataset,
        tok_val_dataset,
        training_args = TrainingArguments(
            per_device_train_batch_size=64,
            per_device_eval_batch_size=128,
            output_dir="test_trainer",
            evaluation_strategy="steps",
            eval_steps=500,
            num_train_epochs=1,
    
            # If set too high, your GPU may run out of memory.
            #per_device_train_batch_size=8,
            #per_device_eval_batch_size=16,
            
            # The learning rate will need to be reduced as model size grows. If the rate is set too high, the
            # loss will become unstable, possibly increasing.
            learning_rate=1e-3,
            # Set for better diagnostics
            #use_cpu=True,
        ),
    )

#notebook_launcher(do_train, num_processes=torch.cuda.device_count())
do_train()

## Evaluate predictions

### Implemenation
You don't need to look all that closely at this code -- we use it for evaluating the model's prediction in the next section.

In [9]:
import math
import matplotlib.pyplot as plt
from IPython.display import display, HTML

def show_predictions(model, tokenizer, device, text):
    for i, line in enumerate(text):
        print(f"line: {line}")
        model = model.to(device)
        logits, label_ids = predict_text(model, tokenizer, line, device=device)
            
        show_colorized_tokens(
            tokenizer=tokenizer,
            logits=logits,
            label_ids=label_ids,
            
            # the metric function to use
            metric_fn=causal_loss_metric,
        
            # how to translate metric to colors
            color_encoder=ColorEncoder(
                is_relative=True,
                #lower_bound=0,
                #upper_bound=10,
                cmap='plasma',
            ),
            top_k = 10,
            
            # Filter threshold on metric
            #threshold=0,
            pad_lines=15
        )
        print("\n")

def predict_text(model, tokenizer, input_text, device):
    batch_encoding = tokenizer(
        input_text,
        truncation=True,
        return_tensors='pt',
        verbose=True,
    )
    input_ids = batch_encoding['input_ids'].to(device)
    model.eval()
    model.to(device)
    logits = model(input_ids=input_ids)
    return logits.cpu().detach().float(), input_ids.cpu().detach()

def show_colorized_tokens(tokenizer, logits, label_ids, metric_fn, color_encoder, top_k, threshold=None, pad_lines=20):
    metrics, metric_min, metric_max, metric_label = metric_fn(logits=logits, label_ids=label_ids)
    value_min, value_max = metrics.aminmax()
    
    print(f"Metric '{metric_label}': n={metrics.numel()}, min={value_min}, max={value_max}, mean={metrics.mean()}, range=({metric_min}, {metric_max})")

    # if causal only; we need to pad the left with something to alighn the predictions.
    metrics = torch.cat((torch.zeros(metrics.size(0), 1, device=metrics.device, dtype=metrics.dtype), metrics), dim=-1)
    if metrics.size(-1) > label_ids.size(-1):
        metrics = metrics.narrow(-1, 0, label_ids.size(-1))
    colors = color_encoder(metrics, value_min, value_max, metric_min=metric_min, metric_max=metric_max)
    html_text = tooltip_style

    label_ids = restore_pad_ids(tokenizer, label_ids)
    for i in range(label_ids.size(0)):
        metric_mean = metrics[i].mean()
        info = f"Metric[{i}] '{metric_label}': n={metrics[i].numel()}, min={metrics[i].min()}, max={metrics[i].max()}, mean={metric_mean}<br>"
        html_text += info
        if threshold is not None and metric_mean < threshold:
            continue
        token_seq = tokenizer.batch_decode(label_ids[i], skip_special_tokens=True)
        color_seq = colors[i]
        token_info_list = generate_token_info_list(tokenizer, token_seq, metric_label, metrics[i], logits[i], top_k)
        text = color_encode_html_tokens(token_seq, color_seq, token_info_list)
        html_text += text + "<br>"

    for _ in range(pad_lines):
        html_text += "<br>"
    
    display(HTML(html_text))
    
# Replace '-100' tokens with the tokenizer's 'pad' token.
def restore_pad_ids(tokenizer, label_ids):
    return torch.where(label_ids == -100, tokenizer.eos_token_id, label_ids) 

tooltip_style = """
<style>
/* Tooltip container class */
.token {
  position: relative;
  display: inline-block;
}

/* Tooltip text */
.token .tooltip {
  visibility: hidden;
  width: 300px;
  background-color: black;
  color: #fff;
  text-align: left;
  padding: 5px 0;
  border-radius: 6px;
 
  /* Position the tooltip text - see examples below! */
  position: absolute;
  z-index: 1;
}

/* Show the tooltip text when you mouse over the tooltip container */
.token:hover .tooltip {
  visibility: visible;
}
</style>
"""

def escape_html_token(token):
    match token:
        case '\n':
            return '<br>'
        case '<':
            return '&lt;'
        case '>':
            return '&gt;'
        case '"':
            return '&quot;'
        case "'":
            return '&#39;'
        case '&':
            return '&amp;'
        case _:
            return token

def html_color(color):
    return "#{:02x}{:02x}{:02x}".format(int(255*color[0]), int(255*color[1]), int(255*color[2]))

def color_encode_html_tokens(token_seq, color_seq, info_seq):
    text = ""
    for token, color, info in zip(token_seq, color_seq, info_seq):
        if token == '\n':
            text += "<br>"
        else:
            # HTML will eat your space tokens if you don't do this!
            if len(token) > 0 and token[0] == ' ':
                token = "&nbsp;" + token[1:]
            text += f"<span class='token' style='color: {html_color(color)}'>{escape_html_token(token)}<span class='tooltip'>{info}</span></span>"
    return text

class ColorEncoder:
    def __init__(self, is_relative=True, upper_bound=math.inf, lower_bound=-math.inf, cmap='viridis'):
        self.is_relative = is_relative
        self.upper_bound = upper_bound
        self.lower_bound = lower_bound
        self.colormap = plt.get_cmap(cmap)

    def __call__(self, metrics, value_min, value_max, metric_min=None, metric_max=None):
        if self.is_relative or metric_min is None and metric_max is None:
            minimum, maximum = value_min, value_max
        elif metric_min is None:
            minimum, maximum = value_min, metric_max
        elif metric_max is None:
            minimum, maximum = metric_min, value_max
        else:
            minimum, maximum = metric_min, metric_max
            
        minimum = max(minimum, self.lower_bound)
        maximum = min(maximum, self.upper_bound)
        return self.colormap(self.normalize_metric(metrics, minimum, maximum))
        
    def normalize_metric(self, metric, minimum, maximum):
        return torch.clamp(input=(metric - minimum) / (maximum - minimum), min=0.0, max=1.0)

def topk_predicted_tokens(tokenizer, logits, top_k=5):
    top_prob, top_indices = torch.topk(torch.softmax(logits, dim=-1), top_k, dim=-1)
    top_tokens = tokenizer.batch_decode(top_indices.flatten(), skip_special_tokens=True)
        
    return top_prob.flatten(), top_tokens
        
def generate_token_info_list(tokenizer, token_seq, metric_label, metrics, logits, top_k=5):
    top_prob, top_tokens = topk_predicted_tokens(tokenizer, logits, top_k)
    info_list = []
    for j in range(len(metrics)):
        text = f"Token: '{token_seq[j]}'<br>{metric_label}: {'%.5f' % metrics[j]}<br>---------------<br>"
        if j != 0: # And is causal!
            start = (j - 1) * top_k
            for k in range(start, start + top_k):
                top_token = top_tokens[k]
                if top_token == '\n':
                    top_token = '\\n'
                text += f"{'%.2f' % top_prob[k]} : '{escape_html_token(top_token)}'<br>"
        info_list.append(text)
    return info_list

#### Logits and Lables Metric Functions
def causal_loss_metric(logits, label_ids, reduction='none'):
    # Shift so that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = label_ids[..., 1:].contiguous()
    
    loss = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), reduction=reduction)\
        .view(label_ids.size(0), label_ids.size(1) - 1)
    
    return loss, 0, None, "Causal Loss"

### Predict
This will take the input text and have the model make predictions for the next token for each token in the sequence.

The color coding indicates the loss for each individual token, with darker colors being more accurate and brighter colors being less so.

If you hover over a token, you can see the top-10 predictions for the next token in the sequence.

In [10]:
show_predictions(model, tokenizer, device="cuda", text=[sample_text])

line: One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them b
Metric 'Causal Loss': n=154, min=0.004118770360946655, max=9.195035934448242, mean=3.3656723499298096, range=(0, None)






### Simple Text Gen
This is a very simple text generator implementation.

In [12]:
class TextGenerator:
    def __init__(self, model, tokenizer, device, temperature=1.0, do_sample=False, seed=None):
        self.model = model
        self.tokenizer = tokenizer
        self.temperature = temperature
        self.device = device
        self.do_sample = do_sample
        self.rand_generator = torch.Generator(device=device)
        self.set_seed(seed)

    @torch.no_grad()
    def generate(self, input_ids, max_new_tokens=20):
        for _ in range(max_new_tokens):
            logits = self.model(input_ids=input_ids)
            logits = logits[:, -1, :] / self.temperature
            probabilities = torch.softmax(logits, dim=-1)
            if self.do_sample:
                next_token_id = torch.multinomial(
                    probabilities, num_samples=1, generator=self.rand_generator)
            else:
                _, next_token_id = torch.topk(probabilities, k=1, dim=-1)
            input_ids = torch.cat((input_ids, next_token_id), dim=1)
        return input_ids

    def set_seed(self, seed):
        if seed is None:
            self.rand_generator.seed()
        else:
            self.rand_generator.manual_seed(seed)
    
    # Lazy generation pipeline for simple inference.
    def prompt(self, input_text, max_new_tokens=20):
        input_ids = self.tokenizer(input_text, return_tensors='pt')['input_ids']
        model_output = self.generate(
            input_ids.to(self.device),
            max_new_tokens=max_new_tokens,
        )
        return self.tokenizer.decode(model_output.to('cpu')[0])

In [14]:
# Test text generation.
# Don't expect too much from this model, as the only input to each prediction is the previous word. 
text_gen = TextGenerator(model, tokenizer, 'cuda', do_sample=True, seed=42)
text = text_gen.prompt("One day, a little girl", max_new_tokens=50)
print(repr(text))

"<|BOS|> One day, a little girl named Timmy was very sad. She put her for a ship, hiam, pointing at home little bird was playing in the park and he didn't want tooth, Lily gave Rosaf It was a cold and had many"


## Create a Huggingface causal model
Many of the Huggingface API's, text-generation for example, requires that the model conforms to the Huggingface model API.

Here, we will take the model from the last exercise and "Huggify" it, while addind a new "Feedforward" layer to improve performance.

The Feedforward layer (aka Multilayer Perceptron (MLP)) acts like a key-value store for the model, increasing its capacity and capabilities.

### Model implementation

In [15]:
from typing import Optional, Tuple, Union
import math
from torch import nn, Tensor
import torch.nn.init as init
from torch.nn import functional as F
from transformers.modeling_outputs import CausalLMOutput
from transformers import (
    PreTrainedModel,
    PretrainedConfig,
    AutoConfig,
    AutoModelForCausalLM,
)

# We will abstract out the causal loss function for reuse.
def causal_loss(logits, labels):
    # Shift so that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    loss = torch.nn.functional.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        # labels with this value are ignored when computing loss
        ignore_index=-100,
        reduction='mean',
    )
    
    return loss.nan_to_num()

# A MLP consists of two or more linear layers, each seperated by
# a non-linear activation function. Here, we use ReLU, which changes values
# less than zero to zero and passes values greater than zero unchanged.
#
# This is a 2-layer MLP, common to Transformer models. The rows in the first layer activate
# the corresponding columns in the second layer, where the input is matched by dot-product
# similarity to the input -- this prduces a normalized value between -1 and 1, with one being
# a perfect match and -1 being an exact opposite. This value is then offset by the corresponding bias
# parameter, with the ReLU layer blocking all inputs less-than or equal to zero.
#
# In the case where the resulting value is non-zero, the corresponding column is added to the output, in proportion
# to the magnitude of the signal.
class FeedforwardNet(nn.Module):
    def __init__(self, d_model, d_feedforward):
        super().__init__()
        self.d_model = d_model
        self.d_feedforward = d_feedforward

        self.linear1 = nn.Linear(self.d_model, self.d_feedforward)
        self.activation = nn.ReLU()
        self.linear2 = nn.Linear(self.d_feedforward, self.d_model)

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        return x

# Huggingface model type string
# This is a unique identifier for a model type, which allows the API to find the
# implementation for the type.
model_type = "simple-causal2"

# Huggingface config class.
#
# Huggingface 'PreTrainedModel' objects are passed a derivative of this class
# when constructed. This is required, if your model will derive from PreTrainedModel.
class CausalLM2Config(PretrainedConfig):
    model_type = model_type
    
    def __init__(
        # All of these MUST have defaults, even if unused.
        self,
        vocab_size=8000,
        hidden_size=256,
        max_sequence_length=2048,
        dim_feedforward = 512,
        
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.max_sequence_length = max_sequence_length
        self.dim_feedforward = dim_feedforward
        
        super().__init__(**kwargs)

# The formward method of this model is designed to be compatible with the HuggingFace Trainer and Tokenizer classes.
# This is essentially a wrapper for a Pytorch transformer model, which implements the HF API.
class CausalLM2(PreTrainedModel):
    config_class = CausalLM2Config
    model_type = 'Transformer'
    
    def __init__(self, config):
        super().__init__(config)
        self.vocab_size = config.vocab_size
        self.d_model = config.hidden_size
        
        self.embedding = nn.Embedding(self.vocab_size, self.d_model)
        self.feedforward = FeedforwardNet(self.d_model, config.dim_feedforward)
        self.output_projection = nn.Linear(self.d_model, self.vocab_size)
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        **kwargs,
    ) -> (Tensor, dict[str, Tensor]):

        # Convert input_ids to embeddings.
        x = self.embedding(input_ids)

        # Pass the input through the feedforward network.
        x = self.feedforward(x)
        
        # Convert embeddings to log-probabilities of next token-id
        logits = self.output_projection(x)
        
        # Compute loss.
        if labels is not None:
            loss = causal_loss(logits, labels)
        else:
            loss = None
        
        if return_dict:
            return CausalLMOutput(loss=loss, logits=logits)
        elif loss is not None:
            return (loss, logits)
        else:
            return logits

    # This is needed for the Huggingface text generation APIs.
    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        attention_mask = kwargs.get("attention_mask", None)
        model_inputs = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        }
        return model_inputs

AutoConfig.register(model_type, CausalLM2Config)
AutoModelForCausalLM.register(CausalLM2Config, CausalLM2)

### Instantiate model

In [16]:
# Create model configuration
config = CausalLM2Config(
    vocab_size = tokenizer.vocab_size,
    hidden_size = 128,
    max_sequence_length = 2048,
    dim_feedforward = 2048,
)

# A config can also be instantiated from a json file like this:
#config = AutoConfig.from_pretrained("path-to-config")

# Instantiate the model
model = AutoModelForCausalLM.from_config(config)
print_model_size(model)
print(model)

Model size: 1.0M parameters
CausalLM2(
  (embedding): Embedding(2000, 128)
  (feedforward): FeedforwardNet(
    (linear1): Linear(in_features=128, out_features=2048, bias=True)
    (activation): ReLU()
    (linear2): Linear(in_features=2048, out_features=128, bias=True)
  )
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)


### Train model

In [17]:
do_train()

Training for 3313 steps


  0%|          | 0/3313 [00:00<?, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.6502459711498685
Global step: 1000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.6118180884255304
Global step: 2000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.604002899593777
Global step: 3000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=3.6064381731881037


## Generate Text
This is a simple wrapper for the Huggingface tokenizer

In [13]:
# https://huggingface.co/docs/transformers/v4.34.1/en/generation_strategies

class TextGen():
    def __init__(self, model, tokenizer, device):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device

    def generate(self, prompt, do_sample=True, top_k=50, top_p=0.9, max_new_tokens=500):
        self.model.to(self.device)
        input_ids = self.tokenizer(prompt, return_tensors='pt')['input_ids'].to(self.device)
        outputs = model.generate(input_ids, do_sample=do_sample, top_k=top_k, top_p=top_p, max_new_tokens=max_new_tokens)
        return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

In [19]:
gen = TextGen(model, tokenizer, device='cuda')
print(gen.generate(sample_text))

 One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them balouses of the tree to get some cookies and the villie loved playing a bigging the ped to his friend and had a time, "I'm not your body to look for Lily asked her. You have to their mom and his head. She said "Arella.
Lily learned to play with excite the mad and said. She looked at the dog was a few of the big box with the stretches of the radder and went home.
When they saw her back!" Lily felt guilence. He took her to play with the day, but her head and a very good," said, Ben. It's a big red

## Save and Load model
Should you want to save and restore a model...

### Save

In [144]:
model.save_pretrained(
    save_directory=model_path,
    safe_serialization=True,
)

### Load

In [41]:
model, load_info = AutoModelForCausalLM.from_pretrained(
    model_path,
    output_loading_info=True,
    local_files_only=True,
)
print(load_info)
print(model)

{'missing_keys': [], 'unexpected_keys': [], 'mismatched_keys': [], 'error_msgs': []}
CausalLM2(
  (embedding): Embedding(2000, 128)
  (feedforward): FeedforwardNet(
    (linear1): Linear(in_features=128, out_features=512, bias=True)
    (activation): ReLU()
    (linear2): Linear(in_features=512, out_features=128, bias=True)
  )
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)


## Vanilla transformer model

In [12]:
from typing import Optional, Tuple, Union
import math
import torch
from torch import nn, Tensor
import torch.nn.init as init
from torch.nn import functional as F
from transformers.modeling_outputs import CausalLMOutput
from transformers import (
    PreTrainedModel,
    PretrainedConfig,
    AutoConfig,
    AutoModelForCausalLM,
)

# We will abstract out the causal loss function for reuse.
def causal_loss(logits, labels):
    # Shift so that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    loss = torch.nn.functional.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        # labels with this value are ignored when computing loss
        ignore_index=-100,
        reduction='mean',
    )
    
    return loss.nan_to_num()

class FeedforwardLayer(nn.Module):
    def __init__(self, d_model, d_feedforward):
        super().__init__()
        self.d_model = d_model
        self.d_feedforward = d_feedforward

        self.linear1 = nn.Linear(self.d_model, self.d_feedforward)
        self.activation = nn.ReLU()
        self.linear2 = nn.Linear(self.d_feedforward, self.d_model)

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        return x

class MultiheadAttention(nn.Module):
    def __init__(
        self,
        d_model,
        num_heads,
    ):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        
        assert d_model % num_heads == 0, "d_model must be evenly divisible by num_heads"

        # The dimension of each head.
        self.d_head = d_model // num_heads

        # We scale the attention scores by the inverse-square-root of the head dimension
        # this shifts the temerature of softmax.
        self.dot_product_scale = 1.0 / math.sqrt(self.d_head)

        # Input projection matricies: K, K, V
        self.query_linear = nn.Linear(self.d_model, self.d_model)
        self.key_linear = nn.Linear(self.d_model, self.d_model)
        self.value_linear = nn.Linear(self.d_model, self.d_model)

        # Output projection matrix:
        # The input and output matrices only make sense with multi-head
        # Don't bother with the output matrix, with a single head.
        if self.num_heads != 1:
            self.output_linear = nn.Linear(self.d_model, self.d_model)

    def forward(self, qkv):
        # qkv: (batch_size, seq_len, d_qkv)
        batch_size, seq_len, d_qkv = qkv.shape
        
        # Feed the inputs through the K, Q, V matrices.
        query, key, value = self.query_linear(qkv), self.key_linear(qkv), self.value_linear(qkv)

        # Split projections into multiple heads and swap position of sequence / heads dimension
        query = query.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        key = key.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        value = value.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(query, key.transpose(-2, -1)) * self.dot_product_scale

        # Mask future positions from the past
        causal_mask = torch.triu(torch.full((seq_len, seq_len), True, device=qkv.device), diagonal=1)
        scores.masked_fill_(causal_mask, float('-inf'))
        
        # Calculate the attention weights; avoid NANs that might emerge from zeros in softmax's denominator
        attention_weights = torch.softmax(scores, dim=-1).clamp(min=1e-10)
        
        # Use the attention weights to get a weighted combination of value vectors
        attended_values = torch.matmul(attention_weights, value)
        
        # Concatenate attention heads and project to original embedding size using the output linear layer
        attended_values = attended_values.transpose(1, 2).contiguous().view(batch_size, seq_len, d_qkv)

        # Project the concatenated output through the output matrix.
        if self.num_heads != 1:
            output = self.output_linear(attended_values)
        else:
            output = attended_values
        
        return output

# Standard transformer layer, from original paper.
class TransformerLayer(nn.Module):
    def __init__(
        self,
        d_model,
        attention,
        feedforward,
    ):
        super().__init__()
        self.d_model = d_model
        self.attention = attention
        self.feedforward = feedforward
        self.norm1 = nn.LayerNorm(self.d_model)
        self.norm2 = nn.LayerNorm(self.d_model)

    def forward(self, x):
        # Keep input as residual
        residual = x

        # Compute attention
        x = self.attention(x)

        # Add attention with residual and normalize.
        x = self.norm1(residual + x)

        # Keep output as next residual.
        residual = x

        # Pass through feedforward network.
        x = self.feedforward(x)

        # Combine residual and ff output, then normalize again.
        x = self.norm2(residual + x)
        
        return x

# A vanilla positional encoder
class PositionalEncoder(nn.Module):
    def __init__(self, d_embed, max_seq):
        super().__init__()
        self.d_embed = d_embed
        self.max_seq = max_seq
        
        weight = torch.zeros(max_seq, d_embed)
        position = torch.arange(0, max_seq, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_embed, 2).float() * (-math.log(10000.0) / d_embed))
        weight[:, 0::2] = torch.sin(position * div_term)
        weight[:, 1::2] = torch.cos(position * div_term)
        weight = weight.unsqueeze(0)
        self.register_buffer('weight', weight)

    def forward(self, x):
        seq_len = x.size(-2)
        return x + self.weight[:, :seq_len]

# Huggingface model type string
model_type = "simple-causal-transformer"

# Huggingface config class.
# Huggingface 'PreTrainedModel' objects are passed a derivative of this class
# when constructed. This is required, if your model will derive from PreTrainedModel.
class CausalTransformerConfig(PretrainedConfig):
    model_type = model_type
    
    def __init__(
        # All of these MUST have defaults, even if unused.
        self,
        vocab_size=2000,
        hidden_size=256,
        max_sequence_length=2048,
        dim_feedforward=512,
        num_attention_heads=1,
        num_hidden_layers = 4,
        
        **kwargs,
    ):
        # These are the canonical names used by Huggingface
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.max_sequence_length = max_sequence_length
        self.dim_feedforward = dim_feedforward
        self.num_attention_heads = num_attention_heads
        self.num_hidden_layers = num_hidden_layers
        
        super().__init__(**kwargs)

# The formward method of this model is designed to be compatible with the HuggingFace Trainer and Tokenizer classes.
# This is essentially a wrapper for a Pytorch transformer model, which implements the HF API.
class CausalTransformer(PreTrainedModel):
    config_class = CausalTransformerConfig
    model_type = 'Transformer'
    
    def __init__(self, config):
        super().__init__(config)
        self.vocab_size = config.vocab_size
        self.d_model = config.hidden_size
        
        self.embedding = nn.Embedding(self.vocab_size, self.d_model)
        self.positional_encoder = PositionalEncoder(d_embed=config.hidden_size, max_seq=config.max_sequence_length)
        self.layers = nn.ModuleList([
            TransformerLayer(
                d_model=config.hidden_size,
                attention=MultiheadAttention(
                    d_model=config.hidden_size,
                    num_heads=config.num_attention_heads,
                ),
                feedforward=FeedforwardLayer(
                    d_model=config.hidden_size,
                    d_feedforward=config.dim_feedforward,
                ),
            ) for _ in range(config.num_hidden_layers)
        ])
        self.output_projection = nn.Linear(config.hidden_size, config.vocab_size)
        self.reset_parameters()
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        **kwargs,
    ) -> (Tensor, dict[str, Tensor]):

        # Convert input_ids to embeddings and add positional information.
        x = self.positional_encoder(self.embedding(input_ids) * self.d_model**0.5)

        # Pass the input through each of the layers.
        for layer in self.layers:
            x = layer(x)
        
        # Convert embeddings to log-probabilities of next token-id
        logits = self.output_projection(x)
        
        # Compute loss.
        if labels is not None:
            loss = causal_loss(logits, labels)
        else:
            loss = None
        
        if return_dict:
            return CausalLMOutput(loss=loss, logits=logits)
        elif loss is not None:
            return (loss, logits)
        else:
            return logits

    def reset_parameters(self):
        # Init the embedding weights as per original design.
        init.normal_(self.embedding.weight, std=self.d_model**-0.5)

    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        attention_mask = kwargs.get("attention_mask", None)
        model_inputs = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        }
        return model_inputs

AutoConfig.register(model_type, CausalTransformerConfig)
AutoModelForCausalLM.register(CausalTransformerConfig, CausalTransformer)

### Instantiate model

In [13]:
def print_model_size(model):
    model_size = sum(t.numel() for t in model.parameters())
    print(f"Model size: {model_size/1000**2:.1f}M parameters")

# Create model configuration
config = CausalTransformerConfig(
    vocab_size = tokenizer.vocab_size,
    hidden_size = 256,
    dim_feedforward = 1024,
    max_sequence_length = tokenizer.model_max_length,
    num_attention_heads=4,
    num_hidden_layers= 2
)

# A config can also be instantiated from a json file like this:
#config = AutoConfig.from_pretrained("path-to-config")

# Instantiate the model
model = AutoModelForCausalLM.from_config(config)
print_model_size(model)
print(model)

Model size: 2.6M parameters
CausalTransformer(
  (embedding): Embedding(2000, 256)
  (positional_encoder): PositionalEncoder()
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (query_linear): Linear(in_features=256, out_features=256, bias=True)
        (key_linear): Linear(in_features=256, out_features=256, bias=True)
        (value_linear): Linear(in_features=256, out_features=256, bias=True)
        (output_linear): Linear(in_features=256, out_features=256, bias=True)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=256, out_features=1024, bias=True)
        (activation): ReLU()
        (linear2): Linear(in_features=1024, out_features=256, bias=True)
      )
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    )
  )
  (output_projection): Linear(in_features=256, out_features=2000, bias=True)
)


### Train model

In [32]:
do_train()

Training for 3313 steps


  0%|          | 0/3313 [00:00<?, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

loss=2.320766806602478
Global step: 1000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=2.0744397242863974
Global step: 2000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=1.9775431354840596
Global step: 3000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=1.9802577826711867


In [33]:
gen = TextGen(model, tokenizer, device='cuda')
print(gen.generate(sample_text))

 One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."

Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them bleed. Lily was happy because it was very expensive. She said, "Thank you, mom. I love you." And they went to her room and saw the needle and seweds on their shirt.

Lily said, "Thank you, Mommy." Her mom said, "I feel good." She said, "That is nice and pretty. You are a kind girl who can use your help with the needle. But you have to be careful with how to sew and sew sews. And it is also very pretty with what she wanted."

Lily nodded and said, "Thank you, mom. I'm happy I still have an advent

In [16]:
# Compared output to simple generator.
text_gen = TextGenerator(model, tokenizer, 'cuda', do_sample=True, seed=42)
text = text_gen.prompt("One day, a little girl", max_new_tokens=500)
print(text)

<|BOS|> One day, a little girl named Sue was feeling dreamless. She wanted to take her home, but she wasn't send to take her home. She said to her family with.

The family was grown-up and windy creative was always scared. He thought this sound had just books that the garden was delicate! Lucy brought it to her bedroom and lifted it off. Sue felt so brave, she started to flow on the pink cover. She started to shine as if she waits for her. 

Sue felt an exiving back and hugged the gianting. It went to her little girl and said, "ning are a special surprise adventure!" She smiled and hopped out of the play. A new isn't charming. This is mine, in a game!

The girl felt so happy and knowing that the gift would visit. The father rare time and never had to be fleking on the car again. The prevent quite last time.k for the day, it reminded her the bow. She wanted to take the tray and if it was a safe of fly. 

Sure, the second sharpy came and the tray, to his family left the cover. Snow thank

## Improved transformer model

In [22]:
from typing import Optional, Tuple, Union
import math

import torch
from torch import nn, Tensor
import torch.nn.init as init
from torch.nn import functional as F

from transformers.modeling_outputs import CausalLMOutput
from transformers import (
    PreTrainedModel,
    PretrainedConfig,
    AutoConfig,
    AutoModelForCausalLM,
)

from flash_attn import flash_attn_qkvpacked_func, flash_attn_func

def causal_loss(logits, labels):
    # Shift so that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    loss = torch.nn.functional.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        # labels with this value are ignored when computing loss
        ignore_index=-100,
        reduction='mean',
    )
    
    return loss.nan_to_num().unsqueeze(0)

def causal_alpha(n_layers):
    return (2.0 * n_layers) ** 0.25

def causal_beta(n_layers):
    return (8.0 * n_layers) ** -0.25

class FeedforwardLayer(nn.Module):
    def __init__(self, d_model, d_feedforward, n_layers, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.d_feedforward = d_feedforward
        self.beta = causal_beta(n_layers)

        self.linear1 = nn.Linear(self.d_model, self.d_feedforward)
        self.activation = nn.SiLU()
        self.linear2 = nn.Linear(self.d_feedforward, self.d_model)
        self.dropout = nn.Dropout(dropout)
        self.reset_parameters()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.linear2(x)
        return x

    def reset_parameters(self):
        # Deepnet initialization
        # https://arxiv.org/pdf/2203.00555.pdf
        init.xavier_uniform_(self.linear1.weight, gain=self.beta)
        init.constant_(self.linear1.bias, 0.)
        init.xavier_uniform_(self.linear2.weight, gain=self.beta)
        init.constant_(self.linear2.bias, 0.)

def alibi_biases(query_len, key_len, device='cpu'):
    x = torch.arange(key_len, device=device)[None, :]
    y = torch.arange(query_len, device=device)[:, None]
    return x - y

class MultiheadAttention(nn.Module):
    def __init__(
        self,
        d_model,
        num_heads,
        n_layers,
        dropout=0.1,
        # Set to False to disable Flash-Attention-2
        flash_attention=False,
    ):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.beta = causal_beta(n_layers)
        self.flash_attention = flash_attention
        
        assert d_model % num_heads == 0, "d_model must be evenly divisible by num_heads"

        # The dimension of each head.
        self.d_head = d_model // num_heads

        # We scale the attention scores by the inverse-square-root of the head dimension
        # this shifts the temerature of softmax.
        self.dot_product_scale = 1.0 / math.sqrt(self.d_head)

        self.in_proj = nn.Parameter(torch.zeros(3 * self.d_model, self.d_model))
        self.in_proj_bias = nn.Parameter(torch.zeros(3 * self.d_model))
        self.output_linear = nn.Linear(self.d_model, self.d_model)

        self.dropout = nn.Dropout(dropout)

        # Use ALiBi relative positional encoding
        # https://arxiv.org/pdf/2108.12409.pdf
        # This is the original ALiBi distribution.
        alibi_slopes = 1.0 / torch.logspace(1, 8, self.num_heads, base=2, dtype=torch.float)
        self.alibi_slopes = nn.Parameter(alibi_slopes)
        #self.register_buffer('alibi_slopes', alibi_slopes)
        self.reset_parameters()

    def project_input(self, qkv):
        proj = F.linear(qkv, self.in_proj, self.in_proj_bias)
        return proj.chunk(chunks=3, dim=-1)
    
    def forward(self, qkv):
        if self.flash_attention:
            return self.flash_forward(qkv)
        # qkv: (batch_size, seq_len, d_qkv)
        batch_size, seq_len, d_qkv = qkv.shape
        
        # Feed the inputs through the K, Q, V matrices.
        query, key, value = self.project_input(qkv)

        # Split projections into multiple heads and swap position of sequence / heads dimension
        query = query.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        key = key.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        value = value.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(query, key.transpose(-2, -1)) * self.dot_product_scale

        # Apply Alibi relative positional weights.
        scores += alibi_biases(scores.shape[-2], scores.shape[-1], device=scores.device) * self.alibi_slopes.view(-1, 1, 1)
        
        # Mask future positions from the past
        causal_mask = torch.triu(torch.full((seq_len, seq_len), True, device=qkv.device), diagonal=1)
        scores.masked_fill_(causal_mask, float('-inf'))
        
        # Calculate the attention weights; avoid NANs that might emerge from zeros in softmax's denominator
        attention_weights = self.dropout(torch.softmax(scores, dim=-1).clamp(min=1e-10))
        
        # Use the attention weights to get a weighted combination of value vectors
        attended_values = torch.matmul(attention_weights, value)
        
        # Concatenate attention heads and project to original embedding size using the output linear layer
        attended_values = attended_values.transpose(1, 2).contiguous().view(batch_size, seq_len, d_qkv)

        # Project the concatenated output through the output matrix.
        output = self.output_linear(attended_values)
        
        return output
        
    def flash_forward(self, qkv):
        batch_size, seq_len, d_embed = qkv.shape
        
        # Feed the inputs through the K, Q, V matrices.
        # query : (batch_size, seq_len, d_model)
        # qkv : (batch_size, seq_len, 3, num_heads, d_kq)
        qkv = F.linear(
            qkv,
            self.in_proj,
            self.in_proj_bias
        ).unflatten(
            -1,
            (3, self.num_heads, self.d_head)
        )

        attended_values = flash_attn_qkvpacked_func(
            qkv,
            dropout_p=self.dropout.p,
            softmax_scale=self.dot_product_scale,
            causal=True,
            alibi_slopes=self.alibi_slopes.float(),
        )
        del qkv
        # attended_values: (batch_size, seqlen, nheads, headdim)

        # Concatentate heads back into d_embed
        attended_values = attended_values.view(batch_size, seq_len, d_embed)

        # Munge the concatenated values
        output = self.output_linear(attended_values)
        
        return output

    def reset_parameters(self):
        # Deepnet initialization
        # https://arxiv.org/pdf/2203.00555.pdf
        
        q, k, v = self.in_proj.chunk(3)
        init.xavier_uniform_(q, gain=1.0)
        init.xavier_uniform_(k, gain=1.0)
        init.xavier_uniform_(v, gain=self.beta)
        init.constant_(self.in_proj_bias, 0.)
        init.xavier_uniform_(self.output_linear.weight, gain=self.beta)
        init.constant_(self.output_linear.bias, 0.)

class ScaleAttention(nn.Module):
    def __init__(
        self,
        d_model,
        num_heads,
        n_layers,
        dropout=0.1,
        # Set to False to disable Flash-Attention-2
        flash_attention=False,
    ):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.beta = causal_beta(n_layers)
        self.flash_attention = flash_attention
        
        assert d_model % num_heads == 0, "d_model must be evenly divisible by num_heads"

        # The dimension of each head.
        self.d_head = d_model // num_heads

        # We scale the attention scores by the inverse-square-root of the head dimension
        # this shifts the temerature of softmax.
        self.dot_product_scale = 1.0 / math.sqrt(self.d_head)

        self.query = nn.Parameter(torch.empty(d_model))
        #self.query_linear = nn.Linear(self.d_model, self.d_model)
        self.key = nn.Parameter(torch.empty(d_model))
        self.value = nn.Parameter(torch.empty(d_model))
        self.output = nn.Parameter(torch.empty(d_model))
        #self.output_linear = nn.Linear(self.d_model, self.d_model)

        self.dropout = nn.Dropout(dropout)

        # Use ALiBi relative positional encoding
        # https://arxiv.org/pdf/2108.12409.pdf
        # This is the original ALiBi distribution.
        alibi_slopes = 1.0 / torch.logspace(1, 8, self.num_heads, base=2, dtype=torch.float)
        self.alibi_slopes = nn.Parameter(alibi_slopes)
        #self.register_buffer('alibi_slopes', alibi_slopes)
        self.reset_parameters()

    def project_input(self, qkv):
        query = qkv * self.query
        #query = self.query_linear(qkv)
        key = (qkv * self.key).roll(shifts=self.d_head // 2, dims=-1)
        value = qkv * self.value
        
        return query, key, value
    
    def forward(self, qkv):
        if self.flash_attention:
            return self.flash_forward(qkv)
        # qkv: (batch_size, seq_len, d_qkv)
        batch_size, seq_len, d_qkv = qkv.shape
        
        # Feed the inputs through the K, Q, V matrices.
        query, key, value = self.project_input(qkv)

        # Split projections into multiple heads and swap position of sequence / heads dimension
        query = query.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        key = key.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        value = value.view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(query, key.transpose(-2, -1)) * self.dot_product_scale

        # Apply Alibi relative positional weights.
        scores += alibi_biases(scores.shape[-2], scores.shape[-1], device=scores.device) * self.alibi_slopes.view(-1, 1, 1)
        
        # Mask future positions from the past
        causal_mask = torch.triu(torch.full((seq_len, seq_len), True, device=qkv.device), diagonal=1)
        scores.masked_fill_(causal_mask, float('-inf'))
        
        # Calculate the attention weights; avoid NANs that might emerge from zeros in softmax's denominator
        attention_weights = self.dropout(torch.softmax(scores, dim=-1).clamp(min=1e-10))
        
        # Use the attention weights to get a weighted combination of value vectors
        attended_values = torch.matmul(attention_weights, value)
        
        # Concatenate attention heads and project to original embedding size using the output linear layer
        attended_values = attended_values.transpose(1, 2).contiguous().view(batch_size, seq_len, d_qkv)

        # Project the concatenated output through the output matrix.
        #output = self.output_linear(attended_values)
        output = attended_values * self.output
        
        return output
        
    def flash_forward(self, qkv):
        batch_size, seq_len, d_embed = qkv.shape
        
        query, key, value = self.project_input(qkv)
        query = query.view(batch_size, seq_len, self.num_heads, self.d_head)
        key = key.view(batch_size, seq_len, self.num_heads, self.d_head)
        value = value.view(batch_size, seq_len, self.num_heads, self.d_head)

        attended_values = flash_attn_func(
            q=query,
            k=key,
            v=value,
            dropout_p=self.dropout.p,
            softmax_scale=self.dot_product_scale,
            causal=True,
            alibi_slopes=self.alibi_slopes.float(),
        )
        del qkv
        # attended_values: (batch_size, seqlen, nheads, headdim)

        # Concatentate heads back into d_embed
        attended_values = attended_values.view(batch_size, seq_len, d_embed)

        # Munge the concatenated values
        output = self.output_linear(attended_values)
        
        return output

    def reset_parameters(self):
        init.normal_(self.query)
        #init.xavier_uniform_(self.query_linear.weight, gain=self.beta)
        #init.constant_(self.query_linear.bias, 0.)
        
        init.normal_(self.key)
        init.normal_(self.value)
        init.normal_(self.output)
        
        #init.xavier_uniform_(self.output_linear.weight, gain=self.beta)
        #init.constant_(self.output_linear.bias, 0.)

# Deepnet transformer layer
class TransformerLayer(nn.Module):
    def __init__(
        self,
        d_model,
        attention,
        feedforward,
        n_layers,
        dropout=0.1,
    ):
        super().__init__()
        self.d_model = d_model
        self.attention = attention
        self.feedforward = feedforward
        self.norm1 = nn.LayerNorm(self.d_model)
        self.norm2 = nn.LayerNorm(self.d_model)
        self.dropout = nn.Dropout(dropout)
        # Deepnet alpha https://arxiv.org/pdf/2203.00555.pdf
        self.alpha = (n_layers * 2.0) ** 0.25

    def forward(self, x):
        # Keep input as residual
        residual = x * self.alpha

        # Compute attention
        x = self.attention(x)

        # Add attention with residual and normalize.
        x = self.norm1(residual + self.dropout(x))

        # Keep output as next residual.
        residual = x * self.alpha

        # Pass through feedforward network.
        x = self.feedforward(x)

        # Combine residual and ff output, then normalize again.
        x = self.norm2(residual + self.dropout(x))
        
        return x

# Huggingface model type string
model_type = "dinalt-causal-transformer"

# Huggingface config class.
# Huggingface 'PreTrainedModel' objects are passed a derivative of this class
# when constructed. This is required, if your model will derive from PreTrainedModel.
class CausalTransformerConfig(PretrainedConfig):
    model_type = model_type
    
    def __init__(
        # All of these MUST have defaults, even if unused.
        self,
        vocab_size=2000,
        hidden_size=256,
        max_sequence_length=2048,
        dim_feedforward=512,
        num_attention_heads=1,
        num_hidden_layers = 4,
        
        **kwargs,
    ):
        # These are the canonical names used by Huggingface
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.max_sequence_length = max_sequence_length
        self.dim_feedforward = dim_feedforward
        self.num_attention_heads = num_attention_heads
        self.num_hidden_layers = num_hidden_layers
        
        super().__init__(**kwargs)

# The formward method of this model is designed to be compatible with the HuggingFace Trainer and Tokenizer classes.
# This is essentially a wrapper for a Pytorch transformer model, which implements the HF API.
class CausalTransformer(PreTrainedModel):
    config_class = CausalTransformerConfig
    model_type = 'Transformer'
    
    def __init__(self, config):
        super().__init__(config)
        self.vocab_size = config.vocab_size
        self.d_model = config.hidden_size
        
        self.embedding = nn.Embedding(self.vocab_size, self.d_model)
        self.layers = nn.ModuleList([
            TransformerLayer(
                d_model=config.hidden_size,
                attention=MultiheadAttention(
                    d_model=config.hidden_size,
                    num_heads=config.num_attention_heads,
                    n_layers=config.num_hidden_layers,
                    flash_attention=config.flash_attention,
                ),
                feedforward=FeedforwardLayer(
                    d_model=config.hidden_size,
                    d_feedforward=config.dim_feedforward,
                    n_layers=config.num_hidden_layers,
                ),
                n_layers=config.num_hidden_layers,
            ) for _ in range(config.num_hidden_layers)
        ])
        self.output_projection = nn.Linear(config.hidden_size, config.vocab_size)
        self.reset_parameters()
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        **kwargs,
    ) -> (Tensor, dict[str, Tensor]):

        # Convert input_ids to embeddings and add positional information.
        x = self.embedding(input_ids) * self.d_model**0.5

        # Pass the input through each of the layers.
        for layer in self.layers:
            x = layer(x)
        
        # Convert embeddings to log-probabilities of next token-id
        logits = self.output_projection(x)
        
        # Compute loss.
        if labels is not None:
            loss = causal_loss(logits, labels)
        else:
            loss = None
        
        if return_dict:
            return CausalLMOutput(loss=loss, logits=logits)
        elif loss is not None:
            return (loss, logits)
        else:
            return logits

    def reset_parameters(self):
        # Init the embedding weights as per original design.
        init.normal_(self.embedding.weight, std=self.d_model**-0.5)

    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        attention_mask = kwargs.get("attention_mask", None)
        model_inputs = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        }
        return model_inputs

AutoConfig.register(model_type, CausalTransformerConfig)
AutoModelForCausalLM.register(CausalTransformerConfig, CausalTransformer)

In [25]:
# Make a somewhat bigger model

def print_model_size(model):
    model_size = sum(t.numel() for t in model.parameters())
    print(f"Model size: {model_size/1000**2:.1f}M parameters")

# Create model configuration
config = CausalTransformerConfig(
    vocab_size = tokenizer.vocab_size,
    hidden_size = 128,
    dim_feedforward = 512,
    max_sequence_length = tokenizer.model_max_length,
    num_attention_heads=2,
    num_hidden_layers=2,
    flash_attention=True,
)

# A config can also be instantiated from a json file like this:
#config = AutoConfig.from_pretrained("path-to-config")

# Instantiate the model
model = AutoModelForCausalLM.from_config(config)
print_model_size(model)
print(model)

Model size: 0.9M parameters
CausalTransformer(
  (embedding): Embedding(2000, 128)
  (layers): ModuleList(
    (0-1): 2 x TransformerLayer(
      (attention): MultiheadAttention(
        (output_linear): Linear(in_features=128, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (feedforward): FeedforwardLayer(
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (activation): SiLU()
        (linear2): Linear(in_features=512, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (output_projection): Linear(in_features=128, out_features=2000, bias=True)
)


In [26]:
# Run this to tokenize the complete dataset
#tok_train_dataset = tokenize_dataset(dataset["train"], tokenizer, select=0.1)

In [27]:
model = model.bfloat16()
do_train()

Training for 3313 steps


  0%|          | 0/3313 [00:00<?, ?it/s]

  0%|          | 0/18 [00:00<?, ?it/s]

loss=2.8541666666666665
Global step: 1000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=2.6067708333333335
Global step: 2000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=2.5052083333333335
Global step: 3000


  0%|          | 0/18 [00:00<?, ?it/s]

loss=2.4887152777777777
