# Chapter 10. Training Transformers from Scratch 


In [None]:
!pip install transformers 
!pip install datasets

## Dataset 

The large dataset is used to train a language model from scratch. 

Large datasets are hard to control. It is more likely that they are created with a high degree of automation. 

Let's see an example of the effect of the training data on text generation: GPT vs GPT-2. GPT was mostly trained on BookCorpus, while GPT-2 was trained on web pages. 

In [None]:
from transformers import pipeline, set_seed

generation_gpt = pipeline("text-generation", model="openai-gpt")
generation_gpt2 = pipeline("text-generation", model="gpt2")

In [None]:
def model_size(model):
    return sum(t.numel() for t in model.parameters())

print(f"GPT  size: {model_size(generation_gpt.model)/1000**2:.1f}M parameters")
print(f"GPT2 size: {model_size(generation_gpt2.model)/1000**2:.1f}M parameters")

GPT  size: 116.5M parameters
GPT2 size: 124.4M parameters


In [None]:
# Generating completions from the models 
def enum_pipeline_ouputs(pipe, prompt, num_return_sequences):
    out = pipe(prompt, num_return_sequences=num_return_sequences,
               clean_up_tokenization_spaces=True)
    return "\n".join(f"{i+1}." + s["generated_text"] for i, s in enumerate(out))

prompt = "\nWhen they came back"
print("GPT completions:\n" + enum_pipeline_ouputs(generation_gpt, prompt, 3))
print("")
print("GPT-2 completions:\n" + enum_pipeline_ouputs(generation_gpt2, prompt, 3))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT completions:
1.
When they came back.'
'no thanks, i'm not much of a fan of them myself.'
'well, let's move on then, let's get out of here before somebody changes their mind.'
 they strolled through
2.
When they came back down to the cave. " let's get some breakfast first. " 
 they ate in silence, then sat and discussed the mission - the first time they 'd been separated, the one in which she 'd been captured. she tried
3.
When they came back upstairs. i hadn't spoken with my dad in days. my mom was away on a business trip and dad was in the hospital. so i walked down to talk to him. " i paused, remembering. " i was in

GPT-2 completions:
1.
When they came back, they told him the name of President Assad, and when they went after us, they told [them], "We're coming." But, they started running away, and we got no contact with them. And when they
2.
When they came back to the office, Mr. Furlong said that many are unhappy, particularly in the business community. They say that Mr. Furlo

## Building custom dataset 

In this section, we build a custom dataset containint Python codes from Github. We will use Google BigQuery.

I skipped the steps of collecting data using Google BigQuery, and directly use the repo for the dataset.

In [None]:
!git clone https://huggingface.co/datasets/transformersbook/codeparrot

### Working with large datasets 

The dataset in this section contains 50gb compressed and 200gb uncompressed data. It is not easy to deal with it at the same time. We will use Dataset form Huggingface. 

**Memory mapping** is a feature of Dataset that is used to overcome RAM limitations. Basically, Datasets opens a read-only pointer to access the dataset without loading it to the RAM. 

Next line loads the dataset fo the disk space (180 gb).

Dataset extracts and reads all the compressed files by loading them in a single optimized cache file. 

In [None]:
from datasets import load_dataset, DownloadConfig

download_config = DownloadConfig(delete_extracted=True)
dataset = load_dataset("./codeparrot", split="train",
                       download_config=download_config)


In [None]:
import psutil

print(f"Number of python files code in dataset : {len(dataset)}")
ds_size = sum(os.stat(f["filename"]).st_size for f in dataset.cache_files)
# os.stat.st_size is expressed in bytes, so we convert to GB
print(f"Dataset size (cache file) : {ds_size / 2**30:.2f} GB")
# Process.memory_info is expressed in bytes, so we convert to MB
print(f"RAM used: {psutil.Process(os.getpid()).memory_info().rss >> 20} MB")

**Streaming** When it is difficult to fit a dataset to hard disk, we can stream the dataset. 

The compressed JSON files in the dataset will be open and read on the fly. The dataset is IterableDataset. We can't access any random sample, we need to process it in order. 

In [None]:
streamed_dataset = load_dataset('./codeparrot', split="train", streaming=True)


In [None]:
# the samples from streamed and unstreamed datasets are same 
iterator = iter(streamed_dataset)

print(dataset[0] == next(iterator))
print(dataset[1] == next(iterator))

We can even refer the dataset on the Hub without downloading the data locally.

In [2]:
from datasets import load_dataset
remote_dataset = load_dataset('transformersbook/codeparrot', split="train",streaming=True)


Downloading readme:   0%|          | 0.00/1.39k [00:00<?, ?B/s]



### Uploading the dataset to Hub

Uploading the dataset to the Hub separately as training and validation. 



In [None]:
!huggingface-cli login

In [None]:
!huggingface-cli repo create --type dataset transformersbook \
codeparrot-train

In [None]:
!huggingface-cli repo create --type dataset transformersbook \
codeparrot-valid

Clone the repos from Hub to local. 

In [None]:
!git clone https://huggingface.co/datasets/transformersbook/codeparrot-train
!git clone https://huggingface.co/datasets/transformersbook/codeparrot-valid

## Building a Tokenizer 

The tokenization process consists of four steps: normalization, pretokenization, the tokenizer model and postprocessing. The tokenizer model is trained on the data. 

The optimality and the performance of tokenizer can be measured: 

* Subword fertility calculates the average number of subwords produced by tokenized word
* Proportion of continued words refers to the proportion of tokenized words in a corpus that are split into at least two subtokens
* Coverage metrics is the proportion of the unknown words. 

The approach in BPE algorithm is to progressively construct a vocabulary of a predefined size by creating new vocabulary tokens through iteratively merging the most frequent co-occuring pair of tokens in the vocabulary.

In [3]:
from transformers import AutoTokenizer

python_code = r"""def say_hello():
    print("Hello, World!")

# Print it
say_hello()
"""
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer(python_code).tokens())

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

['def', 'Ġsay', '_', 'hello', '():', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '("', 'Hello', ',', 'ĠWorld', '!"', ')', 'Ċ', 'Ċ', '#', 'ĠPrint', 'Ġit', 'Ċ', 'say', '_', 'hello', '()', 'Ċ']


In [None]:
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode

byte_to_unicode_map = bytes_to_unicode()
unicode_to_byte_map = dict((v, k) for k, v in byte_to_unicode_map.items())
base_vocab = list(unicode_to_byte_map.keys())

print(f'Size of our base vocabulary: {len(base_vocab)}')
print(f'First element: `{base_vocab[0]}`, last element: `{base_vocab[-1]}`')

Size of our base vocabulary: 256
First element: `!`, last element: `Ń`


In [None]:
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(python_code))

[('def', (0, 3)), ('Ġsay', (3, 7)), ('_', (7, 8)), ('hello', (8, 13)), ('():', (13, 16)), ('ĊĠĠĠ', (16, 20)), ('Ġprint', (20, 26)), ('("', (26, 28)), ('Hello', (28, 33)), (',', (33, 34)), ('ĠWorld', (34, 40)), ('!")', (40, 43)), ('Ċ', (43, 44)), ('Ċ', (44, 45)), ('#', (45, 46)), ('ĠPrint', (46, 52)), ('Ġit', (52, 55)), ('Ċ', (55, 56)), ('say', (56, 59)), ('_', (59, 60)), ('hello', (60, 65)), ('()', (65, 67)), ('Ċ', (67, 68))]


In [None]:
print(f"Size of the vocabulary: {len(tokenizer)}")

Size of the vocabulary: 50257


In [None]:
print(tokenizer(python_code).tokens())

['def', 'Ġsay', '_', 'hello', '():', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '("', 'Hello', ',', 'ĠWorld', '!"', ')', 'Ċ', 'Ċ', '#', 'ĠPrint', 'Ġit', 'Ċ', 'say', '_', 'hello', '()', 'Ċ']


## Training a Tokenizer 

The tokenizer is trained to know which letter combinations are the most frequent in the corpus. 

Let's train a tokenizer using 1-2 GB data, around 100K documents.

In [None]:
from tqdm.auto import tqdm
from datasets import load_dataset
length = 100000
dataset_name = 'transformersbook/codeparrot-train'
dataset = load_dataset(dataset_name, split="train", streaming=True)
iter_dataset = iter(dataset)

def batch_iterator(batch_size=10):
    for _ in tqdm(range(0, length, batch_size)):
        yield [next(iter_dataset)['content'] for _ in range(batch_size)]

# new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(),
#                                                  vocab_size=12500,
#                                                  initial_alphabet=base_vocab)




In [None]:
tokens = sorted(new_tokenizer.vocab.items(), key=lambda x: x[1], reverse=False)
print([f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[257:280]]);

In [None]:
print([f'{new_tokenizer.convert_tokens_to_string(t)}' for t,_ in tokens[-12:]]);

In [None]:
import keyword

print(f'There are in total {len(keyword.kwlist)} Python keywords.')
for keyw in keyword.kwlist:
    if keyw not in new_tokenizer.vocab:
        print(f'No, keyword `{keyw}` is not in the vocabulary')

In [None]:
# Building a larger tokenizer
length = 200000
new_tokenizer_larger = tokenizer.train_new_from_iterator(batch_iterator(),
    vocab_size=32768, initial_alphabet=base_vocab)

  0%|          | 0/20000 [00:00<?, ?it/s]

In [None]:
tokens = sorted(new_tokenizer_larger.vocab.items(), key=lambda x: x[1],
                reverse=False)
print([f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[-12:]]);

In [None]:
print(new_tokenizer_larger(python_code).tokens())

In [None]:
# Investigating common words 
for keyw in keyword.kwlist:
    if keyw not in new_tokenizer_larger.vocab:
        print(f'No, keyword `{keyw}` is not in the vocabulary')

### Save the trained tokenizer 



In [None]:
model_ckpt = "codeparrot"
org = "transformersbook"
new_tokenizer_larger.push_to_hub(model_ckpt)

In [4]:
# Loading the tokenizer from hub
from transformers import AutoTokenizer

model_ckpt = "transformersbook/codeparrot"
reloaded_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
print(reloaded_tokenizer(python_code).tokens())

Downloading (…)okenizer_config.json:   0%|          | 0.00/251 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/497k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/277k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/840k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

['def', 'Ġsay', '_', 'hello', '():', 'ĊĠĠĠ', 'Ġprint', '("', 'Hello', ',', 'ĠWorld', '!")', 'Ċ', 'Ċ', '#', 'ĠPrint', 'Ġit', 'Ċ', 'say', '_', 'hello', '()', 'Ċ']


## Training a Model from Scratch 

Several pretraining objectives are:

* Casual LM: We provide an input to the decoder and ask it to complete it. It is a self-supervised training objective. A decoder-only architecture like GPT can be used. 

* Masked LM: We provide a noisy input to the model and ask it to reconstruct the original clean sample. BERT, XLM models are pretrained with MLM objective. 

* Sequence-to-sequence training: We provide a sequence to the encoder-decoder model and ask it to generate a related one to the task. T5, BART and PEGASUS models are Seq2seq models. 




In [6]:
# Loading the model config 
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
config = AutoConfig.from_pretrained("gpt2", vocab_size=len(tokenizer))
model = AutoModelForCausalLM.from_config(config)

In [9]:
def model_size(model):
    return sum(t.numel() for t in model.parameters())

print(f'GPT-2 size: {model_size(model)/1000**2:.1f}M parameters')

GPT-2 size: 111.0M parameters


In [None]:
# save and push to the hub 
model.save_pretrained("models/" + model_ckpt, push_to_hub=True,
                      organization=org)

### Dataloader 

Instead of padding the samples separately, we will concatenate several examples by adding a special tokens between them and split them into chunks.

In [None]:
!pip install tqdm

In [12]:
from tqdm import tqdm

In [13]:
examples, total_characters, total_tokens = 500, 0, 0
dataset = load_dataset('transformersbook/codeparrot-train', split='train',
                       streaming=True)

for _, example in tqdm(zip(range(examples), iter(dataset)), total=examples):
    total_characters += len(example['content'])
    total_tokens += len(tokenizer(example['content']).tokens())

characters_per_token = total_characters / total_tokens

  0%|          | 1/500 [00:01<10:03,  1.21s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (2605 > 1024). Running this sequence through the model will result in indexing errors
100%|██████████| 500/500 [00:17<00:00, 28.48it/s]


In [14]:
print(characters_per_token)

3.6233025034779565


In [16]:
import torch
from torch.utils.data import IterableDataset

class ConstantLengthDataset(IterableDataset):

    def __init__(self, tokenizer, dataset, seq_length=1024,
                 num_of_sequences=1024, chars_per_token=3.6):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
        self.dataset = dataset
        self.seq_length = seq_length
        self.input_characters = seq_length * chars_per_token * num_of_sequences 
    
    def __iter__(self):
        iterator = iter(self.dataset)
        more_examples = True
        while more_examples:
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.input_characters:
                    m=f"Buffer full: {buffer_len}>={self.input_characters:.0f}"
                    print(m)
                    break
                try:
                    m=f"Fill buffer: {buffer_len}<{self.input_characters:.0f}"
                    print(m)
                    buffer.append(next(iterator)["content"])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    iterator = iter(self.dataset)
            all_token_ids = []
            tokenized_inputs = self.tokenizer(buffer, truncation=False)
            for tokenized_input in tokenized_inputs["input_ids'"]:
                for tokenized_input in tokenized_inputs:
                    all_token_ids.extend(tokenized_input + [self.concat_token_id])

            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i : i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    yield torch.tensor(input_ids)

Since we provide same lenghts inputs to the model, we don't need mask tokens.

In [None]:
shuffled_dataset = dataset.shuffle(buffer_size=100)
constant_length_dataset = ConstantLengthDataset(tokenizer, shuffled_dataset,
                                                num_of_sequences=10)
dataset_iterator = iter(constant_length_dataset)

lengths = [len(b) for _, b in zip(range(5), dataset_iterator)]
print(f"Lengths of the sequences: {lengths}")

### Training the model 

We will use Accelerate to make the parallel computation of GPT training. 

In [None]:
import torch
import torch.nn.functional as F
from datasets import load_dataset
from accelerate import Accelerator

device = 'cpu'
accelerator = Accelerator()

model = torch.nn.Transformer().to(device) 
model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)
model, optimizer, data = accelerator.prepare(model, optimizer, data)

model.train()
for epoch in range(10):
    for source, targets in data:
        source = source.to(device)
        targets = targets.to(device)
        optimizer.zero_grad()
        output = model(source)
        loss = F.cross_entropy(output, targets)
        loss.backward()
        accelerator.backward(loss)
        optimizer.step()

In [None]:
# Defining training arguments
from argparse import Namespace

# Commented parameters correspond to the small model
config = {"train_batch_size": 2, # 12
          "valid_batch_size": 2, # 12
          "weight_decay": 0.1,
          "shuffle_buffer": 1000,
          "learning_rate": 2e-4, # 5e-4
          "lr_scheduler_type": "cosine",
          "num_warmup_steps": 750, # 2000
          "gradient_accumulation_steps": 16, # 1
          "max_train_steps": 50000, # 150000
          "max_eval_steps": -1,
          "seq_length": 1024,
          "seed": 1,
          "save_checkpoint_steps": 50000} # 15000

args = Namespace(**config)

In [None]:
from torch.utils.tensorboard import SummaryWriter
import logging
import wandb

def setup_logging(project_name):
    logger = logging.getLogger(__name__)
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO, handlers=[
        logging.FileHandler(f"log/debug_{accelerator.process_index}.log"),
        logging.StreamHandler()])
    if accelerator.is_main_process: # We only want to set up logging once
        wandb.init(project=project_name, config=args)
        run_name = wandb.run.name
        tb_writer = SummaryWriter()
        tb_writer.add_hparams(vars(args), {'0': 0})
        logger.setLevel(logging.INFO)
        datasets.utils.logging.set_verbosity_debug()
        transformers.utils.logging.set_verbosity_info()
    else:
        tb_writer = None
        run_name = ''
        logger.setLevel(logging.ERROR)
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()
    return logger, tb_writer, run_name


In [None]:
# Log metrics on TensorBoard and WB
def log_metrics(step, metrics):
    logger.info(f"Step {step}: {metrics}")
    if accelerator.is_main_process:
        wandb.log(metrics)
        [tb_writer.add_scalar(k, v, step) for k, v in metrics.items()]

In [None]:
# Creating data loaders for training and validation 
from torch.utils.data.dataloader import DataLoader

def create_dataloaders(dataset_name):
    train_data = load_dataset(dataset_name+'-train', split="train",
                              streaming=True)
    train_data = train_data.shuffle(buffer_size=args.shuffle_buffer,
                                    seed=args.seed)
    valid_data = load_dataset(dataset_name+'-valid', split="validation",
                              streaming=True)

    train_dataset = ConstantLengthDataset(tokenizer, train_data,
                                          seq_length=args.seq_length)
    valid_dataset = ConstantLengthDataset(tokenizer, valid_data,
                                          seq_length=args.seq_length)

    train_dataloader=DataLoader(train_dataset, batch_size=args.train_batch_size)
    eval_dataloader=DataLoader(valid_dataset, batch_size=args.valid_batch_size)
    return train_dataloader, eval_dataloader

In [18]:
def get_grouped_params(model, no_decay=["bias", "LayerNorm.weight"]):
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [{'params': params_with_wd, 'weight_decay': args.weight_decay},
            {'params': params_without_wd, 'weight_decay': 0.0}]

In [None]:
def evaluate():
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(batch, labels=batch)
        loss = outputs.loss.repeat(args.valid_batch_size)
        losses.append(accelerator.gather(loss))
        if args.max_eval_steps > 0 and step >= args.max_eval_steps: break
    loss = torch.mean(torch.cat(losses))
    try:
	       perplexity = torch.exp(loss)
    except OverflowError:
	       perplexity = torch.tensor(float("inf"))
    return loss.item(), perplexity.item()

In [None]:
set_seed(args.seed)

# Accelerator
accelerator = Accelerator()
samples_per_step = accelerator.state.num_processes * args.train_batch_size

# Logging
logger, tb_writer, run_name = setup_logging(project_name.split("/")[1])
logger.info(accelerator.state)

# Load model and tokenizer
if accelerator.is_main_process:
    hf_repo = Repository("./", clone_from=project_name, revision=run_name)
model = AutoModelForCausalLM.from_pretrained("./", gradient_checkpointing=True)
tokenizer = AutoTokenizer.from_pretrained("./")

# Load dataset and dataloader
train_dataloader, eval_dataloader = create_dataloaders(dataset_name)

# Prepare the optimizer and learning rate scheduler
optimizer = AdamW(get_grouped_params(model), lr=args.learning_rate)
lr_scheduler = get_scheduler(name=args.lr_scheduler_type, optimizer=optimizer,
                             num_warmup_steps=args.num_warmup_steps,
                             num_training_steps=args.max_train_steps,)
def get_lr():
    return optimizer.param_groups[0]['lr']

# Prepare everything with our `accelerator` (order of args is not important)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader)

# Train model
model.train()
completed_steps = 0
for step, batch in enumerate(train_dataloader, start=1):
    loss = model(batch, labels=batch).loss
    log_metrics(step, {'lr': get_lr(), 'samples': step*samples_per_step,
                       'steps': completed_steps, 'loss/train': loss.item()})
    loss = loss / args.gradient_accumulation_steps
    accelerator.backward(loss)
    if step % args.gradient_accumulation_steps == 0:
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        completed_steps += 1
    if step % args.save_checkpoint_steps == 0:
        logger.info('Evaluating and saving model checkpoint')
        eval_loss, perplexity = evaluate()
        log_metrics(step, {'loss/eval': eval_loss, 'perplexity': perplexity})
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        if accelerator.is_main_process:
            unwrapped_model.save_pretrained("./")
            hf_repo.push_to_hub(commit_message=f'step {step}')
        model.train()
    if completed_steps >= args.max_train_steps:
        break

# Evaluate and save the last checkpoint
logger.info('Evaluating and saving model after training')
eval_loss, perplexity = evaluate()
log_metrics(step, {'loss/eval': eval_loss, 'perplexity': perplexity})
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
if accelerator.is_main_process:
    unwrapped_model.save_pretrained("./")
    hf_repo.push_to_hub(commit_message=f'final model')