# Training a BERT model from scratch

This demo illustrates the training of a BERT model from scratch. The model is trained on a small dataset of wolfSSL, which is an SSL library used in embedded software development. 

The notebook is based on the existing tutorial from Hugging Face [link](https://huggingface.co/blog/how-to-train).


# Training a tokenizer

The first step in training the model is to train the tokenizer. 

In [1]:
# We use the standard BPE tokenizer for this workbook
# it was described in the previous chapter of the book
# when we discussed feature extraction
from tokenizers import ByteLevelBPETokenizer

paths = ['source_code_wolf_ssl.txt']

print(f'Found {len(paths)} files')
print(f'First file: {paths[0]}')

Found 1 files
First file: source_code_wolf_ssl.txt


In [2]:
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

print('Training tokenizer...')

# Customize training
# we use a large vocabulary size, but we could also do with ca. 10_000
tokenizer.train(files=paths, 
                vocab_size=52_000, 
                min_frequency=2, 
                special_tokens=["<s>","<pad>","</s>","<unk>","<mask>",])

Training tokenizer...





In [3]:
import os

# we give this model a catchy name - wolfBERTa
# because it is a RoBERTa model trained on the WolfSSL source code
token_dir = './wolfBERTa'

if not os.path.exists(token_dir):
  os.makedirs(token_dir)

tokenizer.save_model('wolfBERTa')

['wolfBERTa/vocab.json', 'wolfBERTa/merges.txt']

In [4]:
# finally, we can test the tokenizer
tokenizer.encode("int main(int argc, void **argv)").tokens

['int', 'Ġmain', '(', 'int', 'Ġargc', ',', 'Ġvoid', 'Ġ**', 'argv', ')']

# Training the model

Now, we can start preparing to train the model. 

In [5]:
from tokenizers.processors import BertProcessing

# let's make sure that the tokenizer does not provide more tokens than we expect
# we expect 510 tokens, because we will use the BERT model
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [6]:
# import the RoBERTa configuration
from transformers import RobertaConfig

# initialize the configuration
# please note that the vocab size is the same as the one in the tokenizer. 
# if it is not, we could get exceptions that the model and the tokenizer are not compatible
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# let's print the configuration
# please note that there is more parameters than what we configured
# this is because we use the default values for the rest of the parameters
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



In [8]:
# Initializing a Model From Scratch
from transformers import RobertaForMaskedLM

# initialize the model
model = RobertaForMaskedLM(config=config)

# let's print the number of parameters in the model
print(model.num_parameters())

# let's print the model
print(model)

83504416
RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Layer

# Prepare the dataset for training

We use the datasets library from Hugging Face in order  to load the dataset. It allows us to work with larger datasets and in a more efficient way.

In [9]:
# but before we actually train the model
# we need to change the tokenizer to the one that we trained
# and to make it compatible with the tokenizer that is expected by the model
# so we read it from the file under a different tokenizer
from transformers import RobertaTokenizer

# initialize the tokenizer from the file
tokenizer = RobertaTokenizer.from_pretrained("./wolfBERTa", max_length=512)

# please note that if we use a tokenizer that was trained before
# the vanilla version of BPETokenizer, we will get an exception
# that the BPE tokenizer is not collable

In [10]:
# let's see if we can change this to use the Dataset library instead of the transformers
from datasets import load_dataset

new_dataset = load_dataset("text", data_files='./source_code_wolf_ssl.txt')

Found cached dataset text (/home/miroslaw/.cache/huggingface/datasets/text/default-bb9f0226a4741661/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
100%|██████████| 1/1 [00:00<00:00, 36.90it/s]


In [11]:
# now, let's tokenize the dataset

# num_proc is the argument to use all cores
tokenized_dataset = new_dataset.map(lambda x: tokenizer(x["text"]), num_proc=8)

Loading cached processed dataset at /home/miroslaw/.cache/huggingface/datasets/text/default-bb9f0226a4741661/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-c2662a868155bd2a_*_of_00008.arrow


In [12]:
# training of the model requires a data collator
# which creates a random set of tokens to mask
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [13]:
# now, we can train the model
# by creating the trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./wolfBERTa",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset['train'],
)

In [14]:
# just checking if CUDA is available on this computer
import torch

torch.cuda.is_available()

True

In [15]:
# here is where we train the model
# which corresponds to the model.fit() method in Keras
# which we used in the previous chapters
trainer.train()



Step,Training Loss
500,6.0093
1000,4.7568
1500,4.3578
2000,4.0055
2500,4.0363
3000,3.7907
3500,3.6728
4000,3.5835
4500,3.584
5000,3.5107


# Save the final model to hard drive

Finally, we save the model to the hard drive.

In [None]:
trainer.save_model("./wolfBERTa")

# Summary

The code above trained a model from scratch. The model is saved to the hard drive and can be used for further analysis.

In chapter 9, we learned how to use such a model, so please take a look at that code in order to see how to use this model.