**Training a BERT Model from Scratch**


We will train our own BERT model from scratch. The model, myBERT, will be trained as a small model with 6 layers, 12 heads, and about 83 million parameters.

myBERT will be a DistillBERT-like model as it has the same architechture of 6 layers and 12 heads.

myBERT will implement a byte-level byte-pair encoding tokenizer (used by GPT-2). BERT models mostly use a workpiece tokenizer. There will be no token type IDs as the segment will be separated by the separation token </s>.

We'll use use the complete works of William Shakespeare as our dataset, train a tokenizer, train the transformer, save it, and run it with a masked language modeling examples.

Let's dive in!

**1) Loading the Dataset**

Let's download 'The Complete Works of William Shakespeare' as a single text file from Project Gutenberg. We can use snippets from this file as the training data for the model.

In [None]:
!wget --show-progress --continue -O /content/shakespeare.txt http://www.gutenberg.org/files/100/100-0.txt

--2021-06-08 13:48:57--  http://www.gutenberg.org/files/100/100-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/files/100/100-0.txt [following]
--2021-06-08 13:48:58--  https://www.gutenberg.org/files/100/100-0.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5757108 (5.5M) [text/plain]
Saving to: ‘/content/shakespeare.txt’


2021-06-08 13:49:00 (3.59 MB/s) - ‘/content/shakespeare.txt’ saved [5757108/5757108]



In [None]:
# Get a glimpse of the dataset
!head -n5 /content/shakespeare.txt
!echo "..."
!shuf -n5 /content/shakespeare.txt

﻿The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
...
    was put down, and the worser allow'd by order of law a furr'd
    I'll steal away.
MESSENGER.
    Grace!
What doth our cousin lay to Mowbray’s charge?


In [None]:
#SHAKESPEARE_TXT = '/content/shakespeare.txt'

**2) Installing Hugging Face Transformers**

We need HuggingFace transformers and tokenizer; however, we don't need tensorflow for our task.

In [None]:
!pip uninstall -y tensorflow #won't need tensoflow here



In [None]:
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-gal6mr23
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-gal6mr23
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.7.0.dev0-cp37-none-any.whl size=2351635 sha256=7aeabe6a2ae494f6a9edcb44a7dc618a5cc2622f79ec5bafb669a109e56b7838
  Stored in directory: /tmp/pip-ephem-wheel-cache-cm_ea8p7/wheels/70/d3/52/b3fa4f8b8ef04167ac62e5bb2accb62ae764db2a378247490e
Successfully built transformers


In [None]:
# Check the versions
!pip list | grep -E 'transformers|tokenizers'

tokenizers                    0.10.3             
transformers                  4.7.0.dev0         


**3) Training a Tokenizer**

We'll also train a tokenizer from scratch. We'll be using a byte-level tokenizer, which breaks a string/word into a sub-string/sub-word.

In [None]:
# Install BPE Tokenizer
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

In [None]:
%%time

# Select the files
paths = [str(x) for x in Path('.').glob('**/*.txt')]

# Select vocab size
vocab_size = 52_000

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths,
                vocab_size=vocab_size, # size of our tokenizer's model length
                min_frequency=2, # minimum frequency threshold
                special_tokens=[
                                '<s>',    # a start token
                                '<pad>',  # a padding token
                                '</s>',   # an end token
                                '<unk>',  # an unknown token
                                '<mask>', # the mask token for language modeling
                ])

CPU times: user 4.36 s, sys: 337 ms, total: 4.7 s
Wall time: 2.53 s


**4) Saving the files to disk**

The tokenizer generates two files after training:
1) merges.txt: contains merged tokenized sub-strings
2) vocab.json: contains the indices of the tokenize sub-strings

Let's create 'myBERT' directory and save the 2 files.

In [None]:
import os

token_dir = '/content/myBERT'

if not os.path.exists(token_dir):
  os.makedirs(token_dir)

tokenizer.save_model('myBERT')

['myBERT/vocab.json', 'myBERT/merges.txt']

**5) Loading the Trained Tokenizer Files**

Since we have trained our own tokenizer, let's load the files.

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer=ByteLevelBPETokenizer(
    './myBERT/vocab.json',
    './myBERT/merges.txt'
    )

In [None]:
# Let's encode a post-processed sequence
tokenizer.encode('nference is inference without bias')

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [None]:
# Check how the tokenizer works
tokenizer.encode('nference is inference without bias').tokens

['n', 'ference', 'Ġis', 'Ġin', 'ference', 'Ġwithout', 'Ġbias']

Now let the tokenizer process the tokens to fit the BERT model. The post processor will add a start and end token.

In [None]:
tokenizer._tokenizer.post_processor=BertProcessing(
    ('</s>', tokenizer.token_to_id('</s>')),
    ('<s>', tokenizer.token_to_id('<s>'))
)

# Select model dimension
d_model = 512

tokenizer.enable_truncation(max_length=d_model)

In [None]:
# Let's encode a post-processed sequence
tokenizer.encode('nference is inference without bias')

Encoding(num_tokens=9, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

After the post-process, we can see that the token size has increased to 9 because the tokenizer added start and end tokens (below).

In [None]:
# Check how the tokenizer works
tokenizer.encode('nference is inference without bias').tokens

['<s>', 'n', 'ference', 'Ġis', 'Ġin', 'ference', 'Ġwithout', 'Ġbias', '</s>']

**6) Checking GPU and NVIDIA**

In [None]:
#Let's see if an NVIDIA GPU card is present
!nvidia-smi

Thu Jun  3 16:34:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0    34W /  70W |  11344MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We can see the GPU information as well as the version on the card. We'll now check if PyTorch sees CUDA

In [None]:
# Import PyTorch
import torch

# If there's a GPU available...
if torch.cuda.is_available():

    # Tell PyTorch to use the GPU
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


**7) Defining the configuration of the Model**

We'll pretrain a RoBERTa-type transformer using same number of layers and heads as a DistillBERT.

In [None]:
# Import Roberta configurations
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [None]:
config

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.7.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

**8) Re-creating the Tokenizer in Transformers**

Now let's load our trained tokenizer.

In [None]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('./myBERT', max_length=512)

Since we have loaded our trained tokenizer, let's initialize a RoBERTa model from scratch.

**9) Initializing a Model From Scratch**

In [None]:
# Import RoBERTa masked model
from transformers import RobertaForMaskedLM

# Initiate the model with defined configurations
model = RobertaForMaskedLM(config=config)

# Print the model and see its building blocks
model

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [None]:
# Check num of parameters
model.num_parameters()

83502880

It's a small model, only with 83 million parameters

**10) Building the Dataset**

Let's load the dataset line by line for batch training with the block size of 128

In [None]:
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path='./shakespeare.txt',
    block_size=128,
)



**11) Defining a Data Collator**

Before initializing the trainer, we need to run a data collator, which takes samples from the dataset and collate them into batches (dictionary-like objects).

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,             # pretraining for masked language modeling
    mlm_probability=0.15  # proportion of masked tokens
)

print('Done...')

Done...


**12) Initializing the Trainer**

In [None]:
from transformers import Trainer, TrainingArguments

# Set training arguments
training_args=TrainingArguments(
    output_dir='./myBERT',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10000,
    save_total_limit=2
)

print('Done...')

Done...


In [None]:
# Set the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

print('The model is ready for training...')

The model is ready for training...


**13) Pre-training the Model**

All set; now launch the trainer.

In [None]:
%%time
trainer.train()
# done

Step,Training Loss
500,7.0201
1000,6.1425
1500,5.8941
2000,5.7191


CPU times: user 8min 52s, sys: 9.67 s, total: 9min 2s
Wall time: 9min 6s


TrainOutput(global_step=2206, training_loss=6.147931677367832, metrics={'train_runtime': 546.5641, 'train_samples_per_second': 258.283, 'train_steps_per_second': 4.036, 'total_flos': 1518795807006720.0, 'train_loss': 6.147931677367832, 'epoch': 1.0})

We can see the training process in real time, inclusing the loss, learning rate, epoch and steps.

**14) Saving the Final Model**

The model is trained. It's time to save the model and configurations.

In [None]:
trainer.save_model('./myBERT')

Inside the 'myBERT' folder, we can now see 'config.json, pytorch_model.bin, and training_args.bin' files, where 'vocab.json and merges.txt' contain the pretrained tokenization of our dataset.

**We have built our BERT model from scratch!!!**

**15) Language Modeling with the FillMaskPipeline**

Let's perform masked language modeling using 'fill-mask' pipeline.

In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./myBERT",
    tokenizer="./myBERT"
)

In [None]:
# Let's ask our model to think like Shakespeare :)
fill_mask("Everything that glitters is not<mask>.")

[{'score': 0.017595890909433365,
  'sequence': 'Everything that glitters is not it.',
  'token': 352,
  'token_str': ' it'},
 {'score': 0.016499634832143784,
  'sequence': 'Everything that glitters is not me.',
  'token': 330,
  'token_str': ' me'},
 {'score': 0.00985712744295597,
  'sequence': 'Everything that glitters is not him.',
  'token': 368,
  'token_str': ' him'},
 {'score': 0.008378772996366024,
  'sequence': 'Everything that glitters is not you.',
  'token': 298,
  'token_str': ' you'},
 {'score': 0.006181748118251562,
  'sequence': 'Everything that glitters is not thee.',
  'token': 448,
  'token_str': ' thee'}]

In [None]:
# Once more: ask our model to think like Shakespeare :)
fill_mask("To be, or not to<mask>.")

[{'score': 0.014091857708990574,
  'sequence': 'To be, or not to you.',
  'token': 298,
  'token_str': ' you'},
 {'score': 0.01212971843779087,
  'sequence': 'To be, or not to me.',
  'token': 330,
  'token_str': ' me'},
 {'score': 0.01062318030744791,
  'sequence': 'To be, or not to it.',
  'token': 352,
  'token_str': ' it'},
 {'score': 0.009581162594258785,
  'sequence': 'To be, or not to sir.',
  'token': 548,
  'token_str': ' sir'},
 {'score': 0.008413019590079784,
  'sequence': 'To be, or not to lord.',
  'token': 491,
  'token_str': ' lord'}]

The output may vary in each run as we're pretraining our model from scratch with limited data (only 5.5 MB).