# Training a BERT model from scratch

This demo illustrates the training of a BERT model from scratch. The model is trained on the tet  

The notebook is based on the existing tutorial from Hugging Face [link](https://huggingface.co/blog/how-to-train).


# Training a tokenizer

The first step in training the model is to train the tokenizer. 

In [33]:
# just checking if CUDA is available on this computer
import torch

torch.cuda.is_available()

True

In [34]:
# We use the standard BPE tokenizer for this workbook
# it was described in the previous chapter of the book
# when we discussed feature extraction
from tokenizers import ByteLevelBPETokenizer

paths = ['war_and_peace.txt']

In [35]:
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

print('Training tokenizer...')

# Customize training
# we use a large vocabulary size, but we could also do with ca. 10_000
tokenizer.train(files=paths, 
                vocab_size=52_000, 
                min_frequency=2, 
                special_tokens=["<s>","<pad>","</s>","<unk>","<mask>",])

Training tokenizer...





In [36]:
# check how the tokenizer works for this string "int main() { return 0; }"
encoded = tokenizer.encode("Santa Claus is coming to town")

print(encoded.ids)

[55, 401, 69, 4628, 1448, 384, 1739, 283, 1891]


In [37]:
import os

# we give this model a catchy name - wolfBERTa
# because it is a RoBERTa model trained on the WolfSSL source code
token_dir = './brownieBERTa'

if not os.path.exists(token_dir):
  os.makedirs(token_dir)

tokenizer.save_model('brownieBERTa')

['brownieBERTa/vocab.json', 'brownieBERTa/merges.txt']

# Training the model

Now, we can start preparing to train the model. 

In [38]:
from tokenizers.processors import BertProcessing

# let's make sure that the tokenizer does not provide more tokens than we expect
# we expect 510 tokens, because we will use the BERT model
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=148)

In [39]:
# import the RoBERTa configuration
from transformers import RobertaConfig

# initialize the configuration
# please note that the vocab size is the same as the one in the tokenizer. 
# if it is not, we could get exceptions that the model and the tokenizer are not compatible
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [40]:
# Initializing a Model From Scratch
from transformers import RobertaForMaskedLM

# initialize the model
model = RobertaForMaskedLM(config=config)

# Prepare the dataset for training

We use the datasets library from Hugging Face in order  to load the dataset. It allows us to work with larger datasets and in a more efficient way.

In [41]:
# but before we actually train the model
# we need to change the tokenizer to the one that we trained
# and to make it compatible with the tokenizer that is expected by the model
# so we read it from the file under a different tokenizer
from transformers import RobertaTokenizer

# initialize the tokenizer from the file
tokenizer = RobertaTokenizer.from_pretrained("./brownieBERTa", max_length=510)

# please note that if we use a tokenizer that was trained before
# the vanilla version of BPETokenizer, we will get an exception
# that the BPE tokenizer is not collable

In [42]:
# let's see if we can change this to use the Dataset library instead of the transformers
from datasets import load_dataset

new_dataset = load_dataset("text", data_files='./war_and_peace.txt')

Generating train split: 0 examples [00:00, ? examples/s]

In [43]:
# now, let's tokenize the dataset

# num_proc is the argument to use all cores
tokenized_dataset = new_dataset.map(lambda x: tokenizer(x["text"]), num_proc=32)

Map (num_proc=32):   0%|          | 0/49767 [00:00<?, ? examples/s]

In [44]:
# training of the model requires a data collator
# which creates a random set of tokens to mask
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [45]:
# now, we can train the model
# by creating the trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./brownieBERTa",
    overwrite_output_dir=True,
    num_train_epochs=50,
    per_device_train_batch_size=256,
    save_steps=1_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset['train'],
)

# start the training process by calling the train method
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
500,6.821
1000,5.7148
1500,5.1688
2000,4.8126
2500,4.5676
3000,4.4126
3500,4.2795
4000,4.1688
4500,4.0595
5000,3.9763


TrainOutput(global_step=9750, training_loss=4.277055413661859, metrics={'train_runtime': 2103.8946, 'train_samples_per_second': 1182.735, 'train_steps_per_second': 4.634, 'total_flos': 1.5609333240384e+16, 'train_loss': 4.277055413661859, 'epoch': 50.0})

# Save the final model to hard drive

Finally, we save the model to the hard drive.

In [46]:
trainer.save_model("./brownieBERTa")

# Testing the model

Now, let's test the model -- predict one token

In [47]:
# make a prediction
from transformers import pipeline
from pprint import pprint

fill_mask = pipeline(
    "fill-mask",
    model="./brownieBERTa",
    tokenizer="./brownieBERTa"
)

strPredicted = fill_mask("Santa Claus <mask>", top_k=10)

pprint(strPredicted)

[{'score': 0.3422628343105316,
  'sequence': 'Santa Claus.”',
  'token': 415,
  'token_str': '.”'},
 {'score': 0.23653315007686615,
  'sequence': 'Santa Claus.',
  'token': 18,
  'token_str': '.'},
 {'score': 0.17522436380386353,
  'sequence': 'Santa Claus!”',
  'token': 419,
  'token_str': '!”'},
 {'score': 0.04776066541671753,
  'sequence': 'Santa Claus?”',
  'token': 460,
  'token_str': '?”'},
 {'score': 0.04432617872953415,
  'sequence': 'Santa Claus....”',
  'token': 1550,
  'token_str': '....”'},
 {'score': 0.03472210839390755,
  'sequence': 'Santa Claus...”',
  'token': 799,
  'token_str': '...”'},
 {'score': 0.015623221173882484,
  'sequence': 'Santa Claus!...”',
  'token': 2791,
  'token_str': '!...”'},
 {'score': 0.011783414520323277,
  'sequence': 'Santa Claus:',
  'token': 30,
  'token_str': ':'},
 {'score': 0.008496532216668129,
  'sequence': 'Santa Claus?...”',
  'token': 3404,
  'token_str': '?...”'},
 {'score': 0.0035278876312077045,
  'sequence': 'Santa Claus.)',
  'to

## Feature extraction - embeddings

In this place, we extract the embeddings from the model for each of the lines that we have in the training set. Then we add the line that we just tested int i = 0; to the list of lines.

Then we visualize it using t-SNE.

In [48]:
from transformers import pipeline
from pprint import pprint

feature_extraction = pipeline(
    "feature-extraction",
    model="./brownieBERTa",
    tokenizer="./brownieBERTa"
)

strPredicted = feature_extraction("Brownie is good.")

pprint(strPredicted[0][0])

Some weights of RobertaModel were not initialized from the model checkpoint at ./brownieBERTa and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[1.1033674478530884,
 1.2807064056396484,
 -0.7393246293067932,
 -0.75388103723526,
 0.37746304273605347,
 0.5539442896842957,
 -1.6205332279205322,
 0.2173890918493271,
 -0.6585490107536316,
 1.2766588926315308,
 0.23506832122802734,
 -0.861558735370636,
 0.40749630331993103,
 2.3658220767974854,
 -0.3706724941730499,
 0.7339725494384766,
 0.022288169711828232,
 0.45131033658981323,
 -0.363322913646698,
 -1.8899754285812378,
 1.5935488939285278,
 -1.404867172241211,
 -1.4151560068130493,
 0.14845840632915497,
 0.4643068015575409,
 -1.8416754007339478,
 -1.6816915273666382,
 1.4302928447723389,
 0.06566373258829117,
 -0.3594614565372467,
 2.1859748363494873,
 1.8484517335891724,
 -0.6307582855224609,
 -1.6817268133163452,
 -0.06582394242286682,
 -1.5709772109985352,
 -0.7745420932769775,
 0.049157850444316864,
 -0.5926477909088135,
 -0.5225352644920349,
 -1.5673067569732666,
 0.8782007098197937,
 0.24394680559635162,
 -0.4026496708393097,
 -0.25952962040901184,
 -2.838500738143921,
 0.

In [2]:
# now, go through the file c_declarations.c and extract features for each of the lines

# we will use the same tokenizer as before
# but we will use the feature extraction pipeline
# to extract features for each of the lines in the file
from transformers import pipeline

feature_extraction = pipeline(
    "feature-extraction",
    model="./brownieBERTa",
    tokenizer="./brownieBERTa"
)

# read the file with all the lines used to train the model
with open('./war_and_peace.txt', 'r') as file:
    data = file.readlines()

data.append("Santa Claus is coming to town")

# extract features for each of the lines
from tqdm import tqdm

features = [feature_extraction(line)[0][0] for line in tqdm(data)]

Some weights of RobertaModel were not initialized from the model checkpoint at ./brownieBERTa and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 49768/49768 [10:30<00:00, 78.87it/s] 


In [3]:
# convert features to a dataframe
import pandas as pd 

df = pd.DataFrame(features)

df['type'] = 0

df.index = data

# if the index is int i = 0; then the type is 1
df.loc["Santa Claus is coming to town", 'type'] = 1


In [None]:
dfFeatures = df.drop(columns=['type'], axis=1)
dfFeatures.dropna(inplace=True)

dfFeatures.to_csv('features.csv')

In [None]:
# now visualize the features
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

def visualize(perplexity):
    # initialize the t-SNE
    tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)

    # fit the t-SNE
    X_embedded = tsne.fit_transform(dfFeatures)

    # Create a color map based on the values in the first column
    colors = ['blue' if val == 0.0 else 'red' for val in df['type']]

    # plot the t-SNE
    plt.figure(figsize=(18, 12))
    plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=colors)
    #plt.xlabel('Component 1', fontsize=20)
    #plt.ylabel('Component 2', fontsize=20)
    #plt.xticks(fontsize=16)
    #plt.yticks(fontsize=16)

    # Format the tick labels to 3 decimal places
    #plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.3f}'))
    #plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.3f}'))

    plt.savefig(f'./img/war_and_peace_tsne_{perplexity}.png')

In [None]:
from tqdm import tqdm

for i in tqdm(range(1, 5)):
    visualize(i)


#len(features)):

  0%|          | 0/4 [00:00<?, ?it/s]

## Fine tuning

In this part of the tutorial, we fine tune our model to predict declarations of functions. We use kNN classifier for that reason.

There are two steps that we need to do:
1) add another column to the dataframe df, where we add information about if a declaration is a function
2) we train and validate the kNN classifier for that task

In [53]:
# now add another column to df
# if the index contains "(" then the type is 2.

df.loc[df.index.str.contains("Santa"), 'type'] = 2

In [54]:
# now train kNN to predict the type of the line
from sklearn.neighbors import KNeighborsClassifier

# initialize the kNN
knn = KNeighborsClassifier(n_neighbors=3)

# fit the kNN
knn.fit(dfFeatures, df['type'])

# predict the type of the line
knn.predict(dfFeatures)

# now validate the predictions
from sklearn.metrics import accuracy_score

fAccuracy = accuracy_score(df["type"], knn.predict(dfFeatures))

print(f'Accuracy of the training set: {fAccuracy}')

# now predict the type of the line "int x = 12;"

# extract features for the line
oneLineFeatures = feature_extraction("Is he coming? ")[0][0]

# convert features to a dataframe
dfLine = pd.DataFrame([oneLineFeatures])

# predict the type of the line
iType = knn.predict(dfLine)

print(f'The type of the line is: {iType}')



Accuracy of the training set: 0.9999799067674008
The type of the line is: [0]


# Summary

In this tutorial, we've learned how to train a simple transformer network. We downloaded the network's architecture from the HuggingFace hub, then we used our set of declarations of C variables and functions to train it.

The result of this training is the network that can help us to write declarations of functions in C. 

If you want to dive deeper into this topic, please do the following exercises:
1. For the fill-mask pipeline, change the predicted string to int main(int argc, <mask> *argv); observe whether the network can predict it; reflecton on the result. 
2. Reduce the number of training epochs to 1, train the network and check the suggestions; do the same for 10 epochs, 20, etc. Observe and reflect upon the quality of the suggestions.
3. Go through the training set and add declarations that you are interested in, e.g., more variations of int main(....); train the network with 20-40 epochs and test your predictions. Observe how the suggestion changes if the network has a similar data in the training set and when it does not.
4. In many places I used the vocabulary size of 5000, please change that to 100, train the network and observe what happens. You can even go further and change this vocabulary size to 10 and do the same. 
5. When training the tokenizer, I used the parameter min_frequency=2, please change that to 10, then go through the entire process and see how that impacts results. Do the same for min_frequency=1. Whis parameter gives the best prediction?
6. Go through the code for visualization. Change the perplexity parameter of the t-SNE. Plot a few diagrams, reflect on how this parameter helps in understanding the data. 