# Training an own model

In this model, we can train our own model. In this example, we use an example training file: singletons_local.csv. This file contains source code of singletons that is autogenerated. The content of the singletons is not useful at all, but the structure is. 

After a good training, we should be able to use the fill-mask pipeline to fill the structure of the singletons. 

The training process is done in the following steps:
1. Training the tokenizer
2. Downloading a vanilla model from the Hugging Face model hub
3. Training the model on the singletons
4. Saving the model


In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
# just checking if CUDA is available on this computer
import torch

torch.cuda.is_available()

In [None]:
# We use the standard BPE tokenizer for this workbook
# it was described in the previous chapter of the book
# when we discussed feature extraction
from tokenizers import ByteLevelBPETokenizer

paths = ['./singletons_local.txt']

In [None]:
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

print('Training tokenizer...')

# Customize training
# we use a large vocabulary size, but we could also do with ca. 10_000
tokenizer.train(files=paths, 
                vocab_size=52_000, 
                min_frequency=2, 
                special_tokens=["<s>","<pad>","</s>","<unk>","<mask>",])

In [None]:
import os

# we give this model a catchy name - wolfBERTa
# because it is a RoBERTa model trained on the WolfSSL source code
token_dir = './singletonBERT'

if not os.path.exists(token_dir):
  os.makedirs(token_dir)

tokenizer.save_model('singletonBERT')

In [None]:
from tokenizers.processors import BertProcessing

# let's make sure that the tokenizer does not provide more tokens than we expect
# we expect 510 tokens, because we will use the BERT model
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
# import the RoBERTa configuration
from transformers import RobertaConfig

# initialize the configuration
# please note that the vocab size is the same as the one in the tokenizer. 
# if it is not, we could get exceptions that the model and the tokenizer are not compatible
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [None]:
# Initializing a Model From Scratch
from transformers import RobertaForMaskedLM

# initialize the model
model = RobertaForMaskedLM(config=config)

In [None]:
# but before we actually train the model
# we need to change the tokenizer to the one that we trained
# and to make it compatible with the tokenizer that is expected by the model
# so we read it from the file under a different tokenizer
from transformers import RobertaTokenizer

# initialize the tokenizer from the file
tokenizer = RobertaTokenizer.from_pretrained("./singletonBERT", max_length=512)

# please note that if we use a tokenizer that was trained before
# the vanilla version of BPETokenizer, we will get an exception
# that the BPE tokenizer is not collable

In [None]:
# let's see if we can change this to use the Dataset library instead of the transformers
from datasets import load_dataset

new_dataset = load_dataset("text", data_files='singletons_local.txt')

In [None]:
# now, let's tokenize the dataset

# num_proc is the argument to use all cores
tokenized_dataset = new_dataset.map(lambda x: tokenizer(x["text"]))

In [None]:
# training of the model requires a data collator
# which creates a random set of tokens to mask
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
# now, we can train the model
# by creating the trainer
import accelerate
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./singletonBERT",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset['train'],
)

In [None]:
# here is where we train the model
# which corresponds to the model.fit() method in Keras
# which we used in the previous chapters
trainer.train()

In [None]:
trainer.save_model("./singletonBERT")

In [None]:
from transformers import pipeline

unmasker = pipeline('fill-mask', 
                    model=model, 
                    tokenizer=tokenizer)

unmasker("int x = <mask>")

## Where to go from here

So, this is how you can train a model. 

What you can do next:
1. Train your model on your own source code -- dump the code to a single file and train your model on it
    a. Remember that the larger the training set, the better the model, but also the longer it takes to train it
2. Use the model to fill the structure of your code
    a. You can use the fill-mask pipeline to fill the structure of your code
    b. For example, try to fill in a test code: write a structure of the test case and fill in the content of the assert() statement.
3. Train a different model from Hugging Face
    a. Take a look at the file `Models for Programming languages.md` and choose the right one
    b. Remeber to choose the one that is MLM model, not XLM (cross-lingual model)