# Masked Language Modeling

This notebook describes how one can pre-train their own AntiBERTa model using the HuggingFace framework. As a demo, we've included the tokenizer we've used, and 1% of the sequences that we used in our training, validation, and test sets of the paper.

## Setup of all the things we need

In [1]:
# Some imports 
from transformers import (
    RobertaConfig,
    RobertaTokenizer,
    RobertaForMaskedLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
import os

In [2]:
# Initialise the tokeniser
tokenizer = RobertaTokenizer.from_pretrained(
    "../antiberta/antibody-tokenizer"
)

# Initialise the data collator, which is necessary for batching
collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.


## Text Data preprocessing

In [3]:
text_datasets = {
    "train": ['../antiberta/assets/train-slice.txt'],
    "eval": ['../antiberta/assets/val-slice.txt'],
    "test": ['../antiberta/assets/test-slice.txt']
}

dataset = load_dataset("text", data_files=text_datasets)
tokenized_dataset = dataset.map(
    lambda z: tokenizer(
        z["text"],
        padding="max_length",
        truncation=True,
        max_length=150,
        return_special_tokens_mask=True,
    ),
    batched=True,
    num_proc=1,
    remove_columns=["text"],
)

Using custom data configuration default-4099dca2205c4257
Found cached dataset text (/Users/joseph/.cache/huggingface/datasets/text/default-4099dca2205c4257/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/joseph/.cache/huggingface/datasets/text/default-4099dca2205c4257/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-5c6d18c0ea3c8c1f.arrow
Loading cached processed dataset at /Users/joseph/.cache/huggingface/datasets/text/default-4099dca2205c4257/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-85b0fa99574ce78c.arrow
Loading cached processed dataset at /Users/joseph/.cache/huggingface/datasets/text/default-4099dca2205c4257/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-42f591da6c4e4415.arrow


## Model configuration

In [4]:
import torch
device = torch.device('mps')

In [5]:
# These are the cofigurations we've used for pre-training.
antiberta_config = {
    "num_hidden_layers": 12,
    "num_attention_heads": 12,
    "hidden_size": 768,
    "d_ff": 3072,
    "vocab_size": 25,
    "max_len": 150,
    "max_position_embeddings": 152,
    "batch_size": 96,
    "max_steps": 225000,
    "weight_decay": 0.01,
    "peak_learning_rate": 0.0001,
}

In [6]:
# Initialise the model
model_config = RobertaConfig(
    vocab_size=antiberta_config.get("vocab_size"),
    hidden_size=antiberta_config.get("hidden_size"),
    max_position_embeddings=antiberta_config.get("max_position_embeddings"),
    num_hidden_layers=antiberta_config.get("num_hidden_layers", 12),
    num_attention_heads=antiberta_config.get("num_attention_heads", 12),
    type_vocab_size=1,
)
model = RobertaForMaskedLM(model_config).to(device)

In [7]:
# construct training arguments
# Huggingface uses a default seed of 42
args = TrainingArguments(
    output_dir="test",
    overwrite_output_dir=True,
    per_device_train_batch_size=antiberta_config.get("batch_size", 32),
    per_device_eval_batch_size=antiberta_config.get("batch_size", 32),
    max_steps=225000,
    save_steps=2500,
    logging_steps=2500,
    adam_beta2=0.98,
    adam_epsilon=1e-6,
    weight_decay=0.01,
    warmup_steps=10000,
    learning_rate=1e-4,
    gradient_accumulation_steps=antiberta_config.get("gradient_accumulation_steps", 1),
    # fp16=True,
    evaluation_strategy="steps",
    seed=42
)

## Setup of the HuggingFace Trainer

In [8]:
trainer = Trainer(
    model=model,
    args=args,
    data_collator=collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["eval"]
)

max_steps is given, it will override any value given in num_train_epochs


In [9]:
trainer.train(resume_from_checkpoint = True)

Loading model from test/checkpoint-215000.
The following columns in the training set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3
  Num Epochs = 225000
  Instantaneous batch size per device = 96
  Total train batch size (w. parallel, distributed & accumulation) = 96
  Gradient Accumulation steps = 1
  Total optimization steps = 225000
  Number of trainable parameters = 85784857
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 215000
  Continuing training from global step 215000
  Will skip the first 215000 epochs then the first 0 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your 

0it [00:00, ?it/s]

  0%|          | 0/225000 [00:00<?, ?it/s]

The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 3.488372093023256e-06, 'epoch': 217500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-217500
Configuration saved in test/checkpoint-217500/config.json


{'eval_loss': nan, 'eval_runtime': 0.2258, 'eval_samples_per_second': 13.285, 'eval_steps_per_second': 4.428, 'epoch': 217500.0}


Model weights saved in test/checkpoint-217500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 2.325581395348837e-06, 'epoch': 220000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-220000
Configuration saved in test/checkpoint-220000/config.json


{'eval_loss': nan, 'eval_runtime': 0.2666, 'eval_samples_per_second': 11.251, 'eval_steps_per_second': 3.75, 'epoch': 220000.0}


Model weights saved in test/checkpoint-220000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 1.1627906976744186e-06, 'epoch': 222500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-222500
Configuration saved in test/checkpoint-222500/config.json


{'eval_loss': nan, 'eval_runtime': 0.2304, 'eval_samples_per_second': 13.02, 'eval_steps_per_second': 4.34, 'epoch': 222500.0}


Model weights saved in test/checkpoint-222500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 225000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-225000
Configuration saved in test/checkpoint-225000/config.json


{'eval_loss': nan, 'eval_runtime': 0.2206, 'eval_samples_per_second': 13.596, 'eval_steps_per_second': 4.532, 'epoch': 225000.0}


Model weights saved in test/checkpoint-225000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 7336.9904, 'train_samples_per_second': 2943.986, 'train_steps_per_second': 30.667, 'train_loss': 0.0, 'epoch': 225000.0}


TrainOutput(global_step=225000, training_loss=0.0, metrics={'train_runtime': 7336.9904, 'train_samples_per_second': 2943.986, 'train_steps_per_second': 30.667, 'train_loss': 0.0, 'epoch': 225000.0})

In [10]:
trainer.save_model('../antiberta/saved_model')

Saving model checkpoint to ../antiberta/saved_model
Configuration saved in ../antiberta/saved_model/config.json
Model weights saved in ../antiberta/saved_model/pytorch_model.bin


In [11]:
# Predict MLM performance on the test dataset
out = trainer.predict(tokenized_dataset['test'])

The following columns in the test set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3
  Batch size = 96


  0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
print(out)

PredictionOutput(predictions=array([[[-0.44890258,  0.        ,  0.39918557, ..., -0.67570436,
         -0.19572093, -0.6968133 ],
        [-0.21400794,  0.        ,  0.10926522, ..., -1.2135879 ,
          0.14994998, -0.8182367 ],
        [-0.2807524 ,  0.        ,  0.37963933, ..., -1.0091581 ,
          0.21263681, -0.83665437],
        ...,
        [-0.5694531 ,  0.        ,  0.06545924, ..., -0.8455439 ,
          0.14736086, -0.55978304],
        [-0.5694531 ,  0.        ,  0.06545924, ..., -0.8455439 ,
          0.14736086, -0.55978304],
        [-0.5694531 ,  0.        ,  0.06545924, ..., -0.8455439 ,
          0.14736086, -0.55978304]],

       [[-0.47246504,  0.        ,  0.4012822 , ..., -0.6577634 ,
         -0.21133558, -0.71489227],
        [-0.24963681,  0.        ,  0.10322255, ..., -1.1950003 ,
          0.12808968, -0.84429336],
        [-0.31693763,  0.        ,  0.39187625, ..., -0.9945083 ,
          0.17764995, -0.8599656 ],
        ...,
        [-0.59228003,  0.

: 