# Masked Language Modeling

This notebook describes how one can pre-train their own AntiBERTa model using the HuggingFace framework. As a demo, we've included the tokenizer we've used, and 1% of the sequences that we used in our training, validation, and test sets of the paper.

## Setup of all the things we need

In [1]:
# Some imports 
from transformers import (
    RobertaConfig,
    RobertaTokenizer,
    RobertaForMaskedLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
import os

In [2]:
# Initialise the tokeniser
tokenizer = RobertaTokenizer.from_pretrained(
    "../antiberta/antibody-tokenizer"
)

# Initialise the data collator, which is necessary for batching
collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.


## Text Data preprocessing

In [3]:
text_datasets = {
    "train": ['../antiberta/assets/train-slice.txt'],
    "eval": ['../antiberta/assets/val-slice.txt'],
    "test": ['../antiberta/assets/test-slice.txt']
}

dataset = load_dataset("text", data_files=text_datasets)
tokenized_dataset = dataset.map(
    lambda z: tokenizer(
        z["text"],
        padding="max_length",
        truncation=True,
        max_length=150,
        return_special_tokens_mask=True,
    ),
    batched=True,
    num_proc=1,
    remove_columns=["text"],
)

Using custom data configuration default-eb16e347bd72efe0
Found cached dataset text (/Users/joseph/.cache/huggingface/datasets/text/default-eb16e347bd72efe0/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/joseph/.cache/huggingface/datasets/text/default-eb16e347bd72efe0/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-ceb8406901057484.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Model configuration

In [17]:
import torch
device = torch.device('mps')

In [18]:
# These are the cofigurations we've used for pre-training.
antiberta_config = {
    "num_hidden_layers": 12,
    "num_attention_heads": 12,
    "hidden_size": 768,
    "d_ff": 3072,
    "vocab_size": 25,
    "max_len": 150,
    "max_position_embeddings": 152,
    "batch_size": 96,
    "max_steps": 225000,
    "weight_decay": 0.01,
    "peak_learning_rate": 0.0001,
}

In [19]:
# Initialise the model
model_config = RobertaConfig(
    vocab_size=antiberta_config.get("vocab_size"),
    hidden_size=antiberta_config.get("hidden_size"),
    max_position_embeddings=antiberta_config.get("max_position_embeddings"),
    num_hidden_layers=antiberta_config.get("num_hidden_layers", 12),
    num_attention_heads=antiberta_config.get("num_attention_heads", 12),
    type_vocab_size=1,
)
model = RobertaForMaskedLM(model_config).to(device)

In [20]:
# construct training arguments
# Huggingface uses a default seed of 42
args = TrainingArguments(
    output_dir="test",
    overwrite_output_dir=True,
    per_device_train_batch_size=antiberta_config.get("batch_size", 32),
    per_device_eval_batch_size=antiberta_config.get("batch_size", 32),
    max_steps=225000,
    save_steps=2500,
    logging_steps=2500,
    adam_beta2=0.98,
    adam_epsilon=1e-6,
    weight_decay=0.01,
    warmup_steps=10000,
    learning_rate=1e-4,
    gradient_accumulation_steps=antiberta_config.get("gradient_accumulation_steps", 1),
    # fp16=True,
    evaluation_strategy="steps",
    seed=42
)

using `logging_steps` to initialize `eval_steps` to 2500
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


## Setup of the HuggingFace Trainer

In [21]:
trainer = Trainer(
    model=model,
    args=args,
    data_collator=collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["eval"]
)



max_steps is given, it will override any value given in num_train_epochs


In [22]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3
  Num Epochs = 225000
  Instantaneous batch size per device = 96
  Total train batch size (w. parallel, distributed & accumulation) = 96
  Gradient Accumulation steps = 1
  Total optimization steps = 225000
  Number of trainable parameters = 85784857


  0%|          | 0/225000 [00:00<?, ?it/s]

The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 2.5e-05, 'epoch': 2500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-2500
Configuration saved in test/checkpoint-2500/config.json


{'eval_loss': nan, 'eval_runtime': 0.1764, 'eval_samples_per_second': 17.01, 'eval_steps_per_second': 5.67, 'epoch': 2500.0}


Model weights saved in test/checkpoint-2500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 5e-05, 'epoch': 5000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-5000
Configuration saved in test/checkpoint-5000/config.json


{'eval_loss': nan, 'eval_runtime': 0.2, 'eval_samples_per_second': 15.0, 'eval_steps_per_second': 5.0, 'epoch': 5000.0}


Model weights saved in test/checkpoint-5000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 7.500000000000001e-05, 'epoch': 7500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-7500
Configuration saved in test/checkpoint-7500/config.json


{'eval_loss': nan, 'eval_runtime': 0.1649, 'eval_samples_per_second': 18.197, 'eval_steps_per_second': 6.066, 'epoch': 7500.0}


Model weights saved in test/checkpoint-7500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 0.0001, 'epoch': 10000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-10000
Configuration saved in test/checkpoint-10000/config.json


{'eval_loss': nan, 'eval_runtime': 0.174, 'eval_samples_per_second': 17.244, 'eval_steps_per_second': 5.748, 'epoch': 10000.0}


Model weights saved in test/checkpoint-10000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.883720930232558e-05, 'epoch': 12500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-12500
Configuration saved in test/checkpoint-12500/config.json


{'eval_loss': nan, 'eval_runtime': 0.2129, 'eval_samples_per_second': 14.092, 'eval_steps_per_second': 4.697, 'epoch': 12500.0}


Model weights saved in test/checkpoint-12500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.767441860465116e-05, 'epoch': 15000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-15000
Configuration saved in test/checkpoint-15000/config.json


{'eval_loss': nan, 'eval_runtime': 0.2517, 'eval_samples_per_second': 11.917, 'eval_steps_per_second': 3.972, 'epoch': 15000.0}


Model weights saved in test/checkpoint-15000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.651162790697675e-05, 'epoch': 17500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-17500
Configuration saved in test/checkpoint-17500/config.json


{'eval_loss': nan, 'eval_runtime': 0.1718, 'eval_samples_per_second': 17.458, 'eval_steps_per_second': 5.819, 'epoch': 17500.0}


Model weights saved in test/checkpoint-17500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.534883720930233e-05, 'epoch': 20000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-20000
Configuration saved in test/checkpoint-20000/config.json


{'eval_loss': nan, 'eval_runtime': 0.1744, 'eval_samples_per_second': 17.202, 'eval_steps_per_second': 5.734, 'epoch': 20000.0}


Model weights saved in test/checkpoint-20000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.418604651162792e-05, 'epoch': 22500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-22500
Configuration saved in test/checkpoint-22500/config.json


{'eval_loss': nan, 'eval_runtime': 0.1976, 'eval_samples_per_second': 15.183, 'eval_steps_per_second': 5.061, 'epoch': 22500.0}


Model weights saved in test/checkpoint-22500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.30232558139535e-05, 'epoch': 25000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-25000
Configuration saved in test/checkpoint-25000/config.json


{'eval_loss': nan, 'eval_runtime': 0.1999, 'eval_samples_per_second': 15.01, 'eval_steps_per_second': 5.003, 'epoch': 25000.0}


Model weights saved in test/checkpoint-25000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.186046511627907e-05, 'epoch': 27500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-27500
Configuration saved in test/checkpoint-27500/config.json


{'eval_loss': nan, 'eval_runtime': 0.227, 'eval_samples_per_second': 13.216, 'eval_steps_per_second': 4.405, 'epoch': 27500.0}


Model weights saved in test/checkpoint-27500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 9.069767441860465e-05, 'epoch': 30000.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-30000
Configuration saved in test/checkpoint-30000/config.json


{'eval_loss': nan, 'eval_runtime': 0.2118, 'eval_samples_per_second': 14.167, 'eval_steps_per_second': 4.722, 'epoch': 30000.0}


Model weights saved in test/checkpoint-30000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3
  Batch size = 96


{'loss': 0.0, 'learning_rate': 8.953488372093024e-05, 'epoch': 32500.0}


  0%|          | 0/1 [00:00<?, ?it/s]

Saving model checkpoint to test/checkpoint-32500
Configuration saved in test/checkpoint-32500/config.json


{'eval_loss': nan, 'eval_runtime': 0.2076, 'eval_samples_per_second': 14.452, 'eval_steps_per_second': 4.817, 'epoch': 32500.0}


Model weights saved in test/checkpoint-32500/pytorch_model.bin


In [None]:
trainer.save_model(options.dir)

In [None]:
# Predict MLM performance on the test dataset
out = trainer.predict(tokenized_dataset['test'])