There are two types of language modeling - causal and masked. 

Causal language models are frequently used for text generation. These models can be used for creative applications like choosing your own text adventure or for an intelligent coding assistant like Copilot or CodeParrot. Masked language models predict a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. BERT is an example of a masked language model. 

This guide illustrates how to:
1. Finetune DistilRoBERTa on the r/askscience subset of the ELI5 dataset.
2. Use finetuned model for inference.

# Libraries

In [None]:
pip install transformers datasets evaluate

In [None]:
import math
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, AutoModelForMaskedLM, pipeline

mps_device = torch.device("mps")

# Load Data

In [None]:
# Load a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library
# Experiment and make sure everything works before spending more time training on the full dataset
eli5 = load_dataset("eli5_category", split="train[:5000]")

# Split the dataset into train and test sets
eli5 = eli5.train_test_split(test_size=0.2)

# Inspect an example
# NB: the output may look like a lot, but we’re only really interested in the text field
# This is an unsupervised task. Labels not required because the next word is the label.
eli5["train"][0]

# Preprocessing

In [None]:
# load a DistilRoBERTa tokenizer to process the text subfield
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

In [None]:
# The text field is actually nested inside answers
# Extract the text subfield from its nested structure with the flatten method
eli5 = eli5.flatten()
eli5["train"][0]

In [None]:
# text field is now a list
# convert the list to a string to jointly tokenize them
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=24,
    remove_columns=eli5["train"].column_names,
)

In [None]:
# Some of these are longer than the maximum input length for the model, which we'll correct below
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # Drop the small remainder; we could add padding if the model supported it instead of this drop
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result 

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

In [None]:
# Create a batch of examples using DataCollatorForLanguageModeling
# Dynamically pad the sentences to the longest length in a batch during collation
# Use the end-of-sequence token as the padding token
# specify mlm_probability to randomly mask tokens during each iteration over the data
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

# Training

In [None]:
model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")
model.to(mps_device)

In [None]:
# Define training hyperparameters in TrainingArguments
training_args = TrainingArguments(
    output_dir="masked_language_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Pass the training arguments to Trainer along with the model, datasets, and data collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

# Call train() to finetune model
trainer.train()

# Evaluation

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

# Inference

In [None]:
# Come up with some text you’d like the model to fill in the blank with
# use the special <mask> token to indicate the blank
text = "The sun is our <mask> star."

In [None]:
# --> Try finetuned model for inference in a pipeline()
# use fill-mask with model, and pass your text to it
# use the top_k parameter to specify how many predictions to return
mask_filler = pipeline("fill-mask", "masked_language_model")
mask_filler(text, top_k=3)

In [None]:
# --> Inference with PyTorch objects  
# Tokenize the text and return the input_ids as PyTorch tensors
# Specify the position of the <mask> token
tokenizer = AutoTokenizer.from_pretrained("masked_language_model")
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Pass inputs to the model and return the logits of the masked token
model = AutoModelForMaskedLM.from_pretrained("masked_language_model")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

# Return the three masked tokens with the highest probability and print them out
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

for token in top_3_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))