There are two types of language modeling - causal and masked. 

Causal language models are frequently used for text generation. These models can be used for creative applications like choosing your own text adventure or for an intelligent coding assistant like Copilot or CodeParrot. Masked language models predict a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. BERT is an example of a masked language model. 

This guide illustrates how to:
Finetune DistilRoBERTa on the r/askscience subset of the ELI5 dataset.
Use your finetuned model for inference.

# Libraries

In [None]:
pip install transformers datasets evaluate

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForLanguageModeling

# Load Data

In [None]:
# Load a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library
# Experiment and make sure everything works before spending more time training on the full dataset
eli5 = load_dataset("eli5_category", split="train[:5000]")

# Split the dataset into train and test sets
eli5 = eli5.train_test_split(test_size=0.2)

# Inspect an example
# NB: the output may look like a lot, but we’re only really interested in the text field
# This is an unsupervised task. Labels not required because the next word is the label.
eli5["train"][0]

# Preprocessing

In [None]:
# load a DistilRoBERTa tokenizer to process the text subfield
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

In [None]:
# The text field is actually nested inside answers
# Extract the text subfield from its nested structure with the flatten method
eli5 = eli5.flatten()
eli5["train"][0]

In [None]:
# text field is now a list
# convert the list to a string to jointly tokenize them
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=24,
    remove_columns=eli5["train"].column_names,
)

In [None]:
# Some of these are longer than the maximum input length for the model, which we'll correct below
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # Drop the small remainder; we could add padding if the model supported it instead of this drop
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result 

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

In [None]:
# Create a batch of examples using DataCollatorForLanguageModeling
# Dynamically pad the sentences to the longest length in a batch during collation
# Use the end-of-sequence token as the padding token
# specify mlm_probability to randomly mask tokens during each iteration over the data
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)