There are two types of language modeling - causal and masked. 

Causal language models are frequently used for text generation. These models can be used for creative applications like choosing your own text adventure or for an intelligent coding assistant like Copilot or CodeParrot. Masked language models predict a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. BERT is an example of a masked language model. 

This guide illustrates how to:
Finetune DistilRoBERTa on the r/askscience subset of the ELI5 dataset.
Use your finetuned model for inference.

# Libraries

In [1]:
pip install transformers datasets evaluate

Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset

# Load Data

In [3]:
# Load a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library
# Experiment and make sure everything works before spending more time training on the full dataset
eli5 = load_dataset("eli5_category", split="train[:5000]")

# Split the dataset into train and test sets
eli5 = eli5.train_test_split(test_size=0.2)

# Inspect an example
# NB: the output may look like a lot, but we’re only really interested in the text field
# This is an unsupervised task. Labels not required because the next word is the label.
eli5["train"][0]

# Preprocessing