There are two types of language modeling, causal and masked. Causal language models are frequently used for text generation. These models can be used for creative applications like choosing your own text adventure or for an intelligent coding assistant like Copilot or CodeParrot.

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

This guide illustrates how to:
1. Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset.
2. Use the finetuned model for inference.

# Libraries

In [None]:
pip install transformers datasets evaluate

In [None]:
from datasets import load_dataset

# Load Data

In [None]:
# Load a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library
# Experiment and make sure everything works before spending more time training on the full dataset
eli5 = load_dataset("eli5_category", split="train[:5000]")

# Split the dataset into train and test sets
eli5 = eli5.train_test_split(test_size=0.2)

# Inspect an example
# NB: the output may look like a lot, but we’re only really interested in the text field
# This is an unsupervised task. Labels not required because the next word is the label.
eli5["train"][0]