There are two types of language modeling, causal and masked. Causal language models are frequently used for text generation. These models can be used for creative applications like choosing your own text adventure or for an intelligent coding assistant like Copilot or CodeParrot.

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

This guide illustrates how to:
1. Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset.
2. Use the finetuned model for inference.

# Libraries

In [1]:
pip install transformers datasets evaluate

Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load Data

In [3]:
# Load a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library
# Experiment and make sure everything works before spending more time training on the full dataset
eli5 = load_dataset("eli5_category", split="train[:5000]")

# Split the dataset into train and test sets
eli5 = eli5.train_test_split(test_size=0.2)

# Inspect an example
# NB: the output may look like a lot, but we’re only really interested in the text field
# This is an unsupervised task. Labels not required because the next word is the label.
eli5["train"][0]

{'q_id': '7414mj',
 'title': 'Voices usually sound tired early in the morning. What\'s the best way to "wake up" our voices quickly?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dnuxftq', 'dnuntk5'],
  'text': ['Vocal chords, much like muscles, need to be warmed up to be used at max efficiency. As a singer, every morning I do vocal warm-up exercises to make sure I\'m not destroying my voice when I sing. I also work at a call center, so preparing my voice for the day makes it much more manageable to speak on the phone for hours on a daily basis. However, to more directly answer your question, there isn\'t a "trick" that will wake up your voice immediately. Doing some quick exercises can help your voice warm up more quickly, though. There are a lot of YouTube videos on these, but my favorite are "bubbles". I would provide links but I\'m at work and on mobile.',
   'So I\'m not sure there is a scientific way, but I\'d say practice ma

# Preprocessing

In [4]:
# Load DistilGPT2 tokenizer to process the 'text' subfield
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [5]:
# notice the 'text' subfield is actually nested inside answers. 
# extract the 'text' subfield from its nested structure with the flatten method
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '7414mj',
 'title': 'Voices usually sound tired early in the morning. What\'s the best way to "wake up" our voices quickly?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dnuxftq', 'dnuntk5'],
 'answers.text': ['Vocal chords, much like muscles, need to be warmed up to be used at max efficiency. As a singer, every morning I do vocal warm-up exercises to make sure I\'m not destroying my voice when I sing. I also work at a call center, so preparing my voice for the day makes it much more manageable to speak on the phone for hours on a daily basis. However, to more directly answer your question, there isn\'t a "trick" that will wake up your voice immediately. Doing some quick exercises can help your voice warm up more quickly, though. There are a lot of YouTube videos on these, but my favorite are "bubbles". I would provide links but I\'m at work and on mobile.',
  'So I\'m not sure there is a scientific way, but I\'d say practice 

In [6]:
# after flattening, text is now its own field - answers.text
# Instead of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1144 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1071 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1049 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1793 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1592 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1282 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1330 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2066 > 1024). Running this sequence through the model will result in indexing errors
