In [1]:
!pip install transformers datasets evaluate

[0m

# Causal language modeling

There are two types of language modeling, causal and masked. This guide illustrates causal language modeling. Causal language models are frequently used for text generation. You can use these models for creative applications like choosing your own text adventure or an intelligent coding assistant like Copilot or CodeParrot.
Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

This guide will show you how to:

Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset.
Use your finetuned model for inference.


In [16]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load ELI5 dataset

In [17]:
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")

# Split the dataset’s train split into a train and test set with the train_test_split method:

In [18]:
eli5 = eli5.train_test_split(test_size=0.2)

In [19]:
eli5["train"][0]

{'q_id': '74wntd',
 'title': 'Why does cold water help burns?',
 'selftext': 'You would intuitively think that the burn damage is irreversible and not affected by the temperature of the skin after the burn.',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['do1mrrq', 'do1nu7x'],
  'text': ["There's a few reasons that cold water is recommended for burns. First, depending on the burn and what caused it, there can be ongoing damage occurring- kind of like how scrambled eggs keep cooking for a bit even after you take them out of the pan; residual heat can still be doing harm after you've moved away from the source of the damage. If you were burned by something other than direct heat, like an acid, chemical or other damaging material, the water can help remove any traces remaining, **DEPENDING ON THE CHEMICAL**. You may want to *avoid* water in some situations, but that's usually going to be in scenarios where you should be trained and wearing safety gear. A

# Preprocess


In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")



In [21]:
eli5 = eli5.flatten()

In [22]:
eli5["train"][0]

{'q_id': '74wntd',
 'title': 'Why does cold water help burns?',
 'selftext': 'You would intuitively think that the burn damage is irreversible and not affected by the temperature of the skin after the burn.',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['do1mrrq', 'do1nu7x'],
 'answers.text': ["There's a few reasons that cold water is recommended for burns. First, depending on the burn and what caused it, there can be ongoing damage occurring- kind of like how scrambled eggs keep cooking for a bit even after you take them out of the pan; residual heat can still be doing harm after you've moved away from the source of the damage. If you were burned by something other than direct heat, like an acid, chemical or other damaging material, the water can help remove any traces remaining, **DEPENDING ON THE CHEMICAL**. You may want to *avoid* water in some situations, but that's usually going to be in scenarios where you should be trained and wearing safety gear

In [23]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

To apply this preprocessing function over the entire dataset, use the 🤗 Datasets map method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once, and increasing the number of processes with num_proc. Remove any columns you don’t need:

In [24]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1212 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1312 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1100 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1793 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1576 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1197 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1710 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2727 > 1024). Running this sequence through the model will result in indexing errors


We can now use a second preprocessing function to

concatenate all the sequences
split the concatenated sequences into shorter chunks defined by block_size, which should be both shorter than the maximum input length and short enough for your GPU RAM.
Copied


In [25]:
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [26]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [27]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [28]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

In [37]:
training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to=["none"]
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.9425,3.847351
2,3.8469,3.837938
3,3.8172,3.837145


TrainOutput(global_step=3924, training_loss=3.8751679244512935, metrics={'train_runtime': 17947.1366, 'train_samples_per_second': 1.749, 'train_steps_per_second': 0.219, 'total_flos': 1025230463041536.0, 'train_loss': 3.8751679244512935, 'epoch': 3.0})

In [38]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")