# Causal language modeling

- https://huggingface.co/docs/transformers/en/tasks/language_modeling

## Load ELI5 dataset

In [1]:
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")

  from .autonotebook import tqdm as notebook_tqdm
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 4.17k/4.17k [00:00<00:00, 5.95MB/s]
Downloading readme: 100%|██████████| 12.6k/12.6k [00:00<00:00, 4.63MB/s]
Downloading data: 100%|██████████| 62.3M/62.3M [00:06<00:00, 9.61MB/s]
Downloading data: 100%|██████████| 5.00M/5.00M [00:00<00:00, 8.89MB/s]
Downloading data: 100%|██████████| 1.76M/1.76M [00:00<00:00, 3.56MB/s]
Downloading data: 100%|██████████| 3.85M/3.85M [00:00<00:00, 9.59MB/s]
Generating train split: 100%|██████████| 91772/91772 [00:08<00:00, 10780.52 examples/s]
Generating validation1 split: 100%|██████████| 5446/5446 [00:00<00:00, 11182.69 examples/s]
Generating validation2 split: 100%|██████████| 2375/2375 [00:00<00:00, 12623.92 examples/s]
Generating test split: 100%|██████████

In [2]:
eli5 = eli5.train_test_split(test_size=0.2)

In [3]:
eli5["train"][0]

{'q_id': '74elki',
 'title': 'If cabin pressure during a flight is controlled, why do our ears do the thing?',
 'selftext': '',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dnxo6rk', 'dnxqu5p'],
  'text': ["Cabin pressure is equalized at about 7000 feet it also changes slower than the actual speed of the ascent. So from 0-7000 and 7000-0 you still experience changes in pressure (but less quickly than if the cabin wasn't pressurized at all) Edit - [image I found while searching for the same thing a while back]( URL_0 )",
   'Friend of mine flys a medical jet that keeps the pressure even lower, 2-3000ft I think. Then there are some outrageously expensive flights.'],
  'score': [5, 3],
  'text_urls': [['https://i.stack.imgur.com/BWHWi.jpg'], []]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

## Preprocess

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

In [5]:
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '74elki',
 'title': 'If cabin pressure during a flight is controlled, why do our ears do the thing?',
 'selftext': '',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dnxo6rk', 'dnxqu5p'],
 'answers.text': ["Cabin pressure is equalized at about 7000 feet it also changes slower than the actual speed of the ascent. So from 0-7000 and 7000-0 you still experience changes in pressure (but less quickly than if the cabin wasn't pressurized at all) Edit - [image I found while searching for the same thing a while back]( URL_0 )",
  'Friend of mine flys a medical jet that keeps the pressure even lower, 2-3000ft I think. Then there are some outrageously expensive flights.'],
 'answers.score': [5, 3],
 'answers.text_urls': [['https://i.stack.imgur.com/BWHWi.jpg'], []],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

In [6]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

In [7]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1091 > 1024). Running this sequence through the model will result in indexing errors
Map (num_proc=4):  25%|██▌       | 1000/4000 [00:01<00:03, 822.41 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1389 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1372 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2111 > 1024). Running this sequence through the model will result in indexing errors
Map (num_proc=4): 100%|██████████| 4000/4000 [00:01<00:00, 2694.26 examples/s]
Map (num_proc=4):   0%|          | 0/1000 [

In [8]:
block_size = 128

def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [9]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4): 100%|██████████| 4000/4000 [00:02<00:00, 1553.73 examples/s]
Map (num_proc=4): 100%|██████████| 1000/1000 [00:00<00:00, 3562.14 examples/s]


In [10]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Train

In [11]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

In [12]:
training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

 13%|█▎        | 500/3864 [13:23<1:28:08,  1.57s/it]

{'loss': 3.9982, 'grad_norm': 4.384872913360596, 'learning_rate': 1.74120082815735e-05, 'epoch': 0.39}


 26%|██▌       | 1000/3864 [34:53<1:15:53,  1.59s/it]

{'loss': 3.9498, 'grad_norm': 4.060859680175781, 'learning_rate': 1.4824016563146998e-05, 'epoch': 0.78}


 33%|███▎      | 1288/3864 [43:28<1:26:20,  2.01s/it]
 33%|███▎      | 1288/3864 [45:20<1:26:20,  2.01s/it]

{'eval_loss': 3.8219423294067383, 'eval_runtime': 111.9973, 'eval_samples_per_second': 24.867, 'eval_steps_per_second': 3.116, 'epoch': 1.0}


 39%|███▉      | 1500/3864 [51:02<57:13,  1.45s/it]   

{'loss': 3.8997, 'grad_norm': 3.799984931945801, 'learning_rate': 1.2236024844720498e-05, 'epoch': 1.16}


 52%|█████▏    | 2000/3864 [1:04:19<40:33,  1.31s/it]  

{'loss': 3.8546, 'grad_norm': 3.95866060256958, 'learning_rate': 9.648033126293997e-06, 'epoch': 1.55}


 65%|██████▍   | 2500/3864 [1:18:09<34:28,  1.52s/it]  

{'loss': 3.8638, 'grad_norm': 4.06355619430542, 'learning_rate': 7.060041407867495e-06, 'epoch': 1.94}


 67%|██████▋   | 2576/3864 [1:20:28<31:23,  1.46s/it]  
 67%|██████▋   | 2576/3864 [1:22:25<31:23,  1.46s/it]

{'eval_loss': 3.81364107131958, 'eval_runtime': 116.452, 'eval_samples_per_second': 23.915, 'eval_steps_per_second': 2.997, 'epoch': 2.0}


 78%|███████▊  | 3000/3864 [1:34:10<24:03,  1.67s/it]   

{'loss': 3.8212, 'grad_norm': 4.162864685058594, 'learning_rate': 4.472049689440994e-06, 'epoch': 2.33}


 91%|█████████ | 3500/3864 [1:47:56<11:13,  1.85s/it]  

{'loss': 3.819, 'grad_norm': 4.005804061889648, 'learning_rate': 1.884057971014493e-06, 'epoch': 2.72}


100%|██████████| 3864/3864 [1:58:35<00:00,  1.82s/it]
100%|██████████| 3864/3864 [2:00:33<00:00,  1.87s/it]

{'eval_loss': 3.8118786811828613, 'eval_runtime': 117.648, 'eval_samples_per_second': 23.672, 'eval_steps_per_second': 2.966, 'epoch': 3.0}
{'train_runtime': 7233.4878, 'train_samples_per_second': 4.271, 'train_steps_per_second': 0.534, 'train_loss': 3.8797500010109345, 'epoch': 3.0}





TrainOutput(global_step=3864, training_loss=3.8797500010109345, metrics={'train_runtime': 7233.4878, 'train_samples_per_second': 4.271, 'train_steps_per_second': 0.534, 'total_flos': 1009062726598656.0, 'train_loss': 3.8797500010109345, 'epoch': 3.0})

In [13]:
import math

eval_results = trainer.evaluate()
print(f"perplexity: {math.exp(eval_results['eval_loss']):.2f}")

100%|██████████| 349/349 [01:43<00:00,  3.37it/s]

perplexity: 45.24





## Inference

In [14]:
prompt = "Somatic hypermutation allows the immune system to"

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

In [20]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model/checkpoint-3500")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [21]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

["Somatic hypermutation allows the immune system to control this kind of damage. The immune system has no reason to control your own body. Instead, the immune system is essentially a small group of small cells in your body that are actually being damaged by your immune system. Because your body has no such cells to deal with each attack, this immune system is able to do so by changing the immune system's machinery. For example, if an echidna dies of the attacks, it will die of the same type of disease as the case of a"]