In [None]:
!pip install -q transformers datasets evaluate

# Causal Language Modeling

Causal language models are frequently used for text generation. We can use these models for creative applications like text daventure or coding assistant.

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens.

## Load ELI5 dataset

In [None]:
from datasets import load_dataset

eli5 = load_dataset('eli5_category', split='train[:5000]')

README.md:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

eli5_category.py:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

The repository for eli5_category contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/eli5_category.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [None]:
eli5 = eli5.train_test_split(test_size=0.2)

In [None]:
eli5['train'][0]

{'q_id': '5nbpg0',
 'title': 'What is the downside to taking out lots of loans, or credit cards and declaring bankruptcy? (UK)',
 'selftext': "I should preface this by saying this is a hypothetical, and i'm not actually going to do this. I have a good job and house(mortgaged). My Experian credit score is 999 out of 999, I am already approved for 10's of thousands in credit. As i understand it, credit cards and personal loans are unsecured debt, meaning they can't take my house if i fail to pay. What stops me from buying £50,000 worth of gold on my credit cards, and not paying them back?",
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dca6nqs', 'dca87a0', 'dca8ei9'],
  'text': ["That's not quite how unsecured debt works. All it means is they can't automatically claim ownership of your house if you miss three payments. But they can still sue you in small claims court, and a bailiff will confiscate your property, sell it, and give the money to your cr

What we are really interested in is the `text` field. The language modeling task is an unsupervised task, so the next word is the label.

## Preprocess

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert/distilgpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



From the example aboe, we can see that the `text` field is nested inside the `answers` field, so we need to extract the `text` subfield from its nested structure with the `flatten` method:

In [None]:
eli5 = eli5.flatten()

In [None]:
eli5['train'][0]

{'q_id': '5nbpg0',
 'title': 'What is the downside to taking out lots of loans, or credit cards and declaring bankruptcy? (UK)',
 'selftext': "I should preface this by saying this is a hypothetical, and i'm not actually going to do this. I have a good job and house(mortgaged). My Experian credit score is 999 out of 999, I am already approved for 10's of thousands in credit. As i understand it, credit cards and personal loans are unsecured debt, meaning they can't take my house if i fail to pay. What stops me from buying £50,000 worth of gold on my credit cards, and not paying them back?",
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dca6nqs', 'dca87a0', 'dca8ei9'],
 'answers.text': ["That's not quite how unsecured debt works. All it means is they can't automatically claim ownership of your house if you miss three payments. But they can still sue you in small claims court, and a bailiff will confiscate your property, sell it, and give the money to your

Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead of tokenizing each sentence separately, convert the list to a string so we can jointly tokenize them.

In [None]:
def preprocess_function(examples):
    return tokenizer([' '.join(x) for x in examples['answers.text']])

In [None]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5['train'].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1667 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2145 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1646 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1204 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1153 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1823 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1985 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1120 > 1024). Running this sequence through the model will result in indexing errors


This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.

We need to use a second preprocessing function to
* concatenate all the sequences
* split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for our GPU RAM.

In [None]:
block_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # Drop the smaller remainder. We could add padding if the model supported it instead of dropping
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    # Split by chunks of block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result['labels'] = result['input_ids'].copy()

    return result

In [None]:
lm_dataset = tokenized_eli5.map(
    group_texts,
    batched=True,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Now Create a batch of examples using `DataCollatorForLanguageModeling`. It is more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

Use the end-of-sequence token as the padding token and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

## Train

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained('distilbert/distilgpt2')

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
training_args = TrainingArguments(
    output_dir='my_eli5_clm-model',
    eval_strategy='epoch',
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

## Evaluate

We can use the `.evaluate()` method to evaluate our model and get its perplexity:

In [None]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

## Inference

In [None]:
prompt = "Somatic hypermutation allows the immune system to"

In [None]:
from transformers import pipeline

generator = pipeline('text-generation', model='stevhliu/my_awesome_eli5_clm-model')

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/131 [00:00<?, ?B/s]



In [None]:
generator(prompt)

[{'generated_text': "Somatic hypermutation allows the immune system to develop a new immune response. This is called immunopositivity syndrome, based on the idea that this innate immunity is actually in part responsible for the damage that's caused to living cells such as the"}]

Manually replicate the `pipeline` results:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('stevhliu/my_awesome_eli5_clm-model')
model = AutoModelForCausalLM.from_pretrained('stevhliu/my_awesome_eli5_clm-model')



In [None]:
inputs = tokenizer(prompt, return_tensors='pt').input_ids
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

tokenizer.batch_decode(outputs, skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['Somatic hypermutation allows the immune system to be able to stop the cells from taking action. In addition to this being a very useful way of inhibiting a person\'s immune system by activating cells (the "dysfunctional", as such). For example, people can "predict" a disease when it affects a person\'s immune system by the fact that they have immune systems activated and this is how they detect the virus. Edit: Added note: When someone has the option of cancelling the medication for some reason, they don\'t have']