In [None]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install accelerate>=0.20.1
!pip install wandb


A causal language model in the context of natural language processing (NLP) typically refers to a type of language model that understands the causal relationships between different parts of a sentence or document. This involves grasping the cause-and-effect connections within a piece of text. Understanding causal relationships is fundamental for tasks such as text summarization, question answering, and even general comprehension of written content.

In the context of NLP, causal models are often associated with transformer-based architectures, which are designed to process sequential data like text effectively. The transformer architecture introduced self-attention mechanisms, enabling the model to consider the entire context when making predictions for each token in a sequence. This allows the model to capture both forward and backward dependencies, providing a better understanding of the causal relationships between words in a sentence.

OpenAI's GPT (Generative Pre-trained Transformer) models, including GPT-3, are examples of causal language models. These models are pre-trained on large amounts of diverse text data and can generate coherent and contextually relevant text based on a given prompt.

For example, consider the following causal relationship:

Prompt: "Because it was raining, he decided to take an umbrella."
In this sentence, the word "Because" indicates a causal relationship, explaining why the person decided to take an umbrella. A causal language model would be able to understand and generate text that reflects this cause-and-effect relationship.

Causal language models are valuable for a wide range of NLP applications, including text generation, summarization, and even for tasks like completing sentences or paragraphs based on a given context. They leverage the power of deep learning and transformer architectures to capture intricate patterns and dependencies within language.

In [None]:
import os
from huggingface_hub import login

In [None]:
login(token="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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from datasets import load_dataset

eli5= load_dataset("eli5", split="train_asks[:5000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/18.2k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.8k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/576M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/21.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/286M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.65M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/330M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/36.2M [00:00<?, ?B/s]

In [None]:
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'title_urls.url', 'selftext_urls.url', 'answers_urls.url'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'title_urls.url', 'selftext_urls.url', 'answers_urls.url'],
        num_rows: 1000
    })
})

In [None]:
eli5=eli5.train_test_split(test_size=0.2)
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
        num_rows: 1000
    })
})

In [None]:
eli5["train"][0]

{'q_id': '10xyjt',
 'title': 'A question regarding Aqua Regia',
 'selftext': '_URL_0_\nIn this video People try to destroy an iPhone 5 by putting it it in Aqua Regia, however it reacts very slowly. However they then dipped it in Hydrofluric (to which it did not appear to react any better with) and returned it to the aqua regia and it then reacted very vigorously. Why is this?',
 'document': '',
 'subreddit': 'askscience',
 'answers': {'a_id': ['c6hluqo', 'c6hnh86'],
  'text': ["If there is ever an example of how _not_ to handle dangerous chemicals, this would be it. Short sleeves and shorts, wrong respirator, wrong gloves, they're just asking for trouble.\n\nI notice that they didn't dip it in hydrofluoric acid; they dipped it in hydrofluoric acid with a whole bunch of organic materials. When that is transferred back to the aqua regia, there is also a large transfer of organic compounds. The initial bubbling is likely the organic compounds, and if anything, the increased agitation just

In [None]:
from datasets import Dataset
train_df=Dataset.to_pandas(eli5["train"])

In [None]:
train_df["answers.text"][0]

array(["If there is ever an example of how _not_ to handle dangerous chemicals, this would be it. Short sleeves and shorts, wrong respirator, wrong gloves, they're just asking for trouble.\n\nI notice that they didn't dip it in hydrofluoric acid; they dipped it in hydrofluoric acid with a whole bunch of organic materials. When that is transferred back to the aqua regia, there is also a large transfer of organic compounds. The initial bubbling is likely the organic compounds, and if anything, the increased agitation just speeds up the reaction between aqua regia and the metal.",
       "There could have been a coating on the iphone (maybe some type of enamel) which was resistant to aqua regia.  HF is pretty good at attacking stuff which most acids won't (like glass), so if there was a coating which was resistant to AR but not HF, the HF would take it off and then allow the AR to start dissolving the phone.\n\nThat's my best guess."],
      dtype=object)

In [None]:
from transformers import AutoTokenizer

model_name="distilgpt2"
tokenizer=AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
eli5=eli5.flatten()
eli5["train"][0]

{'q_id': '10xyjt',
 'title': 'A question regarding Aqua Regia',
 'selftext': '_URL_0_\nIn this video People try to destroy an iPhone 5 by putting it it in Aqua Regia, however it reacts very slowly. However they then dipped it in Hydrofluric (to which it did not appear to react any better with) and returned it to the aqua regia and it then reacted very vigorously. Why is this?',
 'document': '',
 'subreddit': 'askscience',
 'answers.a_id': ['c6hluqo', 'c6hnh86'],
 'answers.text': ["If there is ever an example of how _not_ to handle dangerous chemicals, this would be it. Short sleeves and shorts, wrong respirator, wrong gloves, they're just asking for trouble.\n\nI notice that they didn't dip it in hydrofluoric acid; they dipped it in hydrofluoric acid with a whole bunch of organic materials. When that is transferred back to the aqua regia, there is also a large transfer of organic compounds. The initial bubbling is likely the organic compounds, and if anything, the increased agitation j

In [None]:
eli5["train"].column_names

['q_id',
 'title',
 'selftext',
 'document',
 'subreddit',
 'answers.a_id',
 'answers.text',
 'answers.score',
 'title_urls.url',
 'selftext_urls.url',
 'answers_urls.url']

In [None]:
def preprocess_function(example):
  return tokenizer([" ".join(x) for x in example["answers.text"]])

In [None]:
tokenized_eli5=eli5.map(preprocess_function,
                        batched=True,
                        num_proc=4,
                        remove_columns=eli5["train"].column_names)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1723 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1088 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2103 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3180 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1391 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1075 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1076 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1645 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
tokenized_eli5

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [None]:
tokenized_eli5.keys()

dict_keys(['train', 'test'])

We also need to make sure the token sequences are shorter than the maximum input length of the model, and we can also add padding if the model supported it.

In [None]:
block_size=128

def group_texts(examples):
    concatenated_examples={k: sum(examples[k], []) for k in examples.keys()}
    total_length=len(concatenated_examples[list(examples.keys())[0]])
    if total_length>=block_size:
        total_length=(total_length//block_size)* block_size
    # Split by chunks of block size
    result={
        k: [t[i: i+block_size] for i in range(0, total_length, block_size)]
        for k,t in concatenated_examples.items()
    }

    result["labels"]=result["input_ids"].copy()
    return result

In [None]:
lm_dataset=tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
lm_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9158
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2118
    })
})

Here we are going to use dynamically pad the sentence to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token=tokenizer.eos_token
# Use the end of sequence token as the padding token and set `mlm=False`.
# This will use the inputs as labels shifted to the right by one element.
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model=AutoModelForCausalLM.from_pretrained("distilgpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
training_args=TrainingArguments(
    output_dir="content/result",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    report_to="wandb",
    push_to_hub=False,
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.9145,3.749817
2,3.8121,3.734173
3,3.767,3.724835
4,3.7279,3.722605
5,3.7153,3.722373


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=2865, training_loss=3.776197439463351, metrics={'train_runtime': 1232.7868, 'train_samples_per_second': 37.143, 'train_steps_per_second': 2.324, 'total_flos': 1495597276200960.0, 'train_loss': 3.776197439463351, 'epoch': 5.0})

In [None]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 41.36


In [None]:
model.save_pretrained("saved_model")
tokenizer.save_pretrained("saved_model")

('saved_model/tokenizer_config.json',
 'saved_model/special_tokens_map.json',
 'saved_model/vocab.json',
 'saved_model/merges.txt',
 'saved_model/added_tokens.json',
 'saved_model/tokenizer.json')

In [None]:
causal_model=AutoModelForCausalLM.from_pretrained("saved_model")
causal_tokenizer=AutoTokenizer.from_pretrained("saved_model")

In [None]:
from transformers import pipeline

prompt="Somatic hypermutation allows the immune system to"

generator=pipeline("text-generation" , model=causal_model , tokenizer=causal_tokenizer)

generator(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Somatic hypermutation allows the immune system to respond appropriately effectively to various physical conditions such as skin infection and infection.\n\nBecause immunological responses to diseases like malaria and malaria are largely unknown, I should say that these types of responses are very'}]

In [None]:
inputs=causal_tokenizer(prompt, return_tensors="pt").input_ids

outputs=causal_model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

causal_tokenizer.batch_decode(outputs, skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Somatic hypermutation allows the immune system to use a selective promoter, so when an attacker can't suppress the immune response, the immune system can suppress the activity of the brain and, indeed, it can also suppress the activity of the other brain, causing a negative reaction.\n\nThis means that a person with a specific disease can potentially get one type of treatment in a single pill (which can be taken separately from a particular drug) or a pill with different pharmacological and pharmacological properties (which can potentially include multiple factors), allowing them"]