# Masked language modeling
Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence. BERT is an example of a masked language model.
https://huggingface.co/docs/transformers/tasks/masked_language_modeling

Finetune DistilRoBERTa on the r/askscience subset of the ELI5 dataset: https://huggingface.co/datasets/eli5

In [1]:
from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")

Found cached dataset eli5 (C:/Users/lkk68/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)


In [2]:
eli5 = eli5.train_test_split(test_size=0.2)

In [3]:
eli5["train"][0]

{'q_id': '2nea7d',
 'title': 'Is there a psychological disorder that defines a person who is very curious?',
 'selftext': 'I know a person (lets call him L) who is very curious and I started to think that it may be a psychological disorder. L always tries to search for details of something, even when this thing is irrelevant in the context of a conversation. L is always asking for more details until he has all the information that he wants to completely understand the situation.\n\nIs there a psychological disorder that defines this kind of behavior?',
 'document': '',
 'subreddit': 'askscience',
 'answers': {'a_id': ['cmcwppx', 'cmdt7rp'],
  'text': ["Does the behaviour exhibited by L cause harm/distress to him or others? Does it prevent him from having a job or getting an education?\n\nIf a behaviour is not harmful to an individual then it's not going to qualify for a diagnosis of a mental illness. I'm unaware of any disorder listed in the DSM that has excessive curiosity as a diagno

You’re only really interested in the text field (nested inside answers). What’s cool about language modeling tasks is you don’t need labels (also known as an unsupervised task) because the next word is the label.

In [4]:
#e xtract the text subfield from its nested structure with the flatten method:
eli5 = eli5.flatten()
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'title_urls.url', 'selftext_urls.url', 'answers_urls.url'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'title_urls.url', 'selftext_urls.url', 'answers_urls.url'],
        num_rows: 1000
    })
})

Each subfield is now a separate column as indicated by the answers prefix, and the text field is a list now

In [5]:
eli5["train"][0]

{'q_id': '2nea7d',
 'title': 'Is there a psychological disorder that defines a person who is very curious?',
 'selftext': 'I know a person (lets call him L) who is very curious and I started to think that it may be a psychological disorder. L always tries to search for details of something, even when this thing is irrelevant in the context of a conversation. L is always asking for more details until he has all the information that he wants to completely understand the situation.\n\nIs there a psychological disorder that defines this kind of behavior?',
 'document': '',
 'subreddit': 'askscience',
 'answers.a_id': ['cmcwppx', 'cmdt7rp'],
 'answers.text': ["Does the behaviour exhibited by L cause harm/distress to him or others? Does it prevent him from having a job or getting an education?\n\nIf a behaviour is not harmful to an individual then it's not going to qualify for a diagnosis of a mental illness. I'm unaware of any disorder listed in the DSM that has excessive curiosity as a dia

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

In [8]:
tokenizer(eli5["train"][0]['answers.text'])

{'input_ids': [[0, 25589, 4395, 75, 18533, 55, 1335, 77, 6740, 16, 355, 6, 11, 754, 24, 197, 185, 1181, 7, 1338, 10, 18533, 4, 8957, 16, 4212, 355, 77, 27513, 514, 13, 6836, 36, 3733, 4, 18236, 43, 13, 5840, 6, 8, 13, 2188, 1330, 7, 1021, 9426, 13510, 1164, 4, 50118, 50118, 1620, 13, 110, 200, 864, 6, 1271, 6740, 7, 2480, 514, 3035, 29, 24, 874, 321, 12938, 347, 142, 24, 1572, 5, 2480, 7, 20147, 4, 4448, 2577, 5, 2480, 3441, 41, 8135, 9, 1007, 7, 1108, 5, 3554, 227, 514, 20237, 6, 61, 32, 182, 670, 11, 5, 2705, 194, 36, 9226, 16, 373, 5, 2859, 73, 36837, 17163, 9, 24904, 322, 152, 8135, 9, 1007, 606, 31, 5, 17210, 1007, 9, 5, 514, 11824, 6, 4881, 63, 5181, 4, 96, 1285, 6, 103, 1007, 16, 67, 551, 7, 1108, 5, 1007, 9, 5, 7300, 12, 11428, 2175, 25, 5, 6740, 14863, 34190, 6, 959, 42, 16, 540, 9, 10, 5883, 87, 5, 3554, 11, 5, 2480, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [35]:
examples=eli5["train"]

In [36]:
len(examples)

4000

In [37]:
listexamples = [" ".join(x) for x in examples["answers.text"]]

In [42]:
len(listexamples)

4000

In [46]:
token_train=tokenizer(listexamples)

Token indices sequence length is longer than the specified maximum sequence length for this model (604 > 512). Running this sequence through the model will result in indexing errors


In [9]:
class TokenizerWrapper:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def tokenize_function(self, examples):
        return self.tokenizer(
            [" ".join(x) for x in examples["answers.text"]],
            padding="max_length",
            truncation=True,
        )

In [10]:
tokenizer_wrapper = TokenizerWrapper(tokenizer)

In [11]:
tokenized_dataset = eli5.map(tokenizer_wrapper.tokenize_function, batched=True, num_proc=3, remove_columns=eli5["train"].column_names)

Map (num_proc=3):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

concatenate all the sequences
split the concatenated sequences into shorter chunks defined by block_size, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [21]:
def group_texts(examples):
    block_size = 128
    # Concatenate all texts.
    #print(examples.keys())
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    #print('total_length:', total_length)
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

In [20]:
lm_dataset = tokenized_dataset.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [22]:
lm_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 16000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4000
    })
})

Use the end-of-sequence token as the padding token and specify mlm_probability to randomly mask tokens each time you iterate over the data:

In [59]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [60]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

In [63]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="my_awesome_eli5_mlm_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [65]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

Cloning https://huggingface.co/lkk688/my_awesome_eli5_mlm_model into local empty directory.


In [66]:
trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.992,0.954211
2,0.9878,0.944883
3,0.9639,0.919972


TrainOutput(global_step=6000, training_loss=1.0057990061442057, metrics={'train_runtime': 516.3088, 'train_samples_per_second': 92.968, 'train_steps_per_second': 11.621, 'total_flos': 1591461679104000.0, 'train_loss': 1.0057990061442057, 'epoch': 3.0})

In [67]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 2.52


In [68]:
text = "The Milky Way is a <mask> galaxy."

In [69]:
inputs = tokenizer(text, return_tensors="pt")

In [71]:
import torch
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_index

tensor([6])

In [72]:
model.device

device(type='cuda', index=0)

In [74]:
inputs=inputs.to('cuda')

In [75]:
logits = model(**inputs).logits

In [76]:
logits

tensor([[[ 3.5724,  4.5370,  4.9301,  ..., -2.8385, -1.0650,  5.4275],
         [ 2.4924,  1.4412, 11.5969,  ..., -1.1933,  0.2912,  6.0798],
         [-2.7016,  1.9855,  1.9075,  ..., -3.1717, -2.3135,  1.6989],
         ...,
         [-4.3938,  1.0695,  1.9510,  ..., -4.3306, -2.3445,  1.6941],
         [-4.5013,  1.1263,  8.4404,  ..., -3.8394, -2.0046, -0.3957],
         [ 1.5742, 12.4126,  8.0504,  ...,  0.6924,  0.8871,  6.6534]]],
       device='cuda:0', grad_fn=<ViewBackward0>)

In [77]:
logits.shape

torch.Size([1, 10, 50265])

In [78]:
mask_token_logits = logits[0, mask_token_index, :]
mask_token_logits

tensor([[-3.2886, -1.1666,  3.3591,  ..., -2.2179, -3.5258,  1.5552]],
       device='cuda:0', grad_fn=<IndexBackward0>)

In [79]:
mask_token_logits.shape

torch.Size([1, 50265])

Then return the three masked tokens with the highest probability and print them out:

In [80]:
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

In [81]:
top_3_tokens

[21300, 2232, 30794]

In [82]:
for token in top_3_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

The Milky Way is a  spiral galaxy.
The Milky Way is a  massive galaxy.
The Milky Way is a  dwarf galaxy.


# Causal Language modeling
Causal language models are frequently used for text generation. Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.
https://huggingface.co/docs/transformers/tasks/language_modeling

Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset: https://huggingface.co/datasets/eli5

In [6]:
from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)

Found cached dataset eli5 (C:/Users/lkk68/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)


In [7]:
eli5 = eli5.flatten()

In [8]:
eli5["train"][0]

{'q_id': '7i6w9r',
 'title': "What's the purpose of delayed-release Naproxen?",
 'selftext': '',
 'document': '',
 'subreddit': 'askscience',
 'answers.a_id': ['dqydn2w', 'dqy120t'],
 'answers.text': ['The above answer is not truly correct, and more so described extended release tablets. Delayed release tablets are similar to instant release tablets, except they have an enteric coating which delays drug release in the stomach. The enteric coating has several uses. The primary use is to avoid toxicity to the stomach lining. Naproxen is an NSAID that can lead to ulcers (damage to stomach lining).. The chemistry of the enteric coating resists the stomach acid, but once it arrives in the small intestine the drug releases and is absorbed. \n\nIf a tablet is ingested without this enteric coating, the release is not delayed and drug dissolution begins in the stomach. This is not an issue for most people, but those with ulcers, gastrointestinal issues, or those take a lot of NSAIDs will probab

In [21]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

In [25]:
tokenizer.pad_token = tokenizer.eos_token

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2") #different

In [22]:
class TokenizerWrapper:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
    
    def tokenize_function(self, examples):
        return self.tokenizer(
            [" ".join(x) for x in examples["answers.text"]],
            padding="max_length",
            truncation=True,
        )

In [23]:
tokenizer_wrapper = TokenizerWrapper(tokenizer)

In [26]:
tokenized_dataset = eli5.map(tokenizer_wrapper.tokenize_function, batched=True, num_proc=3, remove_columns=eli5["train"].column_names)

Map (num_proc=3):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=3):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [27]:
def group_texts(examples):
    block_size = 128
    # Concatenate all texts.
    #print(examples.keys())
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    #print('total_length:', total_length)
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

In [28]:
lm_dataset = tokenized_dataset.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [29]:
#use the same processed dataset used in MASKED LM
lm_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 8000
    })
})

In [30]:
def addlabels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

In [31]:
lm_datasetlabels = lm_dataset.map(addlabels)

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

In [32]:
lm_datasetlabels

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
})

Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

In [33]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [42]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model_gpt2 = AutoModelForCausalLM.from_pretrained("distilgpt2")

In [43]:
training_args = TrainingArguments(
    output_dir="./output/my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    num_train_epochs=3
)

trainer = Trainer(
    model=model_gpt2,
    args=training_args,
    train_dataset=lm_datasetlabels["train"],
    eval_dataset=lm_datasetlabels["test"],
    data_collator=data_collator,
)

In [44]:

trainer.train()



Epoch,Training Loss,Validation Loss
1,3.8081,3.726445
2,3.6842,3.708932
3,3.6509,3.707046


TrainOutput(global_step=12000, training_loss=3.72333686319987, metrics={'train_runtime': 746.6018, 'train_samples_per_second': 128.583, 'train_steps_per_second': 16.073, 'total_flos': 3135561007104000.0, 'train_loss': 3.72333686319987, 'epoch': 3.0})

In [45]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 40.73


trainer.push_to_hub()

In [46]:
prompt = "Somatic hypermutation allows the immune system to"

In [53]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(prompt)

A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Somatic hypermutation allows the immune system to distinguish different pathogens such as leukaemia from leukaemia in humans. However, we did not investigate any of the possible pathogenic diseases in the immune system because of these issues.\n\n\n'}]

In [47]:
inputs = tokenizer(prompt, return_tensors="pt").input_ids

In [48]:
inputs=inputs.to('cuda')

In [49]:
inputs.device

device(type='cuda', index=0)

In [50]:
model_gpt2.device

device(type='cuda', index=0)

In [51]:
outputs = model_gpt2.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [52]:
outputs

tensor([[   50, 13730,  8718,    76,  7094,  3578,   262, 10900,  1080,   284,
           779,   606,   284, 41229,   572, 38366,   422,  4369,    13,   632,
           635,  1724,   326,   611,   345,   804,   379,   262, 35757,  2974,
           287,   262,  3632,    11,   345,   460,   766,   326,   530,   318,
          1682,  1016,   284,   307,  4047, 18290,   284,   477,  6982,   286,
         38366,    13,   198,   198,  6943, 20547,  4327,   284,   307,   366,
         38345,     1,   290,  4143,   691,  7580,  1728,  3354,   286,   262,
          1692,  1767,    13,   632,   338,   407,   281,  3489,  3572,    11,
           780,   981,   345,   460,   787,   257,  1256,   286, 30869,   416,
          2045,   329,   606,    11,   340,  2331,  5340,   284,   466,  2279,
           290,   484,  1183,  4143,   307,   517, 17769,   621,   262, 20547]],
       device='cuda:0')

In [53]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['Somatic hypermutation allows the immune system to use them to fend off pathogens from disease. It also means that if you look at the antibody levels in the brain, you can see that one is actually going to be highly resistant to all kinds of pathogens.\n\nMost viruses tend to be "immune" and generally only infect certain parts of the human body. It\'s not an obvious choice, because while you can make a lot of antibodies by looking for them, it seems impossible to do everything and they\'ll generally be more lethal than the viruses']