In [None]:
!pip install -q transformers datasets evaluate

# Masked Language Modeling

Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally. This means the model has full access to the tokens on the left and right. Masked language modeling is great for tasks that require a good contextual understanding of an entire sequence.

## Load ELI5 dataset

In [None]:
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)

README.md:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

eli5_category.py:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

The repository for eli5_category contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/eli5_category.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [None]:
eli5["train"][0]

{'q_id': '5mswoe',
 'title': 'Why do far away things appear small?',
 'selftext': 'This might be the dumbest question I\'ve ever asked in my entire life, but I don\'t have a real answer to it other than "shut up that\'s a stupid question." If something is 10 feet tall, why can\'t it appear to be literally 10 feet tall everywhere? Why does it look 5 feet tall further away? It\'s not like the photons that are reflecting off of it are narrowing their scope the further they travel away, rightv',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dc658ir', 'dc62nf8', 'dc62obx'],
  'text': ['Draw a line from each rod and cone in your eye, through your lens and out into the world. Your field of view is composed of anything that intercepts those lines. These lines expand outward in a cone shape (because your eye\'s lens is convex), so objects that are closer are "hit" by more of these view-lines than they would be if they were distant. The lines represent how lig

## Preprocess

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert/distilroberta-base')

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Since the `text` field is nested inside `answers` field in the `eli5` dataset, we need to extract the `text` subfield from its nested structure with the `flatten` method:

In [None]:
eli5 = eli5.flatten()
eli5['train'][0]

{'q_id': '5mswoe',
 'title': 'Why do far away things appear small?',
 'selftext': 'This might be the dumbest question I\'ve ever asked in my entire life, but I don\'t have a real answer to it other than "shut up that\'s a stupid question." If something is 10 feet tall, why can\'t it appear to be literally 10 feet tall everywhere? Why does it look 5 feet tall further away? It\'s not like the photons that are reflecting off of it are narrowing their scope the further they travel away, rightv',
 'category': 'Physics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dc658ir', 'dc62nf8', 'dc62obx'],
 'answers.text': ['Draw a line from each rod and cone in your eye, through your lens and out into the world. Your field of view is composed of anything that intercepts those lines. These lines expand outward in a cone shape (because your eye\'s lens is convex), so objects that are closer are "hit" by more of these view-lines than they would be if they were distant. The lines represent how 

Instead of tokenizing each sentence separately, convert the list to a strnig so we can jointly tokenize them.

In [None]:
def preprocess_function(examples):
    return tokenizer([' '.join(x) for x in examples['answers.text']])

In [None]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5['train'].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2439 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (594 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1324 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1678 > 512). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (559 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (637 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (996 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (635 > 512). Running this sequence through the model will result in indexing errors


This dataset contains the token sequences, but some of these are longer than the maximum input length for the model. We need a second preprocessing function to
* concatenate all the sequences
* split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for our GPU RAM.

In [None]:
block_size = 128

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # Drop the small remainder; we can add padding if the model supported it instead of dropping
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }

    return result

In [None]:
lm_dataset = tokenized_eli5.map(
    group_texts,
    batched=True,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using `DataCollatorForLanguageModeling`. It is more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of pading the whole dataset to the maximum length.

Use the end-of-sequence token as the padding token and specify `mlm_probability` to randomly mask tokens each time we iterate over the data:

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15,
)

## Train

In [None]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained('distilbert/distilroberta-base')

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
training_args = TrainingArguments(
    output_dir='my_eli5_mlm_model',
    eval_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trianer.train()

## Evaluate

In [None]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

## Inference

Use the special `<mask>` token to indicate the blank:

In [None]:
text = "The Milky Way is a <mask> galaxy."

In [None]:
from transformers import pipeline

mask_filler = pipeline('fill-mask', model='stevhliu/my_awesome_eli5_mlm_model')

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/386 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]



In [None]:
mask_filler(text, top_k=5)

[{'score': 0.5213990807533264,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.0711224228143692,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.05867781862616539,
  'token': 650,
  'token_str': ' small',
  'sequence': 'The Milky Way is a small galaxy.'},
 {'score': 0.0491192564368248,
  'token': 30794,
  'token_str': ' dwarf',
  'sequence': 'The Milky Way is a dwarf galaxy.'},
 {'score': 0.037515927106142044,
  'token': 3065,
  'token_str': ' giant',
  'sequence': 'The Milky Way is a giant galaxy.'}]

Manually replicate the results:

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained('stevhliu/my_awesome_eli5_mlm_model')
model = AutoModelForMaskedLM.from_pretrained('stevhliu/my_awesome_eli5_mlm_model')



We need to specify the position of the `<mask>` token:

In [None]:
inputs = tokenizer(text, return_tensors='pt')
mask_token_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]

logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

mask_token_logits.shape

torch.Size([1, 50265])

Return the five masked tokens with the highest probability and print them out:

In [None]:
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

The Milky Way is a  spiral galaxy.
The Milky Way is a  massive galaxy.
The Milky Way is a  small galaxy.
The Milky Way is a  dwarf galaxy.
The Milky Way is a  giant galaxy.
