# Masked Language Models

In [1]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [2]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


In [3]:
text = "This is a great [MASK]."

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


In [6]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to C:/Users/lkk68/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to C:/Users/lkk68/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [7]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

In [8]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [9]:
# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [10]:
tokenizer.model_max_length

512

In [11]:
chunk_size = 128 #chunk the token to save the GPU memory

In [12]:
tokenized_datasets["train"][0]

{'input_ids': [101,
  1045,
  12524,
  1045,
  2572,
  8025,
  1011,
  3756,
  2013,
  2026,
  2678,
  3573,
  2138,
  1997,
  2035,
  1996,
  6704,
  2008,
  5129,
  2009,
  2043,
  2009,
  2001,
  2034,
  2207,
  1999,
  3476,
  1012,
  1045,
  2036,
  2657,
  2008,
  2012,
  2034,
  2009,
  2001,
  8243,
  2011,
  1057,
  1012,
  1055,
  1012,
  8205,
  2065,
  2009,
  2412,
  2699,
  2000,
  4607,
  2023,
  2406,
  1010,
  3568,
  2108,
  1037,
  5470,
  1997,
  3152,
  2641,
  1000,
  6801,
  1000,
  1045,
  2428,
  2018,
  2000,
  2156,
  2023,
  2005,
  2870,
  1012,
  1026,
  7987,
  1013,
  1028,
  1026,
  7987,
  1013,
  1028,
  1996,
  5436,
  2003,
  8857,
  2105,
  1037,
  2402,
  4467,
  3689,
  3076,
  2315,
  14229,
  2040,
  4122,
  2000,
  4553,
  2673,
  2016,
  2064,
  2055,
  2166,
  1012,
  1999,
  3327,
  2016,
  4122,
  2000,
  3579,
  2014,
  3086,
  2015,
  2000,
  2437,
  2070,
  4066,
  1997,
  4516,
  2006,
  2054,
  1996,
  2779,
  25430,
  14728,
  2245,


In [13]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'") #length of the input_ids

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


In [14]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


800 is 363+304+133, concate the three list (three reviews) together

In [15]:
tokenized_samples.keys()

dict_keys(['input_ids', 'attention_mask', 'word_ids'])

In [16]:
tokenized_samples['input_ids']

[[101,
  1045,
  12524,
  1045,
  2572,
  8025,
  1011,
  3756,
  2013,
  2026,
  2678,
  3573,
  2138,
  1997,
  2035,
  1996,
  6704,
  2008,
  5129,
  2009,
  2043,
  2009,
  2001,
  2034,
  2207,
  1999,
  3476,
  1012,
  1045,
  2036,
  2657,
  2008,
  2012,
  2034,
  2009,
  2001,
  8243,
  2011,
  1057,
  1012,
  1055,
  1012,
  8205,
  2065,
  2009,
  2412,
  2699,
  2000,
  4607,
  2023,
  2406,
  1010,
  3568,
  2108,
  1037,
  5470,
  1997,
  3152,
  2641,
  1000,
  6801,
  1000,
  1045,
  2428,
  2018,
  2000,
  2156,
  2023,
  2005,
  2870,
  1012,
  1026,
  7987,
  1013,
  1028,
  1026,
  7987,
  1013,
  1028,
  1996,
  5436,
  2003,
  8857,
  2105,
  1037,
  2402,
  4467,
  3689,
  3076,
  2315,
  14229,
  2040,
  4122,
  2000,
  4553,
  2673,
  2016,
  2064,
  2055,
  2166,
  1012,
  1999,
  3327,
  2016,
  4122,
  2000,
  3579,
  2014,
  3086,
  2015,
  2000,
  2437,
  2070,
  4066,
  1997,
  4516,
  2006,
  2054,
  1996,
  2779,
  25430,
  14728,
  2245,
  2055,
  305

In [17]:
sum(tokenized_samples['input_ids'], [])

[101,
 1045,
 12524,
 1045,
 2572,
 8025,
 1011,
 3756,
 2013,
 2026,
 2678,
 3573,
 2138,
 1997,
 2035,
 1996,
 6704,
 2008,
 5129,
 2009,
 2043,
 2009,
 2001,
 2034,
 2207,
 1999,
 3476,
 1012,
 1045,
 2036,
 2657,
 2008,
 2012,
 2034,
 2009,
 2001,
 8243,
 2011,
 1057,
 1012,
 1055,
 1012,
 8205,
 2065,
 2009,
 2412,
 2699,
 2000,
 4607,
 2023,
 2406,
 1010,
 3568,
 2108,
 1037,
 5470,
 1997,
 3152,
 2641,
 1000,
 6801,
 1000,
 1045,
 2428,
 2018,
 2000,
 2156,
 2023,
 2005,
 2870,
 1012,
 1026,
 7987,
 1013,
 1028,
 1026,
 7987,
 1013,
 1028,
 1996,
 5436,
 2003,
 8857,
 2105,
 1037,
 2402,
 4467,
 3689,
 3076,
 2315,
 14229,
 2040,
 4122,
 2000,
 4553,
 2673,
 2016,
 2064,
 2055,
 2166,
 1012,
 1999,
 3327,
 2016,
 4122,
 2000,
 3579,
 2014,
 3086,
 2015,
 2000,
 2437,
 2070,
 4066,
 1997,
 4516,
 2006,
 2054,
 1996,
 2779,
 25430,
 14728,
 2245,
 2055,
 3056,
 2576,
 3314,
 2107,
 2004,
 1996,
 5148,
 2162,
 1998,
 2679,
 3314,
 1999,
 1996,
 2142,
 2163,
 1012,
 1999,
 2090,
 48

Cut the concatenated_examples into equal size chunk

In [18]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


Group all these into one function "group_texts", drop the last chunk

In [19]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [20]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

## Token Mask Data Collator

In [21]:
from transformers import DataCollatorForLanguageModeling
#mlm_probability means the percentage of [MASK]
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) 

In [22]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

In [23]:
for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] i [MASK] i am curious - yellow from my video store because [MASK] all [MASK] controversy [MASK] surrounded it when it was first released in [MASK]. i also heard that at first it was seized by [MASK]. s. customs [MASK] [MASK] ever tried [MASK] enter this country, therefore being a fan of films considered [MASK] controversial " i really edison to see this for myself. < br [MASK] > < br / > [MASK] plot is centered around a [MASK] swedish drama student named lena who wants to learn [MASK] she can about life [MASK] in [MASK] she wants to focus her attention [MASK] to making [MASK] sort of [MASK] on what the [MASK] swede thought about certain political issues such'

'>>> as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about [MASK] opinions on politics, she has sex with her drama [MASK], [MASK], [MASK] married men. < br [MASK] > < br / > what kills me about i am [MASK] - yellow is that 40 years ago, this [MA

In [24]:
for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.convert_ids_to_tokens(chunk)}'")


'>>> ['[CLS]', 'i', 'rented', '[MASK]', 'am', 'curious', '-', '[MASK]', 'from', 'my', 'video', 'store', '[MASK]', 'of', 'all', 'the', 'controversy', 'that', '[MASK]', 'it', 'when', '[MASK]', 'was', '[MASK]', 'released', 'in', '[MASK]', '[MASK]', 'i', 'also', '[MASK]', 'that', 'at', 'first', 'it', 'was', 'seized', 'by', 'u', '.', 's', '.', 'customs', 'if', 'it', '[MASK]', 'tried', '[MASK]', 'enter', 'this', 'country', ',', 'therefore', 'being', 'a', 'fan', 'of', 'films', 'considered', '"', 'controversial', '"', 'i', 'really', 'had', 'to', 'see', 'this', 'for', 'myself', '.', '<', 'br', '[MASK]', '>', '<', 'br', '/', '>', 'the', 'plot', 'is', '[MASK]', 'around', 'a', 'young', '[MASK]', 'drama', 'student', 'named', 'lena', 'who', '[MASK]', 'to', 'learn', 'everything', '[MASK]', 'can', 'about', '[MASK]', '.', 'in', 'particular', 'she', 'wants', 'to', 'focus', 'her', 'attention', '##s', 'to', 'making', 'some', 'sort', 'of', 'documentary', 'on', 'what', 'the', '[MASK]', 'sw', '##ede', 'thou

[MASK] token are randomly inserted into the text

In [25]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [26]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"./output/{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=False,
    fp16=True,
    logging_steps=logging_steps,
)

In [27]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
)

In [28]:
trainer.train()



  0%|          | 0/471 [00:00<?, ?it/s]

{'loss': 2.6977, 'learning_rate': 1.3503184713375796e-05, 'epoch': 0.99}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 2.5231194496154785, 'eval_runtime': 0.6522, 'eval_samples_per_second': 1533.243, 'eval_steps_per_second': 24.532, 'epoch': 1.0}
{'loss': 2.5618, 'learning_rate': 6.878980891719745e-06, 'epoch': 1.99}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 2.474677085876465, 'eval_runtime': 0.6521, 'eval_samples_per_second': 1533.41, 'eval_steps_per_second': 24.535, 'epoch': 2.0}
{'loss': 2.5276, 'learning_rate': 2.547770700636943e-07, 'epoch': 2.98}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 2.4369473457336426, 'eval_runtime': 0.6596, 'eval_samples_per_second': 1516.04, 'eval_steps_per_second': 24.257, 'epoch': 3.0}
{'train_runtime': 68.397, 'train_samples_per_second': 438.616, 'train_steps_per_second': 6.886, 'train_loss': 2.59651494481761, 'epoch': 3.0}


TrainOutput(global_step=471, training_loss=2.59651494481761, metrics={'train_runtime': 68.397, 'train_samples_per_second': 438.616, 'train_steps_per_second': 6.886, 'train_loss': 2.59651494481761, 'epoch': 3.0})

In [29]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/16 [00:00<?, ?it/s]

>>> Perplexity: 11.41


Perplexity lower is better

## Use Whole Word Masking

In [30]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id

    return default_data_collator(features)

In [31]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i [MASK] i am curious - yellow [MASK] my video store [MASK] [MASK] all the controversy that surrounded it when it was first released in 1967. i also heard that at first [MASK] [MASK] [MASK] by u. s. customs if [MASK] ever tried to enter this country [MASK] therefore [MASK] [MASK] fan [MASK] films considered " controversial " i [MASK] had to see this [MASK] myself [MASK] < br / [MASK] < br / > the plot [MASK] centered around a young swedish drama student [MASK] lena who wants to learn everything she can about life. in particular she wants [MASK] focus her attentions to making some sort [MASK] [MASK] on [MASK] [MASK] average [MASK] [MASK] thought [MASK] [MASK] political issues such'

'>>> as [MASK] [MASK] war and race issues in the [MASK] states. in between [MASK] [MASK] and ordinary denizens of stockholm about their opinions on [MASK], she has [MASK] with her drama teacher, classmates, and married men [MASK] < br / > < br / > what kills me about [MASK] am curious - [MASK] [M

In [32]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

Loading cached split indices for dataset at C:\Users\lkk68\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-1de8ceb78de95865.arrow and C:\Users\lkk68\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-7a495c3653c6aa44.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [33]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"./output/{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=False,
    fp16=True,
    logging_steps=logging_steps,
    remove_unused_columns=False, #do not remove word_ids, needed for Whole word masking
)

In [34]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=whole_word_masking_data_collator, #use whole word maksing data collator
)

In [35]:
trainer.train()

  0%|          | 0/471 [00:00<?, ?it/s]

{'loss': 0.7367, 'learning_rate': 1.337579617834395e-05, 'epoch': 0.99}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 0.6717138290405273, 'eval_runtime': 0.6619, 'eval_samples_per_second': 1510.791, 'eval_steps_per_second': 24.173, 'epoch': 1.0}
{'loss': 0.6734, 'learning_rate': 6.751592356687898e-06, 'epoch': 1.99}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 0.6662740707397461, 'eval_runtime': 0.6607, 'eval_samples_per_second': 1513.63, 'eval_steps_per_second': 24.218, 'epoch': 2.0}
{'loss': 0.6689, 'learning_rate': 1.2738853503184715e-07, 'epoch': 2.98}


  0%|          | 0/16 [00:00<?, ?it/s]

{'eval_loss': 0.6505001783370972, 'eval_runtime': 0.6659, 'eval_samples_per_second': 1501.657, 'eval_steps_per_second': 24.027, 'epoch': 3.0}
{'train_runtime': 65.5165, 'train_samples_per_second': 457.9, 'train_steps_per_second': 7.189, 'train_loss': 0.6928917070862594, 'epoch': 3.0}


TrainOutput(global_step=471, training_loss=0.6928917070862594, metrics={'train_runtime': 65.5165, 'train_samples_per_second': 457.9, 'train_steps_per_second': 7.189, 'train_loss': 0.6928917070862594, 'epoch': 3.0})

In [36]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/16 [00:00<?, ?it/s]

>>> Perplexity: 1.93


In [37]:
model.device

device(type='cuda', index=0)

In [38]:
text = "This is a great [MASK]."
inputs = tokenizer(text, return_tensors="pt")
inputs = inputs.to('cuda')
token_logits = model(**inputs).logits

In [39]:
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great film.'
'>>> This is a great movie.'
'>>> This is a great idea.'
'>>> This is a great adventure.'
'>>> This is a great one.'


# Change data_collator for data evaluation

In [40]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [41]:
#insert MASK into the original dataset for evaluator
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [42]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

Loading cached split indices for dataset at C:\Users\lkk68\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-1de8ceb78de95865.arrow and C:\Users\lkk68\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-7a495c3653c6aa44.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [43]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [44]:
eval_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 1000
})

In [45]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [46]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [47]:
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

In [48]:
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [49]:
from transformers import get_scheduler
num_train_epochs = 8
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [50]:
output_dir='./output'

In [51]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    # if accelerator.is_main_process:
    #     tokenizer.save_pretrained(output_dir)
    #     repo.push_to_hub(
    #         commit_message=f"Training in progress epoch {epoch}", blocking=False
    #     )

  0%|          | 0/1256 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 11.221171807718862
>>> Epoch 1: Perplexity: 10.768010645279727
>>> Epoch 2: Perplexity: 10.55645597665742
>>> Epoch 3: Perplexity: 10.342456958531864
>>> Epoch 4: Perplexity: 10.151993669983593
>>> Epoch 5: Perplexity: 10.03863898173546
>>> Epoch 6: Perplexity: 9.959069167523797
>>> Epoch 7: Perplexity: 9.911696030015712


In [52]:
text = "This is a great [MASK]"
inputs = tokenizer(text, return_tensors="pt")

In [53]:
import torch
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_index

tensor([5])

In [54]:
inputs=inputs.to('cuda')

In [55]:
logits = model(**inputs).logits

In [56]:
logits.shape

torch.Size([1, 7, 30522])

In [57]:
mask_token_logits = logits[0, mask_token_index, :]
mask_token_logits

tensor([[-11.4297, -10.2422, -10.4297,  ...,  -8.8047, -11.2344,  -3.4688]],
       device='cuda:0', grad_fn=<IndexBackward0>)

In [58]:
mask_token_logits.shape

torch.Size([1, 30522])

In [59]:
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

In [60]:
top_3_tokens

[999, 1012, 2143]

In [61]:
for token in top_3_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

This is a great !
This is a great .
This is a great film
