# MASKED-MODELS Base

Here we will adapt distilbert-base-uncased, to our dataset. By adapting the domain of this MLM, we will get a MLM that is more capable to leverage vocabulary semantically closer to the one present in our dataset.

## Data Loading

In [1]:
import pandas as pd
import numpy as np

dataset = pd.read_csv("data/dataset.csv", encoding="utf-8")

dataset.head()

Unnamed: 0,Date_published,Headline,Synopsis,Full_text,Final Status
0,2022-06-21,"Banks holding on to subsidy share, say payment...",The companies have written to the National Pay...,ReutersPayments companies and banks are at log...,Negative
1,2022-04-19,Digitally ready Bank of Baroda aims to click o...,"At present, 50% of the bank's retail loans are...",AgenciesThe bank presently has 20 million acti...,Positive
2,2022-05-27,Karnataka attracted investment commitment of R...,Karnataka is at the forefront in attracting in...,PTIKarnataka Chief Minister Basavaraj Bommai.K...,Positive
3,2022-04-06,Splitting of provident fund accounts may be de...,The EPFO is likely to split accounts only at t...,Getty ImagesThe budget for FY22 had imposed in...,Negative
4,2022-06-14,Irdai weighs proposal to privatise Insurance I...,"Set up in 2009 as an advisory body, IIB collec...",AgenciesThere is a view in the insurance indus...,Positive


## Data Cleaning

Let's quickly do all the needed operations

In [2]:
# Convert to binary
dataset.loc[97, "Final Status"] = "Positive"
dataset["Final Status"] = dataset["Final Status"].map({"Positive": 1, "Negative": 0})

In [3]:
# Check  and clean empty Synopsis
dataset[dataset["Synopsis"].isna()].index

Index([56], dtype='int64')

In [4]:
dataset.loc[56, "Synopsis"] = " "

In [5]:
# replace contractions
def decontracted(phrase):
    phrase = re.sub(r"\'t", "not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [6]:
import re
def preprocess_text(text):
    text = decontracted(str(text))
    text = re.sub("[^a-zA-Z0-9.,!?$/ ]", " ", text)
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"(?<!\\)(\\\\)?\n", r"\\n", text)
    text = re.sub(r"(?<!\\)(\\\\)?\t", r"\\t", text)
    text = re.sub(r"(?<!\\)(\\\\)?\r", r"\\r", text)
    return text

In [7]:
corpus = []
dataset["processed_article"] = (
    dataset["Headline"] + " " + dataset["Synopsis"] + " " + dataset["Full_text"]
)


# for i in range(len(dataset["Headline"])):
#     dataset.loc[i, "processed_article"] = preprocess_text(dataset["Headline"][i])
#     corpus.append(dataset["processed_article"][i])

for i in range(len(dataset["processed_article"])):
    dataset.loc[i, "processed_article"] = preprocess_text(
        dataset["processed_article"][i]
    )
    corpus.append(dataset["processed_article"][i])

In [8]:
dataset[["processed_article", "Final Status"]]

dataset = dataset.rename(columns={"Final Status": "label"})

column_order = ["processed_article", "label"]
dataset = dataset[column_order]
dataset.head()

Unnamed: 0,processed_article,label
0,"Banks holding on to subsidy share, say payment...",0
1,Digitally ready Bank of Baroda aims to click o...,1
2,Karnataka attracted investment commitment of R...,1
3,Splitting of provident fund accounts may be de...,0
4,Irdai weighs proposal to privatise Insurance I...,1


For ease of usage with Transformer models, we convert the dataset into a Hugging Face dataset and split it into train, validation and test sets.

In [9]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)

In [10]:
from datasets import DatasetDict

supervised_unsupervised = dataset_hf.train_test_split(test_size=0.5)

train_test = supervised_unsupervised["train"].train_test_split(test_size=0.5)

train_test_unsupervised_dataset = DatasetDict(
    {
        "train": train_test["train"],
        "test": train_test["test"],
        "unsupervised": supervised_unsupervised["test"],
    }
)

In [11]:
train_test_unsupervised_dataset

DatasetDict({
    train: Dataset({
        features: ['processed_article', 'label'],
        num_rows: 100
    })
    test: Dataset({
        features: ['processed_article', 'label'],
        num_rows: 100
    })
    unsupervised: Dataset({
        features: ['processed_article', 'label'],
        num_rows: 200
    })
})

In [12]:
train_test_unsupervised_dataset_copy = train_test_unsupervised_dataset.copy()

train_test_unsupervised_dataset["unsupervised"] = (
    train_test_unsupervised_dataset_copy["unsupervised"].map(
        lambda example: {"processed_article": example["processed_article"], "label": -1}
    )
)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [13]:
sample = (
    train_test_unsupervised_dataset["train"].shuffle(seed=42).select(range(3))
)

for row in sample:
    print(f"\n'>>> Article: {row['processed_article']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Article: Stock market update Stocks that hit 52 week highs on NSE Gallantt Ispat, GSS Infotech, Swaraj Suiting Ltd., Sumitomo Chemical and Cupid Ltd, hit their fresh 52 week highs at 10 34AM. Shutterstock.comRSI has turned north from the 60 level, confirming bullishness.NEW DELHI Shares of Gallantt Ispat, GSS Infotech, Swaraj Suiting Ltd., Sumitomo Chemical and Cupid Ltd, hit their fresh 52 week highs at 10 34AM IST on NSE. Benchmark NSE Nifty index fell 100.6 points to 16968.5 amid selling in frontline bluechip stocks. However, stocks such as Future Retail, Karda Const, Zee Learn, Future Enterprises DVR and Future Enterprises, touched their fresh 52 week low. Overall, 14 shares traded in the green in Nifty50 index, while 36 traded in the red. In the Nifty 50 index, ONGC, Britannia, Power Grid, NTPC and UPL were among top gainers, while Apollo Hospital, Titan Company, Dr. Reddys, Cipla and Sun Pharma traded in the red. The BSE Sensex was trading 314.04 points down at 56661.95 at 

In [14]:
sample = train_test_unsupervised_dataset["unsupervised"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Article: {row['processed_article']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Article: Eco recovery, improving business confidence to help banks in FY23 Moody is Global credit rating agency Moodys expects India is banking sector to stablise this year riding on a gradual economic recovery, improving consumer and business confidence, decline in bad loan provisions and better margins, despite the uncertainties posed by the Russia Ukraine conflict. AgenciesGlobal credit rating agency Moodys expects India is banking sector to stablise this year riding on a gradual economic recovery, improving consumer and business confidence, decline in bad loan provisions and better margins, despite the uncertainties posed by the Russia Ukraine conflict. Fundamentals for the sector will improve especially due to India is continuing economic recovery which Moodys expects will grow at 8.4 in the fiscal ended March 2023 down from 9.3 in the year ended March 2022. Increasing corporate earnings and easing funding constraints for non bank finance companies, which are significant bor

## Masked Model

In [15]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [16]:
text = "This is a great [MASK]."

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [18]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


In [19]:
def tokenize_function(examples):
    result = tokenizer(examples["processed_article"])
    if tokenizer.is_fast:
        result["word_ids"] = [
            result.word_ids(i) for i in range(len(result["input_ids"]))
        ]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = train_test_unsupervised_dataset.map(
    tokenize_function, batched=True, remove_columns=["processed_article", "label"]
)
tokenized_datasets

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1176 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 100
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 200
    })
})

In [20]:
chunk_size = 128

In [21]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Article {idx} length: {len(sample)}'")

'>>> Article 0 length: 391'
'>>> Article 1 length: 1176'
'>>> Article 2 length: 977'


In [22]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated articles length: {total_length}'")

'>>> Concatenated articles length: 2544'


In [23]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 112'


In [24]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [25]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 414
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 439
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 861
    })
})

In [26]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'pharma science down 4. 89 per cent, alembic pharmaceuticals down 3. 28 per cent, cipla down 2. 68 per cent, divis laboratories down 2. 51 per cent and granules india down 2. 45 per cent finished as the top losers of the day. the nifty pharma index closed 1. 42 per cent down at 13522. 45. benchmark nse nifty50 index ended down 215. 0 points at 16958. 65, while the bse sensex stood down 703. 59 points at 56463. 15. among the 50 stocks in the'

In [27]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

'pharma science down 4. 89 per cent, alembic pharmaceuticals down 3. 28 per cent, cipla down 2. 68 per cent, divis laboratories down 2. 51 per cent and granules india down 2. 45 per cent finished as the top losers of the day. the nifty pharma index closed 1. 42 per cent down at 13522. 45. benchmark nse nifty50 index ended down 215. 0 points at 16958. 65, while the bse sensex stood down 703. 59 points at 56463. 15. among the 50 stocks in the'

MASKING

In [28]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [29]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] stock market update nifty pharma index falls 1. 42 the nifty pharma index closed 1. 42 [MASK] [MASK] [MASK] [MASK] 13522. 45. getty imagesmacd is known for signaling trend reversals in traded securities or indices. it is the [MASK] between the 26 day and [MASK] day exponential moving averages [MASK] [MASK] delhi the nifty pharma index closed on a negative note on tuesday. shares [MASK] natco ph [MASK]a up 0. 11 [MASK] cent and gland pharma up 0. 03 shimmering [MASK] ended the day as top gainers in the pack. on [MASK] other hand, strides'

'>>> pharma science down 4. 89 per cent, [MASK]mbic pharmaceuticals down 3. 28 per cent, cipla down 2. 68 per cent, divis laboratories down [MASK]. 51 [MASK] cent and granules india down 2. 45 per cent finished as the top losers of [MASK] day [MASK] the nifty ph [MASK]a index closed 1. 42 per cent [MASK] at [MASK]22. 45. benchmarkllan [MASK] nifty50ovo ended down 215. 0 points at 16958. 65, while the bse sensex stood blooded 703 [MASK] 59 

### Whole Word Masking

In [30]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [31]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] stock [MASK] [MASK] nifty pharma index falls [MASK] [MASK] 42 [MASK] nifty pharma index closed 1. 42 per cent down at [MASK] [MASK]. 45. getty imagesmacd [MASK] known for signaling trend reversals in traded securities or indices [MASK] it is [MASK] [MASK] [MASK] the [MASK] day and 12 day exponential [MASK] [MASK]. new delhi the [MASK] [MASK] [MASK] pharma index closed on a negative [MASK] on tuesday. shares [MASK] natco pharma up 0. 11 per cent and [MASK] pharma up 0. 03 [MASK] cent ended [MASK] day as top [MASK] [MASK] in the pack. on the other [MASK], strides'

'>>> pharma science down [MASK]. [MASK] per cent, [MASK] [MASK] [MASK] pharmaceuticals [MASK] 3. [MASK] per cent [MASK] [MASK] [MASK] [MASK] down 2. 68 per cent, divis laboratories down 2 [MASK] 51 per [MASK] and granules india down 2. 45 per cent finished as the top losers of the [MASK] [MASK] the [MASK] [MASK] [MASK] pharma [MASK] closed 1. 42 per [MASK] down at 13522. [MASK]. benchmark nse nifty50 index ended do

In [32]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(lm_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-nlp_assignment2",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    num_train_epochs=10,
    learning_rate=2e-5,
    weight_decay=0.01,
    # per_device_train_batch_size=batch_size,
    # per_device_eval_batch_size=batch_size,
    push_to_hub=False,
    logging_steps=logging_steps,
)

In [33]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [34]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/55 [00:00<?, ?it/s]

>>> Perplexity: 43.83


In [35]:
trainer.train()

  0%|          | 0/520 [00:00<?, ?it/s]

{'loss': 3.7748, 'grad_norm': 11.045541763305664, 'learning_rate': 1.976923076923077e-05, 'epoch': 0.12}
{'loss': 3.2041, 'grad_norm': 11.184574127197266, 'learning_rate': 1.953846153846154e-05, 'epoch': 0.23}
{'loss': 3.5838, 'grad_norm': 11.514263153076172, 'learning_rate': 1.930769230769231e-05, 'epoch': 0.35}
{'loss': 3.5685, 'grad_norm': 11.957366943359375, 'learning_rate': 1.907692307692308e-05, 'epoch': 0.46}
{'loss': 3.265, 'grad_norm': 11.121075630187988, 'learning_rate': 1.8846153846153846e-05, 'epoch': 0.58}
{'loss': 3.3804, 'grad_norm': 12.087769508361816, 'learning_rate': 1.8615384615384616e-05, 'epoch': 0.69}
{'loss': 3.1829, 'grad_norm': 12.318615913391113, 'learning_rate': 1.8384615384615386e-05, 'epoch': 0.81}
{'loss': 3.6178, 'grad_norm': 12.5647554397583, 'learning_rate': 1.8153846153846155e-05, 'epoch': 0.92}


  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 3.120351791381836, 'eval_runtime': 4.6717, 'eval_samples_per_second': 93.97, 'eval_steps_per_second': 11.773, 'epoch': 1.0}
{'loss': 3.1056, 'grad_norm': 11.398429870605469, 'learning_rate': 1.7923076923076925e-05, 'epoch': 1.04}
{'loss': 3.0324, 'grad_norm': 10.65538501739502, 'learning_rate': 1.7692307692307694e-05, 'epoch': 1.15}
{'loss': 3.0677, 'grad_norm': 12.129164695739746, 'learning_rate': 1.7461538461538464e-05, 'epoch': 1.27}
{'loss': 2.8571, 'grad_norm': 10.017070770263672, 'learning_rate': 1.7230769230769234e-05, 'epoch': 1.38}
{'loss': 3.1222, 'grad_norm': 11.482268333435059, 'learning_rate': 1.7e-05, 'epoch': 1.5}
{'loss': 3.2489, 'grad_norm': 13.565450668334961, 'learning_rate': 1.676923076923077e-05, 'epoch': 1.62}
{'loss': 2.9621, 'grad_norm': 10.946194648742676, 'learning_rate': 1.653846153846154e-05, 'epoch': 1.73}
{'loss': 2.9243, 'grad_norm': 11.496365547180176, 'learning_rate': 1.630769230769231e-05, 'epoch': 1.85}
{'loss': 3.1309, 'grad_norm': 11.7

  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 3.031648635864258, 'eval_runtime': 4.6747, 'eval_samples_per_second': 93.911, 'eval_steps_per_second': 11.766, 'epoch': 2.0}
{'loss': 3.0699, 'grad_norm': 12.976603507995605, 'learning_rate': 1.5846153846153848e-05, 'epoch': 2.08}
{'loss': 2.9583, 'grad_norm': 12.268256187438965, 'learning_rate': 1.5615384615384618e-05, 'epoch': 2.19}
{'loss': 2.9452, 'grad_norm': 11.631314277648926, 'learning_rate': 1.5384615384615387e-05, 'epoch': 2.31}
{'loss': 2.9354, 'grad_norm': 9.781465530395508, 'learning_rate': 1.5153846153846155e-05, 'epoch': 2.42}
{'loss': 2.937, 'grad_norm': 11.739794731140137, 'learning_rate': 1.4923076923076925e-05, 'epoch': 2.54}
{'loss': 2.8965, 'grad_norm': 11.418217658996582, 'learning_rate': 1.4692307692307694e-05, 'epoch': 2.65}
{'loss': 2.7277, 'grad_norm': 11.223549842834473, 'learning_rate': 1.4461538461538462e-05, 'epoch': 2.77}
{'loss': 2.9961, 'grad_norm': 12.376062393188477, 'learning_rate': 1.4230769230769232e-05, 'epoch': 2.88}
{'loss': 2.898,

  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.9473958015441895, 'eval_runtime': 4.6582, 'eval_samples_per_second': 94.243, 'eval_steps_per_second': 11.807, 'epoch': 3.0}
{'loss': 2.8731, 'grad_norm': 10.015551567077637, 'learning_rate': 1.3769230769230771e-05, 'epoch': 3.12}
{'loss': 2.817, 'grad_norm': 11.802018165588379, 'learning_rate': 1.353846153846154e-05, 'epoch': 3.23}
{'loss': 2.7567, 'grad_norm': 10.124642372131348, 'learning_rate': 1.3307692307692309e-05, 'epoch': 3.35}
{'loss': 2.728, 'grad_norm': 10.135851860046387, 'learning_rate': 1.3076923076923078e-05, 'epoch': 3.46}
{'loss': 2.8587, 'grad_norm': 11.181451797485352, 'learning_rate': 1.2846153846153848e-05, 'epoch': 3.58}
{'loss': 3.0278, 'grad_norm': 11.209908485412598, 'learning_rate': 1.2615384615384616e-05, 'epoch': 3.69}
{'loss': 2.8286, 'grad_norm': 10.430642127990723, 'learning_rate': 1.2384615384615385e-05, 'epoch': 3.81}
{'loss': 2.7455, 'grad_norm': 11.305782318115234, 'learning_rate': 1.2153846153846153e-05, 'epoch': 3.92}


  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.884331226348877, 'eval_runtime': 4.7312, 'eval_samples_per_second': 92.788, 'eval_steps_per_second': 11.625, 'epoch': 4.0}
{'loss': 2.7686, 'grad_norm': 11.361594200134277, 'learning_rate': 1.1923076923076925e-05, 'epoch': 4.04}
{'loss': 2.7994, 'grad_norm': 11.459810256958008, 'learning_rate': 1.1692307692307694e-05, 'epoch': 4.15}
{'loss': 2.9249, 'grad_norm': 9.435851097106934, 'learning_rate': 1.1461538461538462e-05, 'epoch': 4.27}
{'loss': 2.622, 'grad_norm': 12.010369300842285, 'learning_rate': 1.1230769230769232e-05, 'epoch': 4.38}
{'loss': 2.6781, 'grad_norm': 11.401501655578613, 'learning_rate': 1.1000000000000001e-05, 'epoch': 4.5}
{'loss': 2.6239, 'grad_norm': 11.522988319396973, 'learning_rate': 1.076923076923077e-05, 'epoch': 4.62}
{'loss': 2.7405, 'grad_norm': 10.83653736114502, 'learning_rate': 1.0538461538461539e-05, 'epoch': 4.73}
{'loss': 2.8062, 'grad_norm': 11.49121379852295, 'learning_rate': 1.0307692307692307e-05, 'epoch': 4.85}
{'loss': 2.6735, 'g

  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.8566370010375977, 'eval_runtime': 4.7162, 'eval_samples_per_second': 93.083, 'eval_steps_per_second': 11.662, 'epoch': 5.0}
{'loss': 2.6677, 'grad_norm': 10.404434204101562, 'learning_rate': 9.846153846153848e-06, 'epoch': 5.08}
{'loss': 2.7761, 'grad_norm': 10.985610008239746, 'learning_rate': 9.615384615384616e-06, 'epoch': 5.19}
{'loss': 2.7762, 'grad_norm': 11.599176406860352, 'learning_rate': 9.384615384615385e-06, 'epoch': 5.31}
{'loss': 2.6412, 'grad_norm': 11.777934074401855, 'learning_rate': 9.153846153846155e-06, 'epoch': 5.42}
{'loss': 2.6034, 'grad_norm': 11.622098922729492, 'learning_rate': 8.923076923076925e-06, 'epoch': 5.54}
{'loss': 2.5691, 'grad_norm': 11.240935325622559, 'learning_rate': 8.692307692307692e-06, 'epoch': 5.65}
{'loss': 2.5078, 'grad_norm': 10.085518836975098, 'learning_rate': 8.461538461538462e-06, 'epoch': 5.77}
{'loss': 2.7996, 'grad_norm': 11.165902137756348, 'learning_rate': 8.230769230769232e-06, 'epoch': 5.88}
{'loss': 2.8925, 'gr

  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.832407236099243, 'eval_runtime': 4.6661, 'eval_samples_per_second': 94.083, 'eval_steps_per_second': 11.787, 'epoch': 6.0}
{'loss': 2.599, 'grad_norm': 10.404022216796875, 'learning_rate': 7.76923076923077e-06, 'epoch': 6.12}
{'loss': 2.8963, 'grad_norm': 11.093927383422852, 'learning_rate': 7.538461538461539e-06, 'epoch': 6.23}
{'loss': 2.8349, 'grad_norm': 11.277181625366211, 'learning_rate': 7.307692307692308e-06, 'epoch': 6.35}
{'loss': 2.2454, 'grad_norm': 8.743915557861328, 'learning_rate': 7.076923076923078e-06, 'epoch': 6.46}
{'loss': 2.5137, 'grad_norm': 11.87866497039795, 'learning_rate': 6.846153846153847e-06, 'epoch': 6.58}
{'loss': 2.7495, 'grad_norm': 10.663759231567383, 'learning_rate': 6.615384615384616e-06, 'epoch': 6.69}
{'loss': 2.3858, 'grad_norm': 10.980155944824219, 'learning_rate': 6.384615384615384e-06, 'epoch': 6.81}
{'loss': 2.5304, 'grad_norm': 11.34186840057373, 'learning_rate': 6.153846153846155e-06, 'epoch': 6.92}


  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.7898738384246826, 'eval_runtime': 4.7243, 'eval_samples_per_second': 92.924, 'eval_steps_per_second': 11.642, 'epoch': 7.0}
{'loss': 2.6298, 'grad_norm': 11.176348686218262, 'learning_rate': 5.923076923076924e-06, 'epoch': 7.04}
{'loss': 2.3515, 'grad_norm': 11.243046760559082, 'learning_rate': 5.692307692307692e-06, 'epoch': 7.15}
{'loss': 2.767, 'grad_norm': 12.827641487121582, 'learning_rate': 5.461538461538461e-06, 'epoch': 7.27}
{'loss': 2.6269, 'grad_norm': 10.588417053222656, 'learning_rate': 5.230769230769232e-06, 'epoch': 7.38}
{'loss': 2.4749, 'grad_norm': 11.69060230255127, 'learning_rate': 5e-06, 'epoch': 7.5}
{'loss': 2.7564, 'grad_norm': 10.927276611328125, 'learning_rate': 4.76923076923077e-06, 'epoch': 7.62}
{'loss': 2.6726, 'grad_norm': 10.261858940124512, 'learning_rate': 4.538461538461539e-06, 'epoch': 7.73}
{'loss': 2.4457, 'grad_norm': 9.987847328186035, 'learning_rate': 4.307692307692308e-06, 'epoch': 7.85}
{'loss': 2.7842, 'grad_norm': 10.43880462

  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.7709856033325195, 'eval_runtime': 4.7181, 'eval_samples_per_second': 93.046, 'eval_steps_per_second': 11.657, 'epoch': 8.0}
{'loss': 2.6164, 'grad_norm': 10.383829116821289, 'learning_rate': 3.846153846153847e-06, 'epoch': 8.08}
{'loss': 2.509, 'grad_norm': 10.324491500854492, 'learning_rate': 3.6153846153846156e-06, 'epoch': 8.19}
{'loss': 2.6959, 'grad_norm': 12.091205596923828, 'learning_rate': 3.384615384615385e-06, 'epoch': 8.31}
{'loss': 2.4444, 'grad_norm': 9.374394416809082, 'learning_rate': 3.153846153846154e-06, 'epoch': 8.42}
{'loss': 2.6149, 'grad_norm': 10.644925117492676, 'learning_rate': 2.9230769230769236e-06, 'epoch': 8.54}
{'loss': 2.528, 'grad_norm': 12.275322914123535, 'learning_rate': 2.6923076923076923e-06, 'epoch': 8.65}
{'loss': 2.7305, 'grad_norm': 10.704662322998047, 'learning_rate': 2.461538461538462e-06, 'epoch': 8.77}
{'loss': 2.308, 'grad_norm': 10.829948425292969, 'learning_rate': 2.230769230769231e-06, 'epoch': 8.88}
{'loss': 2.4268, 'gra

  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.7497873306274414, 'eval_runtime': 4.6998, 'eval_samples_per_second': 93.409, 'eval_steps_per_second': 11.703, 'epoch': 9.0}
{'loss': 2.5293, 'grad_norm': 10.335525512695312, 'learning_rate': 1.7692307692307695e-06, 'epoch': 9.12}
{'loss': 2.4495, 'grad_norm': 10.087556838989258, 'learning_rate': 1.5384615384615387e-06, 'epoch': 9.23}
{'loss': 2.4067, 'grad_norm': 9.233561515808105, 'learning_rate': 1.307692307692308e-06, 'epoch': 9.35}
{'loss': 2.3243, 'grad_norm': 10.976149559020996, 'learning_rate': 1.076923076923077e-06, 'epoch': 9.46}
{'loss': 2.7107, 'grad_norm': 11.977532386779785, 'learning_rate': 8.461538461538463e-07, 'epoch': 9.58}
{'loss': 2.6467, 'grad_norm': 10.579895973205566, 'learning_rate': 6.153846153846155e-07, 'epoch': 9.69}
{'loss': 2.7577, 'grad_norm': 10.052021980285645, 'learning_rate': 3.846153846153847e-07, 'epoch': 9.81}
{'loss': 2.6732, 'grad_norm': 11.025419235229492, 'learning_rate': 1.5384615384615387e-07, 'epoch': 9.92}


  0%|          | 0/55 [00:00<?, ?it/s]

{'eval_loss': 2.770171880722046, 'eval_runtime': 4.7171, 'eval_samples_per_second': 93.066, 'eval_steps_per_second': 11.66, 'epoch': 10.0}
{'train_runtime': 235.5278, 'train_samples_per_second': 17.578, 'train_steps_per_second': 2.208, 'train_loss': 2.7943473962637095, 'epoch': 10.0}


TrainOutput(global_step=520, training_loss=2.7943473962637095, metrics={'train_runtime': 235.5278, 'train_samples_per_second': 17.578, 'train_steps_per_second': 2.208, 'train_loss': 2.7943473962637095, 'epoch': 10.0})

In [36]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/55 [00:00<?, ?it/s]

>>> Perplexity: 15.96


#### Saving the model

The model can be saved for future loading.

In [37]:
trainer.save_model("./nlp_assignment2_mlm")

events.out.tfevents.1713607999.inacio-macmini.Home.890.1:   0%|          | 0.00/359 [00:00<?, ?B/s]

events.out.tfevents.1713607698.inacio-macmini.Home.890.0:   0%|          | 0.00/25.9k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

#### Loading and using a saved model

In [42]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer2 = AutoTokenizer.from_pretrained("./nlp_assignment2_mlm")
model2 = AutoModelForMaskedLM.from_pretrained("./nlp_assignment2_mlm")

In [43]:
model2.push_to_hub(repo_id="nlp-assignment2-mlm", private=True)
tokenizer2.push_to_hub(repo_id="nlp-assignment2-mlm", private=True)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ricardoinacio/nlp-assignment2-mlm/commit/8f04d87ac21ea742e4cd2f59ec03233d041a90da', commit_message='Upload tokenizer', commit_description='', oid='8f04d87ac21ea742e4cd2f59ec03233d041a90da', pr_url=None, pr_revision=None, pr_num=None)

In [44]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", model="./nlp_assignment2_mlm")

In [45]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great deal.
>>> this is a great idea.
>>> this is a great mistake.
>>> this is a great job.
>>> this is a great book.


From base untuned:

'>>> This is a great deal.'

'>>> This is a great success.'

'>>> This is a great adventure.'

'>>> This is a great idea.'

'>>> This is a great feat.'