# Training one period's MLM model

In [1]:
with open('../input/articles/2015_articles.txt', 'r') as f:
    articles = f.readlines()
    print(f'Found {len(articles)} articles')

Found 100585 articles


In [2]:
articles[0]

"''american success,'' which like ''winter kills'' is a supremely playful and visually witty movie, finds the same handsome stars - jeff bridges and belinda bauer - in even more farfetched circumstances. in ''winter kills,'' mr. bridges played the heir to a wicked old entrepreneur's colossal fortune, and miss bauer played a spy and temptress with a part-time job at a women's magazine. this time, they're cast as a married couple, harry and sarah flowers, who have problems. harry's problem is sarah, and sarah's problem is that she spends her days dressed as a ballerina with butterfly wings. she is meant to be a fairy-tale figure, an enchanted princess who can't possibly meet the demands of everyday life. when harry interrupts her reveries to try to make love to her, sarah complains bitterly that he's breaking their conjugal rules. he can't try this. it isn't friday. the story follows harry's efforts to transform himself into sarah's idea of a real man, and to rob her financier father in 

#### Training and testing data

In [3]:
sum(' office ' in article for article in articles)

7735

In [4]:
sum(' head ' in article for article in articles)

5486

In [5]:
sum(' success ' in article for article in articles)

2375

In [6]:
def train_test(articles: list, key_words: list = [' office ', ' head ', ' success ']):
    train = []
    test = []

    target_articles = []
    other_articles = []

    for article in articles:
        if any(key_word in article for key_word in key_words):
            target_articles.append(article)
        else:
            other_articles.append(article)

    train.extend(target_articles[:int(len(target_articles) * 0.8)])
    test.extend(target_articles[int(len(target_articles) * 0.8):])
    
    return train, test

In [7]:
train, test = train_test(articles)
print(f'Train: {len(train)} articles')
print(f'Test: {len(test)} articles')

Train: 11201 articles
Test: 2801 articles


In [8]:
with open('../input/data_2015/train.txt', 'w') as f:
    f.writelines(train)

with open('../input/data_2015/test.txt', 'w') as f:
    f.writelines(test)

# Training the MLM

In [9]:
%env PYTORCH_ENABLE_MPS_FALLBACK=1

env: PYTORCH_ENABLE_MPS_FALLBACK=1


In [35]:
from datasets import load_dataset

In [36]:
data_files = {"train": "train.txt", "test": "test.txt"}
data = load_dataset('../input/data_2015', data_files=data_files)
data

Downloading and preparing dataset text/data_2015 to /Users/imenekolli/.cache/huggingface/datasets/text/data_2015-5fda990804bfed91/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /Users/imenekolli/.cache/huggingface/datasets/text/data_2015-5fda990804bfed91/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 11201
    })
    test: Dataset({
        features: ['text'],
        num_rows: 2801
    })
})

In [37]:
data['train'][0]

{'text': "''american success,'' which like ''winter kills'' is a supremely playful and visually witty movie, finds the same handsome stars - jeff bridges and belinda bauer - in even more farfetched circumstances. in ''winter kills,'' mr. bridges played the heir to a wicked old entrepreneur's colossal fortune, and miss bauer played a spy and temptress with a part-time job at a women's magazine. this time, they're cast as a married couple, harry and sarah flowers, who have problems. harry's problem is sarah, and sarah's problem is that she spends her days dressed as a ballerina with butterfly wings. she is meant to be a fairy-tale figure, an enchanted princess who can't possibly meet the demands of everyday life. when harry interrupts her reveries to try to make love to her, sarah complains bitterly that he's breaking their conjugal rules. he can't try this. it isn't friday. the story follows harry's efforts to transform himself into sarah's idea of a real man, and to rob her financier f

### Preprocess

In [38]:
from transformers import AutoTokenizer

In [39]:
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

In [40]:
max_sequence_length = 512  # Maximum sequence length allowed by the model

def preprocess_function(examples):
    # Split the input sequences into smaller parts if they exceed the maximum length
    tokenized_examples = tokenizer(examples["text"], truncation=True, max_length=max_sequence_length)
    return tokenized_examples

In [41]:
tokenized_data = data.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=["text"],)

Map (num_proc=4):   0%|          | 0/11201 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2801 [00:00<?, ? examples/s]

In [42]:
tokenized_data['train'][0]

{'input_ids': [0,
  17809,
  8015,
  12657,
  1282,
  10559,
  61,
  101,
  12801,
  31421,
  10469,
  17809,
  16,
  10,
  15835,
  352,
  23317,
  8,
  21545,
  33228,
  1569,
  6,
  5684,
  5,
  276,
  19222,
  2690,
  111,
  4112,
  3145,
  11879,
  8,
  12138,
  8865,
  741,
  9994,
  111,
  11,
  190,
  55,
  444,
  37606,
  4215,
  4,
  11,
  12801,
  31421,
  10469,
  10559,
  475,
  338,
  4,
  11879,
  702,
  5,
  24482,
  7,
  10,
  28418,
  793,
  11777,
  18,
  33568,
  13016,
  6,
  8,
  2649,
  741,
  9994,
  702,
  10,
  10258,
  8,
  34919,
  5224,
  19,
  10,
  233,
  12,
  958,
  633,
  23,
  10,
  390,
  18,
  4320,
  4,
  42,
  86,
  6,
  51,
  214,
  2471,
  25,
  10,
  2997,
  891,
  6,
  12280,
  1506,
  8,
  579,
  36000,
  7716,
  6,
  54,
  33,
  1272,
  4,
  12280,
  1506,
  18,
  936,
  16,
  579,
  36000,
  6,
  8,
  579,
  36000,
  18,
  936,
  16,
  14,
  79,
  12500,
  69,
  360,
  7001,
  25,
  10,
  1011,
  254,
  1243,
  19,
  24317,
  11954,
  4,
  

In [43]:
lm_dataset = tokenized_data

In [44]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [45]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
from transformers import TrainingArguments
from transformers import Trainer
import torch

In [47]:
training_args = TrainingArguments(
    output_dir="../model/MLM_2015",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

  0%|          | 0/4203 [00:00<?, ?it/s]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.2443, 'learning_rate': 1.7620747085415183e-05, 'epoch': 0.36}
{'loss': 0.2354, 'learning_rate': 1.524149417083036e-05, 'epoch': 0.71}


  0%|          | 0/351 [00:00<?, ?it/s]

{'eval_loss': 0.2029215693473816, 'eval_runtime': 19489.7455, 'eval_samples_per_second': 0.144, 'eval_steps_per_second': 0.018, 'epoch': 1.0}
{'loss': 0.2282, 'learning_rate': 1.286224125624554e-05, 'epoch': 1.07}
{'loss': 0.2264, 'learning_rate': 1.048298834166072e-05, 'epoch': 1.43}
{'loss': 0.2227, 'learning_rate': 8.103735427075898e-06, 'epoch': 1.78}


  0%|          | 0/351 [00:00<?, ?it/s]

{'eval_loss': 0.20161357522010803, 'eval_runtime': 2214.6406, 'eval_samples_per_second': 1.265, 'eval_steps_per_second': 0.158, 'epoch': 2.0}
{'loss': 0.219, 'learning_rate': 5.7244825124910784e-06, 'epoch': 2.14}
{'loss': 0.2204, 'learning_rate': 3.3452295979062578e-06, 'epoch': 2.5}
{'loss': 0.2186, 'learning_rate': 9.659766833214373e-07, 'epoch': 2.86}


  0%|          | 0/351 [00:00<?, ?it/s]

{'eval_loss': 0.19864130020141602, 'eval_runtime': 36536.9869, 'eval_samples_per_second': 0.077, 'eval_steps_per_second': 0.01, 'epoch': 3.0}
{'train_runtime': 209287.4986, 'train_samples_per_second': 0.161, 'train_steps_per_second': 0.02, 'train_loss': 0.2265650676370603, 'epoch': 3.0}


TrainOutput(global_step=4203, training_loss=0.2265650676370603, metrics={'train_runtime': 209287.4986, 'train_samples_per_second': 0.161, 'train_steps_per_second': 0.02, 'train_loss': 0.2265650676370603, 'epoch': 3.0})

In [48]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/351 [00:00<?, ?it/s]

Perplexity: 1.22


trainer.push_to_hub()

tokenizer.push_to_hub('imene-kolli/my_awesome_eli5_mlm_model')

### Inference

In [10]:
text = "The president held the <mask> for 10 years."

In [11]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", "my_awesome_eli5_mlm_model")
mask_filler(text, top_k=3)

[{'score': 0.4038034975528717,
  'token': 558,
  'token_str': ' office',
  'sequence': 'The president held the office for 10 years.'},
 {'score': 0.34403306245803833,
  'token': 737,
  'token_str': ' position',
  'sequence': 'The president held the position for 10 years.'},
 {'score': 0.0841364786028862,
  'token': 633,
  'token_str': ' job',
  'sequence': 'The president held the job for 10 years.'}]

In [80]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", "my_awesome_eli5_mlm_model")
mask_filler('welcome to my <mask>, I am the manager here.', top_k=3)

[{'score': 0.09414557367563248,
  'token': 558,
  'token_str': ' office',
  'sequence': 'welcome to my office, I am the manager here.'},
 {'score': 0.053980667144060135,
  'token': 165,
  'token_str': ' team',
  'sequence': 'welcome to my team, I am the manager here.'},
 {'score': 0.03739987686276436,
  'token': 1082,
  'token_str': ' site',
  'sequence': 'welcome to my site, I am the manager here.'}]

In [81]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", "my_awesome_eli5_mlm_model")
mask_filler('I recently bought the Microsoft <mask> software.', top_k=3)

[{'score': 0.5165475010871887,
  'token': 1387,
  'token_str': ' Office',
  'sequence': 'I recently bought the Microsoft Office software.'},
 {'score': 0.06393763422966003,
  'token': 12591,
  'token_str': ' Edge',
  'sequence': 'I recently bought the Microsoft Edge software.'},
 {'score': 0.0584854893386364,
  'token': 27241,
  'token_str': ' Excel',
  'sequence': 'I recently bought the Microsoft Excel software.'}]

In [75]:
for i in range(len(data['train'])):
    if 'office' in data['train'][i]['text'].split()[20:30]:
        print(' '.join(data['train'][i]['text'].split()[:30]))
        e = ' '.join(data['train'][i]['text'].split()[:30]).replace(' office ', ' <mask> ')
        break

''the political shop will probably be getting more and more involved in issues,'' mr. nofziger predicted as he prepared to leave office for the less hectic and more lucrative life


In [76]:
from transformers import pipeline

mask_filler = pipeline("fill-mask", "my_awesome_eli5_mlm_model")
mask_filler(e, top_k=3)

[{'score': 0.22669820487499237,
  'token': 2302,
  'token_str': ' politics',
  'sequence': "''the political shop will probably be getting more and more involved in issues,'' mr. nofziger predicted as he prepared to leave politics for the less hectic and more lucrative life"},
 {'score': 0.16870960593223572,
  'token': 558,
  'token_str': ' office',
  'sequence': "''the political shop will probably be getting more and more involved in issues,'' mr. nofziger predicted as he prepared to leave office for the less hectic and more lucrative life"},
 {'score': 0.07311130315065384,
  'token': 30017,
  'token_str': ' academia',
  'sequence': "''the political shop will probably be getting more and more involved in issues,'' mr. nofziger predicted as he prepared to leave academia for the less hectic and more lucrative life"}]

In [12]:
text

'The president held the <mask> for 10 years.'

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_mlm_model")
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

In [17]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

In [18]:
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

for token in top_3_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

The president held the  position for 10 years.
The president held the  office for 10 years.
The president held the  job for 10 years.
