<a href="https://colab.research.google.com/github/nepomucenoc/MLM/blob/main/mlm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Masked Language Model
 In machine learning, MLM stands for "Masked Language Model," which is a technique used in Natural Language Processing (NLP) to train language models by randomly masking certain words within a sentence and then asking the model to predict what those masked words are based on the surrounding context; this helps the model learn deep semantic relationships between words and understand the overall meaning of a sentence, making it particularly useful for tasks like text classification, question answering, and text generation.

In [1]:
# Install dependencies
!pip install transformers datasets

import torch
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments
from transformers import pipeline
from datasets import load_dataset

# Loading the Tokenizer and the BERT Pre-Trained Model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Loading data - using a simple HuggingFace dataset for NLP
dataset = load_dataset('imdb', split='train[:5%]')  # Apenas uma amostra para simplificação

# Tokenizing the texts
def tokenize_function(examples):
    # Tokenizing and Preparing Data for Masked Language Modeling
    encodings = tokenizer(examples['text'], padding="max_length", truncation=True, return_tensors="pt")
    encodings['labels'] = encodings['input_ids'].detach().clone()  # Using input_ids themselves as labels
    # Applying a random mask to the texts for MLM
    # Creating a mask at random positions
    # Setting 15% of tokens to be masked
    for i in range(len(encodings['input_ids'])):
        # Creating a random mask (just as a simple example)
        rand = torch.rand(encodings['input_ids'].shape[1])
        mask_idx = rand < 0.15
        encodings['input_ids'][i, mask_idx] = tokenizer.mask_token_id  # Replaces token with mask
    return encodings

# Apply tokenization to the entire dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Now tokenized dataset should contain the columns: 'input_ids', 'attention_mask', 'labels'

# Defining training parameters
training_args = TrainingArguments(
    output_dir='./results',          # where the results are save
    evaluation_strategy="epoch",     # evaluate each epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=4,   # Training batch size (reduced)
    num_train_epochs=3,              # Number of training epochs
    weight_decay=0.01,               # Weight decay
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,  # Here we pass the tokenized dataset
    eval_dataset=tokenized_datasets    # For the evaluation, we also used the tokenized dataset
)

# Start train
trainer.train()

# Text Generation Pipeline with Fine-Tuning
text_generator = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Testing the model with an example sentence
input_text = "The quick brown fox jumps over the [MASK] dog."
output = text_generator(input_text)

# Displaying the generated output
print(f"Generated Text: {output[0]['sequence']}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture clas

Epoch,Training Loss,Validation Loss
1,No log,0.15389
2,0.528100,0.126949
3,0.528100,0.118201


Device set to use cpu


Generated Text: the quick brown fox jumps over the little dog.


## Examples test

In [2]:
# Testando o modelo com uma frase de exemplo
input_text = "The hair of that girl is [MASK]."
output = text_generator(input_text)

# Exibindo a saída gerada
print(f"Generated Text: {output[0]['sequence']}")

Generated Text: the hair of that girl is black.


In [22]:
# Testando o modelo com outra frase de exemplo
input_text = "I love [MASK]."
output = text_generator(input_text)

# Exibindo a saída gerada
print(f"Generated Text: {output[0]['sequence']}")

Generated Text: i love you.


## 1. Save Model

In [4]:
model.save_pretrained('./fine_tuned_model')

# save tokenizer
tokenizer.save_pretrained('./fine_tuned_model')

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json')

## 2. Load the model and tokenizer later:

In [6]:
from transformers import BertForMaskedLM, BertTokenizer

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained('./fine_tuned_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_model')

## 3. Save checkpoints during training:

In [7]:
training_args = TrainingArguments(
    output_dir='./results',          # Onde os resultados são salvos
    save_steps=500,                  # Salvar checkpoints a cada 500 passos
    save_total_limit=2,              # Manter apenas 2 checkpoints mais recentes
    evaluation_strategy="epoch",     # Avaliação a cada época
    learning_rate=2e-5,              # Taxa de aprendizado
    per_device_train_batch_size=4,   # Tamanho do lote de treinamento
    num_train_epochs=3,              # Número de épocas de treinamento
    weight_decay=0.01,               # Decaimento de peso
)



## 4. Save the template to Google Drive (optional):

In [8]:
from google.colab import drive
drive.mount('/content/drive')

# Save in Google Drive
model.save_pretrained('/content/drive/MyDrive/DataScience/MLM/fine_tuned_model')
tokenizer.save_pretrained('/content/drive/MyDrive/DataScience/MLM/fine_tuned_model')

Mounted at /content/drive


('/content/drive/MyDrive/DataScience/MLM/fine_tuned_model/tokenizer_config.json',
 '/content/drive/MyDrive/DataScience/MLM/fine_tuned_model/special_tokens_map.json',
 '/content/drive/MyDrive/DataScience/MLM/fine_tuned_model/vocab.txt',
 '/content/drive/MyDrive/DataScience/MLM/fine_tuned_model/added_tokens.json')