<a href="https://colab.research.google.com/github/orlandxrf/curso-dl/blob/main/notebooks/10a_BERT_Masked_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Masked Language Model

Mas sobre [Fill-Mask](https://huggingface.co/tasks/fill-mask)

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 3.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 41.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 41.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

## Método 1

In [None]:
from transformers import pipeline

> Cambiar el modelo pre-entrenado para el idioma español.

[Modelos de Hugging Face](https://huggingface.co/models)

Filtrar modelos por `Fill-Mask` y `Language`

In [None]:
model_checkpoint = "bert-base-uncased"

In [None]:
# Initialize MLM pipeline
mlm = pipeline('fill-mask', model=model_checkpoint)

# Get mask token
mask = mlm.tokenizer.mask_token


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
frase = f'what is {mask} name?'

result = mlm(frase)

for x in result:
    print (x)

{'score': 0.5362831950187683, 'token': 2115, 'token_str': 'your', 'sequence': 'what is your name?'}
{'score': 0.2603796124458313, 'token': 2014, 'token_str': 'her', 'sequence': 'what is her name?'}
{'score': 0.14665333926677704, 'token': 2010, 'token_str': 'his', 'sequence': 'what is his name?'}
{'score': 0.03641762584447861, 'token': 2026, 'token_str': 'my', 'sequence': 'what is my name?'}
{'score': 0.004835779312998056, 'token': 2049, 'token_str': 'its', 'sequence': 'what is its name?'}


## Método 2

In [None]:
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# visualizar la cantidad de parámetros de este checkpoint (modelo)
parameters_number = model.num_parameters() / 1_000_000

print(f">>> {model_checkpoint}: {round(parameters_number)}M")



>>> bert-base-uncased: 110M


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
import torch

# Get mask token
mask = mlm.tokenizer.mask_token

frase = f'data science {mask} will not pass the course'

inputs = tokenizer(frase, return_tensors="pt")

print (f"inputs:\t{inputs}\n")
print (f"tokenizer.decode([token]):\t{tokenizer.decode(inputs.input_ids[0])}\n")

print (f"tokenizer.mask_token:\t{tokenizer.mask_token}")

token_logits = model(**inputs).logits

print (f"token_logits:\t{token_logits}\n")

print (f"tokenizer.mask_token_id:\t{tokenizer.mask_token_id}\n")

# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]

print (f"mask_token_index:\t{mask_token_index}\n")
print (f"mask_token_logits:\t{mask_token_logits}\n")

# Pick the [MASK] candidates with the highest logits
top_10_tokens = torch.topk(mask_token_logits, 10, dim=1).indices[0].tolist()

for token in top_10_tokens:
    print(f">>> {frase.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

inputs:	{'input_ids': tensor([[ 101, 2951, 2671,  103, 2097, 2025, 3413, 1996, 2607,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

tokenizer.decode([token]):	[CLS] data science [MASK] will not pass the course [SEP]

tokenizer.mask_token:	[MASK]
token_logits:	tensor([[[ -6.7515,  -6.7122,  -6.7156,  ...,  -6.0969,  -5.8909,  -4.1559],
         [ -9.3213,  -9.2264,  -9.1887,  ...,  -8.6386,  -7.7959,  -6.7946],
         [ -9.2814,  -8.8226,  -8.9900,  ...,  -8.5409,  -7.7992, -10.2986],
         ...,
         [-14.2551, -14.1695, -13.6868,  ..., -11.5396, -10.7572, -12.0141],
         [ -9.0199,  -8.9663,  -8.8720,  ...,  -8.1527,  -7.9353,  -7.4043],
         [-13.0369, -12.8171, -13.0187,  ..., -10.1752, -11.7309,  -8.2591]]],
       grad_fn=<AddBackward0>)

tokenizer.mask_token_id:	103

mask_token_index:	tensor([3])

mask_token_logits:	tensor([[-5.7791, -5.6853, -5.7119,  ..., -5.7060, -5.5290, -5.8119

## Múltiples mascaras

In [None]:
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer


model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Get mask token
mask = mlm.tokenizer.mask_token

# frase = f'{mask} science {mask} will not pass the course'
frase = f"Using a Transformer {mask} is {mask}"

inputs = tokenizer(frase, return_tensors="pt")

tokenized_text = tokenizer.tokenize(frase)
print (tokenized_text)

print (inputs)

['using', 'a', 'transform', '##er', '[MASK]', 'is', '[MASK]']
{'input_ids': tensor([[  101,  2478,  1037, 10938,  2121,   103,  2003,   103,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [None]:
outputs = model(**inputs)
predictions = outputs[0]

print (predictions)

tensor([[[ -6.7083,  -6.6438,  -6.6441,  ...,  -6.0651,  -5.7368,  -3.8747],
         [ -6.6019,  -6.3780,  -6.6740,  ...,  -6.5999,  -5.0723,  -4.1171],
         [-13.4614, -13.3045, -13.3661,  ..., -12.4056,  -9.6751,  -9.4832],
         ...,
         [ -9.6787,  -9.4650,  -9.4854,  ...,  -9.3639,  -6.1835,  -8.5192],
         [ -6.7329,  -6.4565,  -6.7158,  ...,  -6.6708,  -6.1550,  -2.9264],
         [-10.4209, -10.0972, -10.2875,  ...,  -8.0917,  -8.6419,  -7.6760]]],
       grad_fn=<AddBackward0>)


In [None]:
sorted_preds, sorted_idx = predictions[0].sort(dim=-1, descending=True)

for k in range(10):
    predicted_index = [sorted_idx[i, k].item() for i in range(0,len(tokenized_text))]
    predicted_token = [tokenizer.convert_ids_to_tokens([predicted_index[x]])[0] for x in range(1,len(tokenized_text))]
    print(predicted_token)

['using', 'a', 'transform', '##er', 'that', 'is']
['for', 'the', 'transforms', '-', ',', 'be']
['with', 'an', 'convert', '##ian', 'it', ':']
['use', 'any', 'sum', ':', 'where', 'for']
['applying', 'this', 'radical', 'kernel', 'this', ',']
['so', 'is', 'mean', 'transform', 'there', 'as']
['taking', 'its', 'fourier', 'operator', ':', 'then']
['then', 'with', 'grid', 'in', 'which', 'becomes']
['in', 'another', 'transformation', ',', '.', 'and']
['choosing', 'in', 'auxiliary', '*', 'function', 'if']
