# Exploring tokenization in MLM

## Conclusion
For proper evaluation of a model's ability to recover a masked word, we should take into account the tokenization of the word, and pass the appropriate number of "mask" tokens to the model for prediction. 

E.g. Consider the word "aileron". `roberta-base` tokenizes this into three tokens: `'Ġa', 'iler', 'on'`. If we replace the word with a single "mask" token, the model will never be able to predict "aileron" because it simply doesn't have a token for that full word. However, if we recognize that "aileron" is tokenized into three tokens, and replace the word with "mask mask mask", then the model can recover the word. 

## Notes 
- Difference between `tokenizer(text)` and `tokenizer.tokenize(text)`. Former gives the input_ids and attention_mask, latter gives only the tokens. 

In [1]:
import torch
from transformers import RobertaTokenizer, RobertaForMaskedLM

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [2]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [3]:
model = RobertaForMaskedLM.from_pretrained('roberta-base')

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [4]:
model

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): 

In [5]:
texts_and_masked_words = {
    0: ("The element with atomic number 70 is Ytterbium. Ytterbium starts with the letter Y and ends with M. It is named after Ytterby, the village in Sweden", "Ytterbium"), 
    1: ("My friend lives in a neighbourhood called Mississauga. It is in the west of Toronto, and starts with the letter M", "Mississauga"),
    2: ("Replaced the right aileron, which is a movable part of an airplane wing. The aileron replacement is part of regular maintenance", "aileron"), 
    3: ("The number 90210 is a zip code in Beverly Hills, California, USA. 90210 is also the name of an American teen drama television series", "90210")
}

text_selected = 3

In [6]:
text = texts_and_masked_words[text_selected][0]
word = texts_and_masked_words[text_selected][1]
print(text)

The number 90210 is a zip code in Beverly Hills, California, USA. 90210 is also the name of an American teen drama television series


In [7]:
print(word)

90210


In [8]:
encoding = tokenizer(text, return_tensors='pt', padding=False)

In [9]:
encoding

{'input_ids': tensor([[    0,   133,   346,   361,  4197,   698,    16,    10, 23595,  3260,
            11, 13130,  4962,     6,   886,     6,  2805,     4,   361,  4197,
           698,    16,    67,     5,   766,     9,    41,   470,  6066,  4149,
          2384,   651,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [10]:
num_words = len(text.split())
num_words

24

In [11]:
num_tokens_without_start_and_stop = len(encoding['input_ids'][0]) - 2  # subtract the start and stop tokens 
num_tokens_without_start_and_stop

31

In [12]:
tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])

['<s>',
 'The',
 'Ġnumber',
 'Ġ9',
 '02',
 '10',
 'Ġis',
 'Ġa',
 'Ġzip',
 'Ġcode',
 'Ġin',
 'ĠBeverly',
 'ĠHills',
 ',',
 'ĠCalifornia',
 ',',
 'ĠUSA',
 '.',
 'Ġ9',
 '02',
 '10',
 'Ġis',
 'Ġalso',
 'Ġthe',
 'Ġname',
 'Ġof',
 'Ġan',
 'ĠAmerican',
 'Ġteen',
 'Ġdrama',
 'Ġtelevision',
 'Ġseries',
 '</s>']

# Approach 1

In [13]:
masked_sentence_01 = text.replace(word, "<mask>", 1)
masked_sentence_01

'The number <mask> is a zip code in Beverly Hills, California, USA. 90210 is also the name of an American teen drama television series'

In [14]:
inputs = tokenizer(masked_sentence_01, return_tensors='pt')

num_tokens = len(inputs['input_ids'][0])
num_tokens

31

In [15]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

['<s>',
 'The',
 'Ġnumber',
 '<mask>',
 'Ġis',
 'Ġa',
 'Ġzip',
 'Ġcode',
 'Ġin',
 'ĠBeverly',
 'ĠHills',
 ',',
 'ĠCalifornia',
 ',',
 'ĠUSA',
 '.',
 'Ġ9',
 '02',
 '10',
 'Ġis',
 'Ġalso',
 'Ġthe',
 'Ġname',
 'Ġof',
 'Ġan',
 'ĠAmerican',
 'Ġteen',
 'Ġdrama',
 'Ġtelevision',
 'Ġseries',
 '</s>']

In [16]:
# model predictions: 
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

In [17]:
# 1 sentence, 11 tokens, scores for each of 50,265 tokens in vocab
predictions.shape

torch.Size([1, 31, 50265])

In [18]:
predicted_ids = torch.argmax(predictions, dim=2)
predicted_ids.shape

torch.Size([1, 31])

In [19]:
predicted_tokens = [tokenizer.decode(token_id) for token_id in predicted_ids]
predicted_tokens

['<s>The number 9 is a zip code in Beverly Hills, California, USA</s> 90210 is also the name of an American teen drama television series</s>']

# Approach 2

In [20]:
word_tokens = tokenizer.tokenize(word)
word_tokens

['9', '02', '10']

In [21]:
num_tokens = len(word_tokens)
num_tokens

3

In [22]:
masked_sentence_02 = text.replace(word, " ".join(["<mask>"] * num_tokens), 1)
masked_sentence_02

'The number <mask> <mask> <mask> is a zip code in Beverly Hills, California, USA. 90210 is also the name of an American teen drama television series'

In [23]:
inputs_02 = tokenizer(masked_sentence_02, return_tensors='pt')

In [24]:
tokenizer.convert_ids_to_tokens(inputs_02['input_ids'][0])

['<s>',
 'The',
 'Ġnumber',
 '<mask>',
 '<mask>',
 '<mask>',
 'Ġis',
 'Ġa',
 'Ġzip',
 'Ġcode',
 'Ġin',
 'ĠBeverly',
 'ĠHills',
 ',',
 'ĠCalifornia',
 ',',
 'ĠUSA',
 '.',
 'Ġ9',
 '02',
 '10',
 'Ġis',
 'Ġalso',
 'Ġthe',
 'Ġname',
 'Ġof',
 'Ġan',
 'ĠAmerican',
 'Ġteen',
 'Ġdrama',
 'Ġtelevision',
 'Ġseries',
 '</s>']

In [25]:
num_tokens = len(inputs_02['input_ids'][0])
num_tokens

33

In [26]:
# predictions 
with torch.no_grad(): 
    outputs = model(**inputs_02)
    predictions = outputs.logits 
    
# 1 sentence, 15 tokens, scores for each of 50,265 tokens in vocab
predictions.shape

torch.Size([1, 33, 50265])

In [27]:
predicted_ids = torch.argmax(predictions, dim=2)
predicted_ids.shape

torch.Size([1, 33])

In [28]:
predicted_tokens = [tokenizer.decode(token_id) for token_id in predicted_ids]
predicted_tokens

['<s>The number 90210 is a zip code in Beverly Hills, California, USA</s> 90210 is also the name of an American teen drama television series</s>']