# Train Using MLM

In this notebook we'll take a look at the logic behind fine-tuning a model using masked-language modelling (MLM).

First we'll import all we need.

In [1]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

In [2]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces [MASK] Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
outputs.keys()

odict_keys(['logits'])

This returns just our MLM output logits.

In [4]:
outputs.logits.shape

torch.Size([1, 62, 30522])

To identify the token position where we have **\[MASK\]** tokens we can check the `inputs` tensor for tokens matching *103* (eg MASK).

In [5]:
mask_pos = torch.flatten((inputs.input_ids[0] == 103).nonzero()).tolist()
mask_pos

	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  ..\torch\csrc\utils\python_arg_parser.cpp:882.)
  mask_pos = torch.flatten((inputs.input_ids[0] == 103).nonzero()).tolist()


[9, 43]

It is for these two positions that we must calculate the loss for when training our model. How does that work? Well, we compare the `inputs` at those two positions, to the predicted `outputs` at those two positions - converted to one-hot encoding and probability distribution respectively.

To convert the `inputs` tokens to one-hot encodings we need the vocab dictionary length, and add a **1** at the position given by the token value.

In [6]:
vocab_size = len(tokenizer.get_vocab())
vocab_size

30522

Which aligns to our `outputs.logits.shape` tensor shape too:

In [7]:
outputs.logits.shape

torch.Size([1, 62, 30522])

In reality, we must mask our tokens randomly, after which we feed in the original (unmasked) *token_ids* into to model as *labels*, and keep *token_ids* as the new masked tensor.

To test this, let's unmask our text first.

In [8]:
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces attacked Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

And now convert our text using the `tokenizer`:

In [9]:
inputs = tokenizer(text, return_tensors='pt')

In [10]:
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Now we want to mask a number of tokens in the *input_ids* tensor, the original BERT model was pretrained using a masking probability of 15%, alongside a few additional rules. We will create a similiar, but simpler implementation by taking the primary 15% masking functionality.

In [11]:
# create random array of floats in equal dimension to input_ids
rand = torch.rand(inputs.input_ids.shape)
rand

tensor([[0.1212, 0.8009, 0.4602, 0.7828, 0.3292, 0.5099, 0.6377, 0.8134, 0.3472,
         0.8093, 0.4983, 0.8700, 0.4231, 0.2130, 0.7116, 0.1163, 0.8053, 0.0981,
         0.6704, 0.0806, 0.7791, 0.6578, 0.5987, 0.9692, 0.2289, 0.5254, 0.4258,
         0.8538, 0.7290, 0.7587, 0.4773, 0.7447, 0.9237, 0.9244, 0.9657, 0.9437,
         0.2099, 0.9151, 0.5392, 0.1769, 0.6366, 0.6294, 0.5757, 0.7823, 0.5347,
         0.9884, 0.4199, 0.0152, 0.7966, 0.0618, 0.6426, 0.6808, 0.3977, 0.2001,
         0.7492, 0.8114, 0.0641, 0.9057, 0.7281, 0.9557, 0.2229, 0.7956]])

In [12]:
# where the random array is less than 0.15, we set true, these will be our MASK tokens
mask_arr = rand < 0.15
mask_arr

tensor([[ True, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False,  True, False,  True, False,  True,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False,  True, False,  True,
         False, False, False, False, False, False,  True, False, False, False,
         False, False]])

In [13]:
# extract the True value index positions
selection = torch.flatten((mask_arr[0]).nonzero()).tolist()
selection

[0, 15, 17, 19, 47, 49, 56]

We then select these *True* value positions in the *input_ids* tensor, and change their values to *103* - we need to copy the original tensor first though, to create a new one called *labels*:

In [14]:
inputs['labels'] = inputs.input_ids.detach().clone()
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

And now use `mask_arr` to mask ~15% of our values:

In [15]:
inputs.input_ids[0, selection] = 103
inputs

{'input_ids': tensor([[  103,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,   103,  1010,   103,  3988,   103,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,   103,  2148,   103,
          1010,  2074,  2058,  1037,  3204,  2044,   103,  1005,  1055, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

We should notice here that we have a slight problem, our CLS token has been replaced by a MASK token - we don't want to do this, nor do we want to replace any other special tokens. In this case, we have only two special tokens, CLS and SEP at the beginning and end of our sequence respectively. So the solution here is easy, we add another condition when creating our Boolean mask tensor `mask_arr`. This condition should check the original `inputs.input_ids`, where a value is **not** equal to *101* or *102* (*CLS* and *SEP* respectively) then we set *True*:

In [17]:
# reinitialize inputs.input_ids first (to remove mask)
inputs = tokenizer(text, return_tensors='pt')
# and copy to labels tensor
inputs['labels'] = inputs.input_ids.detach().clone()
# now test the new conditional logic
(inputs.input_ids != 101) * (inputs.input_ids != 102)

tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True, False]])

And add this logic to our `mask_arr`:

In [18]:
# where the random array is less than 0.15, we set true, these will be our MASK tokens
# BUT - we must also add our new conditions too
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102)
mask_arr

tensor([[False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False,  True, False,  True, False,  True,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False,  True, False,  True,
         False, False, False, False, False, False,  True, False, False, False,
         False, False]])

In [19]:
inputs.input_ids

tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,  3792,
          1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055, 17331,
          1012,   102]])

And run through the same steps again.

In [20]:
selection = torch.flatten((mask_arr[0]).nonzero()).tolist()  # create selection from mask_arr
inputs.input_ids[0, selection] = 103  # apply selection index to inputs.input_ids, adding MASK tokens
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,   103,  1010,   103,  3988,   103,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
         22965,  2923,  2749,  4457,  3481,  7680,  3334,   103,  2148,   103,
          1010,  2074,  2058,  1037,  3204,  2044,   103,  1005,  1055, 17331,
          1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([

Now we're ready. We pass these inputs into our model.

In [21]:
outputs = model(**inputs)

And we'll find that we now have an additional output tensor, the *loss* tensor.

In [22]:
outputs.keys()

odict_keys(['loss', 'logits'])

In [23]:
outputs.loss

tensor(0.7898, grad_fn=<NllLossBackward>)

It is this loss that we will be training on.