# BERT for Pre-training 
BERT for pre-training have two output types NSP- Next Sentence Prediction & MLM - Masked Language Modelling heads. We use these two heads to fine tune our model and once thats done, we can get rid of the two heads and use Bert as is. 

In [1]:
from transformers import BertTokenizer, BertForPreTraining
import torch

In [2]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForPreTraining.from_pretrained(model_name)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces [MASK] Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

In [4]:
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)

In [5]:
outputs.keys()

odict_keys(['prediction_logits', 'seq_relationship_logits'])

In [7]:
outputs['prediction_logits'].shape, outputs['seq_relationship_logits'].shape

(torch.Size([1, 62, 30522]), torch.Size([1, 2]))

- **prediction_logits** - They are of shape [1, 62, 30522]. This means that these are the MLM head output as 62 denotes the number of tokens and 30522 is the length of each token vector as the vocab size of bert is little over 30,000
- **seq_relationship_logits** - These are the outputs from the NSP head which contains the logits scores for whether a given sentence is or not the next sentence of a given previous sentence. 

## MLM 

In [9]:
vocab = tokenizer.get_vocab()
vocab["artificial"]


7976

In [10]:
#reverse the vocab to id:value pairs
idx_vocab = {value:key for key, value in vocab.items()}
idx_vocab[7976]

'artificial'

In [11]:
# to check the size similarity between vocab and MLM output
outputs['prediction_logits'][0][2].shape, len(idx_vocab)

(torch.Size([30522]), 30522)

In [12]:
# STEP 1 - SOFTMAX to convert the logits into probability scores
softmax = torch.nn.functional.softmax(input = outputs['prediction_logits'][0][2], dim=-1)

# STEP 2 - ARGMAX to get the token id with maximum probability
argmax = torch.argmax(softmax)

In [13]:
argmax

tensor(8181)

In [14]:
idx_vocab[argmax.item()]

'abraham'

For given input text with MASK token:

"After Abraham Lincoln won the November 1860 presidential [MASK] on an "
"anti-slavery platform, an initial seven slave states declared their "
"secession from the country to form the Confederacy. War broke out in "
"April 1861 when secessionist forces [MASK] Fort Sumter in South "
"Carolina, just over a month after Lincoln's inauguration."

The value for 2nd index was predicted as 'abraham'

For the rest of the words that were not masked. they will return the id for default input words(close match not exact)

In [23]:

softmax = torch.nn.functional.softmax(input = outputs['prediction_logits'][0], dim=0)
softmax


tensor([[8.6535e-04, 9.2138e-04, 9.4259e-04,  ..., 1.0541e-03, 2.1833e-03,
         5.1356e-03],
        [6.2549e-06, 7.3302e-06, 6.1164e-06,  ..., 6.1201e-06, 1.9717e-05,
         5.6888e-05],
        [3.4552e-03, 3.0115e-03, 5.1767e-03,  ..., 1.9010e-03, 3.4777e-03,
         3.3927e-03],
        ...,
        [1.8583e-01, 2.0567e-01, 2.1530e-01,  ..., 1.5572e-01, 5.1572e-01,
         2.0563e-04],
        [1.1240e-06, 1.0609e-06, 1.2606e-06,  ..., 9.1067e-06, 1.4953e-05,
         1.7173e-05],
        [1.7728e-05, 1.0281e-05, 1.7399e-05,  ..., 1.0896e-05, 1.9261e-04,
         5.3675e-05]], grad_fn=<SoftmaxBackward0>)

In [24]:
argmax = torch.argmax(softmax, dim= 1)
argmax

tensor([28191,  2348,  8181, 16628,  2180,  3882,  2281,  7313,  4883, 27419,
         2006,  2010,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
         8914,  2163, 13520,  2037,  4336,  2013,  1996,  2406,  2000,  2433,
        28775, 18179, 16363,  2162,  3631,  2041,  1999,  2258,  6863,  2043,
        18232,  2923,  2749,  4548,  3481,  7680,  5017,  2005,  2148,  3792,
        24901,  2074,  2058,  1037,  3204,  2077,  3946,  1005,  1055, 17331,
         1025, 25656])

In [27]:
import pprint

In [28]:
predicted_text = ""
for tk in argmax:
    predicted_text += idx_vocab[tk.item()] + " "

pprint.pprint(predicted_text)

('##ecin although abraham lincolnshire won 1948 november 1860 presidential '
 'primaries on his anti - slavery platform , an initial seven tributary states '
 'declare their independence from the country to form ##ici confederacy ##yre '
 'war broke out in april 1861 when ##oya ##ist forces occupied fort sum ##mer '
 "for south carolina ##trip just over a month before grant ' s inauguration ; "
 '##tson ')


In [33]:
pprint.pprint(text)

('After Abraham Lincoln won the November 1860 presidential [MASK] on an '
 'anti-slavery platform, an initial seven slave states declared their '
 'secession from the country to form the Confederacy. War broke out in April '
 '1861 when secessionist forces [MASK] Fort Sumter in South Carolina, just '
 "over a month after Lincoln's inauguration.")


As noticed above, the prediction somewhat has a close relationship with the original text. 
For the Masked token predictions:
- 1st Mask is predicted as *primaries* (truth - elections)
- 2nd Mask is predicted as *occupied* (truth - attacked)

## NSP

In order for the NSP to work, we need to input sequence of sentences where it can make NSP predictions

In [35]:
text = ("After Abraham Lincoln won the November 1860 presidential [MASK] on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces [MASK] Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

In [36]:
inputs = tokenizer(text, text2, return_tensors='pt')

In [37]:
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,   103,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,   103,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

- within the tokenizer output dictionary, it is seen that the 'input_ids' have a token 102 (SEP) that separates the first sentence from the second sentence
- We have 'token_type_ids' where the token values for first sentence is 0 and that for the second sentence is 1. this is the input that Bert receives which is vital for NSP
- we soo no difference in the 'attention_mask'

In [38]:
outputs = model(**inputs)

In [40]:
outputs.seq_relationship_logits

tensor([[ 6.0844, -5.6813]], grad_fn=<AddmmBackward0>)

In [41]:
argmax = torch.argmax(input= outputs.seq_relationship_logits)
argmax

tensor(0)

Index 0 - is next sentence

Index 1 - in not next sentence