# NSP Training Logic

Next sentence prediction (NSP) is the other side of pretraining for BERT. It consists of taking two sentences, A and B - and attempting to guess (classification) whether sentence B comes after sentence A.

So, where MLM allowed us to encourage BERT to build up a contextual understanding between words - NSP encourages BERT to learn longer term contextual relationships between sentences rather than words.

Let's take a look at how this works in code. First, we import and initialize everything we need.

In [18]:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch

In [19]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy. War broke out in "
        "April 1861 when secessionist forces attacked Fort Sumter in South "
        "Carolina, just over a month after Lincoln's inauguration.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Now, if we were to tokenize and process this text through our model as is, we'll the *logits* tensor as output:

In [21]:
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
outputs.keys()

odict_keys(['logits'])

The *logits* tensor is our NSP output prediction, which looks like:

In [22]:
outputs.logits

tensor([[ 4.4646, -3.6635]], grad_fn=<AddmmBackward>)

Then we apply softmax to convert these logits into a probability distribution.

In [23]:
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
probs

tensor([[9.9970e-01, 2.9502e-04]], grad_fn=<SoftmaxBackward>)

And finally, take the argmax to get our prediction:

In [24]:
torch.argmax(probs)

tensor(0)

We are getting **0**, which is `IsNextSentence` - however, we haven't actually specified two sentences - so this prediction is meaningless.

There are two parts we're missing. We need to specify two sentences in our *input_ids* tensor - and we need to create the *labels* tensor too.

Let's start by splitting the two sentences.

In [35]:
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces attacked Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

Then we tokenize.

In [36]:
inputs = tokenizer(text, text2, return_tensors='pt')

In [37]:
inputs

{'input_ids': tensor([[  101,  2044,  8181,  5367,  2180,  1996,  2281,  7313,  4883,  2602,
          2006,  2019,  3424,  1011,  8864,  4132,  1010,  2019,  3988,  2698,
          6658,  2163,  4161,  2037, 22965,  2013,  1996,  2406,  2000,  2433,
          1996, 18179,  1012,   102,  2162,  3631,  2041,  1999,  2258,  6863,
          2043, 22965,  2923,  2749,  4457,  3481,  7680,  3334,  1999,  2148,
          3792,  1010,  2074,  2058,  1037,  3204,  2044,  5367,  1005,  1055,
         17331,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Because we tokenized the two sentences seperately, our tokenizer will deal it a little differently. First, we can see the *SEP* token in the *input_ids* tensor with token id **102** - this marks the boundary between sentence A and sentence B.

Second, we can distinguish between sentences from the *token_type_ids* - sentence A tokens are assigned a **0**, whereas sentence B tokens are assigned a **1** token.

Now, we still need to add our *labels* tensor - but how should it be formatted? Well, we use a `torch.LongTensor` format, and it must contain a single value `[0]` if sentence B **is** the next sentence, else it should be `[1]`. Here, sentence B is the next sentence so we set it to `[0]`.

In [75]:
labels = torch.LongTensor([0])
labels

tensor([0])

Now we process everything as we did before, including our *labels* tensor.

In [76]:
outputs = model(**inputs, labels=labels)
outputs.keys()

odict_keys(['loss', 'logits'])

In [77]:
outputs.loss

tensor(0.0002, grad_fn=<NllLossBackward>)

Now we return the *loss* tensor - which we use for training our model with NSP.