***Next Sentence Prediction Task***

The masked language modelling was made for BERT to understand language, the next sentence prediction problem was formulated in order to figure out whether given two sentences come exactly one after the other

Consider the example:

- "Istanbul is a great city to visit"
- "I was just there"

Did sentence B come directly after sentence A?

**The pooler**

Sits on top of the context-full representation through the CLS token and it is used to map our context representation for downstream task

This is how the CLS token is trained to represent the whole context of the sequence

Then it gets passed through a feed forward and soft-max function and the only possibilities are yes or no it does not come after it.

In [2]:
# imports

import torch
from transformers import BertForNextSentencePrediction, BertTokenizer

In [6]:
# Loading model and tokenizer

bert_nsp = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [8]:
# Looking at the model

bert_nsp

BertForNextSentencePrediction(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [10]:
# Loading text

text0 = 'Deliver huge improvements in your machine learning pipelines without spending hours fine-tuning parameters!'
text1 = "This book's practical case-studies reveal feature engineering techniques that upgrade your data wrangling"

In [11]:
# Tokenization of the two sentences

inputs = tokenizer(text0, text1, return_tensors='pt')

In [12]:
inputs

{'input_ids': tensor([[  101,  8116,  4121,  8377,  1999,  2115,  3698,  4083, 13117,  2015,
          2302,  5938,  2847,  2986,  1011, 17372, 11709,   999,   102,  2023,
          2338,  1005,  1055,  6742,  2553,  1011,  2913,  7487,  3444,  3330,
          5461,  2008, 12200,  2115,  2951, 23277,  5654,  2989,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

We can see the tokens for each sentence as well as the ids separating the two sentences, lastly we have the attention masks.

In [13]:
inputs.input_ids

tensor([[  101,  8116,  4121,  8377,  1999,  2115,  3698,  4083, 13117,  2015,
          2302,  5938,  2847,  2986,  1011, 17372, 11709,   999,   102,  2023,
          2338,  1005,  1055,  6742,  2553,  1011,  2913,  7487,  3444,  3330,
          5461,  2008, 12200,  2115,  2951, 23277,  5654,  2989,   102]])

In [14]:
inputs.token_type_ids

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [15]:
inputs.attention_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [16]:
# Passing the tokens to our nsp model

outputs = bert_nsp(**inputs)

outputs

NextSentencePredictorOutput(loss=None, logits=tensor([[ 6.1182, -5.7342]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Here we can see the logits which show which class (0 for True and 1 for False) seems to be predicting 0 with a much higher rate

In [17]:
# Calculating the loss by passing through a label

outputs = bert_nsp(**inputs, labels= torch.LongTensor([0]))

outputs

NextSentencePredictorOutput(loss=tensor(7.1525e-06, grad_fn=<NllLossBackward0>), logits=tensor([[ 6.1182, -5.7342]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

We get a really small loss number meaning the model is very confident that indeed the two sentences come one after the other.

In [19]:
# Calculating the loss function passing through a 1 label

outputs = bert_nsp(**inputs, labels= torch.LongTensor([1]))

outputs

NextSentencePredictorOutput(loss=tensor(11.8524, grad_fn=<NllLossBackward0>), logits=tensor([[ 6.1182, -5.7342]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

And as we expected we get a high loss value.

**Conclusion**

This task allows BERT to understand how sentences work together on a larger corpora.