In [1]:
!pip install transformers



In [2]:
from transformers import BertTokenizer
from torch.nn import functional as F
import torch

# Next Sentence Prediction
The Next Sentence Prediction task has been formulated as a binary classification task: the model is trained to distinguish the original following sentence from a randomly chosen sentence from the corpus, and it
showed great helps in multiple NLP tasks especially inference ones.

Here is an example of doing next sentence prediction using a model and a tokenizer. The process is the following:

1. Instantiate the tokenizer and the model from the checkpoint name. The model is identified as a BERT model and loads it with the weights stored in the checkpoint.
2. Define two sentences, where the second follows (or not) the first one.
3. Encode the two sentences into a list of IDs, with the correct model-specific separators token type ids and attention masks.
4. Pass this sequence through the model. This outputs a distribution over the classes 'isNextSent' and 'isNotNextSent'.The model gives higher score to the most probable class.
5. Retrieve the most probable class.

In [3]:
from transformers import BertForNextSentencePrediction

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [4]:
prompt = "The child came home from school."
next_sentence = "He played soccer after school."

# prompt = "The child came home from school."
# next_sentence = "The candidate won the election."

In [5]:
encoding = tokenizer(prompt, next_sentence, return_tensors='pt')
encoding

{'input_ids': tensor([[ 101, 1996, 2775, 2234, 2188, 2013, 2082, 1012,  102, 2002, 2209, 4715,
         2044, 2082, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [6]:
tokenizer.convert_ids_to_tokens(encoding['input_ids'].squeeze(0))

['[CLS]',
 'the',
 'child',
 'came',
 'home',
 'from',
 'school',
 '.',
 '[SEP]',
 'he',
 'played',
 'soccer',
 'after',
 'school',
 '.',
 '[SEP]']

In [7]:
outputs = model(**encoding)
outputs

NextSentencePredictorOutput(loss=None, logits=tensor([[ 3.2739, -2.0896]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [8]:
softmax = F.softmax(outputs.logits, dim = 1)
class_id = torch.argmax(softmax)
print(class_id)

tensor(0)


# Masked Language Modeling
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the right of the mask) and the left context (tokens on the left of the mask). Such a training creates a strong basis for downstream tasks requiring bi-directional context.

Here is an example of doing masked language modeling using a model and a tokenizer. The process is the following:

1. Instantiate the tokenizer and the model from the checkpoint name. The model is identified as a BERT model and loads it with the weights stored in the checkpoint.
2. Define a sequence with a masked token, placing the tokenizer.mask_token instead of a word.
3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that context.
5. Retrieve the top 10 tokens using the PyTorch topk method.
6. Replace the mask token by the tokens and print the results

In [9]:
from transformers import BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."
text

'The capital of France, [MASK], contains the Eiffel Tower.'

In [11]:
input = tokenizer(text, return_tensors="pt")
input

{'input_ids': tensor([[  101,  1996,  3007,  1997,  2605,  1010,   103,  1010,  3397,  1996,
          1041, 13355,  2884,  3578,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [12]:
tokenizer.convert_ids_to_tokens(input['input_ids'].squeeze(0))

['[CLS]',
 'the',
 'capital',
 'of',
 'france',
 ',',
 '[MASK]',
 ',',
 'contains',
 'the',
 'e',
 '##iff',
 '##el',
 'tower',
 '.',
 '[SEP]']

In [13]:
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
mask_index

(tensor([6]),)

In [14]:
output = model(**input)
output

MaskedLMOutput(loss=None, logits=tensor([[[ -6.6462,  -6.6775,  -6.6606,  ...,  -5.9660,  -5.7844,  -4.1951],
         [-14.7222, -15.2151, -15.0513,  ..., -13.5289, -11.3960, -14.5610],
         [-10.1223, -10.7297, -10.1163,  ...,  -9.2822,  -7.6954, -15.4930],
         ...,
         [-10.7090, -11.2617, -10.9946,  ...,  -8.4995,  -9.6521, -14.2806],
         [-12.2987, -12.0131, -12.5270,  ..., -10.8341, -11.2091,  -5.0134],
         [-12.7292, -13.4996, -13.1655,  ..., -13.2183, -10.6310, -12.8908]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [15]:
logits = output.logits
logits.shape

torch.Size([1, 16, 30522])

In [16]:
softmax = F.softmax(logits, dim=-1)
softmax.sum(axis=-1)

tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]],
       grad_fn=<SumBackward1>)

In [17]:
mask_word = softmax[0, mask_index, :]
mask_word

tensor([[4.2686e-10, 3.5966e-10, 6.0383e-10,  ..., 1.0400e-09, 8.8831e-10,
         1.1838e-10]], grad_fn=<IndexBackward0>)

In [18]:
top_word = torch.argmax(mask_word, dim=1)
print(tokenizer.decode(top_word))

paris


In [19]:
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

The capital of France, paris, contains the Eiffel Tower.
The capital of France, lyon, contains the Eiffel Tower.
The capital of France, lille, contains the Eiffel Tower.
The capital of France, toulouse, contains the Eiffel Tower.
The capital of France, marseille, contains the Eiffel Tower.
The capital of France, orleans, contains the Eiffel Tower.
The capital of France, strasbourg, contains the Eiffel Tower.
The capital of France, nice, contains the Eiffel Tower.
The capital of France, cannes, contains the Eiffel Tower.
The capital of France, versailles, contains the Eiffel Tower.


# Extractive Question Answering
Extractive Question Answering is the task of extracting an answer from a text given a question.

Here is an example of question answering using a BERT fine-tuned model on that task. The process is the following:

1. Instantiate the tokenizer and the model from the checkpoint name. The model is identified as a BERT model and loads it with the weights stored in the checkpoint.
2. Define a text and a few questions.
3. Select a question and build a sequence from the text and the current question, with the correct model-specific separators token type ids and attention masks.
4. Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and text), for both the start and end positions.
5. Compute the softmax of the result to get probabilities over the tokens.
6. Fetch the tokens from the identified start and stop values, convert those tokens to a string.
7. Print the results.

In [20]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
text = r"""
  🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
  architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
  Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
  TensorFlow 2.0 and PyTorch.
  """

questions = [
  "How many pretrained models are available in 🤗 Transformers?",
  "What does 🤗 Transformers provide?",
  "🤗 Transformers provides interoperability between which frameworks?",
]

question = questions[0]

In [22]:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  2129,  2116,  3653, 23654,  2098,  4275,  2024,  2800,  1999,
           100, 19081,  1029,   102,   100, 19081,  1006,  3839,  2124,  2004,
          1052, 22123,  2953,  2818,  1011, 19081,  1998,  1052, 22123,  2953,
          2818,  1011,  3653, 23654,  2098,  1011, 14324,  1007,  3640,  2236,
          1011,  3800,  4294,  2015,  1006, 14324,  1010, 14246,  2102,  1011,
          1016,  1010, 23455,  1010, 28712,  2213,  1010,  4487, 16643, 23373,
          1010, 28712,  7159,  1529,  1007,  2005,  3019,  2653,  4824,  1006,
         17953,  2226,  1007,  1998,  3019,  2653,  4245,  1006, 17953,  2290,
          1007,  2007,  2058,  3590,  1009,  3653, 23654,  2098,  4275,  1999,
          2531,  1009,  4155,  1998,  2784,  6970, 25918,  8010,  2090, 23435,
         12314,  1016,  1012,  1014,  1998,  1052, 22123,  2953,  2818,  1012,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [23]:
input_ids = inputs["input_ids"].tolist()[0]
tokenizer.convert_ids_to_tokens(input_ids)

['[CLS]',
 'how',
 'many',
 'pre',
 '##train',
 '##ed',
 'models',
 'are',
 'available',
 'in',
 '[UNK]',
 'transformers',
 '?',
 '[SEP]',
 '[UNK]',
 'transformers',
 '(',
 'formerly',
 'known',
 'as',
 'p',
 '##yt',
 '##or',
 '##ch',
 '-',
 'transformers',
 'and',
 'p',
 '##yt',
 '##or',
 '##ch',
 '-',
 'pre',
 '##train',
 '##ed',
 '-',
 'bert',
 ')',
 'provides',
 'general',
 '-',
 'purpose',
 'architecture',
 '##s',
 '(',
 'bert',
 ',',
 'gp',
 '##t',
 '-',
 '2',
 ',',
 'roberta',
 ',',
 'xl',
 '##m',
 ',',
 'di',
 '##sti',
 '##lbert',
 ',',
 'xl',
 '##net',
 '…',
 ')',
 'for',
 'natural',
 'language',
 'understanding',
 '(',
 'nl',
 '##u',
 ')',
 'and',
 'natural',
 'language',
 'generation',
 '(',
 'nl',
 '##g',
 ')',
 'with',
 'over',
 '32',
 '+',
 'pre',
 '##train',
 '##ed',
 'models',
 'in',
 '100',
 '+',
 'languages',
 'and',
 'deep',
 'inter',
 '##oper',
 '##ability',
 'between',
 'tensor',
 '##flow',
 '2',
 '.',
 '0',
 'and',
 'p',
 '##yt',
 '##or',
 '##ch',
 '.',
 '[SEP]']

In [24]:
outputs = model(**inputs)
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-6.5990, -6.1387, -7.8761, -7.8292, -8.7167, -8.9216, -8.6420, -8.2637,
         -8.1355, -7.8485, -7.9702, -8.3272, -9.3328, -6.5990, -3.5472, -3.5201,
         -7.6366, -6.9593, -8.0762, -8.4253, -6.1195, -7.9624, -8.3828, -8.0024,
         -8.3195, -5.4944, -8.2405, -6.5540, -8.1549, -8.4309, -8.2269, -8.3963,
         -6.7172, -8.1222, -8.3750, -8.2482, -5.6149, -6.8511, -6.3145, -6.1016,
         -7.6912, -7.6393, -5.7812, -7.5353, -7.2942, -5.4125, -8.2601, -6.2903,
         -8.0463, -8.2036, -7.1411, -8.3341, -6.6303, -8.4391, -6.5964, -7.9467,
         -8.5371, -7.0665, -8.0596, -7.9908, -8.5030, -7.0008, -7.7140, -6.2754,
         -6.6136, -7.0534, -5.5944, -6.5917, -6.9854, -8.0204, -6.4408, -7.9721,
         -7.5704, -8.2496, -5.2606, -6.5214, -6.1437, -7.8759, -5.9586, -7.1834,
         -5.0404, -2.2078,  5.1718,  4.9945, -3.1125, -4.7365, -6.2348, -6.1286,
         -3.5514, -5.0355, -1.5301, -6.8052, -5.3237, -7

In [25]:
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

In [26]:
# Get the most likely beginning of answer with the argmax of the score
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1

In [27]:
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

In [28]:
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
