<a href="https://colab.research.google.com/github/plaban1981/Hugging_Face_transformers_topics/blob/main/Performing_question_answering_with_fine_tuned_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install nlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.3 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 68.7 MB/s 
Installing collected packages: xxhash, nlp
Successfully installed nlp-0.4.0 xxhash-3.0.0


In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 4.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 50.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.8.1 tokenizers-0.12.1 transformers-4.21.1


## Import Libraries

In [3]:
from transformers import BertForQuestionAnswering,BertTokenizer
import numpy as np
import pandas as pd

* We use the bert-large-uncased-whole-wordmasking-fine-tuned-squad model

* This model is fine-tuned on the Stanford Question-
Answering Dataset (SQUAD)

In [10]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
from google.colab import output
output.disable_custom_widget_manager()

## Download the Model

In [7]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

## Download the Tokenizer

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

## Preprocessing the input

In [41]:
question = "What is the immune system?"
paragraph = '''The immune system is a system of many biological structures
and processes within an organism that protects against disease. To function
properly, an immune system must detect a wide variety of agents, known as
pathogens, from viruses to parasitic worms, and distinguish them from the
organism's own healthy tissue.'''

* Add a [CLS] token to the beginning of the question and an [SEP] token to the end of both
the question and the paragraph

In [22]:
question = '[CLS] ' + question + '[SEP]'
paragraph = paragraph + '[SEP]'

## Tokenize the question and paragraph

In [90]:
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)

In [24]:
print(question_tokens)

['[CLS]', 'what', 'is', 'the', 'immune', 'system', '?', '[SEP]']


## Combine the question and paragraph tokens and convert them to input_ids

In [25]:
tokens = question_tokens + paragraph_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)

In [16]:
print(input_ids)

[101, 2054, 2003, 1996, 11311, 2291, 1029, 102, 1996, 11311, 2291, 2003, 1037, 2291, 1997, 2116, 6897, 5090, 1998, 6194, 2306, 2019, 15923, 2008, 18227, 2114, 4295, 1012, 2000, 3853, 7919, 1010, 2019, 11311, 2291, 2442, 11487, 1037, 2898, 3528, 1997, 6074, 1010, 2124, 2004, 26835, 2015, 1010, 2013, 18191, 2000, 26045, 16253, 1010, 1998, 10782, 2068, 2013, 1996, 15923, 1005, 1055, 2219, 7965, 8153, 1012, 102]


In [17]:
print(tokens)

['[CLS]', 'what', 'is', 'the', 'immune', 'system', '?', '[SEP]', 'the', 'immune', 'system', 'is', 'a', 'system', 'of', 'many', 'biological', 'structures', 'and', 'processes', 'within', 'an', 'organism', 'that', 'protects', 'against', 'disease', '.', 'to', 'function', 'properly', ',', 'an', 'immune', 'system', 'must', 'detect', 'a', 'wide', 'variety', 'of', 'agents', ',', 'known', 'as', 'pathogen', '##s', ',', 'from', 'viruses', 'to', 'parasitic', 'worms', ',', 'and', 'distinguish', 'them', 'from', 'the', 'organism', "'", 's', 'own', 'healthy', 'tissue', '.', '[SEP]']


In [91]:
input_ids = tokenizer.encode(question, paragraph)
print(input_ids)

[101, 2054, 2003, 1996, 11311, 2291, 1029, 102, 1996, 11311, 2291, 2003, 1037, 2291, 1997, 2116, 6897, 5090, 1998, 6194, 2306, 2019, 15923, 2008, 18227, 2114, 4295, 1012, 2000, 3853, 7919, 1010, 2019, 11311, 2291, 2442, 11487, 1037, 2898, 3528, 1997, 6074, 1010, 2124, 2004, 26835, 2015, 1010, 2013, 18191, 2000, 26045, 16253, 1010, 1998, 10782, 2068, 2013, 1996, 15923, 1005, 1055, 2219, 7965, 8153, 1012, 102]


In [76]:
type(input_ids)

list

In [72]:
tokenizer.convert_tokens_to_string(tokens)

"[CLS] what is the immune system ? [SEP] the immune system is a system of many biological structures and processes within an organism that protects against disease . to function properly , an immune system must detect a wide variety of agents , known as pathogens , from viruses to parasitic worms , and distinguish them from the organism ' s own healthy tissue . [SEP]"

In [78]:
tokens = tokenizer.convert_ids_to_tokens(ids=input_ids) #here input_ids needs to be list
print(tokens)

['[CLS]', 'what', 'is', 'the', 'immune', 'system', '?', '[SEP]', 'the', 'immune', 'system', 'is', 'a', 'system', 'of', 'many', 'biological', 'structures', 'and', 'processes', 'within', 'an', 'organism', 'that', 'protects', 'against', 'disease', '.', 'to', 'function', 'properly', ',', 'an', 'immune', 'system', 'must', 'detect', 'a', 'wide', 'variety', 'of', 'agents', ',', 'known', 'as', 'pathogen', '##s', ',', 'from', 'viruses', 'to', 'parasitic', 'worms', ',', 'and', 'distinguish', 'them', 'from', 'the', 'organism', "'", 's', 'own', 'healthy', 'tissue', '.', '[SEP]']


## Define segment_ids. 
* segment_ids will be 0 for all the tokens of the
question and 
* segment_ids 1 for all the tokens of the paragraph

In [79]:
segment_ids = [0] * len(question_tokens)
segment_ids = segment_ids + [1] * len(paragraph_tokens)

In [30]:
print(segment_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [80]:
#Search the input_ids for the first instance of the `[SEP]` token.
sep_index = input_ids.index(tokenizer.sep_token_id)

In [81]:
print(sep_index)

7


In [82]:
print(len(input_ids))

67


In [83]:
# The number of segment A tokens includes the [SEP] token istelf.
num_seg_a = sep_index + 1
print(num_seg_a)
# The remainder are segment B.
num_seg_b = len(input_ids) - num_seg_a
print(num_seg_b)

# Construct the list of 0s and 1s.
segment_ids = [0]*num_seg_a + [1]*num_seg_b
print(segment_ids)
# There should be a segment_id for every input token.
assert len(segment_ids) == len(input_ids)

8
59
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Convert input_ids and segment_ids to tensors

In [84]:
import torch
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])

## Get the answer
* feed input_ids and segment_ids to the model, 
* The model  returns the start score and end
score for all of the tokens

In [63]:
model(input_ids, token_type_ids = segment_ids)

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-6.2588, -4.6880, -6.7744, -6.3712, -5.8096, -8.4909, -9.0369, -6.2588,
          2.3760, -0.8670, -4.0859,  2.1112,  7.0353,  3.1633, -2.0016,  1.8844,
          2.4239, -0.8321, -4.7245, -0.6628, -0.9607, -1.5406, -0.9789, -1.5247,
          1.5805, -3.6135, -1.7062, -6.2587, -4.3460, -5.7781, -6.2772, -7.2236,
         -2.5216, -2.8306, -5.5702, -4.4567, -3.9796, -6.1513, -5.8940, -6.4212,
         -7.3876, -5.6694, -7.7685, -4.6375, -6.5613, -3.7148, -7.0651, -8.1083,
         -5.4551, -4.3829, -7.9004, -4.8883, -5.8361, -7.9597, -6.8583, -4.6028,
         -7.3392, -7.3848, -6.5887, -5.8965, -5.8692, -7.9263, -6.7758, -5.4052,
         -5.2147, -7.6892, -6.2588]], grad_fn=<CloneBackward0>), end_logits=tensor([[-0.9886, -4.8667, -6.2903, -7.4715, -6.2248, -4.1261, -6.7786, -0.9885,
         -4.3361, -4.3006, -0.9956, -4.1916, -2.0522,  0.1625, -3.4849, -3.4991,
         -0.9854,  1.9729, -4.0725,  4.2137, -1.7283, -3.0363

In [85]:
start_scores = model(input_ids, token_type_ids = segment_ids).start_logits
end_scores = model(input_ids, token_type_ids = segment_ids).end_logits

* Select the start_index, which is the index of the token that has the highest start score, 
* Select the end_index, which is the index of the token that has the highest end score

In [86]:
type(start_scores)

torch.Tensor

In [87]:
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)

In [92]:
start_index,end_index

(tensor(12), tensor(26))

In [88]:
print(tokens)

['[CLS]', 'what', 'is', 'the', 'immune', 'system', '?', '[SEP]', 'the', 'immune', 'system', 'is', 'a', 'system', 'of', 'many', 'biological', 'structures', 'and', 'processes', 'within', 'an', 'organism', 'that', 'protects', 'against', 'disease', '.', 'to', 'function', 'properly', ',', 'an', 'immune', 'system', 'must', 'detect', 'a', 'wide', 'variety', 'of', 'agents', ',', 'known', 'as', 'pathogen', '##s', ',', 'from', 'viruses', 'to', 'parasitic', 'worms', ',', 'and', 'distinguish', 'them', 'from', 'the', 'organism', "'", 's', 'own', 'healthy', 'tissue', '.', '[SEP]']


In [89]:
print(' '.join(tokens[start_index:end_index+1]))

a system of many biological structures and processes within an organism that protects against disease
