In [1]:
%%capture
!pip install transformers==3.5.1

In [2]:
!pip install torch==1.4.0.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.4.0.
  Downloading torch-1.4.0-cp37-cp37m-manylinux1_x86_64.whl (753.4 MB)
[K     |████████████████████████████████| 753.4 MB 6.9 kB/s 
[?25hInstalling collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.12.0+cu113
    Uninstalling torch-1.12.0+cu113:
      Successfully uninstalled torch-1.12.0+cu113
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.0+cu113 requires torch==1.12.0, but you have torch 1.4.0 which is incompatible.
torchtext 0.13.0 requires torch==1.12.0, but you have torch 1.4.0 which is incompatible.
torchaudio 0.12.0+cu113 requires torch==1.12.0, but you have torch 1.4.0 which is incompatible.
fastai 2.7.6 requires torch<1.13,>=1.7, but you have torch 1.4.0 which

In [3]:
!pip install nlp==0.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlp==0.4.0
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 35.6 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 343 kB/s 
Installing collected packages: xxhash, nlp
Successfully installed nlp-0.4.0 xxhash-3.0.0


In this project we will learn how to perform question answering witha fine-tuned question-answering BERT model

In [4]:
# Importing libarries
from transformers import BertForQuestionAnswering, BertTokenizer

In [5]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

In [6]:
# We need to download and load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [7]:
question = "What is the immune system?"
paragraph = "The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue."

Next we will add a [CLS] token to the beginning of the question and an [SEP] token to the end of both question and paragraph.

In [8]:
question = '[CLS] ' + question + '[SEP]'
paragraph = paragraph + '[SEP]'

In [16]:
# We do need to tokenize the question and paragraph
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)

In [17]:
# We do need to combine both question and paragraph tokens and convert them to input_ids
tokens = question_tokens + paragraph_tokens 
input_ids = tokenizer.convert_tokens_to_ids(tokens)

In [18]:
# We will need to define segment_ids.
# Now segment_ids will be 0 for all the tokens of the question and 1 for all the tokens of the paragraph
segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)

In [19]:
import torch
# We will convert input_ids and segment_ids to tensors
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])

Now that we have processed the input and now let's feed it to the model and get the result.
We do need to feed input_ids and segment_ids to the model which will return the start score and end score for all of the tokens

In [20]:
start_scores, end_scores = model(input_ids, token_type_ids = segment_ids)

Next we do need to select start_index which is the index of the token that has the highest start socre, and end_index, which is the index of the token that has the highest end score

In [21]:
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)

In [22]:
# Now we will print the text span between the start and end index as our answer
print(' '.join(tokens[start_index:end_index+1]))

a system of many biological structures and processes within an organism that protects against disease


Now that we did learn and implement how to fine-tune BERT for question-answering project.