# **Perform Question Answering on a Pretrained BERT model**

To perform this task, we will use the pretrain model from the **transformer** package. We can install it through the following command.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m95.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m119.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0


Import the necessary libraries

In [6]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

### We use the pretrained BertForQuestionAnswering model 'bert-large-uncased-whole-word-masking-finetuned-squad', more information here: https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad 

In [7]:
tokenizer = AutoTokenizer.from_pretrained(
    "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad"
)
model = AutoModelForQuestionAnswering.from_pretrained(
    "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad"
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Define the functions to perform the question answering task

First, we need to tokenize the input text and the question. Then we can feed the tokenized input to the model to get the answer.

## **1. Implement the function to tokenize the input text and the question**

In [23]:
def tokenize_question_answer(question, answer_text):
    """
    Tokenize the question and answer text into input IDs.

    Args:
        question (string): The question.
        answer_text (string): The paragraph containing the answer.

    Returns:
        tuple: input_ids, segment_ids
    """
    input_ids = tokenizer.encode(question, answer_text)

    sep_index = input_ids.index(tokenizer.sep_token_id)
    num_seg_a = sep_index + 1
    num_seg_b = len(input_ids) - num_seg_a
    segment_ids = [0] * num_seg_a + [1] * num_seg_b

    assert len(segment_ids) == len(input_ids)

    return input_ids, segment_ids

## **2. Implement the function to evaluate the answer**

We also need to evaluate the start and end positions of the answer in the tokenized input.

In [27]:
def evaluate_model(input_ids, segment_ids):
    """
    Use the model to predict start and end logits.

    Args:
        input_ids (list): Tokenized input IDs.
        segment_ids (list): Segment IDs distinguishing question and answer text.

    Returns:
        tuple: start_scores, end_scores
    """
    input_ids_tensor = torch.tensor([input_ids])
    segment_ids_tensor = torch.tensor([segment_ids])

    outputs = model(input_ids=input_ids_tensor, token_type_ids=segment_ids_tensor)
    return outputs.start_logits, outputs.end_logits

As the model returns the start and end positions of the answer in the tokenized input, we need to convert these positions to the positions in the original input text.

Finally, we can return the answer from the original input text using the start and end positions.

## **3. Implement the function to reconstruct the answer from the tokenized input**

In [28]:
def reconstruct_answer(start_scores, end_scores, input_ids):
    """
    Reconstruct the answer from start and end scores and the input tokens.

    Args:
        start_scores (Tensor): The predicted start positions.
        end_scores (Tensor): The predicted end positions.
        input_ids (list): The tokenized input IDs.

    Returns:
        string: The predicted answer.
    """
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer = tokens[answer_start]

    for i in range(answer_start + 1, answer_end + 1):
        if tokens[i][0:2] == "##":
            answer += tokens[i][2:]
        else:
            answer += " " + tokens[i]

    return answer

Now let's define the function to perform the question answering task.

## **4. Implement the function to perform the question answering task**

In [29]:
def answer_question(question, answer_text):
    """
    Answer the given question based on the answer text.

    Args:
        question (string): The question.
        answer_text (string): The paragraph containing the answer.

    Returns:
        None
    """
    input_ids, segment_ids = tokenize_question_answer(question, answer_text)
    start_scores, end_scores = evaluate_model(input_ids, segment_ids)
    answer = reconstruct_answer(start_scores, end_scores, input_ids)

    print('Question: "' + question + '"')
    print('Answer: "' + answer + '"')

## **5. Test the function with some examples**

Give it a try with a simple example.

In [30]:
question = "what is my dog name?"
paragraph = "I have a dog. It's name is Ricky. I get it at my 15th birthday, when it was a puppy."

answer_question(question, paragraph)

Question: "what is my dog name?"
Answer: "ricky"


Looks good! The model was able to find the correct answer to the question.

Let's try with another example

In [18]:
question = "when Leonhard Euler was born?"
paragraph = "Leonhard Euler: 15 April 1707 – 18 September 1783 was a Swiss mathematician, \
physicist, astronomer, geographer, logician and engineer who made important and influential discoveries in many branches of mathematics, \
such as infinitesimal calculus and graph theory, \
while also making pioneering contributions to several branches such as topology and analytic number theory. \
He also introduced much of the modern mathematical terminology and notation, \
particularly for mathematical analysis, such as the notion of a mathematical function.[4] He is also known for his work in mechanics, fluid dynamics, optics, astronomy and music theory"

answer_question(question, paragraph)

Question: "when Leonhard Euler was born?"
Answer: "15 april 1707"


Now let's try with a more complex example.

In [None]:
paragraph = "Picasso was born at 23:15 on 25 October 1881, in the city of Málaga, Andalusia, in southern Spain. \
He was the first child of Don José Ruiz y Blasco (1838–1913) and María Picasso y López. Picasso's family was of middle-class background. \
His father was a painter who specialized in naturalistic depictions of birds and other game. For most of his life, Ruiz was a professor of art at the School of Crafts and a curator of a local museum. \
Ruiz's ancestors were minor aristocrats."

Question 1: What is Picasso's father job?

In [19]:
question = "what is Picasso's father job"
answer_question(question, paragraph)

Question: "what is Picasso's father job"
Answer: "a painter"


Question 2: What is the occupation of Picasso's father?

In [20]:
question = "what is the occupation of Picasso's father"
answer_question(question, paragraph)

Question: "what is the occupation of Picasso's father"
Answer: "painter"


Question 3: What is his mother's name?

In [21]:
question = "what is his mother's name"
answer_question(question, paragraph)

Question: "what is his mother's name"
Answer: "maria picasso y lopez"


Question 4: What is Picasso's family like?

In [22]:
question = "what is Picasso's family like"
answer_question(question, paragraph)

Question: "what is Picasso's family like"
Answer: "middle - class background"
