Question answering tasks return an answer given a question, e.g. in asking a virtual assistant like Alexa, Siri or Google what the weather is, you’ve used a question answering model before. 

There are two common types of question answering tasks:
1. Extractive: extract the answer from the given context.
2. Abstractive: generate an answer from the context that correctly answers the question.

In this Huggingface-based notebook we:
A) Finetune DistilBERT on the SQuAD dataset for **extractive** question answering.
B) Use the finetuned model for inference.

# Libraries

In [1]:
!pip install transformers datasets evaluate
!pip install ipywidgets



In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer

In [3]:
# Log into Huggingface to share model with the community
#from huggingface_hub import notebook_login
#notebook_login()

# Load Data

In [4]:
# load a subset of the SQuAD dataset from the 🤗 Datasets library for experimentation
squad = load_dataset("squad", split="train[:5000]")

In [5]:
# split into train and test sets
squad = squad.train_test_split(test_size=0.2)

squad['train'][10]

{'id': '56d614dd1c85041400946f08',
 'title': '2008_Sichuan_earthquake',
 'context': 'Francis Marcus of the International Federation of the Red Cross praised the Chinese rescue effort as "swift and very efficient" in Beijing on Tuesday. But he added the scale of the disaster was such that "we can\'t expect that the government can do everything and handle every aspect of the needs". The Economist noted that China reacted to the disaster "rapidly and with uncharacteristic openness", contrasting it with Burma\'s secretive response to Cyclone Nargis, which devastated that country 10 days before the earthquake.',
 'question': 'What kind of attitude did Burma display in response to a cyclone a few days earlier?',
 'answers': {'text': ['secretive'], 'answer_start': [427]}}

# Preprocessing

In [6]:
# load a DistilBERT tokenizer to process the question and context fields
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting truncation="only_second".
2. Next, map the start and end positions of the answer to the original context by setting return_offset_mapping=True.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.


In [7]:
# function to truncate and map the start and end tokens of the answer to the context
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

# apply the preprocessing function over the entire dataset using 🤗 Datasets map function. 
# speed up the map function by setting batched=True to process multiple elements of the dataset at once 
# remove any columns you don’t need
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]