Question answering tasks return an answer given a question, e.g. in asking a virtual assistant like Alexa, Siri or Google what the weather is, you’ve used a question answering model before. 

There are two common types of question answering tasks:
1. Extractive: extract the answer from the given context.
2. Abstractive: generate an answer from the context that correctly answers the question.

In this Huggingface-based notebook we:
A) Finetune DistilBERT on the SQuAD dataset for **extractive** question answering.
B) Use the finetuned model for inference.

# Libraries

In [1]:
!pip install transformers datasets evaluate
!pip install ipywidgets



In [2]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DefaultDataCollator, AutoModelForQuestionAnswering, \
TrainingArguments, Trainer, pipeline

2023-12-20 21:30:55.072901: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# Log into Huggingface to share model with the community
#from huggingface_hub import notebook_login
#notebook_login()

# Load Data

In [4]:
# load a subset of the SQuAD dataset from the 🤗 Datasets library for experimentation
squad = load_dataset("squad", split="train[:5000]")

In [5]:
# split into train and test sets
squad = squad.train_test_split(test_size=0.2)

squad['train'][10]

{'id': '57341835d058e614000b693f',
 'title': 'Montana',
 'context': "The Yellowstone River rises on the continental divide near Younts Peak in Wyoming's Teton Wilderness. It flows north through Yellowstone National Park, enters Montana near Gardiner, and passes through the Paradise Valley to Livingston. It then flows northeasterly across the state through Billings, Miles City, Glendive, and Sidney. The Yellowstone joins the Missouri in North Dakota just east of Fort Union. It is the longest undammed, free-flowing river in the contiguous United States, and drains about a quarter of Montana (36,000 square miles (93,000 km2)).",
 'question': 'Where does the Yellowstone meet the Missouri river?',
 'answers': {'text': ['North Dakota'], 'answer_start': [371]}}

# Preprocessing

In [6]:
# load a DistilBERT tokenizer to process the question and context fields
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting truncation="only_second".
2. Next, map the start and end positions of the answer to the original context by setting return_offset_mapping=True.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.


In [7]:
# function to truncate and map the start and end tokens of the answer to the context
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

# apply the preprocessing function over the entire dataset using 🤗 Datasets map function. 
# speed up the map function by setting batched=True to process multiple elements of the dataset at once 
# remove any columns you don’t need
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
# create a batch of examples using DefaultDataCollator
# Unlike other data collators in 🤗 Transformers, the DefaultDataCollator does not apply any additional preprocessing such as padding
data_collator = DefaultDataCollator()

# Training

In [9]:
# Load DistilBERT with AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# finetune model with Trainer
# Use push_to_hub=True in TrainingArguments() if logged in to enable pushing to HF then
# trainer.push_to_hub() after model is trained
training_args = TrainingArguments(
    output_dir="question_answering_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.11915
2,2.655600,1.611291
3,2.655600,1.515877


TrainOutput(global_step=750, training_loss=2.228500691731771, metrics={'train_runtime': 446.0319, 'train_samples_per_second': 26.904, 'train_steps_per_second': 1.681, 'total_flos': 1175877900288000.0, 'train_loss': 2.228500691731771, 'epoch': 3.0})

In [11]:
# share your model to the Hub with the push_to_hub() method so everyone can use your model
#trainer.push_to_hub()

In [12]:
# The below line can be used to manually generate a config.json file
# model.config.to_json_file("question_answering_model/config.json")

In [13]:
# Evaluation placeholder (expensive)
# Visibility into model performance provided by trainer evaluation loss

# Inference

In [14]:
# Instantiate a pipeline for question answering
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

question_answerer = pipeline("question-answering", model="question_answering_model/checkpoint-500")
question_answerer(question=question, context=context)

{'score': 0.14135223627090454,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

In [15]:
# Reproducing the pipeline
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 13 programming languages and 46 natural languages."

# tokenize the text and return PyTorch tensors:
tokenizer = AutoTokenizer.from_pretrained("question_answering_model/checkpoint-500")
inputs = tokenizer(question, context, return_tensors="pt")

# pass inputs to the model and return the logits
model = AutoModelForQuestionAnswering.from_pretrained("question_answering_model/checkpoint-500")
with torch.no_grad():
    outputs = model(**inputs)

# get the highest probability from the model output for the start and end positions
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

# decode the predicted tokens to get the answer
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'176 billion parameters and can generate text in 13 programming languages and 46'