Question answering tasks return an answer given a question, e.g. in asking a virtual assistant like Alexa, Siri or Google what the weather is, you’ve used a question answering model before. 

There are two common types of question answering tasks:
1. Extractive: extract the answer from the given context.
2. Abstractive: generate an answer from the context that correctly answers the question.

In this Huggingface-based notebook we:
A) Finetune DistilBERT on the SQuAD dataset for **extractive** question answering.
B) Use the finetuned model for inference.

# Libraries

In [1]:
!pip install transformers datasets evaluate
!pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.1-py3-none-any.whl.metadata (2.4 kB)
Collecting comm>=0.1.3 (from ipywidgets)
  Downloading comm-0.2.0-py3-none-any.whl.metadata (3.7 kB)
Collecting widgetsnbextension~=4.0.9 (from ipywidgets)
  Downloading widgetsnbextension-4.0.9-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.9 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl.metadata (4.1 kB)


Downloading ipywidgets-8.1.1-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading comm-0.2.0-py3-none-any.whl (7.0 kB)
Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl (214 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.9/214.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading widgetsnbextension-4.0.9-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: widgetsnbextension, jupyterlab-widgets, comm, ipywidgets
  Attempting uninstall: comm
    Found existing installation: comm 0.1.2
    Uninstalling comm-0.1.2:
      Successfully uninstalled comm-0.1.2
Successfully installed comm-0.2.0 ipywidgets-8.1.1 jupyterlab-widgets-3.0.9 widgetsnbextension-4.0.9


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer

In [3]:
# Log into Huggingface to share model with the community
#from huggingface_hub import notebook_login
#notebook_login()

# Load Data

In [4]:
# load a subset of the SQuAD dataset from the 🤗 Datasets library for experimentation
squad = load_dataset("squad", split="train[:5000]")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [5]:
# split into train and test sets
squad = squad.train_test_split(test_size=0.2)

squad['train'][10]

{'id': '56d1073817492d1400aab787',
 'title': 'To_Kill_a_Mockingbird',
 'context': "To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature. The plot and characters are loosely based on the author's observations of her family and neighbors, as well as on an event that occurred near her hometown in 1936, when she was 10 years old.",
 'question': 'What year was To Kill a Mockingbird first published?',
 'answers': {'text': ['1960'], 'answer_start': [60]}}

# Preprocessing

In [None]:
# load a DistilBERT tokenizer to process the question and context fields
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")