Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Question Answering on the SQuAD Dataset using Transformers Models


# Before You Start

The running time shown in this notebook is on a Standard_NC24rs_v3 Azure Data Science Virtual Machine with 4 NVIDIA Tesla V100 GPUs. 
> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

The table below provides some reference running time of BERT on different machine configurations.  

|QUICK_RUN|Machine Configurations|Running time|
|:---------|:----------------------|:------------|
|True|4 **CPU**s, 14GB memory| ~ 10 minutes |
|True|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 3 minutes |
|False|4 NVIDIA Tesla K80 GPUs, 48GB GPU memory| ~ 18 hours |
|False|4 NVIDIA Tesla V100 GPUs, 64GB GPU memory, without RDMA (NC24s)| ~ 7 hours|
|False|4 NVIDIA Tesla V100 GPUs, 64GB GPU memory, with RDMA (NC24**r**s)| ~ 4 hours|

If you run into CUDA out-of-memory error, try reducing the `PER_GPU_BATCH_SIZE` and increasing the `GRADIENT_ACCUMULATION_STEPS`. As long as `PER_GPU_BATCH_SIZE` * `GRADIENT_ACCUMULATION_STEPS` doesn't change, the effective **per gpu** batch size is the same as larger `PER_GPU_BATCH_SIZE` and smaller `GRADIENT_ACCUMULATION_STEPS`.

In [1]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = False

## Summary
This notebook demonstrates how to fine tune [pre-trained transformers models](https://github.com/huggingface/transformers) for extractive question answering task. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation. 

The following models are currently supported:
* BERT: [Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
* XLNet: [Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf)
* DistilBert: [A small, fast, cheap and light Transformer model based on Bert architecture](https://medium.com/huggingface/distilbert-8cf3380435b5)

In [None]:
import os
import sys

import scrapbook as sb
import torch

from utils_nlp.common.pytorch_utils import dataloader_from_dataset
from utils_nlp.common.timer import Timer
from utils_nlp.dataset.squad import load_pandas_df
from utils_nlp.eval.question_answering import evaluate_qa
from utils_nlp.models.transformers.datasets import QADataset
from utils_nlp.models.transformers.question_answering import (
    AnswerExtractor,
    QAProcessor,
)

## Configurations

To get all the transformer models supporting question answering, call `AnswerExtractor.list_supported_models()`.

In [2]:
AnswerExtractor.list_supported_models()

['bert-base-uncased',
 'bert-large-uncased',
 'bert-base-cased',
 'bert-large-cased',
 'bert-base-multilingual-uncased',
 'bert-base-multilingual-cased',
 'bert-base-chinese',
 'bert-base-german-cased',
 'bert-large-uncased-whole-word-masking',
 'bert-large-cased-whole-word-masking',
 'bert-large-uncased-whole-word-masking-finetuned-squad',
 'bert-large-cased-whole-word-masking-finetuned-squad',
 'bert-base-cased-finetuned-mrpc',
 'xlnet-base-cased',
 'xlnet-large-cased',
 'distilbert-base-uncased',
 'distilbert-base-uncased-distilled-squad']

In [12]:
MODEL_NAME = "bert-large-cased-whole-word-masking"
DO_LOWER_CASE = False

# MODEL_NAME = "xlnet-large-cased"
# DO_LOWER_CASE = False

# MODEL_NAME = "distilbert-base-uncased"
# DO_LOWER_CASE = True

TRAIN_DATA_USED_PERCENT = 1
DEV_DATA_USED_PERCENT = 1
NUM_EPOCHS = 2

MAX_SEQ_LENGTH = 384
DOC_STRIDE = 128
PER_GPU_BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 1
NUM_GPUS = torch.cuda.device_count()

if QUICK_RUN:
    TRAIN_DATA_USED_PERCENT = 0.001
    DEV_DATA_USED_PERCENT = 0.01
    NUM_EPOCHS = 1
    
    MAX_SEQ_LENGTH = 128
    DOC_STRIDE = 64
    PER_GPU_BATCH_SIZE = 1

print("Max sequence length: {}".format(MAX_SEQ_LENGTH))
print("Document stride: {}".format(DOC_STRIDE))
print("Per gpu batch size: {}".format(PER_GPU_BATCH_SIZE))

RANDOM_SEED = 42
SQUAD_VERSION = "v1.1" 
CACHE_DIR = "./temp"

MAX_QUESTION_LENGTH = 64
LEARNING_RATE = 3e-5

DOC_TEXT_COL = "doc_text"
QUESTION_TEXT_COL = "question_text"
ANSWER_START_COL = "answer_start"
ANSWER_TEXT_COL = "answer_text"
QA_ID_COL = "qa_id"
IS_IMPOSSIBLE_COL = "is_impossible"

Max sequence length: 384
Document stride: 128
Per gpu batch size: 4


## Load Data

### The SQuAD Dataset
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. [\[1, 2\]](#References)

<img src="https://nlpbp.blob.core.windows.net/images/squad.png">

There has been two versions of SQuAD datasets. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. SQuAD 2.0 adds 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. These datasets are available at [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/). Each dataset comes with a training dataset and a development dataset. 


The utility function `load_pandas_df` downloads the dataset specified by `squad_version` and `file_split` to `local_cache_path` if it doesn't exist already.

In [5]:
train_df = load_pandas_df(local_cache_path=CACHE_DIR, squad_version=SQUAD_VERSION, file_split="train")
dev_df = load_pandas_df(local_cache_path=CACHE_DIR, squad_version=SQUAD_VERSION, file_split="dev")

100%|██████████| 7.82k/7.82k [00:00<00:00, 20.6kKB/s]
100%|██████████| 1.02k/1.02k [00:00<00:00, 19.9kKB/s]


In [6]:
train_df.head()

Unnamed: 0,doc_text,question_text,answer_start,answer_text,qa_id,is_impossible
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,515,Saint Bernadette Soubirous,5733be284776f41900661182,False
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,188,a copper statue of Christ,5733be284776f4190066117f,False
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,279,the Main Building,5733be284776f41900661180,False
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,381,a Marian place of prayer and reflection,5733be284776f41900661181,False
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,92,a golden statue of the Virgin Mary,5733be284776f4190066117e,False


In [7]:
dev_df.head()

Unnamed: 0,doc_text,question_text,answer_start,answer_text,qa_id,is_impossible
0,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"[177, 177, 177]","[Denver Broncos, Denver Broncos, Denver Broncos]",56be4db0acb8001400a502ec,False
1,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"[249, 249, 249]","[Carolina Panthers, Carolina Panthers, Carolin...",56be4db0acb8001400a502ed,False
2,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"[403, 355, 355]","[Santa Clara, California, Levi's Stadium, Levi...",56be4db0acb8001400a502ee,False
3,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"[177, 177, 177]","[Denver Broncos, Denver Broncos, Denver Broncos]",56be4db0acb8001400a502ef,False
4,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"[488, 488, 521]","[gold, gold, gold]",56be4db0acb8001400a502f0,False


In [8]:
train_df = train_df.sample(frac=TRAIN_DATA_USED_PERCENT).reset_index(drop=True)
dev_df = dev_df.sample(frac=DEV_DATA_USED_PERCENT).reset_index(drop=True)

`QADataset` is a standard question answering dataset for downstream processing.

In [9]:
train_dataset = QADataset(
    df=train_df,
    doc_text_col=DOC_TEXT_COL,
    question_text_col=QUESTION_TEXT_COL,
    qa_id_col=QA_ID_COL,
    is_impossible_col=IS_IMPOSSIBLE_COL,
    answer_start_col=ANSWER_START_COL,
    answer_text_col=ANSWER_TEXT_COL
)
dev_dataset = QADataset(
    df=dev_df,
    doc_text_col=DOC_TEXT_COL,
    question_text_col=QUESTION_TEXT_COL,
    qa_id_col=QA_ID_COL,
    is_impossible_col=IS_IMPOSSIBLE_COL,
    answer_start_col=ANSWER_START_COL,
    answer_text_col=ANSWER_TEXT_COL
)

## Tokenize and Preprocess Data

The `QAProcessor.preprocess` tokenizes the input paragraph, question, and answer texts, and converts them into the format required by pre-trained transformer models, involving the following steps:
* Tokenization.
* Convert character-based answer span indices to token-based indices.
* Truncate the question token list if it's longer than `max_question_length`.
* Split the paragraph into multiple segments if it's longer than `max_seq_length` - `max_question_length` - 3. (The "-3" is for the special [CLS] token and two [SEP] tokens.)
* Add the special tokens [CLS] and [SEP].
* Pad the concatenated token sequence to `max_seq_length` if it's shorter.
* Convert the tokens into token indices corresponding to the tokenizer's vocabulary.

`QAProcessor.preprocess` returns a Pytorch DataSet. By default, it saves `cached_examples_train/test.jsonl` and `cached_features_train/test.jsonl` to `./cached_qa_features`. These files are required by postprocessing the predicted answer start and end indices to get the final answer text. You can change the default file directory by specifying `feature_cache_dir`. 

In [10]:
qa_processor = QAProcessor(model_name=MODEL_NAME, to_lower=DO_LOWER_CASE)
train_dataset = qa_processor.preprocess(
    train_dataset,
    is_training=True,
    max_question_length=MAX_QUESTION_LENGTH,
    max_seq_length=MAX_SEQ_LENGTH,
    doc_stride=DOC_STRIDE,
)

# we keep a copy of the oroginal dev_dataset as it is needed for evaluation
dev_dataset_processed = qa_processor.preprocess(
    dev_dataset,
    is_training=False,
    max_question_length=MAX_QUESTION_LENGTH,
    max_seq_length=MAX_SEQ_LENGTH,
    doc_stride=DOC_STRIDE,
)

train_dataloader = dataloader_from_dataset(
    train_dataset, batch_size=PER_GPU_BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=True
)
dev_dataloader = dataloader_from_dataset(
    dev_dataset_processed, batch_size=PER_GPU_BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=False
)

100%|██████████| 213450/213450 [00:00<00:00, 2918674.41B/s]


## Fine-tune AnswerExtractor

In [11]:
qa_extractor = AnswerExtractor(model_name=MODEL_NAME, cache_dir=CACHE_DIR)

In [None]:
with Timer() as t:
    qa_extractor.fit(train_dataloader,
                     num_epochs=NUM_EPOCHS,
                     learning_rate=LEARNING_RATE,
                     gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
                     seed=RANDOM_SEED,
                     cache_model=True)
print("Training time : {:.3f} hrs".format(t.interval / 3600)) 

## Predict
Note that the `AnswerExtractor.predict` only outputs the probabilities of each token being the start and end of the answer span. `postprocess_bert_answer` and  `postprocess_xlnet_answer` are two helper functions for postprocessing these probabilities and generating the final answers. 

In [14]:
qa_results = qa_extractor.predict(dev_dataloader)

Evaluating: 100%|██████████| 661/661 [04:42<00:00,  2.64it/s]


## Postprocess and Generate the Final Answers

In [15]:
final_answers, answer_probs, nbest_answers = qa_processor.postprocess(
    qa_results,
    examples_file="./cached_qa_features/cached_examples_test.jsonl",
    features_file="./cached_qa_features/cached_features_test.jsonl")

In [16]:
for i in [0, 10, 100]:
    print('Paragraph:')
    print(dev_df.iloc[i]['doc_text'])
    print()
    print('Question:')
    print(dev_df.iloc[i]['question_text'])
    print()
    print('Ground truth answers:')
    print(dev_df.iloc[i]['answer_text'])
    print()
    print('Predicted answer:')
    print(final_answers[dev_df.iloc[i]['qa_id']])
    print()
    print('Top N best answers')
    print(nbest_answers[dev_df.iloc[i]['qa_id']])
    print('-------------------------------------------------------------------------------------------------------------------')

Paragraph:
In August 1999, ABC premiered a special series event, Who Wants to Be a Millionaire, a game show based on the British program of the same title. Hosted throughout its ABC tenure by Regis Philbin, the program became a major ratings success throughout its initial summer run, which led ABC to renew Millionaire as a regular series, returning on January 18, 2000. At its peak, the program aired as much as six nights a week. Buoyed by Millionaire, during the 1999–2000 season, ABC became the first network to move from third to first place in the ratings during a single television season. Millionaire ended its run on the network's primetime lineup after three years in 2002, with Buena Vista Television relaunching the show as a syndicated program (under that incarnation's original host Meredith Vieira) in September of that year.

Question:
Who originally hosted Who Wants to Be a Millionaire for ABC?

Ground truth answers:
['Regis Philbin', 'Regis Philbin', 'Regis Philbin']

Predicted 

## Evaluate

Question answering task is usually evaluated on two metrics: exact match (EM) and F1 score.   
The exact match is computed by first performing some simple normalization (e.g. remove punctuation and convert to lower case) on the ground truth and predicted answers and check if they match exactly after normalization.   
F1 score is computed from token-level precision and recall by comparing the ground truth and predicted answers. 

In [17]:
evaluation_result = evaluate_qa(actual_dataset=dev_dataset,
                                preds=final_answers)

{
  "exact": 86.6414380321665,
  "f1": 92.68221713064221,
  "total": 10570,
  "HasAns_exact": 86.6414380321665,
  "HasAns_f1": 92.68221713064221,
  "HasAns_total": 10570
}


The table below compares running time and model performance of BERT, XLNet, and DistilBert on Standard_NC24rs_v3 DSVM.

|Model name|Training time|Scoring time|Exact Match (EM)|F1 score|
|:---------|:------------|:-----------|:--------------|--------|
|bert-large-cased-whole-word-masking| 3.4 hrs| ~ 5 mins|86.64|92.68|
|xlnet-large-cased|5.2 hrs| ~ 10 mins|84.67|91.69|
|distilbert-base-uncased|0.66 hr| ~ 1 min|76.62|84.71|

In [None]:
sb.glue("exact", evaluation_result["exact"])
sb.glue("f1", evaluation_result["f1"])

## References

1. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang, [*SQuAD: 100,000+ Questions for Machine Comprehension of Text*](https://arxiv.org/abs/1606.05250), EMNLP, 2016.
2. Pranav Rajpurkar, Robin Jia, Percy Liang, [*Know What You Don't Know: Unanswerable Questions for SQuAD*](https://arxiv.org/abs/1806.03822), ACL, 2018