**Fine-Tuning of BERT Models for Question Answering**

This notebook uses pretrained Hugging Face BERT models and finetunes the models for question answering tasks. This notebook uses example code provided by Hugging Face for finetuning a model for question answering [[Hugging Face Question Answering]](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=HFASsisvIrIb). The code has been modified to finetune on the SQuAD 2.0 and Google NQ datasets.

In [None]:
!pip install datasets transformers

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/54/90/43b396481a8298c6010afb93b3c1e71d4ba6f8c10797a7da8eb005e45081/datasets-1.5.0-py3-none-any.whl (192kB)
[K     |████████████████████████████████| 194kB 8.4MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 17.4MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/e7/27/1c0b37c53a7852f1c190ba5039404d27b3ae96a55f48203a74259f8213c9/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 50.6MB/s 
Collecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/af/07/bf95f398e6598202d878332280f36e589512174882536eb20d792532a57d/huggingface_hub-0.0.7-py3-none-any.whl
Collecting fsspec
[?25l  Downloading htt

In [None]:
########### INPUT ###########
# Input the base model, batch size, and whether or not the dataset contains non-answerable questions (squad_v2 = True)
model_checkpoint = "bert-base-uncased"
batch_size = 16
squad_v2 = True

Load Data

In [None]:
# import the load_dataset and load_metric for loading and evaluation of datasets
from datasets import load_dataset, load_metric

In [None]:
########### INPUT ###########
# Load the dataset to finetune model
import json
import pandas as pd
#datasets = load_dataset('json', data_files='/content/drive/MyDrive/ColabNotebooks/data/nq_train.jsonl')
datasets = load_dataset('json', data_files={'train': '/content/drive/MyDrive/ColabNotebooks/data/nq-qg_train.jsonl',
                                            'validation': '/content/drive/MyDrive/ColabNotebooks/data/nq-qg_validation.jsonl'})

Using custom data configuration default-4bc2bcae376ac672


Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-4bc2bcae376ac672/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-4bc2bcae376ac672/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02. Subsequent calls will reuse this data.


In [None]:
# view dataset object
datasets

DatasetDict({
    train: Dataset({
        features: ['answers', 'context', 'id', 'question'],
        num_rows: 8486
    })
    validation: Dataset({
        features: ['answers', 'context', 'id', 'question'],
        num_rows: 2122
    })
})

In [None]:
# view example from dataset
datasets["train"][0]

{'answers': {'answer_start': [115], 'text': ['Natalie Portman']},
 'context': " Jackie is a 2016 biographical drama film directed by Pablo Larraín and written by Noah Oppenheim . The film stars Natalie Portman as Jackie Kennedy and tells the story of her life after the 1963 assassination of her husband John F. Kennedy . Peter Sarsgaard , Greta Gerwig , Billy Crudup , and John Hurt also star ; it was Hurt 's final film released before his death in January 2017 . ",
 'id': '16632414047311151991',
 'question': 'who played jackie kennedy in the film jackie'}

In [None]:
# code to view random examples in pandas
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=3):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,answers,context,id,question
0,"{'answer_start': [], 'text': []}","These north doors would serve as a votive offering to celebrate the sparing of Florence from relatively recent scourges such as the Black Death in 1348 . Many artists competed for this commission and a jury selected seven semifinalists . These finalists include Lorenzo Ghiberti , Filippo Brunelleschi , Donatello and Jacopo della Quercia , with 21 - year - old Ghiberti winning the commission . At the time of judging , only Ghiberti and Brunelleschi were finalists , and when the judges could not decide , they were assigned to work together on them . Brunelleschi 's pride got in the way , and he went to Rome to study architecture leaving Ghiberti to work on the doors himself . Ghiberti 's autobiography , however , claimed that he had won , `` without a single dissenting voice . '' The original designs of The Sacrifice of Isaac by Ghiberti and Brunelleschi are on display in the museum of the Bargello .",13296814297406529690,who won the competition in 1401 to design a set of doors for the florence baptistery
1,"{'answer_start': [1], 'text': ['The Pentagon is the headquarters of the United States Department of Defense']}","The Pentagon is the headquarters of the United States Department of Defense , located in Arlington County , Virginia , across the Potomac River from Washington , D.C. As a symbol of the U.S. military , The Pentagon is often used metonymically to refer to the U.S. Department of Defense .",16830472586737870505,what is the main purpose of the pentagon
2,"{'answer_start': [252], 'text': ['barnyard']}","After her father spares the life of a piglet from culling it as runt of the litter , a little girl named Fern Arable nurtures the piglet lovingly , naming him Wilbur . On greater maturity , Wilbur is sold to Fern 's uncle , Homer Zuckerman , in whose barnyard he is left yearning for companionship but is snubbed by other barn animals , until befriended by a barn spider named Charlotte , living on a web overlooking Wilbur 's enclosure . Upon Wilbur 's discovery that he is intended for slaughter , she promises to hatch a plan guaranteed to spare his life . Accordingly , she secretly weaves praise of him into her web , attracting publicity among Zuckerman 's neighbors who attribute the praise to divine intervention . As time passes , more inscriptions appear on Charlotte 's webs , increasing his renown . Therefore , Wilbur is entered in the county fair , accompanied by Charlotte and the rat Templeton , whom she employs in gathering inspiration for her messages . There , Charlotte spins an egg sac containing her 514 unborn children , and Wilbur , despite winning no prizes , is later celebrated by the fair 's staff and visitors ( thus made too prestigious alive to justify killing him ) . Exhausted apparently by laying eggs , Charlotte remains at the fair and dies shortly following Wilbur 's departure . Having returned to Zuckerman 's farm , Wilbur guards Charlotte 's egg sac and is saddened further when the new spiders depart shortly after hatching . The three smallest remain , however , and take up residence in the doorway where Charlotte used to live . Pleased at finding new friends , Wilbur names one of them Nellie , while the remaining two name themselves Joy and Aranea . The book then concludes by mentioning that more generations of spiders kept him company in subsequent years .",7181411033362897666,where does the story charlotte web take place


Preprocess Training Data

In [None]:
# import the correct tokenizer for the model architecture
from transformers import AutoTokenizer    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
# verify that the tokenizer is a fast tokenizer
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [None]:
# run a check to verify that the tokenizer is working for an example questions and answer
tokenizer("Is this tokenizer working?", "Yes, the tokenizer is working correctly")

{'input_ids': [101, 2003, 2023, 19204, 17629, 2551, 1029, 102, 2748, 1010, 1996, 19204, 17629, 2003, 2551, 11178, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
########### INPUT ###########
# Length will be truncated to handle long contexts
# Set the max length (questions and context) and stride (context overlap)
max_length = 384
doc_stride = 128 

In [None]:
# verify that the truncation is working correctly by finding an example
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

In [None]:
# check to see what the length of the example is without truncation (should be greater than max_length)
len(tokenizer(example["question"], example["context"])["input_ids"])

398

In [None]:
# tokenizer should return always return the question plus truncated contexts
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

In [None]:
# verify the length for the mulitple examples are provided for tokenized example
[len(x) for x in tokenized_example["input_ids"]]

[384, 154]

In [None]:
# decode the tokenized example to verify that we have a question plus the truncated context
for x in tokenized_example["input_ids"][:]:
    print(tokenizer.decode(x))

[CLS] where is a good year filmed with russell crowe [SEP] the film was shot throughout nine weeks in 2005, mostly in locations scott described as ` ` eight minutes from my house''. french locations were filmed at bonnieux, cucuron and gordes in vaucluse, marseille provence airport, and the rail station in avignon. london locations included albion riverside in battersea, broadgate, the bluebird cafe on king's road in chelsea, and criterion restaurant in piccadilly circus. the scene with the tennis match between max and duflot was added on the set, replacing an argument at the vines to provide ` ` a battle scene''. as the swimming pool on chateau la canorgue did not fit the one scott had envisioned from the scene, only the scenes outside the pool were filmed there. the one after max had fallen was dug and concreted nearby, and the original one had its bottom replaced digitally to match. the production team could not film the wine cave from la canorgue as they shot during the period wher

In [None]:
# use the tokenizer to map the offset for locating the answer
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 5), (6, 8), (9, 10), (11, 15), (16, 20), (21, 27), (28, 32), (33, 40), (41, 46), (0, 0), (1, 4), (5, 9), (10, 13), (14, 18), (19, 29), (30, 34), (35, 40), (41, 43), (44, 48), (49, 50), (51, 57), (58, 60), (61, 70), (71, 76), (77, 86), (87, 89), (90, 91), (91, 92), (93, 98), (99, 106), (107, 111), (112, 114), (115, 120), (121, 122), (122, 123), (124, 125), (126, 132), (133, 142), (143, 147), (148, 154), (155, 157), (158, 164), (164, 166), (167, 168), (169, 171), (171, 173), (173, 176), (177, 180), (181, 183), (183, 187), (188, 190), (191, 193), (193, 195), (195, 198), (198, 199), (200, 201), (202, 211), (212, 220), (221, 228), (229, 230), (231, 234), (235, 238), (239, 243), (244, 251), (252, 254), (255, 257), (257, 262), (263, 264), (265, 271), (272, 281), (282, 290), (291, 297), (298, 307), (308, 310), (311, 317), (317, 320), (321, 322), (323, 328), (328, 332), (333, 334), (335, 338), (339, 343), (343, 347), (348, 352), (353, 355), (356, 360), (361, 362), (362, 363), (364,

In [None]:
# verify that the mapping is working correctly
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

where where


In [None]:
# use the tokenizer to find sequence ids for locating the position of the question and answer
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
# answers = example["answers"]
# start_char = answers["answer_start"]
# start_char
example

{'answers': {'answer_start': [], 'text': []},
 'context': " The film was shot throughout nine weeks in 2005 , mostly in locations Scott described as `` eight minutes from my house '' . French locations were filmed at Bonnieux , Cucuron and Gordes in Vaucluse , Marseille Provence Airport , and the rail station in Avignon . London locations included Albion Riverside in Battersea , Broadgate , the Bluebird Cafe on King 's Road in Chelsea , and Criterion Restaurant in Piccadilly Circus . The scene with the tennis match between Max and Duflot was added on the set , replacing an argument at the vines to provide `` a battle scene '' . As the swimming pool on Chateau La Canorgue did not fit the one Scott had envisioned from the scene , only the scenes outside the pool were filmed there . The one after Max had fallen was dug and concreted nearby , and the original one had its bottom replaced digitally to match . The production team could not film the wine cave from La Canorgue as they shot duri

In [None]:
# identify the first and last token of the answer in the context or return no answer

# locate the start and end character of answer
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

IndexError: ignored

In [None]:
# verify that the start and end tokens produced are the correct answer
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

NameError: ignored

In [None]:
# To make this notebook generalizable to any model, we account for the special case where the model expects padding on the left
pad_on_right = tokenizer.padding_side == "right"

In [None]:
# This function combines the above methods by tokenizing each example with truncation and padding

def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
# the function can work on multiple features. Verify that the tokenization is working correctly
features = prepare_train_features(datasets['train'][:2])
features

{'input_ids': [[101, 2040, 2209, 9901, 5817, 1999, 1996, 2143, 9901, 102, 9901, 2003, 1037, 2355, 16747, 3689, 2143, 2856, 2011, 11623, 2474, 11335, 2378, 1998, 2517, 2011, 7240, 6728, 11837, 8049, 1012, 1996, 2143, 3340, 10829, 3417, 2386, 2004, 9901, 5817, 1998, 4136, 1996, 2466, 1997, 2014, 2166, 2044, 1996, 3699, 10102, 1997, 2014, 3129, 2198, 1042, 1012, 5817, 1012, 2848, 18906, 28745, 26526, 2094, 1010, 26111, 16216, 2099, 16279, 1010, 5006, 13675, 6784, 6279, 1010, 1998, 2198, 3480, 2036, 2732, 1025, 2009, 2001, 3480, 1005, 1055, 2345, 2143, 2207, 2077, 2010, 2331, 1999, 2254, 2418, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
# apply the function on all elements of all the splits in the dataset including training, validation, and testing data
# remove the old columns since the preprocessing changes the number of samples
# results are cached. Pass "load_from_cache_file=False" to force the preprocessing to be applied again
tokenized_datasets = datasets.map(
    prepare_train_features, 
    batched=True, 
    remove_columns=datasets["train"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




Fine-Tune the Model

In [None]:
# import Pytorch pretrained model for question answering
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# from_pretrained method downloads and caches the model
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# warning regarding not using weights and layers is normal. we are removing the 
# masked language modeling head to pretrain the model on the QA task for which
# we do not have pretrained weights and requires fine-tuning

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [None]:
########### INPUT ###########
# training arguments is a class that contains the attributes to customize training
# set the folder name f"model-dataset", which will be used to save checkpoints
# set the learning_rate, number of epochs, and weight_decay
# batch_size has been set at the beginning of the notebook 
args = TrainingArguments(
    f"bert-nq-qg",
    evaluation_strategy = "epoch",
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    num_train_epochs = 3,
    weight_decay = 0.01,
)

In [None]:
# import a default data collator
from transformers import default_data_collator

# set the data_collator to the default data collator
data_collator = default_data_collator

In [None]:
# pass all of the training arguments and datasets to the trainer
trainer = Trainer(
    model,
    args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator,
    tokenizer = tokenizer,
)

In [None]:
# finetune the model by calling train method
# running this cell will take time.
trainer.train()

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,2.5705,1.892981,27.7902,76.862
2,1.6879,1.727196,27.7791,76.892
3,1.3459,1.764431,27.7789,76.893


TrainOutput(global_step=1608, training_loss=1.8292019307909912, metrics={'train_runtime': 1125.45, 'train_samples_per_second': 1.429, 'total_flos': 6453390021792768.0, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 338434, 'init_mem_gpu_alloc_delta': 436709888, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 684408, 'train_mem_gpu_alloc_delta': 1308423168, 'train_mem_cpu_peaked_delta': 95174331, 'train_mem_gpu_peaked_delta': 8289920000})

In [None]:
########### INPUT ###########
# save the model. input the model name ("model-dataset-trained")
trainer.save_model("bert-nq-qg-trained")

Evaluation

In [None]:
# the validation features will need to be re-processed similar to the training features
# the processing will also need to check if the output span is inside the context (and not in the question)
# it will also need to retrieve the text inside

def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [None]:
# apply the function to validation set 
# remove the old columns since the preprocessing changes the number of samples
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




In [None]:
# extract the predictions for all features using method trainer.predict
raw_predictions = trainer.predict(validation_features)

In [None]:
# trainer hides columns not used by the model. the columns needed for post-processing are set back 
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

In [None]:
########### INPUT ###########
# to classify answers, we use the score obtained by adding the start and end logits
# limit the number of possible answers by setting n_best_size
# limit the length of the answer by setting max_answer_length
n_best_size = 20
max_answer_length = 30

In [None]:
# get the output logits from trainer
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

In [None]:
# code to verify the score and corresponding text are working correctly
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 6.4018435,
  'text': 'French table service and that of much of the English - speaking world ( apart from the United States and parts of Canada'},
 {'score': 5.960499,
  'text': 'French table service and that of much of the English - speaking world'},
 {'score': 5.6395655, 'text': 'French table service'},
 {'score': 4.8673234, 'text': 'French'},
 {'score': 4.5820923,
  'text': 'in modern French table service and that of much of the English - speaking world ( apart from the United States and parts of Canada'},
 {'score': 4.451668,
  'text': 'French table service and that of much of the English - speaking world ( apart from the United States and parts of Canada )'},
 {'score': 4.292549, 'text': 'a dish served before the main course of a meal'},
 {'score': 4.1407475,
  'text': 'in modern French table service and that of much of the English - speaking world'},
 {'score': 4.138939,
  'text': 'French table service and that of much of the English - speaking world ( apart from the Un

In [None]:
# view the actual answer
datasets["validation"][0]["answers"]

{'answer_start': [73],
 'text': ['modern French table service and that of much of the English - speaking world ( apart from the United States and parts of Canada )']}

In [None]:
# apply the process above to all features by mapping between examples and their corresponding features
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

In [None]:
# to handle the non-answerable questions, we need to extract the score for the impossible answer
# the score is collected from minimum of the scores from the CLS token for each feature generated by the example
# the question is not answerable when that score is greater than the highest answerable score
from tqdm.auto import tqdm
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

In [None]:
# apply the postprocessing function to the raw predictions
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Post-processing 2122 example predictions split into 2136 features.


HBox(children=(FloatProgress(value=0.0, max=2122.0), HTML(value='')))




In [None]:
########### INPUT ###########
# load the metric from the datasets library
metric = load_metric("squad_v2")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2264.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3182.0, style=ProgressStyle(description…




In [None]:
# format predictions and labels
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'HasAns_exact': 44.652908067542214,
 'HasAns_f1': 55.02339978974004,
 'HasAns_total': 1599,
 'NoAns_exact': 61.376673040152966,
 'NoAns_f1': 61.376673040152966,
 'NoAns_total': 523,
 'best_exact': 48.82186616399623,
 'best_exact_thresh': 0.0,
 'best_f1': 56.63638843722632,
 'best_f1_thresh': 0.0,
 'exact': 48.77474081055608,
 'f1': 56.589263083786214,
 'total': 2122}