**Fine-Tuning of BERT Models for Question Answering**

This notebook uses pretrained Hugging Face BERT models and finetunes the models for question answering tasks. This notebook uses example code provided by Hugging Face for finetuning a model for question answering [[Hugging Face Question Answering]](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=HFASsisvIrIb). The code has been modified to finetune on the SQuAD 2.0 and Google NQ datasets.

In [1]:
!pip install datasets transformers

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/54/90/43b396481a8298c6010afb93b3c1e71d4ba6f8c10797a7da8eb005e45081/datasets-1.5.0-py3-none-any.whl (192kB)
[K     |████████████████████████████████| 194kB 5.8MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 34.6MB/s 
[?25hCollecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/62/11/f7689b996f85e45f718745c899f6747ee5edb4878cadac0a41ab146828fa/fsspec-0.9.0-py3-none-any.whl (107kB)
[K     |████████████████████████████████| 112kB 40.6MB/s 
[?25hCollecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/af/07/bf95f398e6598202d878332280f36e589512174882536eb20d792532a57d/huggingface_hub-0.0.7-py3-none-any.whl
Collecting xxhash
[?25l  Downloading https://fi

In [2]:
########### INPUT ###########
# Input the base model, batch size, and whether or not the dataset contains non-answerable questions (squad_v2 = True)
model_checkpoint = "bert-base-uncased"
batch_size = 16
squad_v2 = True

Load Data

In [3]:
# import the load_dataset and load_metric for loading and evaluation of datasets
from datasets import load_dataset, load_metric

In [4]:
########### INPUT ###########
# Load the dataset to finetune model
import json
import pandas as pd
#datasets = load_dataset('json', data_files='/content/drive/MyDrive/ColabNotebooks/data/nq_train.jsonl')
datasets = load_dataset('json', data_files={'train': '/content/drive/MyDrive/ColabNotebooks/data/nq-cl-reduce_train.jsonl',
                                              'validation': '/content/drive/MyDrive/ColabNotebooks/data/nq-cl-reduce_validation.jsonl'})

Using custom data configuration default-ad5edf2efffd1c1c


Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-ad5edf2efffd1c1c/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ad5edf2efffd1c1c/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02. Subsequent calls will reuse this data.


In [5]:
# from google.colab import drive
# drive.mount('/content/drive')

In [6]:
# view dataset object
datasets

DatasetDict({
    train: Dataset({
        features: ['answers', 'context', 'id', 'question'],
        num_rows: 7856
    })
    validation: Dataset({
        features: ['answers', 'context', 'id', 'question'],
        num_rows: 1964
    })
})

In [7]:
# view example from dataset
datasets["train"][0]

{'answers': {'answer_start': [4311], 'text': ['late twentieth century']},
 'context': " Automobile air conditioning ( also called A / C ) systems use air conditioning to cool the air in a vehicle .       A company in New York City in the United States first offered installation of air conditioning for cars in 1933 . Most of their customers operated limousines and luxury cars .   In 1939 , Packard became the first automobile manufacturer to offer an air conditioning unit in its cars . These were manufactured by Bishop and Babcock Co , of Cleveland , Ohio . The `` Bishop and Babcock Weather Conditioner '' also incorporated a heater . Cars ordered with the new `` Weather Conditioner '' were shipped from Packard 's East Grand Boulevard facility to the B&B factory where the conversion was performed . Once complete , the car was shipped to a local dealer where the customer would take delivery .   Packard fully warranted and supported this conversion , and marketed it well . However , it was 

In [8]:
# code to view random examples in pandas
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=3):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(datasets["train"])

Unnamed: 0,answers,context,id,question
0,"{'answer_start': [113], 'text': ['3 November 2017']}","The Thrill of It All is the second studio album by English singer and songwriter Sam Smith . It was released on 3 November 2017 through Capitol Records . On 6 October 2017 , Smith announced via Twitter that his second album , titled `` The Thrill of It All '' , was to be released on 3 November 2017 . It is Smith 's second full album of material after his hugely successful debut album In the Lonely Hour ( 2014 ) , which has sold 12 million copies worldwide . Speaking to Billboard about the album , Smith said : `` I went through , like , this vortex , came out , I feel like I 've rebuilt myself as a stronger thing and I 'm just gon na go into the vortex again , '' he says in a preview that features a montage of studio sessions . `` I was n't trying to make a big pop record when I made this album . I was actually just trying to make something personal and like a diary . '' `` Too Good at Goodbyes '' was released as the album 's lead single on 8 September 2017 . It topped the UK Singles Chart and made the top five on the US Billboard Hot 100 . `` One Last Song '' was sent to radio in the United Kingdom on 3 November 2017 , on the day of album release as its second single . On 6 October 2017 , Smith released `` Pray '' , a gospel - tinged ballad in collaboration with Timbaland , prompted by time spent in Iraq with the charity War Child as a promotional single from the album . Another promotional single , `` Burning '' , was released on 27 October 2017 . The Thrill of It All received generally positive reviews from music critics . On Metacritic , which assigns a normalised rating out of 100 to reviews from mainstream publications , the album received an average score of 72 based on 16 reviews . Neil McCormick of The Daily Telegraph gave the album four stars , and was highly positive about it and Smith 's vocals , calling them `` supernatural '' and saying : `` The Thrill of It All does n't just wallow in love 's misery , it practically drowns in the stuff . Its 10 songs are almost unrelentingly miserable , self - absorbed and self - pitying , verging on the lachrymose and sentimental ( as lovers in the midst of a break - up often are ) . The instrumentation is understated piano and strings blended with just the occasional hint of contemporary hip - hop effects . At times , Smith 's lyrics display a slightly clunking prosaicness . There 's not much poetry in lines such as ' real love is never a waste of time ' or ' there ' ' no insurance to pay for the damage ' . Yet it all hits home , because Smith makes every note sound like a matter of life and death . ' Him ' is the album 's centrepiece , a gospel drama addressed to a judgmental ' father ' , insisting on Smith 's right to love whom he chooses . It is a kind of hymn to Him , and as the choir powers up it gains a righteous glory . '' Andy Gill from The Independent also gave a four - star review , and shared in the positivity about t",51171585290469459,when did the thrill of it all come out
1,"{'answer_start': [98], 'text': ['1979']}","A `` Happy Meal '' is a form of kids ' meal sold at the fast - food chain McDonald 's since June 1979 . A toy is included with the food , both of which are usually contained in a box or paper bag with the McDonald 's logo . The packaging and toy are frequently part of a marketing tie - in to an existing television show , film , or toy brand . The Happy Meal contains a main item ( typically a hamburger , cheeseburger , or small serving of Chicken McNuggets ) , a side item ( french fries , apple slices , or a salad in some areas ) , and a drink ( milk , juice , or a soft drink ) . The choice of items changes from country to country , and may depend on the size of the restaurant . In some countries , the choices have been expanded to include items such as a grilled cheese sandwich ( known as a `` Fry Kid '' ) , or more healthy options such as apple slices , a mini snack wrap , salads , or pasta , as one or more of the options . In most countries , McDonald 's has introduced a `` healthy option '' to the Happy Meal . Children have always been able to choose milk with their Happy Meal and the chain added fruit juice drink instead of a soft drink , and bags of dried fruit ( or a whole piece of fruit such as an apple or carrot sticks ) in place of fries . In some regions different names are used . In French Canada , it is called `` Joyeux Festin '' ( literally meaning Happy Feast in Canadian French ) . In Latin America and Puerto Rico ( not so in Spain ) it is known as Cajita Feliz ( Happy little box in Latin Spanish ) . In Brazil it is known as McLanche Feliz ( Happy McSnack in Brazilian Portuguese ) . In Japan , it was called Oko sama Lunch from 1987 to 1988 , then Okosama Set from 1988 to 1995 ( Okosama is a polite word for `` child '' ) , before being renamed to Happy Set . In Germany , it was known as Juniortüte ( Bag for Juniors in German ) until 1999 . In the mid-1970s , Yolanda Fernández de Cofiño began working with her husband operating McDonald 's restaurants in Guatemala . She created what she called the `` Menu Ronald '' ( Ronald menu ) , which offered a hamburger , small fries and a small sundae to help mothers feed their children more effectively while at McDonald 's restaurants . The concept was eventually brought to the attention of McDonald 's management in Chicago . The company gave the development of the product to Bob Bernstein , founder and CEO of Bernstein - Rein , an agency that has counted McDonald 's as a key client since 1967 . Bernstein came up with the Happy Meal . In 1977 , the McDonald 's restaurant owner clients who regularly met with Bernstein were looking for ways to create a better experience for families with kids . Bernstein reasoned that if kids could get a packaged meal all their own instead of just picking at their parent 's food , everybody would be happier . He had often noticed his young son at the breakfast table poring over the various items on cereal boxes and thought , `` Why not do that for McDonald 's ? The package is the key ! '' He then called in his creative team and had them mock up some paperboard boxes fashioned to resemble lunch pails with the McDonald 's Golden Arches for handles . They called in nationally known children 's illustrators and offered them the blank slate of filling the box 's sides and tops with their own colorful ideas from art to jokes to games to comic strips to stories to fantasy : whatever they thought might appeal to kids , at least 8 items per box . Inside the box would be a burger , small fries , packet of cookies and a surprise gift . A small drink would accompany these items . Bernstein named it The Happy Meal and it was successfully introduced with television and radio spots and in - store posters in the Kansas City market in October 1977 . Other markets followed and the national roll - out hap",2449200124862142011,when did mcdonald's start giving out toys
2,"{'answer_start': [], 'text': []}","Fight Club is a 1996 novel by Chuck Palahniuk . It follows the experiences of an unnamed protagonist struggling with insomnia . Inspired by his doctor 's exasperated remark that insomnia is not suffering , the protagonist finds relief by impersonating a seriously ill person in several support groups . Then he meets a mysterious man named Tyler Durden and establishes an underground fighting club as radical psychotherapy . In 1999 , director David Fincher adapted the novel into a film of the same name , starring Brad Pitt and Edward Norton . The film acquired a cult following despite lower than expected box - office results . The film 's prominence heightened the profile of the novel and that of Palahniuk . The sequel Fight Club 2 was released in comic book form in May 2015 . Fight Club centers on an anonymous narrator , who works as a product recall specialist for an unnamed car company . Because of the stress of his job and the jet lag brought upon by frequent business trips , he begins to suffer from recurring insomnia . When he seeks treatment , his doctor advises him to visit a support group victims to `` see what real suffering is like '' . He finds that sharing the problems of others -- despite not having testicular cancer himself -- alleviates his insomnia . The narrator 's unique treatment works until he meets Marla Singer , another `` tourist '' who visits the support group under false pretenses . The possibly disturbed Marla reminds the narrator that he is a faker who does not belong there . He begins to hate Marla for keeping him from crying , and , therefore , from sleeping . After a confrontation , the two agree to attend separate support group meetings to avoid each other . The truce is uneasy , and the narrator 's insomnia returns . While on a nude beach , the narrator meets Tyler Durden , a charismatic extremist of mysterious means . After an explosion destroys the narrator 's condominium , he asks to stay at Tyler 's house . Tyler agrees , but asks for something in return : `` I want you to hit me as hard as you can . '' Both men find that they enjoy the ensuing fistfight . They subsequently move in together and establish a `` fight club '' , drawing numerous men with similar temperaments into bare - knuckle fighting matches , set to the following rules : Later in the book , a mechanic tells the narrator about two new rules of the fight club : nobody is the center of the fight club except for the two men fighting , and the fight club will always be free . Marla , noticing that the narrator has not recently attended his support groups , calls him to claim that she has overdosed on Xanax in a half - hearted suicide attempt . Tyler returns from work , picks up the phone to Marla 's drug - induced rambling , and rescues her . Tyler and Marla embark on an uneasy affair that confounds the narrator and confuses Marla . Throughout this affair , Marla is unaware both of fight club 's existence and the interaction between Tyler and the narrator . Because Tyler and Marla are never seen at the same time , the narrator wonders whether Tyler and Marla are the same person . As fight club attains a nationwide presence , Tyler uses it to spread his anti-consumerist ideas , recruiting fight club 's members to participate in increasingly elaborate pranks on corporate America . He eventually gathers the most devoted fight club members and forms `` Project Mayhem '' , a cult - like organization that trains itself as an army to bring down modern civilization . This organization , like fight club , is controlled by a set of rules : While initially a loyal participant in Project Mayhem , the narrator becomes uncomfortable with the increasing destructiveness of its activities . He resolves to stop Tyler and his followers when Bob , a friend from the testicular cancer support group , is killed during one of Project Mayhem 's sabotage operations . The narrator then learns that he himself is Tyler Durden . As the narrator 's mental state deteriorated , his mind formed a new personality that was able to escape from the problems of his life . Marla inadvertently reveals to the narrator that he and Tyler are the same person . Tyler 's affair with Marla -- whom the narrator professes to dislike -- was the narrator 's own affair with Marla . The narrator 's bouts of insomnia had been Tyler 's personality surfacing ; Tyler was active whenever the narrator was `` sleeping '' . The Tyler personality not only created fight club , he also blew up the Narrator 's condo . Tyler plans to blow up a skyscraper using homemade bombs created by Project Mayhem ; the target of the explosion is the nearby national museum . Tyler plans to die as a martyr during this event , taking the narrator 's life as well . Realizing this , the narrator sets out to stop Tyler , although Tyler is always thinking ahead of him . The narrator makes his way to the roof of the building , where Tyler holds him at gunpoint . When Marla comes to the roof with one of the support groups , Tyler vanishes , as Tyler `` was his hallucination , not hers . '' With Tyler gone , the narrator waits for the bomb to explode and kill him . The bomb malfunctions because Tyler mixed paraffin into the explosives . Still alive and holding Tyler 's gun , the narrator makes the first decision that is truly his own : he puts the gun in his mouth and shoots himself . Some time later , he awakens in a mental hospital , believing he is in Heaven , and imagines an argument with God over human nature . The book ends with the narrator being approached by hospital employees who reveal themselves to be Project members . They tell him their plans still continue , and that they are expecting Tyler to come back . Palahniuk once had an altercation while camping , and though he returned to work bruised and swollen , his co-workers avoided asking him what had happened on the camping trip . Their reluctance to know what happened in his private life inspired him to write Fight Club . In 1995 , Palahniuk joined a Portland - based writing group that practiced a technique called `` dangerous writing '' . This technique , developed by American author Tom Spanbauer , emphasizes the use of minimalist prose , and the use of painful , personal experiences for inspiration . Under Spanbauer 's influence , Palahniuk produced an early draft of what would later become his novel Invisible Monsters ( 1999 ) , but it was rejected by all publishers he submitted it to . Palahniuk then decided to write an even darker novel , by expanding upon his short story , `` Fight Club '' . Initially Fight Club was published as a seven - page short story in the compilation Pursuit of Happiness ( 1995 ) , but Palahniuk expanded it to novel length ( in which the original short story became chapter six ) ; Fight Club : A Novel was published in 1996 . Fight Club : A Novel was re-issued in 1999 and 2004 ; the latter edition includes the author 's introduction about the conception and popularity of the novel and movie , in which Palahniuk states : ... bookstores were full of books like The Joy Luck Club and The Divine Secrets of the Ya - Ya Sisterhood and How to Make an American Quilt . These were all novels that presented a social model for women to be together . But there was no novel that presented a new social model for men to share their lives . He later explains : Really , what I was writing was just The Great Gatsby updated a little . It was ' apostolic ' fiction -- where a surviving apostle tells the story of his hero . There are two men and a woman . And one man , the hero , is shot to death . One critic has noted that this essay can be seen as Palahniuk 's way of interpreting his own novel . According to this critic , Palahniuk 's essay emphasizes the communicative and romantic elements of the novel while it deemphasizes its transgressive elements . In interviews , the writer has said he is still approached by people wanting to know the location of the nearest fight club . Palahniuk insists there is no such real organization . He has heard of real fight clubs , some said to have existed before the novel . Project Mayhem is lightly based on The Cacophony Society , of which Palahniuk is a member , and other events derived from stories told to him . Fight Club 's cultural impact is evidenced by the establishment of fight clubs by teenagers and `` techies '' in the United States . Pranks , such as food - tampering , have been repeated by fans of the book , documented in Palahniuk 's essay `` Monkey Think , Monkey Do '' , in the book Stranger Than Fiction : True Stories ( 2004 ) and in the introduction to the 2004 re-issue of Fight Club . Other fans have been inspired to prosocial activity , telling Palahniuk the novel had inspired them to return to college . In addition to the feature film , a stage adaptation by Dylan Yates has been performed in Seattle and in Charlotte , North Carolina . In 2004 , work began on a musical theater adaptation by Palahniuk , Fincher , and Trent Reznor , to premiere on the film 's 10th anniversary . In 2015 the project was still in development , with Julie Taymor having been added to the creative team . A modern - day everyman figure as well as an employee specializing in recalls for an unnamed car company , the Narrator -- who remains unnamed throughout the novel -- is extremely depressed and suffers from insomnia . Some readers call him `` Joe '' , because of his constant use of the name in such statements as , `` I am Joe 's boiling point '' . The quotes , `` I am Joe 's ( blank ) '' , refer to the Narrator 's reading old Reader 's Digest articles in which human organs write about themselves in the first person , with titles such as `` I Am Joe 's Liver '' . The film adaptation replaces `` Joe '' with `` Jack '' , inspiring some fans to call the Narrator `` Jack '' . In the novel and film , the Narrator uses various aliases in the support groups . His subconscious is in need of a sense of freedom , he inevitably feels trapped within his own body , and when introduced to Tyler Durden , he begins to see all of the qualities he lacks in himself : `` I love everything about Tyler Durden , his courage , his smarts , and his nerve . Tyler is funny and forceful and independent , and men look up to him and expect him to change their world . Tyler is capable and free , and I am not . '' In the official sequel comic book series also penned by Palahniuk ( with art by Cameron Stewart ) , Fight Club 2 , it is revealed that the Narrator has chosen to be identified by the name of Sebastian . `` Because of his nature '' , Tyler works night jobs where he sabotages companies and harms clients . He also steals left - over drained human fat from liposuction clinics to supplement his income through soap making and to create the ingredients for bomb manufacturing , which will be put to work later with his fight club . He is the co-founder of Fight Club , as it was his idea to instigate the fight that led to it . He later launches Project Mayhem , from which he and the members commit various attacks on consumerism. Tyler is blond , according to the Narrator 's comment `` in his everything - blond way '' . The unhinged but magnetic Tyler becomes the `` villain '' of the novel later in the story . The Narrator refers to Tyler as a free spirit who says , `` Let that which does not matter truly slide . '' A woman whom the Narrator meets during a support group . The Narrator no longer receives the same release from the groups when he realizes Marla is faking her problems just as he is . After he leaves the groups , he meets her again when she becomes Tyler 's lover . Marla is shown to be extremely unkempt , uncaring , and sometimes even suicidal . At times , she shows a softer , more caring side . The Narrator meets Bob at a support group for testicular cancer . A former bodybuilder , Bob lost his testicles to cancer caused by the steroids he used to bulk up his muscles . He had to undergo testosterone injections , resulting in increased estrogen . The increased estrogen levels caused him to grow large breasts and to develop a softer voice .",1258918401649366361,what was the goal of project mayhem in fight club


Preprocess Training Data

In [10]:
# import the correct tokenizer for the model architecture
from transformers import AutoTokenizer    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [11]:
# verify that the tokenizer is a fast tokenizer
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [12]:
# run a check to verify that the tokenizer is working for an example questions and answer
tokenizer("Is this tokenizer working?", "Yes, the tokenizer is working correctly")

{'input_ids': [101, 2003, 2023, 19204, 17629, 2551, 1029, 102, 2748, 1010, 1996, 19204, 17629, 2003, 2551, 11178, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [13]:
########### INPUT ###########
# Length will be truncated to handle long contexts
# Set the max length (questions and context) and stride (context overlap)
max_length = 384
doc_stride = 128 

In [14]:
# verify that the truncation is working correctly by finding an example
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

Token indices sequence length is longer than the specified maximum sequence length for this model (1043 > 512). Running this sequence through the model will result in indexing errors


In [15]:
# check to see what the length of the example is without truncation (should be greater than max_length)
len(tokenizer(example["question"], example["context"])["input_ids"])

1043

In [16]:
# tokenizer should return always return the question plus truncated contexts
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

In [17]:
# verify the length for the mulitple examples are provided for tokenized example
[len(x) for x in tokenized_example["input_ids"]]

[384, 384, 384, 308]

In [18]:
# decode the tokenized example to verify that we have a question plus the truncated context
for x in tokenized_example["input_ids"][:]:
    print(tokenizer.decode(x))

[CLS] when did air conditioning became standard in cars [SEP] automobile air conditioning ( also called a / c ) systems use air conditioning to cool the air in a vehicle. a company in new york city in the united states first offered installation of air conditioning for cars in 1933. most of their customers operated limousines and luxury cars. in 1939, packard became the first automobile manufacturer to offer an air conditioning unit in its cars. these were manufactured by bishop and babcock co, of cleveland, ohio. the ` ` bishop and babcock weather conditioner'' also incorporated a heater. cars ordered with the new ` ` weather conditioner'' were shipped from packard's east grand boulevard facility to the b & b factory where the conversion was performed. once complete, the car was shipped to a local dealer where the customer would take delivery. packard fully warranted and supported this conversion, and marketed it well. however, it was not commercially successful for a number of reason

In [19]:
# use the tokenizer to map the offset for locating the answer
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 4), (5, 8), (9, 12), (13, 25), (26, 32), (33, 41), (42, 44), (45, 49), (0, 0), (1, 11), (12, 15), (16, 28), (29, 30), (31, 35), (36, 42), (43, 44), (45, 46), (47, 48), (49, 50), (51, 58), (59, 62), (63, 66), (67, 79), (80, 82), (83, 87), (88, 91), (92, 95), (96, 98), (99, 100), (101, 108), (109, 110), (117, 118), (119, 126), (127, 129), (130, 133), (134, 138), (139, 143), (144, 146), (147, 150), (151, 157), (158, 164), (165, 170), (171, 178), (179, 191), (192, 194), (195, 198), (199, 211), (212, 215), (216, 220), (221, 223), (224, 228), (229, 230), (231, 235), (236, 238), (239, 244), (245, 254), (255, 263), (264, 273), (273, 274), (275, 278), (279, 285), (286, 290), (291, 292), (295, 297), (298, 302), (303, 304), (305, 312), (313, 319), (320, 323), (324, 329), (330, 340), (341, 353), (354, 356), (357, 362), (363, 365), (366, 369), (370, 382), (383, 387), (388, 390), (391, 394), (395, 399), (400, 401), (402, 407), (408, 412), (413, 425), (426, 428), (429, 435), (436, 439), 

In [20]:
# verify that the mapping is working correctly
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

when when


In [21]:
# use the tokenizer to find sequence ids for locating the position of the question and answer
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [22]:
# answers = example["answers"]
# start_char = answers["answer_start"]
# start_char
example

{'answers': {'answer_start': [4311], 'text': ['late twentieth century']},
 'context': " Automobile air conditioning ( also called A / C ) systems use air conditioning to cool the air in a vehicle .       A company in New York City in the United States first offered installation of air conditioning for cars in 1933 . Most of their customers operated limousines and luxury cars .   In 1939 , Packard became the first automobile manufacturer to offer an air conditioning unit in its cars . These were manufactured by Bishop and Babcock Co , of Cleveland , Ohio . The `` Bishop and Babcock Weather Conditioner '' also incorporated a heater . Cars ordered with the new `` Weather Conditioner '' were shipped from Packard 's East Grand Boulevard facility to the B&B factory where the conversion was performed . Once complete , the car was shipped to a local dealer where the customer would take delivery .   Packard fully warranted and supported this conversion , and marketed it well . However , it was 

In [23]:
list(map(lambda i:i, example["context"]))[4311:]

['l',
 'a',
 't',
 'e',
 ' ',
 't',
 'w',
 'e',
 'n',
 't',
 'i',
 'e',
 't',
 'h',
 ' ',
 'c',
 'e',
 'n',
 't',
 'u',
 'r',
 'y',
 ' ',
 '.',
 ' ',
 'A',
 'l',
 't',
 'h',
 'o',
 'u',
 'g',
 'h',
 ' ',
 'a',
 'i',
 'r',
 ' ',
 'c',
 'o',
 'n',
 'd',
 'i',
 't',
 'i',
 'o',
 'n',
 'e',
 'r',
 's',
 ' ',
 'u',
 's',
 'e',
 ' ',
 's',
 'i',
 'g',
 'n',
 'i',
 'f',
 'i',
 'c',
 'a',
 'n',
 't',
 ' ',
 'p',
 'o',
 'w',
 'e',
 'r',
 ' ',
 ';',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'd',
 'r',
 'a',
 'g',
 ' ',
 'o',
 'f',
 ' ',
 'a',
 ' ',
 'c',
 'a',
 'r',
 ' ',
 'w',
 'i',
 't',
 'h',
 ' ',
 'c',
 'l',
 'o',
 's',
 'e',
 'd',
 ' ',
 'w',
 'i',
 'n',
 'd',
 'o',
 'w',
 's',
 ' ',
 'i',
 's',
 ' ',
 'l',
 'e',
 's',
 's',
 ' ',
 't',
 'h',
 'a',
 'n',
 ' ',
 'i',
 'f',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'w',
 'i',
 'n',
 'd',
 'o',
 'w',
 's',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 'o',
 'p',
 'e',
 'n',
 ' ',
 't',
 'o',
 ' ',
 'c',
 'o',
 'o',
 'l',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'o',
 'c',
 'c',
 'u',
 'p'

In [24]:
# identify the first and last token of the answer in the context or return no answer

# locate the start and end character of answer
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

The answer is not in this feature.


In [25]:
print(token_start_index)

10


In [27]:
# verify that the start and end tokens produced are the correct answer
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
print(answers["text"][0])

NameError: ignored

In [51]:
# To make this notebook generalizable to any model, we account for the special case where the model expects padding on the left
pad_on_right = tokenizer.padding_side == "right"

In [52]:
# This function combines the above methods by tokenizing each example with truncation and padding

def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [53]:
# the function can work on multiple features. Verify that the tokenization is working correctly
features = prepare_train_features(datasets['train'][:2])
features

{'input_ids': [[101, 2043, 2106, 2250, 14372, 2150, 3115, 1999, 3765, 102, 9935, 2250, 14372, 1006, 2036, 2170, 1037, 1013, 1039, 1007, 3001, 2224, 2250, 14372, 2000, 4658, 1996, 2250, 1999, 1037, 4316, 1012, 1037, 2194, 1999, 2047, 2259, 2103, 1999, 1996, 2142, 2163, 2034, 3253, 8272, 1997, 2250, 14372, 2005, 3765, 1999, 4537, 1012, 2087, 1997, 2037, 6304, 3498, 28012, 2015, 1998, 9542, 3765, 1012, 1999, 3912, 1010, 24100, 2150, 1996, 2034, 9935, 7751, 2000, 3749, 2019, 2250, 14372, 3131, 1999, 2049, 3765, 1012, 2122, 2020, 7609, 2011, 3387, 1998, 8670, 9818, 7432, 2522, 1010, 1997, 6044, 1010, 4058, 1012, 1996, 1036, 1036, 3387, 1998, 8670, 9818, 7432, 4633, 4650, 2121, 1005, 1005, 2036, 5100, 1037, 3684, 2121, 1012, 3765, 3641, 2007, 1996, 2047, 1036, 1036, 4633, 4650, 2121, 1005, 1005, 2020, 12057, 2013, 24100, 1005, 1055, 2264, 2882, 8459, 4322, 2000, 1996, 1038, 1004, 1038, 4713, 2073, 1996, 7584, 2001, 2864, 1012, 2320, 3143, 1010, 1996, 2482, 2001, 12057, 2000, 1037, 2334, 1103

In [31]:
# apply the function on all elements of all the splits in the dataset including training, validation, and testing data
# remove the old columns since the preprocessing changes the number of samples
# results are cached. Pass "load_from_cache_file=False" to force the preprocessing to be applied again
tokenized_datasets = datasets.map(
    prepare_train_features, 
    batched=True, 
    remove_columns=datasets["train"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




Fine-Tune the Model

In [32]:
# import Pytorch pretrained model for question answering
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# from_pretrained method downloads and caches the model
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# warning regarding not using weights and layers is normal. we are removing the 
# masked language modeling head to pretrain the model on the QA task for which
# we do not have pretrained weights and requires fine-tuning

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [33]:
########### INPUT ###########
# training arguments is a class that contains the attributes to customize training
# set the folder name f"model-dataset", which will be used to save checkpoints
# set the learning_rate, number of epochs, and weight_decay
# batch_size has been set at the beginning of the notebook 
args = TrainingArguments(
    f"bert-nq-qg",
    evaluation_strategy = "epoch",
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    num_train_epochs = 3,
    weight_decay = 0.01,
)

In [34]:
# import a default data collator
from transformers import default_data_collator

# set the data_collator to the default data collator
data_collator = default_data_collator

In [35]:
# pass all of the training arguments and datasets to the trainer
trainer = Trainer(
    model,
    args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator,
    tokenizer = tokenizer,
)

In [36]:
# finetune the model by calling train method
# running this cell will take time.
trainer.train()

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,0.4874,0.435194,204.0855,77.232
2,0.3323,0.431457,204.0424,77.249
3,0.2155,0.488653,203.6933,77.381


TrainOutput(global_step=11655, training_loss=0.3767050469397271, metrics={'train_runtime': 8362.3106, 'train_samples_per_second': 1.394, 'total_flos': 4.67761639473239e+16, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 334895, 'init_mem_gpu_alloc_delta': 436709888, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1176410, 'train_mem_gpu_alloc_delta': 1308321792, 'train_mem_cpu_peaked_delta': 97198401, 'train_mem_gpu_peaked_delta': 8290021376})

In [37]:
########### INPUT ###########
# save the model. input the model name ("model-dataset-trained")
trainer.save_model("bert-nq-cl-reduce-trained")

Evaluation

In [38]:
# the validation features will need to be re-processed similar to the training features
# the processing will also need to check if the output span is inside the context (and not in the question)
# it will also need to retrieve the text inside

def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [39]:
# apply the function to validation set 
# remove the old columns since the preprocessing changes the number of samples
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [40]:
# extract the predictions for all features using method trainer.predict
raw_predictions = trainer.predict(validation_features)

In [41]:
# trainer hides columns not used by the model. the columns needed for post-processing are set back 
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

In [42]:
########### INPUT ###########
# to classify answers, we use the score obtained by adding the start and end logits
# limit the number of possible answers by setting n_best_size
# limit the length of the answer by setting max_answer_length
n_best_size = 20
max_answer_length = 30

In [43]:
# get the output logits from trainer
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

In [44]:
# code to verify the score and corresponding text are working correctly
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 15.683541, 'text': 'W. Winter'},
 {'score': 10.421932, 'text': 'W. Winter -- known as Wint'},
 {'score': 9.84499, 'text': 'Winter'},
 {'score': 9.077557, 'text': 'W. Winter -- known as Wint -- and Charles Kidd'},
 {'score': 7.584131, 'text': '. Winter'},
 {'score': 7.092781, 'text': 'W'},
 {'score': 6.3153315, 'text': 'W. Winter -- known as Wint --'},
 {'score': 4.5833807, 'text': 'Winter -- known as Wint'},
 {'score': 4.355462, 'text': 'W. Winter --'},
 {'score': 3.8281991, 'text': 'W.'},
 {'score': 3.621228, 'text': 'Wint'},
 {'score': 3.2390056, 'text': 'Winter -- known as Wint -- and Charles Kidd'},
 {'score': 2.8396778, 'text': 'Bruce Glover'},
 {'score': 2.3225212, 'text': '. Winter -- known as Wint'},
 {'score': 2.276853, 'text': 'Wint -- and Charles Kidd'},
 {'score': 1.5263276,
  'text': 'Bruce Glover and Kidd by bespectacled jazz musician Putter Smith'},
 {'score': 1.0772647, 'text': 'Putter Smith'},
 {'score': 0.9781464, 'text': '. Winter -- known as Wint -- and C

In [45]:
# view the actual answer
datasets["validation"][0]["answers"]

{'answer_start': [631], 'text': ['bespectacled jazz musician Putter Smith']}

In [46]:
# apply the process above to all features by mapping between examples and their corresponding features
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

In [47]:
# to handle the non-answerable questions, we need to extract the score for the impossible answer
# the score is collected from minimum of the scores from the CLS token for each feature generated by the example
# the question is not answerable when that score is greater than the highest answerable score
from tqdm.auto import tqdm
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

In [48]:
# apply the postprocessing function to the raw predictions
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Post-processing 1964 example predictions split into 15762 features.


HBox(children=(FloatProgress(value=0.0, max=1964.0), HTML(value='')))




In [49]:
########### INPUT ###########
# load the metric from the datasets library
metric = load_metric("squad_v2")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2264.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3182.0, style=ProgressStyle(description…




In [50]:
# format predictions and labels
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'HasAns_exact': 18.930041152263374,
 'HasAns_f1': 21.948244579483777,
 'HasAns_total': 1458,
 'NoAns_exact': 85.7707509881423,
 'NoAns_f1': 85.7707509881423,
 'NoAns_total': 506,
 'best_exact': 36.15071283095723,
 'best_exact_thresh': 0.0,
 'best_f1': 38.391313949535274,
 'best_f1_thresh': 0.0,
 'exact': 36.15071283095723,
 'f1': 38.39131394953531,
 'total': 1964}