Implement an extractive question answering (QA) system using the classifier described in Section 16.6.1. We will discuss this approach in class. Train the classifier using the training partition of the SQuAD 1.1 dataset, and report the Exact Match score on the development partition. Include the scores you obtain in the repository notebook. How do your scores compare with the ones reported in Table 5 in the 2016 SQuAD 1.1 paper?
 
Note: the SQuAD 1.1 dataset and paper are available here: https://rajpurkar.github.io/SQuAD-explorer/

Most of today’s extractive QA approaches follow the architecture in- troduced in (Devlin et al., 2018), which is based on Figure 12.4 from Chapter 12. In particular, extractive QA methods concatenate the in- put question and supporting passage into a single sequence, where the two texts are separated by [SEP]. For example, the input corresponding to the third question from Figure 16.12 is:10
[CLS] Where do water droplets collide with ice crystals to form precipitation? [SEP] In meteorology, precipitation is [...] smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers.” [SEP]
Note that the transformer library will handle the generation of the po- sition and segment embeddings shown in Figure 12.4.

In [1]:
import random
import torch
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

# enable tqdm in pandas
tqdm.pandas()

# set to True to use the gpu (if there is one available)
use_gpu = True

# select device
device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
print(f'device: {device.type}')

# random seed
seed = 2024

# set random seed
if seed is not None:
    print(f'random seed: {seed}')
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

device: cuda
random seed: 2024


In [2]:
from transformers import BertTokenizerFast, BertForQuestionAnswering
import torch

model_name = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

model = model.to(device)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
from datasets import load_dataset
dataset = load_dataset("squad")

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [5]:
dataset['validation'].to_pandas()

Unnamed: 0,id,title,context,question,answers
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth..."
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S..."
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta..."
...,...,...,...,...,...
10565,5737aafd1c456719005744fb,Force,"The pound-force has a metric counterpart, less...",What is the metric term less used than the New...,"{'text': ['kilogram-force', 'pound-force', 'ki..."
10566,5737aafd1c456719005744fc,Force,"The pound-force has a metric counterpart, less...",What is the kilogram-force sometimes reffered ...,"{'text': ['kilopond', 'kilopond', 'kilopond', ..."
10567,5737aafd1c456719005744fd,Force,"The pound-force has a metric counterpart, less...",What is a very seldom used unit of mass in the...,"{'text': ['slug', 'metric slug', 'metric slug'..."
10568,5737aafd1c456719005744fe,Force,"The pound-force has a metric counterpart, less...",What seldom used term of a unit of force equal...,"{'text': ['kip', 'kip', 'kip', 'kip', 'kip'], ..."


In [6]:
dataset['train'].to_pandas()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...
...,...,...,...,...,...
87594,5735d259012e2f140011a09d,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what US state did Kathmandu first establish...,"{'text': ['Oregon'], 'answer_start': [229]}"
87595,5735d259012e2f140011a09e,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",What was Yangon previously known as?,"{'text': ['Rangoon'], 'answer_start': [414]}"
87596,5735d259012e2f140011a09f,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",With what Belorussian city does Kathmandu have...,"{'text': ['Minsk'], 'answer_start': [476]}"
87597,5735d259012e2f140011a0a0,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what year did Kathmandu create its initial ...,"{'text': ['1975'], 'answer_start': [199]}"


In [7]:
dataset['train'][54321]

{'id': '572a2a293f37b31900478766',
 'title': 'Digimon',
 'context': "The second Digimon series is direct continuation of the first one, and began airing on April 2, 2000. Three years later, with most of the original DigiDestined now in high school at age fourteen, the Digital World was supposedly secure and peaceful. However, a new evil has appeared in the form of the Digimon Emperor (Digimon Kaiser) who as opposed to previous enemies is a human just like the DigiDestined. The Digimon Emperor has been enslaving Digimon with Dark Rings and Control Spires and has somehow made regular Digivolution impossible. However, five set Digi-Eggs with engraved emblems had been appointed to three new DigiDestined along with T.K. and Kari, two of the DigiDestined from the previous series. This new evolutionary process, dubbed Armor Digivolution helps the new DigiDestined to defeat evil lurking in the Digital World. Eventually, the DigiDestined defeat the Digimon Emperor, more commonly known as Ken Ic

In [8]:
# https://huggingface.co/learn/nlp-course/en/chapter7/7
dataset["train"].filter(lambda x: len(x["answers"]["text"]) != 1)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [9]:
max_length = 384
stride = 128

def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs.to(device)



def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        
        # Handle case where no answer exists
        if len(answer["answer_start"]) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])
        
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # Check if answer is fully inside the context
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Find start position
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            # Find end position
            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # Add start and end positions to inputs
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    
    return inputs

In [10]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    offset_mapping = inputs.pop("offset_mapping")
    
    inputs["example_id"] = []
    # inputs["offset_mapping"] = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        inputs["example_id"].append(examples["id"][sample_idx])
        
        # filter offset mapping
        # sequence_ids = inputs.sequence_ids(i)
        # inputs["offset_mapping"].append([
        #     o if sequence_ids[k] == 1 else None for k, o in enumerate(offset_mapping[i])
        # ])

    return inputs


# Note
Due to limited resource, I chose only 1000 of the original dataset

In [11]:
dataset.set_format("torch")

In [12]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

# preprocess datasets
train_dataset = dataset["train"].select(range(1000)).map(
    preprocess_training_examples,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

validation_dataset = dataset["validation"].select(range(1000)).map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)

# Create dataloaders
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    # collate_fn=default_data_collator,
    batch_size=8  # Adjust based on your GPU memory
)

eval_dataloader = DataLoader(
    validation_dataset,
    # collate_fn=default_data_collator,
    batch_size=8  # Same as training batch size
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [13]:
type(train_dataset['input_ids'])

torch.Tensor

In [14]:
train_dataset.to_pandas()

Unnamed: 0,input_ids,token_type_ids,attention_mask,start_positions,end_positions
0,"[101, 2000, 3183, 2106, 1996, 6261, 2984, 9382...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",130,137
1,"[101, 2054, 2003, 1999, 2392, 1997, 1996, 1028...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",52,56
2,"[101, 1996, 13546, 1997, 1996, 6730, 2540, 201...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",81,83
3,"[101, 2054, 2003, 1996, 24665, 23052, 2012, 10...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",95,101
4,"[101, 2054, 7719, 2006, 2327, 1997, 1996, 2364...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",33,39
...,...,...,...,...,...
1027,"[101, 2054, 2828, 1997, 5929, 2515, 1996, 2329...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",25,26
1028,"[101, 2054, 2120, 7071, 3303, 20773, 2000, 344...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",14,15
1029,"[101, 2129, 2172, 5356, 2106, 20773, 2404, 204...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",49,51
1030,"[101, 2054, 7064, 2086, 2101, 2044, 16864, 210...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",70,70


No longer using trainer

In [15]:
# train_dataset = dataset["train"].map(
#     preprocess_training_examples,
#     batched=True,
#     remove_columns=dataset["train"].column_names,
# )

In [16]:
model_save_path = "model/baseline"

In [17]:
# from transformers import TrainingArguments, Trainer

# args = TrainingArguments(
#     model_save_path,
#     evaluation_strategy="no",
#     save_strategy="epoch",
#     learning_rate=2e-5,
#     num_train_epochs=3,
#     weight_decay=0.01,
#     fp16=True,
# )

In [18]:
# trainer = Trainer(
#     model=model,
#     args=args,
#     train_dataset=subset_train_dataset,
#     tokenizer=tokenizer,
# )
# trainer.train()

In [19]:
# validation_dataset.to_pandas()

In [20]:
# validation_dataset = validation_dataset.remove_columns(['token_type_ids', 'offset_mapping'])

In [21]:
eval_dataloader.dataset.to_pandas()

Unnamed: 0,input_ids,token_type_ids,attention_mask,example_id
0,"[101, 2029, 5088, 2136, 3421, 1996, 10511, 201...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",56be4db0acb8001400a502ec
1,"[101, 2029, 5088, 2136, 3421, 1996, 22309, 201...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",56be4db0acb8001400a502ed
2,"[101, 2073, 2106, 3565, 4605, 2753, 2202, 2173...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",56be4db0acb8001400a502ee
3,"[101, 2029, 5088, 2136, 2180, 3565, 4605, 2753...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",56be4db0acb8001400a502ef
4,"[101, 2054, 3609, 2001, 2109, 2000, 17902, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",56be4db0acb8001400a502f0
...,...,...,...,...
1015,"[101, 2054, 2001, 1996, 2171, 1997, 1996, 1442...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",5733647e4776f419006609af
1016,"[101, 2054, 23050, 2001, 2328, 1999, 1996, 370...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",5733647e4776f419006609b0
1017,"[101, 2040, 2515, 1996, 6231, 1997, 2210, 1602...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",5733647e4776f419006609b1
1018,"[101, 2054, 6104, 2003, 1999, 3638, 1997, 1996...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",5733647e4776f419006609b2


# Note
Due to limited resource, I limited epochs to 3 for now.

In [22]:
epochs = 3
lr = 2e-5

for i in range(epochs):
    model.train()
    total_loss = 0
    for batch in tqdm(train_dataloader):
        # Move to device
        batch['input_ids'] = batch['input_ids'].to(device)
        batch['attention_mask'] = batch['attention_mask'].to(device)
        
        batch = {k: v.to(device) for k, v in batch.items() 
                 if k in ['input_ids', 'attention_mask', 'start_positions', 'end_positions']}
        
        outputs = model(**batch)

        loss = outputs.loss
        total_loss += loss.item()
    
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {i + 1}, Loss: {total_loss / len(train_dataloader)}")


  0%|          | 0/129 [00:00<?, ?it/s]

Epoch 1, Loss: 4.1854470940523365


  0%|          | 0/129 [00:00<?, ?it/s]

Epoch 2, Loss: 2.5369425891905792


  0%|          | 0/129 [00:00<?, ?it/s]

Epoch 3, Loss: 1.3582624113836954


In [23]:
from evaluate import load as load_metric
metric = load_metric('squad')

model.eval()
all_predictions = []
all_references = []

for batch in tqdm(eval_dataloader):
    with torch.no_grad():
        batch['input_ids'] = batch['input_ids'].to(device)
        batch['attention_mask'] = batch['attention_mask'].to(device)
        
        outputs = model(**{k: batch[k] for k in ['input_ids', 'attention_mask']})
        
        # process predictions
        for i in range(batch['input_ids'].size(0)):
            # Get token indices for prediction
            start_pred = torch.argmax(outputs.start_logits[i]).item()
            end_pred = torch.argmax(outputs.end_logits[i]).item()
            
            # decode prediction
            input_ids = batch['input_ids'][i].tolist()
            predicted_answer = tokenizer.decode(
                input_ids[start_pred:end_pred+1], 
                skip_special_tokens=True
            )
            
            # ground truth!!
            example = dataset['validation'][i]
            references = example['answers']['text']
            
            all_predictions.append(predicted_answer)
            all_references.append(references)
     
        

  0%|          | 0/128 [00:00<?, ?it/s]

In [24]:
# would have to change to handle it somewhere else
example_ids = dataset["validation"].select(range(1000))["id"]

In [None]:
# there was an issue with the predictions and references list, so I had to reformat it
# not sure if this is doing it correctly

In [25]:
formatted_predictions = [
    {"id": example_id, "prediction_text": pred_text}
    for example_id, pred_text in zip(example_ids, all_predictions)
]

formatted_references = [
    {"id": example_id, "answers": {"text": ref_texts, "answer_start": [0] * len(ref_texts)}}
    for example_id, ref_texts in zip(example_ids, all_references)
]


In [26]:
results = metric.compute(predictions=formatted_predictions, references=formatted_references)

In [27]:
print(f"Exact Match (EM) Score: {results['exact_match']:.2f}")
print(f"F1 Score: {results['f1']:.2f}")

Exact Match (EM) Score: 0.40
F1 Score: 0.65
