# Fine-Tuning BERT on SQuAD v1.0

The Stanford Question Answering Dataset is a collection of crowd-sourced question/answer pairs, given a question and a passage from Wikipedia containing the answer. The task is to predict the answer text span in the passage.

## 0. Configuration

In [1]:
import os
# Store the huggingface data in a shared group folder on the provided JupyterLab instance.
os.environ['HF_HOME'] = '../../groups/192.039-2024W/bert/huggingface/cache'

In [2]:
from pathlib import Path
from transformers import set_seed

# RANDOMNESS SEED
SEED = 42
set_seed(SEED)

# Which dataset to load
DATASET_NAME = "squad"
MODEL_VERSION = "single"

TRAIN_OUTPUT_DIR = (
    Path("../../groups/192.039-2024W/bert") / "training" / f"{DATASET_NAME}-{MODEL_VERSION}"
)

BATCH_SIZE = 32  # Original Paper claims to use 32 for the SQuAD finetuning
NUM_EPOCHS = 3  # Original Paper claims to use 3 fine-tuning epochs for the SQuAD finetuning

In [3]:
import torch

if torch.cuda.is_available():
  device = torch.device("cuda")
  device_count = torch.cuda.device_count()
  device_name = torch.cuda.get_device_name(0)

  print(f"There are {device_count} GPU(s) available.")
  print(f"GPU used: {device_name}")
  ! nvidia-smi -q --display=MEMORY,COMPUTE

else:
  print("No GPU available, using CPU.")
  device = torch.device("cpu")

There are 1 GPU(s) available.
GPU used: NVIDIA A40


Timestamp                                 : Tue Jan 28 07:47:36 2025
Driver Version                            : 550.90.07
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:05:00.0
    FB Memory Usage
        Total                             : 46068 MiB
        Reserved                          : 665 MiB
        Used                              : 4 MiB
        Free                              : 45401 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 2 MiB
        Free                              : 65534 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default



## 1. Dataset

In [4]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset(DATASET_NAME)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

The training set only contains one possible answer. We can check this by filtering through the dataset and looking for an entry in the answers field that has more than one text field.

In [5]:
dataset["train"].filter(lambda x: len(x["answers"]["text"]) != 1)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

However, the evaluation set has several possible answers for each sample, which may be the same or different.

In [6]:
dataset["validation"][0]["answers"]

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

In [7]:
dataset["validation"][2]["answers"]

{'text': ['Santa Clara, California',
  "Levi's Stadium",
  "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."],
 'answer_start': [403, 355, 355]}

## 2. BERT-base

In [8]:
PRE_TRAINED_CHECKPOINT = "google-bert/bert-base-uncased"

### 2.1 Tokenization

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_CHECKPOINT)

#### 2.1.1 Preprocessing training dataset

These preprocessing steps are essential when it comes to question answering tasks. In case the dataset has some examples that have a very long context which exceeds the maximum length of the model, we need to truncate their context. Next, we then need to set the start and end posistions of the answer to the original context. With that mapping we then can find the start and end tokens of the answer. Therefore, we need to find which part of the offset corresponds to the question and which one corresponds to the context.

In [10]:
def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [11]:
train_dataset = dataset["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

In [12]:
len(dataset["train"]), len(train_dataset)

(87599, 88524)

In [13]:
import pandas as pd

with pd.option_context('display.max_colwidth', 400):
    display(pd.DataFrame(train_dataset[:1]).transpose())

Unnamed: 0,0
input_ids,"[101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 692..."
token_type_ids,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
attention_mask,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
start_positions,130
end_positions,137


#### 2.1.2 Preprocessing validation dataset

After preprocessing the training dataset, we will preprocess the validation dataset. This differs a little bit from the preprocessing of the training dataset, because we do not need to generate labels. This would only be necessary when we want to compute a validation loss, but since that number does not really tell us if the model is good or not, we will not compute it.

In [14]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [15]:
validation_dataset = dataset["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)

In [16]:
len(dataset["validation"]), len(validation_dataset)

(10570, 10784)

#### 2.1.3 Dealing with Padding

As we already handled the padding in the preprocessing step by padding all the samples to the maximum length that we set, we do not need to define a data collator.

### 2.2 Metrics

In [17]:
import evaluate

metric = evaluate.load(DATASET_NAME)
metric

EvaluationModule(name: "squad", module_type: "metric", features: {'predictions': {'id': Value(dtype='string', id=None), 'prediction_text': Value(dtype='string', id=None)}, 'references': {'id': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}}, usage: """
Computes SQuAD scores (F1 and EM).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'id': id of the question-answer pair (see above),
        - 'answers': a Dict in the SQuAD dataset format
            {
                'text': list of possible texts for the answer, as a list of strings
                'answer_start': list of start positions for 

In [18]:
import collections
import numpy as np
from tqdm.auto import tqdm

n_best = 20
max_answer_length = 30

def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

### 2.3 Training

In [19]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(PRE_TRAINED_CHECKPOINT)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
    output_dir=(TRAIN_OUTPUT_DIR / PRE_TRAINED_CHECKPOINT.replace("/", "_")).resolve(),
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    eval_strategy="no",
    save_strategy="epoch",
    logging_strategy="epoch",
    num_train_epochs=NUM_EPOCHS,
    learning_rate=2e-5,  # Original paper uses 5e-5
    weight_decay=0.01,
    save_total_limit=3,
)

In [21]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    processing_class=tokenizer,
)

In [22]:
torch.cuda.empty_cache()

print(f"--- {training_arguments.output_dir=}")
training_summary_bert_base = trainer.train()

--- training_arguments.output_dir='/home/e12433721/groups/192.039-2024W/bert/training/squad-single/google-bert_bert-base-uncased'


Step,Training Loss
2767,1.4143
5534,0.8735
8301,0.6896


In [23]:
training_summary_bert_base

TrainOutput(global_step=8301, training_loss=0.9924698455005421, metrics={'train_runtime': 4149.3898, 'train_samples_per_second': 64.003, 'train_steps_per_second': 2.001, 'total_flos': 5.204482670991974e+16, 'train_loss': 0.9924698455005421, 'epoch': 3.0})

### 2.4 Evaluation

In [24]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
scores_bert_base = compute_metrics(start_logits, end_logits, validation_dataset, dataset["validation"])

  0%|          | 0/10570 [00:00<?, ?it/s]

In [25]:
scores_bert_base

{'exact_match': 79.91485335856197, 'f1': 87.69099229972197}

In [26]:
training_history_bert_base = pd.DataFrame(trainer.state.log_history)
training_history_bert_base.epoch = training_history_bert_base.epoch.astype(int)
training_history_bert_base.groupby("epoch").first()

Unnamed: 0_level_0,loss,grad_norm,learning_rate,step,train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1.4143,20.969122,1.3e-05,2767,,,,,
2,0.8735,17.271847,7e-06,5534,,,,,
3,0.6896,19.913282,0.0,8301,4149.3898,64.003,2.001,5.204483e+16,0.99247


In [27]:
from IPython.display import Markdown, display

display(Markdown(f"### Model performance:"))
results = pd.DataFrame(
    [scores_bert_base["exact_match"]] + [scores_bert_base["f1"]],
    index=["EM"] + ["F1"],
    columns=["our BERT_BASE"],
)

# Achieved scores from original BERT paper:
results["original BERT_BASE"] = [80.8,88.5]

print(
    '"BERT_BASE" performance on the SQuAD dataset as reported in the original paper.'
)
results

### Model performance:

"BERT_BASE" performance on the SQuAD dataset as reported in the original paper.


Unnamed: 0,our BERT_BASE,original BERT_BASE
EM,79.914853,80.8
F1,87.690992,88.5


## 3. BERT-large

In [28]:
PRE_TRAINED_CHECKPOINT = "google-bert/bert-large-uncased"

### 3.1 Tokenization

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_CHECKPOINT)

#### 3.1.1 Preprocessing training dataset

In [30]:
def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [31]:
train_dataset = dataset["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

#### 3.1.2 Validation dataset

In [32]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [33]:
validation_dataset = dataset["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)

### 3.2 Training

In [34]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(PRE_TRAINED_CHECKPOINT)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at google-bert/bert-large-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [35]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
    output_dir=(TRAIN_OUTPUT_DIR / PRE_TRAINED_CHECKPOINT.replace("/", "_")).resolve(),
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    eval_strategy="no",
    save_strategy="epoch",
    logging_strategy="epoch",
    num_train_epochs=NUM_EPOCHS,
    learning_rate=2e-5,  # Original paper uses 5e-5
    weight_decay=0.01,
    save_total_limit=3,
)

In [36]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    processing_class=tokenizer,
)

In [37]:
torch.cuda.empty_cache()

print(f"--- {training_arguments.output_dir=}")
training_summary_bert_large = trainer.train()

--- training_arguments.output_dir='/home/e12433721/groups/192.039-2024W/bert/training/squad-single/google-bert_bert-large-uncased'


Step,Training Loss
2767,1.1582
5534,0.6734
8301,0.4684


In [38]:
training_summary_bert_large

TrainOutput(global_step=8301, training_loss=0.7666577816066965, metrics={'train_runtime': 12760.915, 'train_samples_per_second': 20.811, 'train_steps_per_second': 0.651, 'total_flos': 1.849789299850629e+17, 'train_loss': 0.7666577816066965, 'epoch': 3.0})

### 3.3 Evaluation

In [39]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
scores_bert_large = compute_metrics(start_logits, end_logits, validation_dataset, dataset["validation"])

  0%|          | 0/10570 [00:00<?, ?it/s]

In [40]:
training_history_bert_large = pd.DataFrame(trainer.state.log_history)
training_history_bert_large.epoch = training_history_bert_large.epoch.astype(int)
training_history_bert_large.groupby("epoch").first()

Unnamed: 0_level_0,loss,grad_norm,learning_rate,step,train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1.1582,13.103781,1.3e-05,2767,,,,,
2,0.6734,15.960748,7e-06,5534,,,,,
3,0.4684,20.557493,0.0,8301,12760.915,20.811,0.651,1.849789e+17,0.766658


In [41]:
from IPython.display import Markdown, display

display(Markdown(f"### Model performance:"))
results["our BERT_LARGE"] = [
    scores_bert_large["exact_match"],
    scores_bert_large["f1"],
]
# Achieved scores from original BERT paper:
results["original BERT_LARGE"] = [84.1,90.9]

results = results[
    [
        "our BERT_BASE",
        "original BERT_BASE",
        "our BERT_LARGE",
        "original BERT_LARGE",
    ]
]
print(
    '"BERT_BASE" and "BERT_LARGE" performance on the SQuAD dataset as reported in the original paper.'
)
results

### Model performance:

"BERT_BASE" and "BERT_LARGE" performance on the SQuAD dataset as reported in the original paper.


Unnamed: 0,our BERT_BASE,original BERT_BASE,our BERT_LARGE,original BERT_LARGE
EM,79.914853,80.8,83.263955,84.1
F1,87.690992,88.5,90.457076,90.9
