#📝 1) Introduction (QA vs. Classification)


**Question Answering (extractive QA**) differs from text classification in both objective and output structure. In classification, the model maps an input (text or pair) to a fixed set of labels (e.g., sentiment: positive/negative). In **extractive QA**, the model receives a question and a context and predicts the start and end token positions that mark the answer span inside the context. Thus, the head is trained to output two distributions over token positions (start/end), not a single softmax over labels. Evaluation also differs: **QA uses Exact Match (EM) and F1 on answer strings**, while classification typically uses accuracy, precision/recall/F1 on labels.



# Installing required libraries

In [1]:
!pip install -q transformers datasets evaluate accelerate


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h

# calling the libraries

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, default_data_collator
import evaluate
import numpy as np

In [3]:
from datasets import load_dataset

# Load full SQuAD v1.1
dataset = load_dataset("squad")

# Shuffle for randomness
dataset = dataset.shuffle(seed=42)

# Custom splits
train_dataset = dataset["train"].select(range(5000))
valid_dataset = dataset["train"].select(range(5000, 6000))
test_dataset  = dataset["train"].select(range(6000, 8000))

print("Train size:", len(train_dataset))
print("Validation size:", len(valid_dataset))
print("Test size:", len(test_dataset))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Train size: 5000
Validation size: 1000
Test size: 2000


#Tokenizing

In [5]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
max_length = 384
doc_stride = 128

def prepare_train_features(examples):
    questions = [q.strip() for q in examples["question"]]

    tokenized = tokenizer(
        questions,
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sample_idx = sample_mapping[i]
        answers = examples["answers"][sample_idx]
        if len(answers["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            sequence_ids = tokenized.sequence_ids(i)
            context_start = sequence_ids.index(1)
            context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

            if not (offsets[context_start][0] <= start_char and offsets[context_end][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                idx = context_start
                while idx <= context_end and offsets[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offsets[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    return tokenized

train_dataset_proc = train_dataset.map(
    prepare_train_features, batched=True, remove_columns=train_dataset.column_names
)

valid_dataset_proc = valid_dataset.map(
    prepare_train_features, batched=True, remove_columns=valid_dataset.column_names
)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Training

In [9]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
import evaluate

data_collator = DataCollatorWithPadding(tokenizer)
metric = evaluate.load("squad")

def compute_metrics(eval_preds):
    # Here we just return raw loss since full post-processing is long
    # (for quick training demo)
    return {}

training_args = TrainingArguments(
    output_dir="./qa-bert-squad-custom",
    eval_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=50,
    save_strategy="no",
    report_to="none"
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_proc,
    eval_dataset=valid_dataset_proc,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,1.7848,1.618154
2,1.3669,1.512801


TrainOutput(global_step=632, training_loss=2.0090128077736384, metrics={'train_runtime': 761.7035, 'train_samples_per_second': 13.262, 'train_steps_per_second': 0.83, 'total_flos': 1979714877410304.0, 'train_loss': 2.0090128077736384, 'epoch': 2.0})

In [10]:
metrics = trainer.evaluate()
print(metrics)


{'eval_loss': 1.5128014087677002, 'eval_runtime': 23.2334, 'eval_samples_per_second': 43.386, 'eval_steps_per_second': 2.712, 'epoch': 2.0}


# watching the metrics

In [11]:
import collections
import numpy as np

# Prepare test features (keep offsets, don't create labels)
def prepare_test_features(examples):
    questions = [q.strip() for q in examples["question"]]
    tokenized = tokenizer(
        questions,
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    tokenized["example_id"] = []
    for i in range(len(tokenized["input_ids"])):
        sample_idx = sample_mapping[i]
        tokenized["example_id"].append(examples["id"][sample_idx])
    return tokenized

test_features = test_dataset.map(
    prepare_test_features,
    batched=True,
    remove_columns=test_dataset.column_names,
)

# Get raw predictions (start/end logits)
raw_predictions = trainer.predict(test_features)

# Post-process predictions into answer strings
def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size=20, max_answer_length=30):
    all_start_logits, all_end_logits = raw_predictions.predictions
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, f in enumerate(features):
        features_per_example[f["example_id"]].append(i)

    predictions = {}

    for example in examples:
        example_id = example["id"]
        context = example["context"]
        feature_indices = features_per_example[example_id]

        valid_answers = []
        for fi in feature_indices:
            start_logits = all_start_logits[fi]
            end_logits = all_end_logits[fi]
            offsets = features["offset_mapping"][fi]

            start_indexes = np.argsort(start_logits)[-1:-n_best_size-1:-1].tolist()
            end_indexes = np.argsort(end_logits)[-1:-n_best_size-1:-1].tolist()

            for start_index in start_indexes:
                for end_index in end_indexes:
                    if start_index >= len(offsets) or end_index >= len(offsets):
                        continue
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    if end_index < start_index:
                        continue
                    length = offsets[end_index][1] - offsets[start_index][0]
                    if length <= 0 or (end_index - start_index + 1) > max_answer_length:
                        continue

                    start_char = offsets[start_index][0]
                    end_char = offsets[end_index][1]
                    valid_answers.append(
                        {"score": start_logits[start_index] + end_logits[end_index],
                         "text": context[start_char:end_char]}
                    )

        if len(valid_answers) > 0:
            best_answer = max(valid_answers, key=lambda x: x["score"])
            predictions[example_id] = best_answer["text"]
        else:
            predictions[example_id] = ""
    return predictions

# Convert predictions to SQuAD format
final_predictions = postprocess_qa_predictions(test_dataset, test_features, raw_predictions)

preds_list = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
refs_list = [{"id": ex["id"], "answers": ex["answers"]} for ex in test_dataset]

# Compute EM & F1
import evaluate
squad_metric = evaluate.load("squad")
results = squad_metric.compute(predictions=preds_list, references=refs_list)

print("📊 Test Results (2000 samples)")
print("Exact Match (EM):", results["exact_match"])
print("F1 Score:", results["f1"])


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

📊 Test Results (2000 samples)
Exact Match (EM): 54.45
F1 Score: 68.42323437369109


# doing a sample test

In [13]:
import torch

def answer_question(question, context):
    inputs = tokenizer(question, context, return_tensors="pt")
    # Move input tensors to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    start = torch.argmax(outputs.start_logits)
    end = torch.argmax(outputs.end_logits) + 1
    return tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start:end])
    )

# Test 1
q1 = "Who developed the theory of relativity?"
c1 = "Albert Einstein developed the theory of relativity in the early 20th century."
print("Answer 1:", answer_question(q1, c1))

# Test 2
q2 = "What is the capital of France?"
c2 = "France is a country in Europe. Its capital city is Paris."
print("Answer 2:", answer_question(q2, c2))

Answer 1: albert einstein
Answer 2: paris


# Reflection

**In this assignment**, I learned how to adapt a large dataset into custom splits for training, validation, and testing. I saw how extractive QA differs from classification because the model predicts spans instead of labels. Tokenization with context/question pairs and mapping answers to token positions was the most complex step. Training even on a small subset (5000/1000/2000) showed how the model quickly adapts to extract answers. Finally, testing with custom contexts demonstrated that fine-tuning helps the model select meaningful answer spans.