<h1><center></center></h1>
<h1><center>Elevvo Internship</center></h1>
<h1><center>Task 6</center></h1>
<h2><center>Question Answering with Transformers</center></h2>

# **Hands on Task 6**

In this notebook, we fine-tune and evaluate the DistilBERT model for extractive question answering.
We use the distilbert-base-uncased checkpoint, which is lightweight and optimized for faster inference.

**⚙️ Steps**

- Load and preprocess dataset (SQuAD-like format).

- Tokenize context–question pairs using HuggingFace DistilBertTokenizer.

- Fine-tune DistilBertForQuestionAnswering.

- Evaluate with EM (Exact Match) and F1 metrics.

- Save model in folder: qa_model_distilbert-base-uncased/.

# **1- Data Collection**

**Setup**

In [1]:
!pip install transformers datasets accelerate evaluate --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

**Import required libraries**

In [2]:
import random
import numpy as np
import torch
from datasets import load_dataset

**Set random seeds and device**

In [3]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

**Load the SQuAD v1.1 dataset from Hugging Face**

In [4]:
dataset = load_dataset("squad")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

**Preview one sample**

In [5]:
sample = dataset["train"][0]
{k: sample[k] for k in ["context", "question", "answers"]}

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

**Check average context and answer lengths**

In [6]:
def avg_len(split):
    ctx = np.mean([len(x["context"].split()) for x in split])
    ans = np.mean([len(x["answers"]["text"][0].split()) for x in split])
    return round(ctx,1), round(ans,2)

avg_ctx_train, avg_ans_train = avg_len(dataset["train"])
avg_ctx_val, avg_ans_val = avg_len(dataset["validation"])
(avg_ctx_train, avg_ans_train, avg_ctx_val, avg_ans_val)

(np.float64(119.8), np.float64(3.16), np.float64(124.0), np.float64(3.02))

**Save smaller subsets for quick tests**

In [7]:
small_train = dataset["train"].select(range(3000))
small_val = dataset["validation"].select(range(500))
len(small_train), len(small_val)

(3000, 500)

# **2. Preprocessing**

**Import tokenizer**

In [8]:
from transformers import AutoTokenizer

**Choose all model**

In [9]:
MODEL_NAMES = [
    "distilbert-base-uncased",
    "bert-base-uncased",
    "roberta-base",
    "albert-base-v2"
]

**Build tokenizers**

In [10]:
tokenizers = {name: AutoTokenizer.from_pretrained(name, use_fast=True) for name in MODEL_NAMES}
list(tokenizers.keys())

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

['distilbert-base-uncased',
 'bert-base-uncased',
 'roberta-base',
 'albert-base-v2']

**Tokenization setup**

In [11]:
def safe_lengths(tok, desired_max=384, desired_stride=128):
    m = tok.model_max_length if isinstance(tok.model_max_length, int) else 512
    max_len = min(desired_max, m)
    # ensure stride < max_len and leaves room for special tokens
    stride = min(desired_stride, max_len - 32) if max_len > 32 else max(1, max_len // 4)
    return max_len, stride

**training feature builder (per tokenizer)**

In [12]:
def make_prepare_train_features(tok):
    pad_on_right = tok.padding_side == "right"
    max_length, doc_stride = safe_lengths(tok)

    def _fn(examples):
        tokenized = tok(
            examples["question" if pad_on_right else "context"],
            examples["context"  if pad_on_right else "question"],
            truncation=True,
            max_length=max_length,
            stride=doc_stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length"
        )

        sample_map = tokenized.pop("overflow_to_sample_mapping")
        offset_mapping = tokenized.pop("offset_mapping")

        start_positions = []
        end_positions = []

        for i, offsets in enumerate(offset_mapping):
            input_ids = tokenized["input_ids"][i]
            cls_index = input_ids.index(tok.cls_token_id) if tok.cls_token_id in input_ids else 0
            sample_idx = sample_map[i]
            answers = examples["answers"][sample_idx]

            if len(answers["answer_start"]) == 0:
                start_positions.append(cls_index)
                end_positions.append(cls_index)
                continue

            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            sequence_ids = tokenized.sequence_ids(i)

            # find context token span
            idx = 0
            ctx_id = 1 if pad_on_right else 0
            while idx < len(sequence_ids) and sequence_ids[idx] != ctx_id:
                idx += 1
            context_start = idx
            while idx < len(sequence_ids) and sequence_ids[idx] == ctx_id:
                idx += 1
            context_end = idx - 1

            if context_start >= len(offsets) or context_end < 0:
                start_positions.append(cls_index)
                end_positions.append(cls_index)
                continue

            if not (offsets[context_start] and offsets[context_start][0] <= start_char and
                    offsets[context_end] and offsets[context_end][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                while context_start < len(offsets):
                    o = offsets[context_start]
                    if o and o[0] <= start_char < o[1]:
                        break
                    context_start += 1

                while context_end >= 0:
                    o = offsets[context_end]
                    if o and o[0] < end_char <= o[1]:
                        break
                    context_end -= 1

                start_positions.append(context_start if context_start >= 0 else cls_index)
                end_positions.append(context_end if context_end >= 0 else cls_index)

        tokenized["start_positions"] = start_positions
        tokenized["end_positions"] = end_positions
        return tokenized

    return _fn

**validation feature builder (per tokenizer)**

In [13]:
def make_prepare_validation_features(tok):
    pad_on_right = tok.padding_side == "right"
    max_length, doc_stride = safe_lengths(tok)

    def _fn(examples):
        tokenized = tok(
            examples["question" if pad_on_right else "context"],
            examples["context"  if pad_on_right else "question"],
            truncation=True,
            max_length=max_length,
            stride=doc_stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length"
        )

        sample_map = tokenized.pop("overflow_to_sample_mapping")
        tokenized["example_id"] = []

        for i in range(len(tokenized["input_ids"])):
            sample_idx = sample_map[i]
            tokenized["example_id"].append(examples["id"][sample_idx])

            sequence_ids = tokenized.sequence_ids(i)
            ctx_id = 1 if pad_on_right else 0
            offsets = tokenized["offset_mapping"][i]
            tokenized["offset_mapping"][i] = [
                o if sequence_ids[k] == ctx_id else None
                for k, o in enumerate(offsets)
            ]

        return tokenized

    return _fn

**Map features for every model**

In [14]:
train_features = {}
validation_features = {}

for name, tok in tokenizers.items():
    train_features[name] = dataset["train"].map(
        make_prepare_train_features(tok),
        batched=True,
        remove_columns=dataset["train"].column_names
    )
    validation_features[name] = dataset["validation"].map(
        make_prepare_validation_features(tok),
        batched=True,
        remove_columns=dataset["validation"].column_names
    )

sorted([(k, len(v), len(validation_features[k])) for k, v in train_features.items()])

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

[('albert-base-v2', 88638, 10808),
 ('bert-base-uncased', 88524, 10784),
 ('distilbert-base-uncased', 88524, 10784),
 ('roberta-base', 88756, 10790)]

**Quick sanity check**

In [15]:
one_name = MODEL_NAMES[0]
example_train = train_features[one_name][0]
example_val = validation_features[one_name][0]
list(example_train.keys()), list(example_val.keys())

(['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
 ['input_ids', 'attention_mask', 'offset_mapping', 'example_id'])

# **3. Model & Training — DistilBERT**


**Import Libraries**

In [16]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer, default_data_collator
import torch
import numpy as np
import pandas as pd
import evaluate

**SQuAD metric and small constants**

In [17]:
metric = evaluate.load("squad")
n_best = 20
max_answer_len = 30

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

**Post‑processing to convert logits → text spans**

In [47]:
def postprocess_qa_predictions(examples, features, raw_predictions):
    # Assume the first two elements of the tuple are start and end logits
    if isinstance(raw_predictions, tuple) and len(raw_predictions) >= 2:
        start_logits, end_logits = raw_predictions[:2]
    else:
        # If not a tuple with at least two elements, raise an informative error
        raise ValueError(f"Expected raw_predictions to be a tuple with at least two elements (start_logits, end_logits), but received type {type(raw_predictions)} with length {len(raw_predictions) if hasattr(raw_predictions, '__len__') else 'N/A'}")

    id2idx = {k: i for i, k in enumerate(examples["id"])}
    feats_per_ex = {}
    for i, f in enumerate(features):
        ex_idx = id2idx[f["example_id"]]
        feats_per_ex.setdefault(ex_idx, []).append(i)
    preds = []
    for ex_idx, ex in enumerate(examples):
        ctx = ex["context"]
        f_idxs = feats_per_ex.get(ex_idx, [])
        cand = []
        for fi in f_idxs:
            s_log = start_logits[fi]
            e_log = end_logits[fi]
            offsets = features[fi]["offset_mapping"]
            s_idx = np.argsort(s_log)[-n_best:]
            e_idx = np.argsort(e_log)[-n_best:]
            for s in s_idx:
                for e in e_idx:
                    if s >= len(offsets) or e >= len(offsets):
                        continue
                    # Check if offsets[s] and offsets[e] are not None BEFORE accessing their elements
                    if offsets[s] is None or offsets[e] is None:
                        continue
                    if e < s or (e - s + 1) > max_answer_len:
                        continue
                    score = float(s_log[s] + e_log[e])
                    cand.append((score, offsets[s][0], offsets[e][1]))
        if not cand:
            preds.append({"id": ex["id"], "prediction_text": ""})
        else:
            best = max(cand, key=lambda x: x[0])
            preds.append({"id": ex["id"], "prediction_text": ctx[best[1]:best[2]]})
    return preds

**Metric callback factory per model**

In [19]:
def make_compute_metrics(model_name):
    val_feats = validation_features[model_name]
    val_examples = dataset["validation"]
    def _fn(eval_pred):
        raw = eval_pred if isinstance(eval_pred, tuple) else eval_pred.predictions
        preds = postprocess_qa_predictions(val_examples, val_feats, raw)
        refs = [{"id": ex["id"], "answers": ex["answers"]} for ex in val_examples]
        return metric.compute(predictions=preds, references=refs)
    return _fn

**TrainingArguments helper that works on old and new transformers**

In [20]:
def make_args(model_name, lr=3e-5, bs=16, epochs=2, fp16=torch.cuda.is_available()):
    out = f"./qa_runs_{model_name.replace('/', '_')}"
    try:
        return TrainingArguments(
            output_dir=out,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            learning_rate=lr,
            per_device_train_batch_size=bs,
            per_device_eval_batch_size=bs,
            num_train_epochs=epochs,
            weight_decay=0.01,
            fp16=fp16,
            logging_steps=100,
            report_to="none",
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            greater_is_better=True
        )
    except TypeError:
        return TrainingArguments(
            output_dir=out,
            do_train=True,
            do_eval=True,
            learning_rate=lr,
            per_device_train_batch_size=bs,
            per_device_eval_batch_size=bs,
            num_train_epochs=epochs,
            weight_decay=0.01,
            fp16=fp16,
            logging_steps=100,
            report_to="none",
            save_total_limit=2,
            save_steps=1000,
            eval_steps=1000
        )

**Build trainer**

In [21]:
def build_trainer(name, model, args):
    try:
        return Trainer(
            model=model,
            args=args,
            train_dataset=train_features[name],
            eval_dataset=validation_features[name],
            processing_class=tokenizers[name],
            data_collator=default_data_collator,
            compute_metrics=make_compute_metrics(name)
        )
    except TypeError:
        return Trainer(
            model=model,
            args=args,
            train_dataset=train_features[name],
            eval_dataset=validation_features[name],
            tokenizer=tokenizers[name],
            data_collator=default_data_collator,
            compute_metrics=make_compute_metrics(name)
        )

In [22]:
name = "distilbert-base-uncased"
print(f"=== Training {name} ===")
model = AutoModelForQuestionAnswering.from_pretrained(name)
args = make_args(name)
trainer = build_trainer(name, model, args)
trainer.train()
metrics = trainer.evaluate()

=== Training distilbert-base-uncased ===


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,4.254
200,3.1223
300,2.5611
400,2.1896
500,2.013
600,1.8504
700,1.7535
800,1.756
900,1.6331
1000,1.5658


In [23]:
save_dir = f"./qa_model_{name.replace('/', '_')}"
trainer.save_model(save_dir)
tokenizers[name].save_pretrained(save_dir)

('./qa_model_distilbert-base-uncased/tokenizer_config.json',
 './qa_model_distilbert-base-uncased/special_tokens_map.json',
 './qa_model_distilbert-base-uncased/vocab.txt',
 './qa_model_distilbert-base-uncased/added_tokens.json',
 './qa_model_distilbert-base-uncased/tokenizer.json')

In [24]:
df_results = pd.DataFrame([{"model": name, **{k: float(v) for k, v in metrics.items()}}])
df_results

Unnamed: 0,model,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
0,distilbert-base-uncased,30.9438,348.503,21.781,2.0


# **4. Inference & Quick Sanity Check**

**Load the fine‑tuned checkpoint for this model**

In [25]:
from transformers import pipeline

MODEL_NAME = "distilbert-base-uncased"
CKPT_DIR   = f"./qa_model_{MODEL_NAME.replace('/', '_')}"
tokenizer  = AutoTokenizer.from_pretrained(CKPT_DIR, use_fast=True)
model      = AutoModelForQuestionAnswering.from_pretrained(CKPT_DIR)

**Build the QA pipeline**

In [26]:
qa_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

Device set to use cuda:0


**Example context and question**

In [27]:
context = """
The Nile River is the longest river in the world, flowing northward through
eastern Africa into the Mediterranean Sea. It has historically been of great
importance to Egyptian civilization.
"""

question = "Which river is the longest in the world?"

# Run inference
result = qa_pipeline(question=question, context=context)
print("Question:", question)
print("Answer:", result["answer"])
print("Score:", round(result["score"], 4))

Question: Which river is the longest in the world?
Answer: The Nile River
Score: 0.6842


# **5. Export (checkpoint)**

**Set paths for this model**

In [28]:
import os, json, pathlib

MODEL_NAME = "distilbert-base-uncased"
CKPT_DIR   = f"./qa_model_{MODEL_NAME.replace('/', '_')}"
pathlib.Path(CKPT_DIR).mkdir(parents=True, exist_ok=True)
CKPT_DIR

'./qa_model_distilbert-base-uncased'

**Save final checkpoint (model + tokenizer + a mini card)**

In [29]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

# Re-save to be sure the latest trained weights are on disk
_ = AutoModelForQuestionAnswering.from_pretrained(CKPT_DIR).save_pretrained(CKPT_DIR)
_ = AutoTokenizer.from_pretrained(CKPT_DIR).save_pretrained(CKPT_DIR)

with open(os.path.join(CKPT_DIR, "model_card.txt"), "w") as f:
    f.write(f"Model: {MODEL_NAME}\n")
    f.write(f"Notes: Fine-tuned on SQuAD v1.1 for extractive QA\n")
print("Saved checkpoint to:", CKPT_DIR)

Saved checkpoint to: ./qa_model_distilbert-base-uncased


**Write metrics JSON (EM/F1)**

In [49]:
# Step 1: Evaluate runtime metrics (loss, eval_runtime, etc.)
runtime_metrics = trainer.evaluate()

# Step 2: Compute EM and F1
raw_predictions = trainer.predict(validation_features[MODEL_NAME]).predictions
compute_fn = make_compute_metrics(MODEL_NAME)
qa_metrics = compute_fn(raw_predictions)

# Step 3: Merge both
all_metrics = {**runtime_metrics, **qa_metrics}

# Step 4: Save to JSON
metrics_out = {
    "model": MODEL_NAME,
    "metrics": {k: float(v) for k, v in all_metrics.items()}
}
with open(f"./metrics_{MODEL_NAME.replace('/', '_')}.json", "w") as f:
    json.dump(metrics_out, f, indent=2)

metrics_out

{'model': 'distilbert-base-uncased',
 'metrics': {'eval_runtime': 30.9092,
  'eval_samples_per_second': 348.893,
  'eval_steps_per_second': 21.806,
  'epoch': 2.0,
  'exact_match': 77.1523178807947,
  'f1': 85.3652932551452}}