<h1><center></center></h1> <h1><center>Elevvo Internship</center></h1> <h1><center>Task 6</center></h1> <h2><center>Model Comparison: DistilBERT vs BERT vs RoBERTa</center></h2>

# **Hands on — Comparison Notebook**

- Load three fine‑tuned QA checkpoints

- Read EM/F1 from the exported metrics_*.json

- Quick side‑by‑side inference and latency

- Summary table with metrics and sizes

# **1- Setup & Paths**

**Install + Imports**

In [1]:
!pip install transformers datasets evaluate --quiet

import os, json, pathlib, torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

**GPU / device**

In [2]:
device = 0 if torch.cuda.is_available() else -1
device

-1

**Mount Google Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Point to Drive folder**

In [36]:
BASE_DIR = "/content/drive/MyDrive/qa_models"
os.listdir(BASE_DIR)

['qa_model_bert-base-uncased',
 'qa_model_distilbert-base-uncased',
 'qa_model_roberta-base',
 'metrics_bert-base-uncased.json',
 'metrics_roberta-base.json',
 'metrics_distilbert-base-uncased.json']

**Define the expected model dirs**

In [37]:
MODEL_DIRS = {
    "distilbert-base-uncased": os.path.join(BASE_DIR, "qa_model_distilbert-base-uncased"),
    "bert-base-uncased":       os.path.join(BASE_DIR, "qa_model_bert-base-uncased"),
    "roberta-base":            os.path.join(BASE_DIR, "qa_model_roberta-base"),
}
MODEL_DIRS

{'distilbert-base-uncased': '/content/drive/MyDrive/qa_models/qa_model_distilbert-base-uncased',
 'bert-base-uncased': '/content/drive/MyDrive/qa_models/qa_model_bert-base-uncased',
 'roberta-base': '/content/drive/MyDrive/qa_models/qa_model_roberta-base'}

**Sanity check files inside each folder**

In [38]:
for name, folder in MODEL_DIRS.items():
    print(f"\n{name} → {folder}")
    if os.path.isdir(folder):
        print("  Files:", sorted(os.listdir(folder))[:8], "...")
    else:
        print("  MISSING folder!")


distilbert-base-uncased → /content/drive/MyDrive/qa_models/qa_model_distilbert-base-uncased
  Files: ['config.json', 'model.safetensors', 'model_card.txt', 'special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json', 'training_args.bin', 'vocab.txt'] ...

bert-base-uncased → /content/drive/MyDrive/qa_models/qa_model_bert-base-uncased
  Files: ['config.json', 'model.safetensors', 'model_card.txt', 'special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json', 'training_args.bin', 'vocab.txt'] ...

roberta-base → /content/drive/MyDrive/qa_models/qa_model_roberta-base
  Files: ['config.json', 'merges.txt', 'model.safetensors', 'model_card.txt', 'special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json', 'training_args.bin'] ...


**Find Saved Metric Files**

In [48]:
import glob, json, os, pathlib

def find_metrics_for(name: str, folder: str):
    candidates = [
        f"metrics_{name}.json".replace("/", "_"),
        "metrics.json",
        f"metrics_{pathlib.Path(folder).name}.json"
    ]

    # Look in BASE_DIR
    for c in candidates:
        p = os.path.join(BASE_DIR, c)
        if os.path.exists(p):
            return p

    # Look in CWD
    for c in candidates:
        p = os.path.join(os.getcwd(), c)
        if os.path.exists(p):
            return p

    # Look inside the model folder
    for c in candidates:
        p = os.path.join(folder, c)
        if os.path.exists(p):
            return p

    # Look anywhere under BASE_DIR
    pattern = os.path.join(BASE_DIR, "**", "metrics*.json")
    for p in glob.glob(pattern, recursive=True):
        if name.replace("/", "_") in os.path.basename(p):
            return p

    return None

METRICS_FILES = {}
for name, folder in MODEL_DIRS.items():
    found = find_metrics_for(name, folder)
    if found:
        METRICS_FILES[name] = found

print("Found metrics files:")
for k, v in METRICS_FILES.items():
    print(f"  {k}: {v}")

Found metrics files:
  distilbert-base-uncased: /content/drive/MyDrive/qa_models/metrics_distilbert-base-uncased.json
  bert-base-uncased: /content/drive/MyDrive/qa_models/metrics_bert-base-uncased.json
  roberta-base: /content/drive/MyDrive/qa_models/metrics_roberta-base.json


# **2- Load Models & Build Pipelines**

**Load each model + tokenizer and record load time**

In [40]:
import time

pipelines = {}
load_rows = []

for name, path in MODEL_DIRS.items():
    t0 = time.perf_counter()
    tok = AutoTokenizer.from_pretrained(path, use_fast=True)
    mdl = AutoModelForQuestionAnswering.from_pretrained(path)
    qa  = pipeline("question-answering", model=mdl, tokenizer=tok, device=device)
    t1 = time.perf_counter()

    size_mb = sum(p.stat().st_size for p in pathlib.Path(path).glob("**/*")) / (1024**2)
    pipelines[name] = qa
    load_rows.append({
        "model": name,
        "load_time_s": round(t1 - t0, 3),
        "checkpoint_size_MB": round(size_mb, 1)
    })

pd.DataFrame(load_rows)

Device set to use cpu
Device set to use cpu
Device set to use cpu


Unnamed: 0,model,load_time_s,checkpoint_size_MB
0,distilbert-base-uncased,0.326,254.1
1,bert-base-uncased,0.361,416.3
2,roberta-base,0.587,477.9


# **3- Read Saved Metrics (EM / F1)**

**Assemble comparison table from metrics JSONs**

In [41]:
rows = []

for name, mfile in METRICS_FILES.items():
    if os.path.exists(mfile):
        with open(mfile) as f:
            data = json.load(f)
        met = data.get("metrics", {})
        exact_match = met.get("exact_match")
        f1 = met.get("f1")
        rows.append({
            "model": name,
            "exact_match": float(exact_match) if exact_match is not None else np.nan,
            "f1": float(f1) if f1 is not None else np.nan
        })
    else:
        rows.append({"model": name, "exact_match": np.nan, "f1": np.nan})

df_metrics = pd.DataFrame(rows).sort_values("f1", ascending=False).reset_index(drop=True)
df_metrics

Unnamed: 0,model,exact_match,f1
0,roberta-base,85.969726,92.085488
1,bert-base-uncased,81.012299,88.356818
2,distilbert-base-uncased,77.152318,85.365293


**Merge with load info**

In [42]:
df_load = pd.DataFrame(load_rows)
df_summary = df_metrics.merge(df_load, on="model", how="left")
df_summary

Unnamed: 0,model,exact_match,f1,load_time_s,checkpoint_size_MB
0,roberta-base,85.969726,92.085488,0.587,477.9
1,bert-base-uncased,81.012299,88.356818,0.361,416.3
2,distilbert-base-uncased,77.152318,85.365293,0.326,254.1


# **4- Quick Side‑by‑Side Inference**

**Provide a context + several questions**

In [43]:
context = """
The Nile River is the longest river in the world, flowing northward through eastern Africa
into the Mediterranean Sea. It has historically been of great importance to Egyptian civilization.
The Amazon River, however, has the largest discharge of water in the world.
"""

questions = [
    "Which river is the longest in the world?",
    "Which river has the largest discharge of water?",
    "Into which sea does the Nile flow?"
]

**Run all models and time them**

In [47]:
records = []

for q in questions:
    for name, qa in pipelines.items():
        t0 = time.perf_counter()
        out = qa(question=q, context=context)
        t1 = time.perf_counter()
        records.append({
            "question": q,
            "model": name,
            "answer": out.get("answer", ""),
            "score": round(float(out.get("score", 0.0)), 4),
            "latency_s": round(t1 - t0, 3)
        })

pd.DataFrame(records)

Unnamed: 0,question,model,answer,score,latency_s
0,Which river is the longest in the world?,distilbert-base-uncased,The Nile River,0.7052,0.15
1,Which river is the longest in the world?,bert-base-uncased,Nile River,0.4226,0.299
2,Which river is the longest in the world?,roberta-base,The Nile River,0.4981,0.3
3,Which river has the largest discharge of water?,distilbert-base-uncased,The Amazon River,0.6022,0.137
4,Which river has the largest discharge of water?,bert-base-uncased,Amazon River,0.4641,0.253
5,Which river has the largest discharge of water?,roberta-base,Amazon River,0.5741,0.285
6,Into which sea does the Nile flow?,distilbert-base-uncased,Mediterranean Sea,0.7904,0.133
7,Into which sea does the Nile flow?,bert-base-uncased,Mediterranean Sea,0.7313,0.267
8,Into which sea does the Nile flow?,roberta-base,Mediterranean Sea,0.7197,0.282


**Per‑question best answer by score**

In [45]:
df = pd.DataFrame(records)
winners = df.sort_values(["question", "score"], ascending=[True, False]).groupby("question").head(1).reset_index(drop=True)
winners

Unnamed: 0,question,model,answer,score,latency_s
0,Into which sea does the Nile flow?,distilbert-base-uncased,Mediterranean Sea,0.7904,0.142
1,Which river has the largest discharge of water?,distilbert-base-uncased,The Amazon River,0.6022,0.135
2,Which river is the longest in the world?,distilbert-base-uncased,The Nile River,0.7052,1.374


# **5- Final Comparison Table**

**Rank models by F1 (fallback to score mean if F1 missing)**

In [46]:
avg_scores = df.groupby("model")["score"].mean().rename("avg_pipeline_score").reset_index()
final = df_summary.merge(avg_scores, on="model", how="left")

if final["f1"].notna().any():
    final = final.sort_values(["f1", "exact_match", "avg_pipeline_score"], ascending=False)
else:
    final = final.sort_values("avg_pipeline_score", ascending=False)

final.reset_index(drop=True)

Unnamed: 0,model,exact_match,f1,load_time_s,checkpoint_size_MB,avg_pipeline_score
0,roberta-base,85.969726,92.085488,0.587,477.9,0.5973
1,bert-base-uncased,81.012299,88.356818,0.361,416.3,0.539333
2,distilbert-base-uncased,77.152318,85.365293,0.326,254.1,0.699267
