# 🤖 DistilBERT Fine-tuning for Question Answering on SQuAD Dataset

This notebook demonstrates how to fine-tune a DistilBERT model for question answering using the Stanford Question Answering Dataset (SQuAD) v2.0. DistilBERT is a smaller, faster version of BERT that retains 97% of BERT's performance while being 60% smaller and 60% faster.

## 📋 Overview
- **Model**: DistilBERT (uncased) - A distilled version of BERT
- **Dataset**: SQuAD v2.0 - Stanford Question Answering Dataset
- **Task**: Extractive Question Answering
- **Framework**: Hugging Face Transformers

## 🎯 What This Notebook Does
1. Loads a pre-trained DistilBERT model and tokenizer
2. Preprocesses the SQuAD v2.0 dataset for question answering
3. Fine-tunes the model on the training data
4. Saves the fine-tuned model for future use

In [None]:
# 📦 Install required libraries
!pip install -q datasets transformers accelerate

print("✅ Libraries installed successfully!")

## 🔧 Model and Tokenizer Setup

We'll use DistilBERT, which is:
- **Faster**: 60% faster inference than BERT
- **Smaller**: 40% fewer parameters
- **Efficient**: Maintains 97% of BERT's performance
- **Versatile**: Pre-trained on a large corpus of English data

In [None]:
# ---------------------------------------------
# 📥 Load Pretrained Tokenizer & Model
# ---------------------------------------------
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering

print("🔄 Loading DistilBERT tokenizer and model...")

# Load tokenizer: converts raw text into BERT-compatible tokens/IDs
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# Load DistilBERT model for QA: outputs start and end positions in context
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

print("✅ Model and tokenizer loaded successfully!")
print(f"📊 Model parameters: {model.num_parameters():,}")

## 📚 Dataset Loading

SQuAD (Stanford Question Answering Dataset) v2.0 contains:
- **Training set**: ~130,000 examples
- **Validation set**: ~12,000 examples
- **Features**: Questions, contexts, and answers (some questions have no answers)
- **Task**: Given a question and context, find the answer span in the context

In [None]:
# ---------------------------------------------
# 📚 Load Dataset
# ---------------------------------------------
from datasets import load_dataset

print("🔄 Loading SQuAD v2.0 dataset...")

# Load the SQuAD v2 dataset which includes questions with and without answers
dataset = load_dataset("squad_v2")

print("✅ Dataset loaded successfully!")
print(f"📊 Training examples: {len(dataset['train']):,}")
print(f"📊 Validation examples: {len(dataset['validation']):,}")

# Display a sample
print("\n📖 Sample from dataset:")
sample = dataset["train"][0]
print(f"Question: {sample['question']}")
print(f"Context: {sample['context'][:200]}...")
print(f"Answer: {sample['answers']['text'][0] if sample['answers']['text'] else 'No answer'}")

## ⚡ Data Preprocessing

The preprocessing function handles several important tasks:

1. **Tokenization**: Converts text to tokens that DistilBERT can understand
2. **Truncation**: Handles long contexts by splitting them into chunks
3. **Answer Mapping**: Maps character-level answer positions to token positions
4. **Padding**: Ensures all inputs have the same length for efficient batch processing

This is crucial for training the model effectively on the SQuAD dataset.

In [None]:
# ---------------------------------------------
# ⚡ Preprocessing Function
# ---------------------------------------------
def preprocess(example):
    """
    Preprocesses examples for question answering task.
    
    Args:
        example: A batch of examples from the dataset
        
    Returns:
        dict: Tokenized inputs with start/end positions for answers
    """
    # Tokenize question + context
    inputs = tokenizer(
        example["question"],
        example["context"],
        max_length=256,                      # Limit input length (faster training)
        stride=128,                          # Overlap for long contexts
        truncation="only_second",            # Only truncate context, not question
        return_offsets_mapping=True,         # Map tokens back to character positions
        return_overflowing_tokens=True,      # Handle long contexts split into multiple chunks
        padding="max_length"                 # Pad to fixed size
    )

    # Helps us track which chunk came from which example
    sample_mapping = inputs.pop("overflow_to_sample_mapping")

    # Offsets map tokens to positions in original text
    offsets = inputs.pop("offset_mapping")

    starts, ends = [], []  # Stores token-level answer positions

    # Loop through each chunk/tokenized input
    for i, offset in enumerate(offsets):
        ids = inputs["input_ids"][i]                # Token IDs
        cls = ids.index(tokenizer.cls_token_id)     # Index of [CLS] token
        seq_ids = inputs.sequence_ids(i)            # 0 = question, 1 = context, None = padding
        sample_idx = sample_mapping[i]              # Index of original sample
        ans = dataset["train"][sample_idx]["answers"]  # Original answer(s)

        # Case: No answer (empty list)
        if len(ans["answer_start"]) == 0:
            starts.append(cls)
            ends.append(cls)
        else:
            # Get answer's start and end characters
            start_char = ans["answer_start"][0]
            end_char = start_char + len(ans["text"][0])

            # Find token index for first context token
            start_tok = 0
            while seq_ids[start_tok] != 1:
                start_tok += 1

            # Find token index for last context token
            end_tok = len(ids) - 1
            while seq_ids[end_tok] != 1:
                end_tok -= 1

            # Check if answer lies within the current chunk
            if not (offset[start_tok][0] <= start_char and offset[end_tok][1] >= end_char):
                starts.append(cls)
                ends.append(cls)
            else:
                # Move forward to exact token that includes start_char
                while start_tok < len(offset) and offset[start_tok][0] <= start_char:
                    start_tok += 1
                starts.append(start_tok - 1)

                # Move backward to exact token that includes end_char
                while end_tok >= 0 and offset[end_tok][1] >= end_char:
                    end_tok -= 1
                ends.append(end_tok + 1)

    # Add answer labels
    inputs["start_positions"] = starts
    inputs["end_positions"] = ends
    return inputs

print("✅ Preprocessing function defined!")

## 🔄 Apply Preprocessing

Now we'll apply the preprocessing function to the entire dataset. This step:
- Tokenizes all questions and contexts
- Converts text to PyTorch tensors
- Prepares the data for efficient training

In [None]:
# ---------------------------------------------
# ⚙️ Apply Preprocessing to Entire Dataset
# ---------------------------------------------
print("🔄 Applying preprocessing to dataset...")

tokenized = dataset.map(
    preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names  # Keep only tokenized data, drop original text
)
tokenized.set_format("torch")  # Convert to PyTorch tensors

print("✅ Preprocessing completed!")
print(f"📊 Tokenized training examples: {len(tokenized['train']):,}")
print(f"📊 Tokenized validation examples: {len(tokenized['validation']):,}")

## ⚖️ Training Setup

We'll configure the training with optimal settings for fine-tuning:
- **Learning Rate**: 3e-5 (standard for BERT-like models)
- **Batch Size**: 12 (adjust based on your GPU memory)
- **Mixed Precision**: FP16 for faster training
- **Epochs**: 1 for demo (increase for better results)

In [None]:
# ---------------------------------------------
# ⚖️ Data Collator
# ---------------------------------------------
from transformers import DataCollatorWithPadding

# This will pad dynamically during training for efficiency
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# ---------------------------------------------
# 🛠 Define Training Settings
# ---------------------------------------------
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./fast-distilbert-qa",       # Directory to save model checkpoints
    eval_strategy="epoch",                   # Evaluate at the end of each epoch
    learning_rate=3e-5,                      # Fine-tuning learning rate
    per_device_train_batch_size=12,          # Train batch size per GPU
    per_device_eval_batch_size=12,           # Eval batch size per GPU
    num_train_epochs=1,                      # Just 1 epoch for demo; increase for better results
    weight_decay=0.01,                       # Regularization to avoid overfitting
    fp16=True,                               # Use mixed precision (faster on GPU)
    logging_steps=100,                       # Log every 100 steps
    report_to="none"                         # Disable external logging (e.g., WandB)
)

print("✅ Training arguments configured!")

## 🚀 Training Time!

The actual fine-tuning process begins here. The Trainer API will:
1. Handle the training loop automatically
2. Apply gradients and optimize the model
3. Evaluate performance on the validation set
4. Save checkpoints during training

**Note**: Training time depends on your hardware. With GPU, this should take 30-60 minutes for 1 epoch.

In [None]:
# ---------------------------------------------
# 🔁 Wrap Everything in Trainer API
# ---------------------------------------------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

print("🚀 Starting training...")
print("⏱️ This may take 30-60 minutes depending on your hardware...")

# ---------------------------------------------
# 🚀 Start Training
# ---------------------------------------------
trainer.train()

print("🎉 Training completed!")

## 💾 Save the Model

Finally, we'll save our fine-tuned model and tokenizer so you can use them later for inference or further training.

In [None]:
# ---------------------------------------------
# 💾 Save Final Fine-Tuned Model
# ---------------------------------------------
print("💾 Saving fine-tuned model and tokenizer...")

trainer.save_model("fast-distilbert-squad")            # Save model weights
tokenizer.save_pretrained("fast-distilbert-squad")     # Save tokenizer

print("✅ Model and tokenizer saved to 'fast-distilbert-squad' directory!")
print("🎯 You can now use this model for question answering tasks!")
print("✅ Done!")  # Training complete

## 🎯 Next Steps

Your fine-tuned DistilBERT model is ready! Here's what you can do next:

1. **Test the Model**: Load the saved model and test it on new questions
2. **Deploy**: Use the model in production applications
3. **Further Training**: Train for more epochs to improve performance
4. **Evaluation**: Run comprehensive evaluation metrics

### 📖 How to Use Your Model

```python
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering

# Load your fine-tuned model
tokenizer = DistilBertTokenizerFast.from_pretrained("fast-distilbert-squad")
model = DistilBertForQuestionAnswering.from_pretrained("fast-distilbert-squad")

# Use for inference
question = "What is the capital of France?"
context = "France is a country in Europe. Its capital city is Paris."
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
# Extract answer from outputs...
```

🎉 **Congratulations!** You've successfully fine-tuned a DistilBERT model for question answering!