# **Question answering model**

**Prepared by:** Jacqueline Fernández Ramírez, Jazmín Fernández Ramírez y Nina M. Odoux. 

## **Project goals**

The goal of this project is to build a Q&A system using a transformer model. We aim to fine-tune a pre-trained model on the SQuAD dataset to predict answers to natural language questions based on a given context passage.

We will use tokens to precisely locate the answer span within the context, then tokenize both the context and the question, and align the answer positions to the corresponding token indices.

We use a offset_mapping feature provided by BertTokenizerFast (and DistilBertTokenizerFast as well) to perform this. This allows the model to learn where the answer starts and ends in tokenized forms.

---

In this notebook, we focus on using a pre-trained BERT model to find answers in texts. The model is already trained to understand natural language and interpret context effectively, but we want to make it even better suited to our data. To do this, we plan to fine-tune the model, a process where we adjust the pre-trained parameters using our specific dataset. This lets the model learn patterns and nuances unique to the task at hand.

Later and considering poor results on local implementations, we will be switching to Google Colab for training since it provides the necessary resources to work with the full dataset, ensuring the model has access to as much information as possible.

When given a text and a question, the model identifies the parts of the text where the answer is likely to be (a range) and extracts the relevant words. This is achieved by finding the START and END tokens of the answer. It's similar to highlighting pargraphs or key words in a text to point out important details.

By combining fine-tuning with the already pre-trained model, we can create a good solution for extracting answers that is both highly accurate and adapted to our idea of project.

**Data for the project:**
* A dataset called SQuAD (Stanford Question Answering Dataset) is used, which includes 87,599 examples of questions, text passages, and answers.


---

## How does this notebook work? 

Principal steps: 

* Loading and preprocessing the SQuAD dataset.

* Creating a custom dataset class for token alignment.

* Fine-tuning a DistilBERT model for question answering.

* Evaluating the model’s predictions.

* Testing the model on question/context pairs.

## Set up and classes

- We performed several steps to fine-tune a BERT model on the SQuAD dataset for question answering tasks. We began by importing the necessary libraries, including **transformers** for the BERT model and tokenizer, and **datasets** for loading the SQuAD dataset. We also used the torch library for creating data loaders and handling tensor operations, which are essential for training the model.

- We initialized the **BertTokenizerFast** from the pre-trained 'bert-base-uncased' model. This tokenizer is optimized for speed and supports additional features like offset mapping.

- We then prepared the SQuAD dataset by loading it using the datasets library. We defined a custom **SQuADDataset class** to handle the tokenization of questions and contexts. This class is from torch.utils.data.Dataset and overrides the **__len__ and __getitem__** methods to provide the length of the dataset and to fetch individual data points, respectively.

- In the **__getitem__** method, we processed each example from the dataset to extract the question, context, and answer text.

We calculated the answer's start and end positions within the context. The tokenizer then converted the question and context into input IDs, attention masks, and token type IDs, which are required for the BERT model. We used the **return_offsets_mapping** feature to map the token positions back to the original text, enabling the identification of answer spans or ranges.

- We created the **DataLoader** for both the *training and validation* datasets. It batches the data and provides *shuffling* for the training set to ensure randomness during training. We set the batch size to 16, which can be iteratively changed, to balance between memory usage and training results efficiency.

- We initialized the **BertForQuestionAnswering** model from the pre-trained **'bert-base-uncased'** model. This model is specifically designed for question answering tasks and includes a span prediction head that outputs the start and end logits for the answer span.

- We set up the optimizer using Stochastic Gradient Descent (SGD) with a learning rate of 0.01 and a momentum of 0.9. The optimizer updates the model parameters during training to minimize the loss. We consider ADAM, but at the moment for dependencies issues or version issues it does not upload properly.


## Training loop and eval

- We implemented the training loop, which iterates over the dataset for a specified number of epochs. For each batch, we moved the inputs to the appropriate device (CPU or GPU) and computed the model's forward pass to obtain the start and end logits.

-> We calculated the loss using the model's built-in loss function, which compares the predicted start and end positions to the actual positions.

-> We then computed the gradients using backpropagation, and the optimizer updated the model parameters.

- We defined an **evaluation function** to calculate the model's performance on the validation set. It computes the exact match and F1 score, which are common metrics for evaluating question answering models from what we've seen on web forums and papers. The exact match measures the percentage of predictions that match the ground truth exactly, while the F1 score considers the overlap between the predicted and actual answer tokens.

- We evaluated the model **on the validation set** using the evaluation function and printed the results.

- Finally, we saved the fine-tuned model and tokenizer to disk using the **save_pretrained** method. This allows the model to be loaded and used for inference or further fine-tuning in the future.

#### **Install Required Libraries**

In [1]:
!pip install transformers datasets evaluate torch tqdm

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)


#### **Packages**

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForQuestionAnswering, BertTokenizerFast, AutoTokenizer
from datasets import load_dataset
import numpy as np
from torch.optim import AdamW, Adam, SGD
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForQuestionAnswering
)
import evaluate
from tqdm import tqdm
import transformers
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd
import os
import warnings
warnings.simplefilter("ignore")
from transformers import default_data_collator
from tqdm.notebook import tqdm
import datetime
import collections
import evaluate
import random,time

# **Phase n°1. Inicial Tests**

In [3]:
# Loading Squad 1 dataset
# ========================================================

squad_dataset = load_dataset("squad")
print(f"Dataset loaded with {len(squad_dataset['train'])} training examples")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset loaded with 87599 training examples


A custom class called SQuAD_Dataset is created to prepare the data for the model. This class transforms each example of question/context/answer into a format that the model can understand.

In [4]:
# Creating a Python Class for Tokenizing Inputs of Squad
# ========================================================
# In this class, we prepare the SQuAD dataset for input into a BERT model.
# The goal is to transform raw text data (questions, contexts, and answers)
# into a structured format that the model can understand and use for training
# or evaluation.

class SQuADDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=384):
        # =============================
        # Constructor for initializing the dataset
        # We chose to pass the dataset, tokenizer, and a maximum length for input sequences
        # as parameters to allow flexibility. The max_length ensures inputs don't exceed
        # the model's constraints.
        # =============================
        self.dataset = dataset  # The dataset containing questions, contexts, and answers
        self.tokenizer = tokenizer  # Tokenizer for processing text into input IDs and other features
        self.max_length = max_length  # Maximum allowed length for tokenized sequences

    def __len__(self):
        # =============================
        # This method returns the total number of examples in the dataset.
        # It's necessary for PyTorch to know the size of the dataset when iterating over it.
        # =============================
        return len(self.dataset)

    def find_answer_token_positions(self, context, answer_start, answer_end, encoding):
        # =============================
        # Method to locate the start and end token positions for the answer.
        # BERT uses subword tokenization, meaning words are split into smaller units (e.g., "running" -> ["run", "##ning"]).
        # This can complicate mapping character-level positions (from the original text)
        # to token-level positions. We chose to handle this complexity here.
        # =============================
        offsets = encoding["offset_mapping"].squeeze(0).tolist()  # Retrieve character-to-token mappings

        start_token = end_token = None  # Initialize start and end token positions as None

        for i, (start, end) in enumerate(offsets):
            # Identify the token that contains the start of the answer
            if start <= answer_start < end:
                start_token = i
            # Identify the token that contains the end of the answer
            if start < answer_end <= end:
                end_token = i
                break  # Exit loop early once end_token is found

        # If no match is found, default to the first or last token
        if start_token is None:
            start_token = 0
        if end_token is None:
            end_token = len(offsets) - 1

        return start_token, end_token

    def __getitem__(self, idx):
        # =============================
        # This method retrieves an individual example from the dataset.
        # We chose to include all necessary preprocessing steps here,
        # so that each data point is fully prepared before being passed to the model.
        # =============================
        example = self.dataset[idx]  # Extract the example at the given index
        question = example["question"]  # The question being asked
        context = example["context"]  # The passage containing the answer
        answer_text = example["answers"]["text"][0]  # The correct answer text
        answer_start = example["answers"]["answer_start"][0]  # Start character position of the answer

        # Calculate the end position of the answer based on its length
        answer_end = answer_start + len(answer_text)

        # =============================
        # Tokenize the inputs (question and context) using the tokenizer.
        # We chose these specific parameters to ensure that:
        # - Long contexts are truncated properly (only the context, not the question).
        # - All inputs are padded to the same maximum length for batch processing.
        # - Tensors are returned in PyTorch format for compatibility with the model.
        # - Offset mappings are included for character-to-token position mapping.
        # =============================
        encoding = self.tokenizer(
            question,
            context,
            max_length=self.max_length,  # Ensure tokenized sequences don't exceed the maximum length
            truncation="only_second",  # Truncate only the context if it's too long
            padding="max_length",  # Pad inputs to max_length to ensure uniformity in batch processing
            return_tensors="pt",  # Return the tokenized inputs as PyTorch tensors
            return_token_type_ids=True,  # Include token type IDs to differentiate between question and context
            return_offsets_mapping=True  # Include mappings of character positions to token positions
        )

        # =============================
        # Remove the batch dimension created by the tokenizer.
        # Although the tokenizer expects multiple inputs (batches),
        # we process one example at a time here, so the batch dimension is unnecessary.
        # We chose to "squeeze" this dimension for simplicity.
        # =============================
        input_ids = encoding["input_ids"].squeeze(0)  # Token IDs of the input
        attention_mask = encoding["attention_mask"].squeeze(0)  # Mask to indicate valid tokens
        token_type_ids = encoding["token_type_ids"].squeeze(0)  # IDs to distinguish question and context

        # Locate the start and end token positions of the answer within the tokenized context
        start_pos, end_pos = self.find_answer_token_positions(context, answer_start, answer_end, encoding)

        # =============================
        # Return the processed data in a dictionary format.
        # This format is chosen to include everything the model needs for training or evaluation:
        # - Tokenized inputs (input_ids, attention_mask, token_type_ids)
        # - Token positions of the answer (start_positions, end_positions)
        # - The original answer text (useful for debugging or evaluation).
        # =============================
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "token_type_ids": token_type_ids,
            "start_positions": torch.tensor(start_pos),  # Convert to tensor for compatibility with PyTorch
            "end_positions": torch.tensor(end_pos),  # Convert to tensor for compatibility with PyTorch
            "answer_text": answer_text  # Include the original answer text for reference
        }

When a model like BERT works with text, it first breaks the content into small pieces called tokens. These tokens are fragments of the original text—such as whole words, parts of words, or even symbols. This allows the model to interpret language and better understand context.

But here's the **challenge:** the tokens used by the model don't have positions that directly match those in the original text. So, in order for the model to locate an answer within the text, we need to translate the positions from the original text into the positions of these tokens.

This "mapping" task is crucial because it ensures the model can find accurate answers and align them with the original text. That way, when the system identifies where an answer starts and ends, it knows exactly which part to highlight.

In [5]:
def find_answer_token_positions(self, context, answer_start, answer_end, encoding):
    # ========================================================
    # This method maps the character positions of the answer (from the original text)
    # to the corresponding token positions in the tokenized input.
    # BERT's subword tokenization doesn't maintain a straightforward 1:1 mapping
    # between words and tokens. For example, "running" might become ["run", "##ning"].
    # We need this step to accurately locate the answer within the tokenized context.
    # ========================================================

    # Get the offset mappings, which contain pairs of start and end character positions
    # for each token in the tokenized context. This helps us locate the answer within the tokens.
    offset_mapping = self.tokenizer(
        context,
        return_offsets_mapping=True,  # Return character-to-token mappings
        add_special_tokens=False      # Exclude special tokens (e.g., [CLS], [SEP])
    )["offset_mapping"]

    # Initialize variables to store the token positions for the start and end of the answer
    start_token = None
    end_token = None

    # ========================================================
    # We chose to adjust token positions for [CLS] and [SEP] special tokens,
    # and account for the tokens generated by the question (since we are combining
    # the question and context into a single input).
    # ========================================================
    question_tokens_len = len(self.tokenizer.encode(encoding["question"], add_special_tokens=False))  # Length of the question tokens
    cls_sep_tokens = 2  # Account for [CLS] at the start and [SEP] separating question and context
    offset = cls_sep_tokens + question_tokens_len  # Total adjustment for special tokens and question length

    # Iterate through offset mappings to locate the start and end token positions of the answer
    for i, (start_char, end_char) in enumerate(offset_mapping):
        if start_char <= answer_start < end_char:
            start_token = i + offset  # Adjust token position by offset
        if start_char <= answer_end < end_char:
            end_token = i + offset
            break  # Stop the loop early after finding the end token for efficiency

    # ========================================================
    # Handle cases where the exact start or end token couldn't be found:
    # - If the start position isn't found, select the token closest to the start.
    # - If the end position isn't found, select the token closest to the end.
    # We chose this approach to ensure the answer boundaries are as accurate as possible.
    # ========================================================
    if start_token is None:
        for i, (start_char, end_char) in enumerate(offset_mapping):
            if start_char > answer_start:
                start_token = i + offset - 1  # Choose the previous token
                break

    if end_token is None:
        for i, (start_char, end_char) in enumerate(offset_mapping):
            if end_char > answer_end:
                end_token = i + offset  # Choose the current token
                break

    # ========================================================
    # Handle edge cases where no token positions are found:
    # - Default to the first token for start.
    # - Default to a token within the valid range for end, avoiding out-of-bounds errors.
    # ========================================================
    if start_token is None:
        start_token = offset  # Default to the offset if nothing else is found
    if end_token is None:
        end_token = min(len(encoding["input_ids"]) - 1, start_token + 5)  # Prevent end token from exceeding input bounds

    # Return the start and end token positions as a dictionary for convenience
    return {"start": start_token, "end": end_token}

#### **Why do we need this function?**

When training a question-answering model, we have to tell the model exactly where the answer appears in the context. The problem is: the original answer positions are given in characters, but the model processes tokens (like subwords). So this function helps us convert character positions into token positions, making sure the model learns the correct answer boundaries during training.

#### **Some anotations:**

When we use a pre-trained BERT model to answer questions, we load it with all its settings ready to understand and process natural language accurately. Basically, this model can read a text, interpret a question, and find an answer within that text.

To handle the data efficiently, we create a DataLoader, which works like an assistant that organizes the information and delivers it to the model in small portions (batches). This is especially useful when working with large amounts of data, because it ensures the model doesn't get overwhelmed and can work quickly—even on computers with limited resources.

### **BERT use_fast**

For our project, we decided to use BERT’s fast tokenizer due to its speed in data processing. This choice was also driven by the limited computational resources available, which made it impossible to obtain results within a reasonable timeframe for testing.

It handles large volumes of text more efficiently by using techniques like multithreading. Additionally, we ensured it preserved all the essential functionalities required to accurately interpret natural language, which was crucial to meet the goals of the project.

In [6]:
squad_dataset = load_dataset("squad")
small_dataset = squad_dataset["train"].select(range(16)) # selecciona una submuestra fija de 16 elementos de los datos de entrenamiento del conjunto de datos

# Tokenizer
# ========================================================
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

# Dataset (customized)
# ========================================================
custom_dataset = SQuADDataset(small_dataset, tokenizer)
dataloader = DataLoader(custom_dataset, batch_size=8, shuffle=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### **Testing the setup with a small sample**

To keep things fast and easy to debug, we use only a small portion of the SQuAD training set — just 16 examples.  
Although this tiny sample won’t give us great results, it’s perfect for making sure everything is working correctly before scaling up.

We also prepare a DataLoader to organize the data into batches. This helps the model process multiple examples at once, making training more efficient and memory-friendly.

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)  # <<< Move the model to the same device


BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

#### **Load the pre-trained question answering model**

We use BertForQuestionAnswering, a version of BERT that’s already fine-tuned to handle question answering tasks.  
By loading the "bert-base-uncased" model from Hugging Face, we get a strong starting point — the model already understands English and can be trained further on our specific data.

In [7]:
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Experiment with different hyperparameters, such as learning rate, batch size, and number of epochs.**

In [12]:
def process_batches(dataloader, model, num_batches=2):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)  # Assure que le modèle est sur le bon device
    model.eval()

    processed_examples = []

    for i, batch in enumerate(dataloader):
        if i >= num_batches:
            break

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch["token_type_ids"].to(device)
        start_positions = batch["start_positions"].to(device)
        end_positions = batch["end_positions"].to(device)

        with torch.no_grad():
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids
            )

            start_logits = outputs.start_logits
            end_logits = outputs.end_logits

            predicted_start = torch.argmax(start_logits, dim=1)
            predicted_end = torch.argmax(end_logits, dim=1)

        # Convertir les tenseurs en CPU AVANT de les utiliser
        input_ids_cpu = input_ids.cpu().numpy()
        predicted_start_cpu = predicted_start.cpu().numpy()
        predicted_end_cpu = predicted_end.cpu().numpy()
        actual_start_cpu = start_positions.cpu().numpy()
        actual_end_cpu = end_positions.cpu().numpy()

        for j in range(len(input_ids_cpu)):
            predicted_answer_tokens = input_ids_cpu[j][predicted_start_cpu[j]:predicted_end_cpu[j] + 1]
            predicted_answer = tokenizer.decode(predicted_answer_tokens)

            actual_answer = batch["answer_text"][j]

            processed_examples.append({
                "original_answer": actual_answer,
                "predicted_answer": predicted_answer,
                "actual_start_token": actual_start_cpu[j],
                "actual_end_token": actual_end_cpu[j],
                "predicted_start_token": predicted_start_cpu[j],
                "predicted_end_token": predicted_end_cpu[j]
            })

            print(f"Example {j+1} in batch {i+1}:")
            print(f"Input IDs: {input_ids_cpu[j]}")
            print(f"Start Logits: {start_logits[j]}")
            print(f"End Logits: {end_logits[j]}")
            print(f"Predicted Start Token: {predicted_start_cpu[j]}")
            print(f"Predicted End Token: {predicted_end_cpu[j]}")
            print(f"Actual Start Token: {actual_start_cpu[j]}")
            print(f"Actual End Token: {actual_end_cpu[j]}")
            print(f"Predicted Answer: {predicted_answer}")
            print(f"Actual Answer: {actual_answer}")
            print("-" * 50)

        print(f"Processed batch {i+1}/{num_batches}")

    return processed_examples


**Inspecting model predictions on batches**

Now that our model and data are ready, it’s time to **test the model** by running it on a few small batches of examples.

This function does several key things:
- Moves the model and data to the correct device (CPU or GPU).
- Feeds input data into the model and collects predictions.
- Extracts the most likely start and end tokens of the answer span.
- Converts the predicted token IDs back into readable text.
- Prints and compares the **actual answers** vs. **predicted answers**.

This step is not for training — we’re just evaluating how the model performs with the current settings.  
It’s a great way to debug and **build intuition** about what the model is learning.

In [13]:
# Define the custom dataset class
# ========================================================
# This class is used to preprocess and prepare the SQuAD dataset for training and evaluation.
# We chose to create this class to customize data handling and ensure that each example
# (question, context, and answer) is converted into a format compatible with the BERT model.

class SQuADDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, tokenizer, max_length=384):
        # ========================================================
        # Constructor for the dataset class.
        # We pass the dataset, tokenizer, and maximum input length as parameters to ensure
        # flexibility for handling different datasets and tokenization settings.
        # ========================================================
        self.dataset = dataset  # The dataset containing examples of questions and answers
        self.tokenizer = tokenizer  # The tokenizer used for encoding text into tokens
        self.max_length = max_length  # Maximum allowed length for tokenized sequences

    def __len__(self):
        # ========================================================
        # Returns the total number of examples in the dataset.
        # This is required by PyTorch to determine the size of the dataset.
        # ========================================================
        return len(self.dataset)

    def __getitem__(self, idx):
        # ========================================================
        # Method to retrieve and process a specific example from the dataset.
        # We chose to include the preprocessing steps directly in this method for simplicity
        # and to ensure that each example is ready to be passed to the model.
        # ========================================================
        example = self.dataset[idx]  # Extract the example at the given index
        question = example["question"]  # The question text
        context = example["context"]  # The context (passage) where the answer is located
        answer_text = example["answers"]["text"][0]  # The correct answer text
        answer_start = example["answers"]["answer_start"][0]  # Start character position of the answer

        # Calculate the end position of the answer based on its length
        answer_end = answer_start + len(answer_text)

        # ========================================================
        # Tokenize the inputs (question and context) using the tokenizer.
        # We chose these specific parameters to ensure proper truncation, padding,
        # and compatibility with BERT for question answering tasks.
        # ========================================================
        encoding = self.tokenizer(
            question,
            context,
            max_length=self.max_length,  # Ensure tokenized sequences don't exceed the maximum length
            truncation="only_second",  # Truncate only the context if it's too long
            padding="max_length",  # Pad inputs to max_length for uniformity during batch processing
            return_tensors="pt",  # Return tensors in PyTorch format
            return_token_type_ids=True,  # Include token type IDs to differentiate question and context
            return_offsets_mapping=True  # Include mappings of character positions to token positions
        )

        # ========================================================
        # Remove the batch dimension created by the tokenizer.
        # Although the tokenizer assumes multiple inputs are being processed together,
        # we are processing one example at a time, so the batch dimension is unnecessary.
        # ========================================================
        input_ids = encoding["input_ids"].squeeze(0)  # Token IDs of the input text
        attention_mask = encoding["attention_mask"].squeeze(0)  # Mask to indicate valid tokens
        token_type_ids = encoding["token_type_ids"].squeeze(0)  # IDs to differentiate question and context

        # Find the token positions for the start and end of the answer
        start_pos, end_pos = self.find_answer_token_positions(context, answer_start, answer_end, encoding)

        # ========================================================
        # Return the processed data in a dictionary format.
        # This format is chosen to ensure all information required by the model is included.
        # ========================================================
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "token_type_ids": token_type_ids,
            "start_positions": torch.tensor(start_pos),  # Convert start token position to tensor
            "end_positions": torch.tensor(end_pos),  # Convert end token position to tensor
            "answer_text": answer_text  # Include the original answer text for reference
        }

    def find_answer_token_positions(self, context, answer_start, answer_end, encoding):
        # ========================================================
        # Method to locate the start and end token positions for the answer.
        # We chose to implement this method because BERT's subword tokenization
        # doesn't maintain a straightforward mapping between characters and tokens.
        # ========================================================
        offsets = encoding["offset_mapping"][0]  # Retrieve character-to-token mappings
        start_token = end_token = None  # Initialize start and end token positions as None

        for i, (start, end) in enumerate(offsets):
            # Identify the token that contains the start of the answer
            if start <= answer_start < end and start_token is None:
                start_token = i
            # Identify the token that contains the end of the answer
            if start < answer_end <= end and end_token is None:
                end_token = i

        # Handle cases where positions aren't found
        if start_token is None:
            start_token = 0  # Default to the first token
        if end_token is None:
            end_token = len(offsets) - 1  # Default to the last token

        return start_token, end_token

# Prepare the dataset
# ========================================================
# Select 16 examples from the training set and prepare them using the custom dataset class.
# We chose to limit the data size for initial experimentation and debugging purposes.
train_dataset = SQuADDataset(squad_dataset["train"].select(range(16)), tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)  # Prepare data loader

# Load the model
# ========================================================
# Load a pre-trained BERT model for question answering.
# We chose to use "bert-base-uncased" as it is widely adopted and effective for this task.
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

# Define the SGD optimizer
# ========================================================
# Configure the optimizer with Stochastic Gradient Descent (SGD) to update model parameters.
# We chose SGD for its simplicity and effectiveness, with a learning rate of 0.01 and momentum of 0.9.
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
# ========================================================
# This function fine-tunes the model using the prepared data.
# We chose to implement it with steps for gradient updates, loss computation, and optimizer adjustments.
def train(model, dataloader, optimizer, epochs=3):
    model.train()  # Set the model to training mode
    for epoch in range(epochs):
        for batch in dataloader:
            # Reset gradients before processing the batch
            optimizer.zero_grad()

            # Move batch data to the appropriate device
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            token_type_ids = batch["token_type_ids"].to(device)
            start_positions = batch["start_positions"].to(device)
            end_positions = batch["end_positions"].to(device)

            # Forward pass through the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
                start_positions=start_positions,
                end_positions=end_positions
            )

            loss = outputs.loss  # Compute the loss
            loss.backward()  # Backpropagate gradients
            optimizer.step()  # Update model parameters

        # Print progress after each epoch
        print(f"Epoch {epoch + 1}/{epochs} completed.")

# Set device
# ========================================================
# Automatically select GPU if available, otherwise use CPU.
# We chose this setup to ensure compatibility with different environments.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)  # Move the model to the selected device

# Fine-tune the model
# ========================================================
# Run the training loop to fine-tune the model on the prepared dataset.
train(model, train_dataloader, optimizer, epochs=3)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3 completed.
Epoch 2/3 completed.
Epoch 3/3 completed.


### **Custom Dataset for BERT QA**

To train a model on SQuAD, we first define a custom dataset class. This helps us format each (question, context, answer) pair into the input BERT expects — including token IDs, attention masks, and token type IDs.

We also locate where the answer appears in the context, converting character positions into token positions. This step is key for teaching the model where the answer begins and ends.

* **DataLoader and model setup**

We use a small sample of the SQuAD dataset (16 examples) to keep things fast and testable. The data is loaded in batches with DataLoader. We then load a pre-trained BERT model fine-tuned for question answering. Using bert-base-uncased is a common and reliable choice.

* **Optimizer and training**

For training, we use Stochastic Gradient Descent (SGD) with a learning rate of 0.01. The training loop runs for a few epochs, updating the model based on the loss between its predictions and the correct answers.

In [15]:
# Evaluate the model on some examples from the dataset
model.eval()
processed_examples = []

# Use 3 examples from the training set for prediction
for batch in train_dataloader:
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    token_type_ids = batch["token_type_ids"].to(device)

    with torch.no_grad():
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

    predicted_start = torch.argmax(outputs.start_logits, dim=1)
    predicted_end = torch.argmax(outputs.end_logits, dim=1)

    # Convert tensors to CPU for processing
    input_ids_cpu = input_ids.cpu().numpy()
    predicted_start_cpu = predicted_start.cpu().numpy()
    predicted_end_cpu = predicted_end.cpu().numpy()
    actual_start_cpu = batch["start_positions"].cpu().numpy()
    actual_end_cpu = batch["end_positions"].cpu().numpy()
    answers = batch["answer_text"]
    
    # Treat each example individually
    for j in range(len(input_ids_cpu)):
        input_id = input_ids_cpu[j]
        tokens = tokenizer.convert_ids_to_tokens(input_id)
        pred_ans = tokenizer.convert_tokens_to_string(tokens[predicted_start_cpu[j]:predicted_end_cpu[j]+1])

        processed_examples.append({
            "original_answer": answers[j],
            "predicted_answer": pred_ans,
            "actual_start_token": actual_start_cpu[j],
            "actual_end_token": actual_end_cpu[j],
            "predicted_start_token": predicted_start_cpu[j],
            "predicted_end_token": predicted_end_cpu[j],
        })

    # We only need a few
    if len(processed_examples) >= 3:
        break

To evaluate the model, we first set it to evaluation mode with model.eval() and then feed batches from the training dataset using the dataloader. For each batch, the model predicts the start and end positions of the answer in the context. We convert these positions into tokens and use the tokenizer to map them back to the corresponding answer text. This allows us to compare the model's predicted answer with the ground truth (the original answer).

We then collect the results, including the predicted and actual start/end token positions, and store them for a few examples (3 in this case). By doing this, we can assess how well the model is performing on the task and verify its predictions. This process is crucial for refining the model and identifying areas for improvement.

In [16]:
# Show results for a few examples
# ========================================================
for i, example in enumerate(processed_examples[:3]):
    print(f"\nExample {i+1}:")
    print(f"Original answer: {example['original_answer']}")
    print(f"Predicted answer: {example['predicted_answer']}")
    print(f"Actual token positions: {example['actual_start_token']} to {example['actual_end_token']}")
    print(f"Predicted token positions: {example['predicted_start_token']} to {example['predicted_end_token']}")



Example 1:
Original answer: Old College
Predicted answer: retired priests and brothers reside in fatima house ( a former retreat center ), holy cross house, as well as columba hall near the grotto. the university through the moreau seminary has ties to theologian frederick buechner. while not catholic, buechner has praised writers from notre dame and moreau seminary created a buechner prize for preaching
Actual token positions: 59 to 60
Predicted token positions: 83 to 154

Example 2:
Original answer: three
Predicted answer: three newspapers, both a radio and television station, and several magazines and journals. begun as a one - page journal in september 1876
Actual token positions: 39 to 39
Predicted token positions: 39 to 64

Example 3:
Original answer: a Marian place of prayer and reflection
Predicted answer: a copper statue of christ
Actual token positions: 95 to 101
Predicted token positions: 50 to 54


# **Phase n°2. Optimized QA pipeline**

In [18]:
class SQuADDataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=384):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        example = self.dataset[idx]
        question = example["question"]
        context = example["context"]
        answer = example["answers"]["text"][0]
        answer_start = example["answers"]["answer_start"][0]
        answer_end = answer_start + len(answer)

        # Tokenize inputs with offsets
        encoding = self.tokenizer(
            question,
            context,
            truncation="only_second",
            padding="max_length",
            max_length=self.max_length,
            return_offsets_mapping=True,
            return_tensors="pt"
        )

        input_ids = encoding["input_ids"].squeeze(0)
        attention_mask = encoding["attention_mask"].squeeze(0)
        offsets = encoding["offset_mapping"].squeeze(0)

        # Identify token positions for answer
        start_pos, end_pos = self.char_to_token_positions(offsets, answer_start, answer_end)

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "start_positions": torch.tensor(start_pos),
            "end_positions": torch.tensor(end_pos)
        }

    def char_to_token_positions(self, offsets, char_start, char_end):
        start_token = end_token = 0
        for idx, (start, end) in enumerate(offsets.tolist()):
            if start <= char_start < end:
                start_token = idx
            if start < char_end <= end:
                end_token = idx
        return start_token, end_token


The **SQuADDataset** class is a custom dataset for handling the SQuAD dataset. It processes each example (question and context) and tokenizes them using a tokenizer, preparing them for input into the model. This is crucial for transforming **raw text** into a format that the model can understand. The __getitem__ method retrieves a specific example, tokenizes it, and identifies the start and end positions of the answer within the tokenized context.

**Key components include:**

* **Tokenizer:** It splits the question and context into tokens, ensuring the sequence doesn't exceed the maximum length (max_length=384). Padding and truncation ensure uniform input lengths.

* **Offsets:** After tokenization, the tokenizer also provides offsets, which map each token to its corresponding character span in the original text. This is important because BERT's tokenization may split words into multiple tokens, and we need to correctly identify the token positions that correspond to the start and end of the answer.

* **char_to_token_positions:** This method converts character-level start and end positions (from the raw text) into token-level positions by iterating over the offsets and identifying the token positions where the answer starts and ends.

In summary, this **class** prepares the dataset by tokenizing text and converting answer positions into tokens, making it ready for training with a model like **BERT**. The transformation is key to enabling the model to correctly predict the answer positions from tokenized input data.

In [19]:
# Load a sample from SQuAD
raw_dataset = load_dataset("squad")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# Create dataset
train_dataset = SQuADDataset(raw_dataset["train"], tokenizer)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [20]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
def train(model, dataloader, optimizer, device, epochs=2):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in dataloader:
            optimizer.zero_grad()

            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            start_positions = batch["start_positions"].to(device)
            end_positions = batch["end_positions"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                start_positions=start_positions,
                end_positions=end_positions
            )

            loss = outputs.loss
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{epochs} - Loss: {avg_loss:.4f}")

#### **Training the model with dropout**

In the training process, dropout plays a key role in regularization. When you call model.train(), dropout is enabled. During training, dropout randomly "drops" a percentage of neurons in the network to prevent overfitting, encouraging the model to generalize better.

**Training Loop Breakdown:**

* Model in Training Mode (model.train()):
This activates dropout and batch normalization, which helps prevent overfitting by randomly deactivating neurons during training.

* Data Preparation:
The data is moved to the appropriate device (CPU/GPU), and input features like input_ids, attention_mask, and the start and end positions of the answers are prepared for the model.

* Forward Pass:
The model processes the input data and computes the predicted start and end positions for the answer. The loss is calculated based on the difference between predicted and true positions.

* Backpropagation (loss.backward()):
The model calculates gradients for all the parameters based on the loss, which are used to update the model’s weights.

* Optimizer Step (optimizer.step()):
The optimizer adjusts the model’s weights using the gradients from backpropagation.

* Average Loss:
The total loss across all batches in an epoch is averaged to monitor the model's learning progress.

**Why Dropout is important:**

Dropout helps regularize the model by preventing it from relying too heavily on specific neurons, encouraging it to learn more robust features. This is particularly useful when training deep models that might otherwise overfit to the training data.

In [22]:
def evaluate(model, tokenizer, dataset, device, num_examples=3):
    model.eval()
    results = []

    for i in range(num_examples):
        example = dataset[i]
        question = example["question"]
        context = example["context"]

        # Tokenize for inference
        encoding = tokenizer(
            question,
            context,
            truncation="only_second",
            padding="max_length",
            max_length=384,
            return_offsets_mapping=True,
            return_tensors="pt"
        )

        input_ids = encoding["input_ids"].to(device)
        attention_mask = encoding["attention_mask"].to(device)
        offset_mapping = encoding["offset_mapping"][0]

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            start_logits = outputs.start_logits
            end_logits = outputs.end_logits

        # Get predicted positions
        start_idx = torch.argmax(start_logits)
        end_idx = torch.argmax(end_logits)

        # Convert token positions to character positions
        start_char = offset_mapping[start_idx][0].item()
        end_char = offset_mapping[end_idx][1].item()

        predicted_answer = context[start_char:end_char]

        results.append({
            "question": question,
            "context": context,
            "true_answer": example["answers"]["text"][0],
            "predicted_answer": predicted_answer
        })

    return results

In [23]:
# 1. Train model
train(model, train_loader, optimizer, device, epochs=2)

# 2. Try the model on a few samples
sample_results = evaluate(model, tokenizer, raw_dataset["validation"], device, num_examples=3)

# 3. Print results
for i, res in enumerate(sample_results):
    print(f"\n--- Example {i+1} ---")
    print(f"Question: {res['question']}")
    print(f"True Answer: {res['true_answer']}")
    print(f"Predicted Answer: {res['predicted_answer']}")


Epoch 1/2 - Loss: 1.4105
Epoch 2/2 - Loss: 0.9084

--- Example 1 ---
Question: Which NFL team represented the AFC at Super Bowl 50?
True Answer: Denver Broncos
Predicted Answer: Denver Broncos

--- Example 2 ---
Question: Which NFL team represented the NFC at Super Bowl 50?
True Answer: Carolina Panthers
Predicted Answer: Carolina Panthers

--- Example 3 ---
Question: Where did Super Bowl 50 take place?
True Answer: Santa Clara, California
Predicted Answer: Levi's Stadium


#### **Evaluation and real predictions**

After training, we evaluate the model on a few examples from the SQuAD validation set to see how well it performs. The evaluate() function selects a few samples, tokenizes the question-context pairs, and feeds them through the model. It then identifies the most likely start and end positions of the answer span using the highest scoring tokens (argmax of logits). These token indices are mapped back to character positions in the original context to extract the predicted answer. This approach allows us to compare what the model predicts versus the ground truth provided in the dataset.

In the printed results, we see that the model performs well on two questions — correctly predicting "Denver Broncos" and "Carolina Panthers" as the respective teams in the Super Bowl. However, it slightly misses the third answer. While the true answer is "Santa Clara, California", the model instead outputs "Levi's Stadium" — a close but not exact response. This highlights a common challenge in QA tasks: even if the answer is related, it must match the expected text span. These examples give us a quick sense of how well the model is understanding and extracting relevant information from the context.

## **Additional validation**

In [24]:
def ask_question(model, tokenizer, question, context):
    model.eval()
    inputs = tokenizer(
        question,
        context,
        return_tensors="pt",
        truncation="only_second",
        padding="max_length",
        max_length=384,
        return_offsets_mapping=True
    )

    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)
    offset_mapping = inputs["offset_mapping"][0]

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        start_idx = torch.argmax(outputs.start_logits)
        end_idx = torch.argmax(outputs.end_logits)

    start_char = offset_mapping[start_idx][0].item()
    end_char = offset_mapping[end_idx][1].item()

    return context[start_char:end_char]

In [25]:
context = "The Eiffel Tower is located in Paris, France. It was built in 1889 and is 324 meters tall."
question = "Where is the Eiffel Tower located?"
answer = ask_question(model, tokenizer, question, context)
print("Predicted Answer:", answer)

Predicted Answer: Paris, France


### **Ask Your Own Questions ¿?**

To make the model interactive, we define an ask_question() function that lets us input any question and context and get an answer from the trained model. The function uses the tokenizer to prepare the inputs and maps the predicted token positions back to the original context to extract the answer text. It works similarly to the evaluation function but is designed for custom user inputs instead of examples from the dataset.

In the test case provided, the context is a short passage about the Eiffel Tower, and the question is "Where is the Eiffel Tower located?". The model correctly predicts "Paris, France" as the answer. This demonstrates the model’s ability to perform real-world question answering — a powerful tool for extracting precise information from unstructured text.

# **Evaluation**

In [26]:
def evaluate_model_on_dataset(model, tokenizer, dataset, device, num_samples=100):
    model.eval()
    exact_matches = []
    f1_scores = []

    for i in range(min(num_samples, len(dataset))):
        example = dataset[i]
        question = example["question"]
        context = example["context"]
        true_answer = example["answers"]["text"][0]

        predicted_answer = ask_question(model, tokenizer, question, context)

        # Compute metrics
        metrics = compute_metrics(predicted_answer, true_answer)
        exact_matches.append(metrics["exact_match"])
        f1_scores.append(metrics["f1"])

    avg_em = sum(exact_matches) / len(exact_matches)
    avg_f1 = sum(f1_scores) / len(f1_scores)

    print(f" Evaluation on {len(exact_matches)} samples:")
    print(f"🔹 Average Exact Match: {avg_em:.2f}")
    print(f"🔹 Average F1 Score: {avg_f1:.2f}")

    return avg_em, avg_f1


This function, evaluate_model_on_dataset, evaluates the model's overall performance on a sample of the validation dataset using two standard QA metrics: Exact Match (EM) and F1 Score. For each example, it predicts the answer using the ask_question() function, compares it to the true answer, and calculates both metrics. EM checks if the prediction exactly matches the correct answer, while F1 measures overlap between the predicted and actual answers in terms of tokens. The results give us a quantitative sense of how well the model is doing

In [None]:
evaluate_model_on_dataset(model, tokenizer, raw_dataset["validation"], device, num_samples=100)

# **Phase n°3. Final tests and model**

In [None]:
warnings.simplefilter("ignore")

weight_path = "kaporter/bert-base-uncased-finetuned-squad"
# loading tokenizer
tokenizer = BertTokenizer.from_pretrained(weight_path)
#loading the model
model = BertForQuestionAnswering.from_pretrained(weight_path)

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [None]:
# Tokenize question and context
tokens = tokenizer.tokenize("Pregunta [SEP] Contexto")
print(tokens) # Make sure '[SEP]' is present

sep_idx = tokens.index('[SEP]')

# We will provide including [SEP] token which seperates question from context and 1 for rest.
token_type_ids = [0 for i in range(sep_idx+1)] + [1 for i in range(sep_idx+1,len(tokens))]
print(token_type_ids)

['pre', '##gun', '##ta', '[SEP]', 'context', '##o']
[0, 0, 0, 0, 1, 1]


#### **Understanding token_type_ids for Question-Answering**
                            
In question-answering tasks like SQuAD, models such as BERT or DistilBERT expect the input in the form:
[CLS] question [SEP] context [SEP]

To help the model distinguish between the question and the context, we use token_type_ids. These IDs are a list of 0s and 1s:

* 0 for tokens belonging to the question (including the [SEP] token right after the question),

* 1 for tokens belonging to the context.

This correctly sets up the token_type_ids so the model knows which part is the question and which is the context.

In [None]:
encoding = tokenizer.encode_plus(
    "Pregunta", "Contexto",
    return_token_type_ids=True
)
print(encoding['token_type_ids'])

[0, 0, 0, 0, 0, 1, 1, 1]


This confirms that the model will treat "Pregunta" and "Contexto" as two distinct segments, which is essential for accurate question-answering.

In [None]:
# Verify that `input_ids` and `token_type_ids` are nested lists
print(input_ids)  # Make sure it's something like [[...]]
print(token_type_ids)

tensor([[  101,  1996, 15080,  1997,  8972,  5445,  2034,  4158,  1999,  2054,
          2095,  2012, 10289,  8214,  1029,   102,  1996,  2118,  2034,  3253,
          4619,  5445,  1010,  1999,  1996,  2433,  1997,  1037,  3040,  1997,
          2840,  1006,  5003,  1007,  1010,  1999,  1996,  8421,  1516,  8492,
          3834,  2095,  1012,  1996,  2565,  4423,  2000,  2421,  3040,  1997,
          4277,  1006,  2222,  1012,  1049,  1012,  1007,  1998,  3040,  1997,
          2942,  3330,  1999,  2049,  2220,  5711,  1997,  3930,  1010,  2077,
          1037,  5337,  4619,  2082,  2495,  2001,  2764,  2007,  1037,  9459,
          2025,  3223,  2000,  4374,  1996,  5445,  1012,  2023,  2904,  1999,
          4814,  2007,  5337,  5918,  2764,  2005,  4619,  5445,  1010,  2164,
          5378,  8972,  1006,  8065,  1007,  5445,  1012,  2651,  2169,  1997,
          1996,  2274,  6667,  3749,  4619,  2495,  1012,  2087,  1997,  1996,
          7640,  2013,  1996,  2267,  1997,  2840,  

In [None]:
# Example of using input_ids and token_type_ids manually for question answering

# Step 1: Encode question and context as a single sequence.
# The '[SEP]' token is used to separate the question from the context.
# We use `add_special_tokens=True` to ensure [CLS] and [SEP] tokens are added correctly.
input_ids = tokenizer.encode("Pregunta [SEP] Contexto", add_special_tokens=True)
# Convert input_ids back to readable tokens.
# This is useful to inspect the positions and structure of the input.
tokens = tokenizer.convert_ids_to_tokens(input_ids)
sep_idx = tokens.index('[SEP]')
token_type_ids = [0 for i in range(sep_idx+1)] + [1 for i in range(sep_idx+1, len(tokens))]

# Make sure they are nested lists
# Wrap input_ids and token_type_ids into lists to simulate a batch of size 1.
# The model expects batched input (i.e., nested lists or tensors with shape [batch_size, sequence_length]).
input_ids = [input_ids]
token_type_ids = [token_type_ids]

# Convert lists to PyTorch tensors so they can be used by the model.
input_ids = torch.tensor(input_ids)
token_type_ids = torch.tensor(token_type_ids)

# Run the model in inference mode to get predictions.
# We feed in the input_ids and token_type_ids.
out = model(input_ids, token_type_ids=token_type_ids)

# Extract the logits (raw predictions) for start and end positions of the answer span.
start_logits = out.start_logits
end_logits = out.end_logits

# Get start and fingits indexes
# Use argmax to get the most probable start and end positions.
# These are indices in the tokenized input where the answer is likely located.
answer_start = torch.argmax(start_logits, dim=1).item()
answer_end = torch.argmax(end_logits, dim=1).item()

# Build response
# Reconstruct the answer from the original tokens.
ans = ''.join(tokens[answer_start:answer_end+1])
print('Predicted answer:', ans)

Predicted answer: [CLS]pre##gun##ta[SEP]context##o


#### **Manual token type IDs for question-answering**

In this example, we manually construct token_type_ids to explicitly separate the question from the context. This is useful when you're building input tensors yourself and want to ensure the model can distinguish between the two segments.

We then wrap the lists as nested lists (batch format), convert them to tensors, and pass them to the model. Finally, we extract the predicted span using argmax on the start and end logits, and reconstruct the answer from tokens. This is a hands-on way to understand how transformer models like BERT or DistilBERT handle question-context separation internally using token type IDs.


In [None]:
print("Tokens:", tokens)
print("Start index:", answer_start)
print("End index:", answer_end)

Tokens: ['[CLS]', 'pre', '##gun', '##ta', '[SEP]', 'context', '##o', '[SEP]']
Start index: 0
End index: 6


In [None]:
from datasets import load_dataset
dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Each entry in the dataset includes:

* An **ID** (unique identifier)

* A **title** (name of the Wikipedia page)

* The **context** (a paragraph from the article)

* The **question**

* The **answers** (a list of possible correct answers and their start character positions)

In [None]:
# to make text bold

# Define ANSI escape sequences for bold text
s_bold = '\033[1m'
e_bold = '\033[0;0m'

# Print sample from training data
print(s_bold + 'Train Data Sample.....' + e_bold)
train_data = dataset["train"]
for data in train_data:
    print(' ')
    # Print ID, Title, Context, Answer, and Answer Start Index in bold
    print(s_bold + 'ID -' + e_bold, data['id'])
    print(s_bold +'TITLE - '+ e_bold, data['title'])
    print(s_bold + 'CONTEXT - '+ e_bold,data['context'])
    print(s_bold + 'ANSWERS - ' + e_bold,data['answers']['text'])
    print(s_bold + 'ANSWERS START INDEX - ' + e_bold,data['answers']['answer_start'])
    print(' ')
    break # Only print first sample

# Separator between train and validation data
print('---'*30)
# Print sample from validation data
print(s_bold + 'Validation Data Sample.....' + e_bold)
train_data = dataset["validation"]
for data in train_data:
    print(' ')
    # Print ID, Title, Context, Answer, and Answer Start Index in bold
    print(s_bold + 'ID -' + e_bold, data['id'])
    print(s_bold +'TITLE - '+ e_bold, data['title'])
    print(s_bold + 'CONTEXT - '+ e_bold,data['context'])
    print(s_bold + 'ANSWERS - ' + e_bold,data['answers']['text'])
    print(s_bold + 'ANSWERS START INDEX - ' + e_bold,data['answers']['answer_start'])
    print(' ')
    break # Only print first sample

[1mTrain Data Sample.....[0;0m
 
[1mID -[0;0m 5733be284776f41900661182
[1mTITLE - [0;0m University_of_Notre_Dame
[1mCONTEXT - [0;0m Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
[1mANSWERS - [0;0m ['Saint Bernadette Soubirous']
[1mANSWERS START INDEX - [0;0m [515]
 
-----------------------------------------------------------------------

To better understand the structure of the SQuAD dataset and how the model is expected to work, we can print out sample data from both the training and validation sets. This code uses ANSI escape sequences (\033[1m and \033[0;0m) to display certain labels in bold for better readability in the terminal. For each sample, we print key fields: the ID, title of the article, context (a paragraph from Wikipedia), the list of correct answers, and the character position where the first answer appears in the context. This helps us visualize what kind of input the model will process and how it learns to locate the answer span. Only the first entry from each split is printed here, but this approach can be extended to loop through more examples or perform data quality checks before training.

In [None]:
# Filter train dataset to exclude samples where the answer text has only 1 word
dataset["train"].filter(lambda x: len(x["answers"]["text"]) != 1)

Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [None]:
# Filter validation dataset to exclude samples where the answer text has only 1 word
dataset["validation"].filter(lambda x: len(x["answers"]["text"]) != 1)

Filter:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10567
})

In [None]:
## Lets sample some dataset so that we can reduce training time.
# Select a subset of the train dataset (first 8000 samples) to reduce training time
dataset["train"] = dataset["train"].select([i for i in range(8000)])
# Select a subset of the validation dataset (first 2000 samples) to reduce validation time
dataset["validation"] = dataset["validation"].select([i for i in range(2000)])
# Display the modified dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 2000
    })
})

In [None]:
# model_checkpoint = "bert-base-cased"
# tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Specify the trained model checkpoint (DistilBERT in this case)
trained_checkpoint = "distilbert-base-uncased"
# Load the tokenizer corresponding to the pre-trained model
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

# Extract context, question, and answer from the first example in the train dataset
context = dataset["train"][0]["context"]
question = dataset["train"][0]["question"]
answer = dataset["train"][0]["answers"]["text"]

# Tokenize the question and context pair
inputs = tokenizer(
    question,
    context,
    max_length=160, # Set the maximum sequence length for tokenization
    truncation="only_second",  # only to truncate context
    stride=70,  # no of overlapping tokens  between concecute context pieces
    return_overflowing_tokens=True,  #to let tokenizer know we want overflow tokens
)

# Print the number of input features and where each comes from (mapping to the original samples)
print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

# Display the question, context, and answer from the dataset
print('Question: ',question)
print(' ')
print('Context : ',context)
print(' ')
print('Answer: ', answer)
print('--'*25)

# Iterate over each tokenized piece of context and decode the tokens to view them
for i,ids in enumerate(inputs["input_ids"]):
    print('Context piece', i+1)
    print(tokenizer.decode(ids[ids.index(102):])) # Decode from the token [SEP] onward
    print(' ')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The 4 examples gave 2 features.
Here is where each comes from: [0, 0].
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
 
Context :  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
 
Answer:  ['Saint Bernadette Soubirous']
--------------------------------------------------
Context piece 1
[SEP] architecturally, the s

#### **Tokenization and Preprocessing for Question Answering**

In this section, we delve into tokenization, a crucial step in preparing text data for use with transformer models like DistilBERT. Tokenization breaks down text into manageable pieces (tokens), enabling the model to understand and process it. The steps below explain how we use tokenization and the importance of each part of the process.

1. **Loading the pre-trained model and tokenizer**:
First, we load a pre-trained tokenizer associated with the model we want to use (DistilBERT here). The tokenizer is responsible for converting text into token IDs that the model can process.

2. **Extracting data**:
We extract the context, question, and answer from the first sample in the training dataset.

3. **Tokenizing the question and context**:
Next, we tokenize the question and context pair. Key parameters:

* max_length: Limits the length of the tokenized sequence.

* truncation: Ensures the context is truncated if it exceeds the max length.

* stride: Creates overlap between consecutive chunks of the context to avoid losing information.

* return_overflowing_tokens: Allows handling of long contexts by splitting them into multiple chunks.

4. **Exploring the tokenized output**:
The tokenized output contains input_ids (token IDs) and overflow_to_sample_mapping (which tells us how the tokens map to the original sample). We print out the number of features and the mappings.

5. **Visualizing the question, context, and answer:**
We display the original question, context, and the correct answer to understand the data we’re working with.

6. **Decoding the tokenized context**:
Finally, we decode the tokenized context back to readable text. This allows us to visualize how the context was split into pieces that the model will process.

In [None]:
del tokenizer
# Load the tokenizer associated with the pre-trained model (DistilBERT)
trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

def train_data_preprocess(examples):

   """
    This function generates the start and end indices of the answer within the context for each example.
    """

    def find_context_start_end_index(sequence_ids):
       """
        This function returns the token index where the context starts and ends.
        """
        token_idx = 0
        while sequence_ids[token_idx] != 1:  #means its special tokens or tokens of queston
            token_idx += 1                   # loop only break when context starts in tokens
        context_start_idx = token_idx

        while sequence_ids[token_idx] == 1:
            token_idx += 1
        context_end_idx = token_idx - 1
        return context_start_idx,context_end_idx

    # Extract questions, context, and answers
    questions = [q.strip() for q in examples["question"]]
    context = examples["context"]
    answers = examples["answers"]

     # Tokenize the question-context pair
    inputs = tokenizer(
        questions,
        context,
        max_length=512, # Limit sequence length for efficient processing
        truncation="only_second", # Truncate context if needed
        stride=128, # Allow overlap between context pieces
        return_overflowing_tokens=True,  #returns id of base context
        return_offsets_mapping=True,  # returns (start_index,end_index) of each token
        padding="max_length" #  Pad sequences to maximum length
    )


    start_positions = []
    end_positions = []

    for i,mapping_idx_pairs in enumerate(inputs['offset_mapping']):
        context_idx = inputs['overflow_to_sample_mapping'][i]

        # from main context
        # Get the answer from the context
        answer = answers[context_idx]
        answer_start_char_idx = answer['answer_start'][0]
        answer_end_char_idx = answer_start_char_idx + len(answer['text'][0])

        # now we have to find it in sub contexts
        tokens = inputs['input_ids'][i]
        sequence_ids = inputs.sequence_ids(i)

        # finding the context start and end indexes wrt sub context tokens
        context_start_idx,context_end_idx = find_context_start_end_index(sequence_ids)

        # Check if the answer is fully inside the context (character-wise)
        # if the answer is not fully inside context label it as (0,0)
        # starting and end index of charecter of full context text
        context_start_char_index = mapping_idx_pairs[context_start_idx][0]
        context_end_char_index = mapping_idx_pairs[context_end_idx][1]

    # If the answer is not fully inside the context, label is (0, 0)
        if (context_start_char_index > answer_start_char_idx) or (
            context_end_char_index < answer_end_char_idx):
            start_positions.append(0)
            end_positions.append(0)

        else:

            # else its start and end token positions
            # here idx indicates index of token
            idx = context_start_idx
            while idx <= context_end_idx and mapping_idx_pairs[idx][0] <= answer_start_char_idx:
                idx += 1
            start_positions.append(idx - 1)


            idx = context_end_idx
            while idx >= context_start_idx and mapping_idx_pairs[idx][1] > answer_end_char_idx:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

train_sample = dataset["train"].select([i for i in range(200)])

train_dataset = train_sample.map(
    train_data_preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names
)

len(dataset["train"]),len(train_dataset)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

(8000, 200)

We begin by loading a pre-trained tokenizer (DistilBERT), which is essential for converting text data into a format that can be processed by the model. The train_data_preprocess function is designed to process each example in the dataset, specifically to identify the start and end token indices for the answer within the given context. Inside this function, the find_context_start_end_index helper function is used to determine where the context begins and ends in the tokenized sequence. The function then processes each example by tokenizing the question-context pair and generating the necessary information, including the answer's start and end positions. Special attention is given to handling cases where the answer might not be fully contained within the context, marking those instances with (0, 0) labels. The processed data is then mapped to a smaller subset of the training dataset, and the lengths of the original and processed datasets are printed to ensure the function is working as expected.

In [None]:
def print_context_and_answer(idx,mini_ds=dataset["train"]):
    # Print the index of the current example to keep track of which example we're printing
    print(idx)
    print('----')
    # Extract the question, context, and answer from the dataset for the given index
    question = mini_ds[idx]['question']
    context = mini_ds[idx]['context']
    answer = mini_ds[idx]['answers']['text']
    # Print the theoretical values (original question, context, and answer) as seen in the dataset
    print('Theoretical values :')
    print(' ')
    print('Question: ')
    print(question)
    print(' ')
    print('Context: ')
    print(context)
    print(' ')
    print('Answer: ')
    print(answer)
    print(' ')
    # Get the start and end character indices of the answer in the original context
    answer_start_char_idx = mini_ds[idx]['answers']['answer_start'][0]
    answer_end_char_idx = answer_start_char_idx + len(mini_ds[idx]['answers']['text'][0])
    # Print the start and end character indices of the answer in the original context
    print('Start and end index of text: ',answer_start_char_idx,answer_end_char_idx)
    print('----'*20)
    # Print the values after tokenization (processed data)
    print('Values after tokenization:')

    #answer
    # Find the index of the [SEP] token (used to separate question from context)
    sep_tok_index = train_dataset[idx]['input_ids'].index(102) #get index for [SEP]
    # Extract the question and context as tokenized sequences (based on the index of [SEP])
    question_ = train_dataset[idx]['input_ids'][:sep_tok_index+1]
    question_decoded = tokenizer.decode(question_)
    context_ = train_dataset[idx]['input_ids'][sep_tok_index+1:]
    context_decoded = tokenizer.decode(context_)
    # Extract the start and end token indices for the answer (from tokenized data)
    start_idx = train_dataset[idx]['start_positions']
    end_idx = train_dataset[idx]['end_positions']
    # Get the tokenized answer using the start and end token indices
    answer_toks = train_dataset[idx]['input_ids'][start_idx:end_idx]
    answer_decoded = tokenizer.decode(answer_toks)

    # Print the decoded values (question, context, answer) after tokenization
    print(' ')
    print('Question: ')
    print(question_decoded)
    print(' ')
    print('Context: ')
    print(context_decoded)
    print(' ')
    print('Answer: ')
    print(answer_decoded)
    print(' ')
    # Print the token start and end positions after tokenization
    print('Start pos and end pos of tokens: ',train_dataset[idx]['start_positions'],train_dataset[idx]['end_positions'])
    print('____'*20)

# Calling the function for multiple examples in the dataset to visualize context and answer
print_context_and_answer(0)
print_context_and_answer(1)
print_context_and_answer(2)
print_context_and_answer(3)

0
----
Theoretical values :
 
Question: 
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
 
Context: 
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
 
Answer: 
['Saint Bernadette Soubirous']
 
Start and end index of text:  515 541
--------------------------------------------------------------------------------
Values after tok

To gain a deeper understanding of how our Question Answering (QA) model interprets and processes input data, we created the function **print_context_and_answer()**. This function serves a dual purpose: first, it prints the original question, context, and answer as they appear in the raw dataset, along with the character-level indices that mark where the answer appears in the context. Then, it prints the same elements as they appear after tokenization. This includes the decoded question and context, the tokenized version of the predicted answer, and the start and end token positions. These insights are essential to ensure that the data is being correctly processed before being fed into the model and that the model is correctly learning to identify answer spans.

Running this function on a few dataset examples allows us to visually confirm whether the tokenized input matches the original, human-readable input. For instance, in one example, the question is “To whom did the Virgin Mary allegedly appear in 1858 in Lourdes, France?”, and the context contains a sentence indicating that she appeared to "Saint Bernadette Soubirous". After tokenization, both the question and context retain their semantic clarity, and the model correctly identifies the answer tokens and their position. This step is especially important for debugging and quality-checking the preprocessing pipeline, as any misalignment between the original and tokenized data could severely impact the model's performance.

In [None]:
def preprocess_validation_examples(examples):
     """
    Preprocesses the validation data for a Question Answering model.
    
    This function tokenizes the input question-context pairs using a pretrained tokenizer.
    It handles long contexts by creating overlapping chunks (with a stride), and ensures
    proper alignment between original text and token positions through offset mappings.
    """
    # Remove leading/trailing whitespaces from all questions
    questions = [q.strip() for q in examples["question"]]
    # Tokenize the question-context pairs with specific parameters
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=512,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Map each chunk to the original example it came from
    sample_map = inputs.pop("overflow_to_sample_mapping")

    base_ids = [] # Will store the original example IDs for each chunk

    for i in range(len(inputs["input_ids"])):

        # take the base id (ie in cases of overflow happens we get base id)
        base_context_idx = sample_map[i]
        base_ids.append(examples["id"][base_context_idx])

         # Get token type IDs to identify question (0) vs context (1)
        # None is used for special tokens like [CLS] or [SEP]
        sequence_ids = inputs.sequence_ids(i)
        # Get the original character-based offsets for the current chunk
        offset = inputs["offset_mapping"][i]
        # for Question tokens provide offset_mapping as None
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]
    # Attach the original example IDs to the tokenized inputs
    inputs["base_id"] = base_ids
    return inputs


# del tokenizer
# Load tokenizer from a pretrained checkpoint
trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
# Take a sample of 100 examples from the validation set
data_val_sample = dataset["validation"].select([i for i in range(100)])
# Apply the preprocessing function to the validation examples
# - batched=True allows processing multiple examples at once
# - remove_columns drops the original columns since we now return tokenized data
eval_set = data_val_sample.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)
# Check the number of processed examples (including overflowed chunks)
len(eval_set)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

100

This function prepares validation data for a Question Answering model by tokenizing question-context pairs using a pretrained tokenizer. It handles long contexts by splitting them into overlapping chunks, ensuring that no relevant information is lost. Each tokenized chunk is linked back to its original example, and only the context tokens are kept for mapping text positions (ignoring question and special tokens). This allows the model to later align predictions with the original context. Finally, we apply this function to a sample of 100 validation examples and convert them into a format the model can use for evaluation.

# **Final model BERT**

In [None]:
from datasets import load_dataset
dataset = load_dataset("squad")

#lets sample a small dataset
dataset['train'] = dataset['train'].select([i for i in range(5000)])
dataset['validation'] = dataset['validation'].select([i for i in range(500)])

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 500
    })
})

In this final section of the notebook we load the SQuAD dataset (Stanford Question Answering Dataset) using the datasets library, which provides access to many standard NLP datasets. Then, we create smaller samples from the original training and validation sets to speed up processing and make it easier to experiment. Specifically, we keep only the first 5,000 training examples and 500 validation examples. This is useful for testing and development, especially when working with limited resources or when training time needs to be reduced.

In [None]:
# Define a custom Dataset class for Question Answering
class DataQA(Dataset):
    def __init__(self, dataset,mode="train"):
        self.mode = mode # set mode to either "train" or "validation"


        if self.mode == "train":
            # sampling
            # Use training split and apply the training preprocessing function
            self.dataset = dataset["train"]
            self.data = self.dataset.map(train_data_preprocess,
                                                      batched=True,
                            remove_columns= dataset["train"].column_names)

        else:
            # Use validation split and apply the validation preprocessing function
            self.dataset = dataset["validation"]
            self.data = self.dataset.map(preprocess_validation_examples,
            batched=True,remove_columns = dataset["validation"].column_names,
               )

            def __len__(self):
                # Return the number of processed examples
                return len(self.data)

    def __getitem__(self, idx):
        # Get one processed example by index
        out = {}
        example = self.data[idx]
        # Convert input_ids and attention_mask to PyTorch tensors
        out['input_ids'] = torch.tensor(example['input_ids'])
        out['attention_mask'] = torch.tensor(example['attention_mask'])

        # If training, also return start and end positions for the answer
        if self.mode == "train":

            out['start_positions'] = torch.unsqueeze(torch.tensor(example['start_positions']),dim=0)
            out['end_positions'] = torch.unsqueeze(torch.tensor(example['end_positions']),dim=0)

        return out


In this part we define a custom dataset class called DataQA, which prepares our **question-answering data** to be used with **PyTorch**. When we initialize the class, it takes in the dataset and a mode ("train" or "validation"). If the mode is **"train"**, it selects the training data and applies a preprocessing function to prepare the inputs and labels (i.e., where the correct answer starts and ends in the context). If the mode is **"validation"**, it uses a different preprocessing function suited for evaluation, without the answer labels.

The __getitem__ method allows us to retrieve one example at a time, formatted as PyTorch tensors. Each example includes the tokenized inputs (input_ids) and an attention mask. When in training mode, it also includes the correct start and end positions of the answer to help the model learn. This class structure makes it easy to feed the data into a **DataLoader** for both training and validation, keeping the workflow clean and modular.

In [None]:
# Load the pre-trained DistilBERT tokenizer
trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

# Create the training and validation datasets using the custom DataQA class
train_dataset = DataQA(dataset,mode="train")
val_dataset = DataQA(dataset,mode="validation")


# Let's inspect the first few examples in the training dataset
for i,d in enumerate(train_dataset):
    for k in d.keys():
        # Print the name and shape of each tensor in the example (e.g., input_ids, attention_mask, etc.)
        print(k + ' : ', d[k].shape)
    print('--'*40)
    # Stop after printing 4 training samples
    if i == 3:
        break

print('__'*50)

# Now inspect the first few examples in the validation dataset
for i,d in enumerate(val_dataset):
    for k in d.keys():
        # Print the name and length of each tensor in the validation example
        print(k + ' : ', len(d[k]))
    print('--'*40)
    # Stop after printing 4 validation samples
    if i == 3:
        break

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
input_ids :  torch.Size([512])
attention_mask :  torch.Size([512])
start_positions :  torch.Size([1])
end_positions :  torch.Size([1])
--------------------------------------------------------------------------------
____________________________________________________________________________________________________
input_ids :  512
attention_mask :  

We initialize the tokenizer from the pre-trained DistilBERT model and create two dataset objects—one for training and one for validation—using the custom DataQA class. The training set is tokenized and labeled with answer start and end positions, while the validation set only includes input IDs and attention masks. We then print the structure of the first four examples from both sets. The printed results confirm that each training example contains tensors of shape [512] for input_ids and attention_mask, and [1] for start_positions and end_positions, which matches the expected input format for a question-answering model. The validation set includes only input_ids and attention_mask, each with a length of 512, as it doesn't need answer positions. This output validates that the data has been correctly preprocessed and is ready for training and evaluation.

In [None]:
# Define a custom Dataset class for Question Answering
class DataQA(Dataset):
    def __init__(self, dataset, mode="train"):
        self.mode = mode # Set the mode to either "train" or "validation"

        # If the mode is 'train', process the training dataset
        if self.mode == "train":
            # sampling
            self.dataset = dataset["train"]
            self.data = self.dataset.map(
                train_data_preprocess, # Apply the training data preprocessing function
                batched=True,
                remove_columns=dataset["train"].column_names,
            )
        else: # If the mode is 'validation', process the validation dataset
            self.dataset = dataset["validation"]
            self.data = self.dataset.map(
                preprocess_validation_examples, # Apply the validation data preprocessing function
                batched=True,
                remove_columns=dataset["validation"].column_names,
            )

    def __len__(self):  # Define __len__ outside the conditional blocks
        return len(self.data)

    def __getitem__(self, idx):
        # Get an example from the dataset at the given index
        out = {}
        example = self.data[idx] # Retrieve the example at index 'idx'
        # Convert the input IDs and attention mask to PyTorch tensors
        out["input_ids"] = torch.tensor(example["input_ids"])
        out["attention_mask"] = torch.tensor(example["attention_mask"])

        # If in training mode, return the start and end positions of the answer as well
        if self.mode == "train":
            out["start_positions"] = torch.unsqueeze(
                torch.tensor(example["start_positions"]), dim=0
            )
            out["end_positions"] = torch.unsqueeze(
                torch.tensor(example["end_positions"]), dim=0
            )

        return out

# Load the dataset
dataset = load_dataset("squad")

# Sample a small subset of the training and validation datasets for experimentation
dataset['train'] = dataset['train'].select([i for i in range(5000)]) # Select the first 5000 examples from the training data
dataset['validation'] = dataset['validation'].select([i for i in range(500)]) # Select the first 500 examples from the validation data

# Load the pre-trained tokenizer from the distilbert-base-uncased checkpoint
trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

# Create instances of the DataQA class for both training and validation datasets
train_dataset = DataQA(dataset, mode="train")
val_dataset = DataQA(dataset, mode="validation")

# Create DataLoader instances for both training and validation datasets
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=2, 
)
eval_dataloader = DataLoader(
    val_dataset, collate_fn=default_data_collator, batch_size=2 # Set the batch size to 2
)

# Iterate through the training DataLoader and print the shape of each tensor in the batch
for batch in train_dataloader:
   print(batch['input_ids'].shape)
   print(batch['attention_mask'].shape)
   print(batch['start_positions'].shape)
   print(batch['end_positions'].shape)
   break

print('---'*20)

for batch in eval_dataloader:
   print(batch['input_ids'].shape)
   print(batch['attention_mask'].shape)
   break

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

torch.Size([2, 512])
torch.Size([2, 512])
torch.Size([2, 1])
torch.Size([2, 1])
------------------------------------------------------------
torch.Size([2, 512])
torch.Size([2, 512])


Here we define and use a custom PyTorch dataset class for Question Answering tasks based on the SQuAD dataset. The DataQA class extends PyTorch’s Dataset and is designed to handle both training and validation modes. In the constructor (__init__), depending on the mode ("train" or "validation"), it selects the appropriate dataset split and applies a preprocessing function: train_data_preprocess for training and preprocess_validation_examples for validation. These functions tokenize the text and prepare the necessary input features. The __len__ method correctly returns the size of the preprocessed dataset, and __getitem__ returns a dictionary with tensors: for all samples, it returns input_ids and attention_mask, and in training mode, it also returns the answer’s start_positions and end_positions for supervised learning.

After defining the dataset class, the script loads the SQuAD dataset using Hugging Face’s load_dataset function and selects small samples (5000 training and 500 validation examples) for quick experimentation. It then loads the tokenizer from the distilbert-base-uncased checkpoint, which will be used to tokenize questions and contexts. With the datasets and tokenizer ready, the script instantiates DataQA for both training and validation data. These are wrapped in DataLoader objects for efficient batching and optional shuffling (enabled during training). A collate function, default_data_collator, is used to merge individual samples into a batch. Finally, to verify that everything works as expected, the script prints the shapes of the tensors in one batch from the training and validation data. This check confirms that input features have been correctly formatted and batched for model training and evaluation.

In [None]:
from transformers import DistilBertForQuestionAnswering
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Available device: {device}')

checkpoint =  "distilbert-base-uncased"
model = DistilBertForQuestionAnswering.from_pretrained(checkpoint)
model = model.to(device)

Available device: cuda


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Define the optimizer to update the model parameters during training
# AdamW optimizer is an improved version of the Adam optimizer, often used with transformer-based models
# lr=2e-5 specifies the learning rate for the optimization process
optimizer = AdamW(model.parameters(), lr=2e-5)

# Set the number of epochs for training. This means the entire training dataset will be passed through the model 2 times.
epochs = 2

# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs
print(total_steps)

# Define a helper function 'format_time' to convert elapsed time (in seconds) into a human-readable format
# This is useful for tracking and displaying the time spent during the training process (e.g., how long an epoch takes)
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))

    # Convert the elapsed time into a time string formatted as hours:minutes:seconds
    return str(datetime.timedelta(seconds=elapsed_rounded))

5010


In [None]:
# we need processed validation data to get offsets at the time of evaluation
validation_processed_dataset = dataset["validation"].map(preprocess_validation_examples,
            batched=True,remove_columns = dataset["validation"].column_names,
               )

After loading and configuring the model with DistilBertForQuestionAnswering, we set up the training environment by selecting the appropriate device (GPU or CPU) and initializing the AdamW optimizer, which is commonly used for transformer-based models. The optimizer is set with a learning rate of 2e-5, suitable for fine-tuning the pre-trained DistilBERT model on our Question Answering task. The training is configured to run for 2 epochs, and we calculate the total number of training steps by multiplying the number of batches by the number of epochs, providing us a sense of how much training will take place. A helpful utility function, format_time, is defined to convert the elapsed time into a human-readable format, which will be useful during training to monitor progress and estimate the time remaining. Finally, the validation dataset is preprocessed to ensure it’s in the right format for evaluation. This step ensures that the model receives the necessary inputs during evaluation, including the offsets and context, which are crucial for tasks like Question Answering. This whole setup prepares the model for efficient training and evaluation.

In [None]:
def predict_answers_and_evaluate(start_logits,end_logits,eval_set,examples):
    """
    make predictions
    Args:
    start_logits : strat_position prediction logits
    end_logits: end_position prediction logits
    eval_set: processed val data
    examples: unprocessed val data with context text
    """
    # appending all id's corresponding to the base context id
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(eval_set):
        example_to_features[feature["base_id"]].append(idx)


    n_best = 20 # Define the number of top predictions to consider for each answer
    max_answer_length = 30 # Limit the maximum length of the predicted answer
    predicted_answers = [] # List to store the final predicted answers

    # Iterate through each example to generate predictions
    for example in examples:
        example_id = example["id"]
        context = example["context"]
        answers = []  # List to store potential answers for the current example

      # looping through each sub contexts corresponding to a context and finding
        # answers
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = eval_set["offset_mapping"][feature_index]

            # sorting the predictions of all hidden states and taking best n_best prediction
            # means taking the index of top 20 tokens
            start_indexes = np.argsort(start_logit).tolist()[::-1][:n_best]
            end_indexes = np.argsort(end_logit).tolist()[::-1][:n_best]

            # Generate possible answers by pairing each start index with each end index
            for start_index in start_indexes:
                for end_index in end_indexes:

                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length.
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                       ):
                        continue
                    # Append the valid answer to the list along with its logit score
                    answers.append({
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                        })


         # If we found any answers, select the one with the highest logit score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            # If no valid answers were found, append an empty answer
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    # Load the SQuAD evaluation metric
    metric = evaluate.load("squad")

    # Prepare the theoretical (true) answers for comparison with the predictions
    theoretical_answers = [
            {"id": ex["id"], "answers": ex["answers"]} for ex in examples
    ]

    # Compute the evaluation metrics (Exact Match and F1 score)
    metric_ = metric.compute(predictions=predicted_answers, references=theoretical_answers)
    return predicted_answers,metric_

In this function, **predict_answers_and_evaluate**, we’re essentially making predictions and evaluating how well our model performs on the validation set. The function processes the **logits** (the raw model outputs) for the start and end positions of answers, matches them with the context they belong to, and ranks them to find the best possible answer. It loops through each example, picking the top predictions based on the logits and ensures the answers are valid by checking their **lengths and positions**. The coolest part? It automatically selects the **best answer** and calculates **performance metrics**, like F1 score and Exact Match, so we can see how well our model is doing. It’s like an interactive way of seeing if your model is on point or needs more tuning!

In [None]:
# Set a random seed for reproducibility of results
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Storing all training and validation stats
stats = []


# Measure total training time
total_train_time_start = time.time()

for epoch in range(epochs):
    print(' ')
    print(f'=====Epoch {epoch + 1}=====')
    print('Training....')

    # ===========
    #    Train
    # ===========
    # measure how long training epoch takes
    t0 = time.time()

    training_loss = 0 # Initialize the loss for this epoch
    # loop through train data
    model.train()
    for step,batch in enumerate(train_dataloader):


        ## Print progress every 40 batches
        if step%40 == 0 and not step == 0:
              elapsed_time = format_time(time.time() - t0)
              # Report progress to the user
              print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed_time))


         # Move the batch data to the device (GPU or CPU)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)


        # Set gradients to zero before backpropagation
        model.zero_grad()
        # Forward pass through the model to get results
        result = model(input_ids = input_ids,
                        attention_mask = attention_mask,
                        start_positions = start_positions,
                        end_positions = end_positions,
                        return_dict=True)

        loss = result.loss  # Extract the loss from the result

        # Accumulate the loss over batches to calculate the average at the end
        training_loss += loss.item()

        # Perform backpropagation to calculate gradients
        loss.backward()

        # update the gradients
        optimizer.step()


    # Calculate the average loss for the epoch
    avg_train_loss = training_loss/len(train_dataloader)

    # Measure how long the training epoch took
    training_time = format_time(time.time() - t0)


    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))


    # ===============
    #    Validation
    # ===============

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()


    start_logits,end_logits = [],[] # Lists to store the start and end logits for each batch
    for step,batch in enumerate(eval_dataloader):


        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)


        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():
             result = model(input_ids = input_ids,
                        attention_mask = attention_mask,return_dict=True)


        # Append the predicted start and end logits to the lists
        start_logits.append(result.start_logits.cpu().numpy())
        end_logits.append(result.end_logits.cpu().numpy())


    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    # start_logits = start_logits[: len(val_dataset)]
    # end_logits = end_logits[: len(val_dataset)]


    # Evaluate the predictions using the helper function
    answers,metrics_ = predict_answers_and_evaluate(start_logits,end_logits,validation_processed_dataset,dataset["validation"])
    print(f'Exact match: {metrics_["exact_match"]}, F1 score: {metrics_["f1"]}')


    print('')
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)

    print("  Validation took: {:}".format(validation_time))

# Print the final message after training completes
print("")
print("Training complete!")

# Print the total time taken for training
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_train_time_start)))

 
=====Epoch 1=====
Training....
  Batch    40  of  2,505.    Elapsed: 0:00:05.
  Batch    80  of  2,505.    Elapsed: 0:00:09.
  Batch   120  of  2,505.    Elapsed: 0:00:14.
  Batch   160  of  2,505.    Elapsed: 0:00:19.
  Batch   200  of  2,505.    Elapsed: 0:00:23.
  Batch   240  of  2,505.    Elapsed: 0:00:28.
  Batch   280  of  2,505.    Elapsed: 0:00:33.
  Batch   320  of  2,505.    Elapsed: 0:00:38.
  Batch   360  of  2,505.    Elapsed: 0:00:43.
  Batch   400  of  2,505.    Elapsed: 0:00:47.
  Batch   440  of  2,505.    Elapsed: 0:00:52.
  Batch   480  of  2,505.    Elapsed: 0:00:57.
  Batch   520  of  2,505.    Elapsed: 0:01:01.
  Batch   560  of  2,505.    Elapsed: 0:01:06.
  Batch   600  of  2,505.    Elapsed: 0:01:11.
  Batch   640  of  2,505.    Elapsed: 0:01:15.
  Batch   680  of  2,505.    Elapsed: 0:01:20.
  Batch   720  of  2,505.    Elapsed: 0:01:25.
  Batch   760  of  2,505.    Elapsed: 0:01:29.
  Batch   800  of  2,505.    Elapsed: 0:01:34.
  Batch   840  of  2,505.  

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

Exact match: 24.4, F1 score: 61.91984322523023

  Validation took: 0:04:16
 
=====Epoch 2=====
Training....
  Batch    40  of  2,505.    Elapsed: 0:00:05.
  Batch    80  of  2,505.    Elapsed: 0:00:10.
  Batch   120  of  2,505.    Elapsed: 0:00:14.
  Batch   160  of  2,505.    Elapsed: 0:00:19.
  Batch   200  of  2,505.    Elapsed: 0:00:24.
  Batch   240  of  2,505.    Elapsed: 0:00:29.
  Batch   280  of  2,505.    Elapsed: 0:00:34.
  Batch   320  of  2,505.    Elapsed: 0:00:38.
  Batch   360  of  2,505.    Elapsed: 0:00:43.
  Batch   400  of  2,505.    Elapsed: 0:00:48.
  Batch   440  of  2,505.    Elapsed: 0:00:52.
  Batch   480  of  2,505.    Elapsed: 0:00:57.
  Batch   520  of  2,505.    Elapsed: 0:01:01.
  Batch   560  of  2,505.    Elapsed: 0:01:06.
  Batch   600  of  2,505.    Elapsed: 0:01:11.
  Batch   640  of  2,505.    Elapsed: 0:01:15.
  Batch   680  of  2,505.    Elapsed: 0:01:20.
  Batch   720  of  2,505.    Elapsed: 0:01:25.
  Batch   760  of  2,505.    Elapsed: 0:01:29.

We are running the first **two epochs** of our training process. As we can see, during each epoch, the model is being trained over batches of data, with progress reported every 40 steps to give us an idea of how long each batch takes to process. The **first epoch** took around 4 minutes and 53 seconds to complete, with an average training loss of 1.65. After training, the model was evaluated on the validation dataset, where it achieved an **exact match** score of **24.4%** and an **F1 score** of approximately **61.92%**. Moving on to the **second epoch**, we observe a decrease in the average training loss to 0.90, and the model's performance on the validation set improved slightly, with the exact match score increasing to 25.4% and the F1 score rising to 63.73%. Overall, after two epochs, the **total training time** was approximately **18 minutes and 15 seconds.**

In [None]:
def ask_question(model, tokenizer, question, context, device):
    # Prepare entry: question and context
    inputs = tokenizer(
        question, context, max_length=384, truncation=True, padding="max_length", return_tensors="pt"
    ).to(device)

   # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get start and end logits
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Get start and end indexes
    start_idx = torch.argmax(start_logits, dim=1).item()
    end_idx = torch.argmax(end_logits, dim=1).item()

    # Extract response
    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx+1])
    )

    return answer

# Example: Ask the model
question = "Who wrote the book?"
context = "The book was written by J.K. Rowling."

answer = ask_question(model, tokenizer, question, context, device)
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: Who wrote the book?
Answer: j. k. rowling.


In [None]:
# Example: Ask the model
question = "Who developed the theory of relativity?"
context = "Albert Einstein is credited with developing the theory of relativity, which revolutionized our understanding of space, time, and gravity."

answer = ask_question(model, tokenizer, question, context, device)
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: Who developed the theory of relativity?
Answer: albert einstein is


In [None]:
# Example: Ask the model
question = "What is the capital of France"
context = "The capital of France is Paris"

answer = ask_question(model, tokenizer, question, context, device)
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: What is the capital of France
Answer: paris [SEP]


In [None]:
question = "What were the main objectives of the Apollo 11 mission, how did it lead to advances in modern space exploration, and what was its cultural significance?"
context = (
    "The Apollo 11 mission, launched by NASA in 1969, aimed to land humans on the Moon and return them safely to Earth. "
    "This mission marked the beginning of human exploration of other celestial bodies, leading to technological advancements such as improved spacecraft design. "
    "It also inspired humanity by symbolizing ambition and progress, becoming a cultural icon of what humans can achieve when united."
)

answer = ask_question(model, tokenizer, question, context, device)

# Tokenize and pass to the model
inputs = tokenizer(
    question, context, max_length=512, truncation="only_second", padding="max_length", return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

# Get logits
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Identify response tokens
start_idx = torch.argmax(start_logits, dim=1).item()
end_idx = torch.argmax(end_logits, dim=1).item()

# Convert to response
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx+1])
)
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: What were the main objectives of the Apollo 11 mission, how did it lead to advances in modern space exploration, and what was its cultural significance?
Answer: to land humans on the moon and return them safely to earth. this mission marked the beginning of human exploration of other celestial bodies, leading to technological advancements such as improved spacecraft design. it also inspired humanity by symbolizing ambition and progress,


The provided code demonstrates how to implement a question-answering (QA) model using a pre-trained transformer model in PyTorch. The ask_question function takes a question and a context, tokenizes them, and passes them through the model to predict the answer. It first prepares the input by tokenizing the question and context, ensuring they fit within the model's maximum input length. The function then uses the model to obtain logits for the start and end positions of the answer within the context. After identifying the start and end indices, the function extracts the answer and converts it back to a string using the tokenizer. Multiple examples are provided, showing how the model answers various questions, such as identifying the author of a book or explaining the significance of the Apollo 11 mission. The results are printed to show the predicted answers, and the model provides answers that closely match the expected information.

---

## **Final reflections**

This notebook has demonstrated the implementation of a basic yet functional Question Answering (QA) system using a pre-trained transformer model from Hugging Face's Transformers library. Through a step-by-step process, we explored how to tokenize input, pass it through the model, and extract meaningful answers from natural language context. The ask_question function we developed encapsulates this pipeline, showcasing how deep learning models can be effectively applied to real-world natural language processing (NLP) tasks.

By testing the model with various question-context pairs, we observed both its strengths—such as its ability to identify accurate spans of text—and its limitations, including occasional imprecision or inclusion of irrelevant tokens. These outcomes highlight the importance of careful preprocessing, model choice, and understanding token-level predictions when working with transformer-based models.

This exercise has not only provided hands-on experience with PyTorch and Hugging Face tools but has also reinforced key NLP concepts such as tokenization, logits interpretation, and attention to detail in model evaluation. Overall, this notebook serves as both a practical implementation and a solid foundation for further exploration into more advanced QA techniques or multilingual applications. With continued experimentation and fine-tuning, this kind of system can evolve into a powerful tool for extracting insights from unstructured text—an increasingly valuable skill in today’s data-driven world.