The following code puts together the end-to-end data pipeline to fine-tune DistilBERT, a lighter and optimized version of BERT, for question-answering tasks using my course materials. This pipeline consists of five critical phases: environment setup, data ingestion, text cleaning, GPU optimization, and dataset preparation.

First, the code mounts my Google Drive using drive.mount() to reach the Excel file containing my web-scraped course data; this dataset includes structured question-answer pairs and their context passages extracted from course modules. Then, it imports necessary libraries: PyTorch for tensor operations, pandas for DataFrame manipulation, regex for pattern-based text cleaning, Hugging Face's transformers library for DistilBERT access, and the datasets library for efficient streaming of the data during training.

The cleaning step then applies quite rigorous text preprocessing: HTML tag removal takes care of the artifacts from web scraping; URL elimination removes hyperlinks; whitespace normalization reduces multiple spaces to singles; special character filtering keeps just alphanumeric text and punctuation. These transformations ensure that my model trains on clean, standardized input devoid of any noise.

Critically, my code uses torch.cuda.is_available() to detect whether a GPU is available; and it hugely speeds up training, roughly 5-10x faster compared to CPU training. Finally, the preprocessed data is converted into Hugging Face's Dataset format for transformer training efficiency and then split 80/20 into training and evaluation sets using train_test_split(test_size=0.2, seed=42) where the seed ensures reproducibility across runs; this is important to maintain consistent experimental baselines for when I iteratively refine my model for QA.

In [226]:
import torch
import pandas as pd
import re
from datasets import Dataset
from transformers import DistilBertForQuestionAnswering, DistilBertTokenizerFast
from transformers import TrainingArguments, Trainer
from google.colab import drive


# Mount Drive
drive.mount('/content/drive/', force_remount=True)

# Load your CSV
df = pd.read_excel("/content/drive/My Drive/Data Collection (ITE Elective Course Lesson)/Dataset/Webscraped data_Modules_Question and Answering.xlsx")


# Text Preprocessing Dataset

# Drop rows with missing essential fields
df.dropna (subset=['Title', 'Context', 'Question', 'Answer'], inplace=True)

# Ensure text fields are string
for col in ['Context', 'Question', 'Answer', 'Title']:
  df[col] = df[col].astype(str)

# Function to clean text
def clean_text(text):
    text = re.sub(r'<[^>]+>', '', text)          # remove HTML tags
    text = re.sub(r'\s+', ' ', text)             # normalize whitespace
    text = re.sub(r'http\S+', '', text)          # remove URLs
    text = re.sub(r'[^A-Za-z0-9.,;:?!\'"()\-\s]', '', text)  # keep punctuation
    text = text.strip()
    return text

# Apply Cleaning
df['Context'] = df['Context'].apply(clean_text)
df['Question'] = df['Question'].apply(clean_text)
df['Answer'] = df['Answer'].apply(clean_text)

print(f"Cleaned dataset shape: {df.shape}")

# Check GPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

print("\n--- Loading and Preprocessing Data ---")

#  Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

#  Split into train and eval sets (e.g., 80% / 20%)
dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_data = dataset["train"]
eval_data = dataset["test"]

print(train_data)
print(eval_data)


Mounted at /content/drive/
Cleaned dataset shape: (710, 7)
Using GPU: Tesla T4

--- Loading and Preprocessing Data ---
Dataset({
    features: ['ID', 'Title', 'Context', 'Question', 'Answer', 'Unnamed: 5', 'Unnamed: 6', '__index_level_0__'],
    num_rows: 568
})
Dataset({
    features: ['ID', 'Title', 'Context', 'Question', 'Answer', 'Unnamed: 5', 'Unnamed: 6', '__index_level_0__'],
    num_rows: 142
})


# **Bert-base-uncased**

This code prepares my dataset for BERT-based question-answering by performing three necessary transformations: loading pre-trained BERT components, mapping answer text positions to token indices, and tokenizing question-context pairs with precise alignment tracking. First, it loads the bert-base-uncased tokenizer and model from Hugging Face onto which my fine-tuning is based.

The add_answer_positions() function does some critical preprocessing in finding where exactly the answer text falls within its corresponding context passage. With case-insensitive matching, context.lower().find(answer_lower), it calculates positions at character level, which is vital since BERT operates on token-level predictions and not on the raw text. When there are no substrings of answersâ€”handling for edge casesâ€”default positions avoid crashes.

Then, tokenize_and_align() does the most complicated work of all: converting my text into BERT's 512-token vocabulary while maintaining the answer position mapping. The tokenizer uses truncation="only_second" for question preservation, and return_offsets_mapping=True to track which tokens correspond to which characters. Using these offset mappings and sequence IDs which distinguish question tokens (ID 0) from context tokens (ID 1), we're going to map my character-level answer positions into token-level positions.

Finally, this code applies these transformations via `.map()` to both training and evaluation datasets, converts outputs to PyTorch tensors with `.set_format("torch", .)`, and loads the model onto the GPU device for efficient training. This ensures that BERT receives correctly aligned token-span targets during fine-tuning and that answer span prediction is accurate.

In [227]:
from transformers import BertTokenizerFast, BertForQuestionAnswering

MODEL_NAME = "bert-base-uncased"

# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME)

print(f"\nðŸ§  Loaded Pretrained QnA Model: {MODEL_NAME}")


def add_answer_positions(example):
    context = example["Context"]
    answer = example["Answer"]

    # Lowercase-safe matching
    context_lower = context.lower()
    answer_lower = answer.lower()

    answer_start = context_lower.find(answer_lower)
    if answer_start == -1:
        # If not found, set defaults
        example["start_positions"] = 0
        example["end_positions"] = 0
    else:
        answer_end = answer_start + len(answer)
        example["start_positions"] = answer_start
        example["end_positions"] = answer_end

    return example


def tokenize_and_align(examples):
    tokenized = tokenizer(
        examples["Question"],
        examples["Context"],
        truncation="only_second",  # focus on context truncation only
        padding="max_length",
        max_length=512,
        return_offsets_mapping=True
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        sequence_ids = tokenized.sequence_ids(i)
        context_start = sequence_ids.index(1)
        context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

        start_char = examples["start_positions"][i]
        end_char = examples["end_positions"][i]

        # Initialize defaults
        token_start_index = context_start
        token_end_index = context_start

        # Map char â†’ token span
        for idx in range(context_start, context_end + 1):
            if offsets[idx][0] <= start_char < offsets[idx][1]:
                token_start_index = idx
            if offsets[idx][0] < end_char <= offsets[idx][1]:
                token_end_index = idx
                break

        start_positions.append(token_start_index)
        end_positions.append(token_end_index)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    tokenized.pop("offset_mapping")  # cleanup

    return tokenized


# Apply add_answer_positions to train and eval data
train_data = train_data.map(add_answer_positions)
eval_data = eval_data.map(add_answer_positions)

# Apply tokenization
tokenized_train = train_data.map(tokenize_and_align, batched=True)
tokenized_eval = eval_data.map(tokenize_and_align, batched=True)

tokenized_train.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])
tokenized_eval.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])


# === Load Model ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

print(f"\nðŸ§© Model loaded successfully for QnA: {MODEL_NAME}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



ðŸ§  Loaded Pretrained QnA Model: bert-base-uncased


Map:   0%|          | 0/568 [00:00<?, ? examples/s]

Map:   0%|          | 0/142 [00:00<?, ? examples/s]

Map:   0%|          | 0/568 [00:00<?, ? examples/s]

Map:   0%|          | 0/142 [00:00<?, ? examples/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



ðŸ§© Model loaded successfully for QnA: bert-base-uncased


# **METRICS AND TRAINING SETUP**

This code implements a comprehensive evaluation framework and training pipeline for fine-tuning my BERT question-answering model. The section establishes two custom metricsâ€”Exact Match (EM) and F1 Scoreâ€”that measure answer prediction quality by comparing predicted answer text against ground-truth answers. These metrics operate at the character/token level, not just raw logit predictions, providing intuitive performance indicators.
The compute_metrics() function runs the entire evaluation workflow: it receives model predictions of start and end logits, converts raw logits to token indices via np.argmax(), reconstructs predicted answer text by decoding the token span, and compares this text against the gold-standard answer using both EM and F1 calculations. The function also tracks average inference time for operational performance insights. Importantly, this function bridges the gap from BERT's token-level outputs to human-readable text comparisons that enable meaningful model assessment.
The configuration of training via TrainingArguments defines several hyperparameters responsible for fine-tuning: 3 epochs, batch sizes reduced to 16 to manage GPU memory, learning rate was set to 0.005, and weight decay of 0.01 for regularization. We set eval_strategy="no" because a full evaluation may be time-consuming, but epocs and steps can be applied for experimentation. We also set fp16=True in order to perform mixed-precision training. Finally, the object responsible for performing the complete training loop is the Trainer. It bundles model, datasets, metrics, and tokenizer and handles for us both training iterations (with gradient update) and the different evaluation cycles.




In [228]:
# --- 4. METRICS AND TRAINING SETUP ---

import numpy as np
import time
from sklearn.metrics import accuracy_score, f1_score
from transformers import TrainingArguments, Trainer
from transformers.data.data_collator import default_data_collator # Import default_data_collator

def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()

    common = set(pred_tokens) & set(truth_tokens)
    if not common:
        return 0.0

    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

def compute_metrics(eval_pred):
    start_time = time.time()

    predictions, labels = eval_pred

    # DistilBERT QA produces start and end logits
    start_logits, end_logits = predictions

    # Take the highest scoring token for start and end
    start_positions = np.argmax(start_logits, axis=1)
    end_positions = np.argmax(end_logits, axis=1)

    exact_matches = []
    f1_scores = []

    # Tokenizer needed to decode back to text
    for i in range(len(start_positions)):
        input_ids = tokenized_eval[i]["input_ids"]
        pred_tokens = input_ids[start_positions[i]: end_positions[i] + 1]
        pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)

        # Ground-truth answer text
        gold_text = eval_data[i]["Answer"]

        exact_matches.append(compute_exact_match(pred_text, gold_text))
        f1_scores.append(compute_f1(pred_text, gold_text))

    # Average inference time
    avg_inference_time = (time.time() - start_time) / len(start_positions)

    metrics = {
        "Exact_Match": np.mean(exact_matches),
        "F1_Score": np.mean(f1_scores),
        "Avg_Inference_Time": avg_inference_time
    }

    return metrics

# Define training arguments (hyperparameters)
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16, # Reduced batch size
    per_device_eval_batch_size=16, # Reduced batch size
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    learning_rate=0.005,
    logging_steps=100,
    eval_strategy="no",
    save_strategy="no",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer, # Keep tokenizer for data collation
    data_collator=default_data_collator, # Use default data collator
)

  trainer = Trainer(


# **Execution (Evaluation)**

Evaluatiion. This code orchestrates the complete training phase of fine-tuning my BERT question-answering model, including training initiation, performance evaluation, and model persistence. It first shows comprehensive pre-training diagnostics: the model name, the number of epochs, the batch size, and whether it is using a GPU/CPU device. The actual execution has been done by calling trainer.train(), which handles the whole training loop across 3 epochs, computing gradients, updating weights, and logging without explicit developer intervention. Tracking the duration of training provides operational insights: the time elapsed since the start until its completion will tell if hardware acceleration-GPU-is reducing computation time.
After successful training, the code does final evaluation on the held-out test set using trainer.evaluate(), which applies my custom metrics functions, Exact Match and F1 Score, in order to measure generalization performance on unseen data. Results are displayed with formatted precision (4 decimal places) for readability. Importantly, this final evaluation reveals the true model performance: training metrics can be very misleading due to overfitting, and testing on withheld data makes that assessment the most honest.
The final step saves the fine-tuned model along with the tokenizer via trainer.save_model(); this will create a reusable checkpoint for later inference. The directory creation with os.makedirs(., exist_ok=True) addresses cases where the save path does not exist. This saved model now serves as a deliverable for production deployment or further downstream analysis, hence turning this fine-tuning effort into an asset. Status messages across the workflow provide feedback to the user and transform what was otherwise a black-box process into a clear and monitorable pipeline.



In [229]:
import time
import os

# --- 5. EXECUTION: START THE FINE-TUNING PROCESS ---

print("ðŸš€" + "="*50)
print("              STARTING MODEL FINE-TUNING")
print("="*50 + "ðŸš€")
print(f"Model: {MODEL_NAME}")
print(f"Number of Training Epochs: {training_args.num_train_epochs}")
print(f"Training Batch Size: {training_args.per_device_train_batch_size}")
print(f"Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
print("-" * 54)

training_start_time = time.time()

print("\n[INFO] Training in progress... Please wait.\n")
trainer.train()

# Calculate and display the total training time
training_end_time = time.time()
training_duration = training_end_time - training_start_time
print(f"\n[SUCCESS] Fine-tuning completed in {training_duration / 60:.2f} minutes.")


# --- 6. FINAL EVALUATION ON THE TEST SET ---

print("\n\nâœ…" + "="*50)
print("              PERFORMING FINAL EVALUATION")
print("="*50 + "âœ…")
print("Evaluating the best model checkpoint on the hold-out test set...")

final_eval_results = trainer.evaluate()

# Print the results in a more readable format
print("\n--- Final Evaluation Metrics ---")
for key, value in final_eval_results.items():
    # Format floating point numbers for better readability
    if isinstance(value, float):
        print(f"{key:<25}: {value:.4f}")
    else:
        print(f"{key:<25}: {value}")
print("-" * 34)


# --- 7. SAVING THE FINE-TUNED MODEL ---

print("\n\nðŸ’¾" + "="*50)
print("                 SAVING THE FINAL MODEL")
print("="*50 + "ðŸ’¾")

# Define a path to save the model and tokenizer
final_model_path = "./fine-tuned-bert-qna-model"

# Create the directory if it doesn't exist
os.makedirs(final_model_path, exist_ok=True)

# The save_model() method saves everything needed to reuse the model:
# - The model weights (pytorch_model.bin)
# - The model configuration (config.json)
# - The tokenizer files (vocab.txt, tokenizer_config.json, etc.)
trainer.save_model(final_model_path)

print(f"\n[SUCCESS] Model and tokenizer have been saved to: '{final_model_path}'")
print("\nThis model can now be loaded for inference in the 'Actual Testing' stage.")
print("\n" + "="*60)
print("                PROCESS COMPLETE")
print("="*60)

              STARTING MODEL FINE-TUNING
Model: bert-base-uncased
Number of Training Epochs: 3
Training Batch Size: 16
Device: GPU
------------------------------------------------------

[INFO] Training in progress... Please wait.



Step,Training Loss
100,1.8926



[SUCCESS] Fine-tuning completed in 0.71 minutes.


              PERFORMING FINAL EVALUATION
Evaluating the best model checkpoint on the hold-out test set...



--- Final Evaluation Metrics ---
eval_loss                : 6.2383
eval_Exact_Match         : 0.0000
eval_F1_Score            : 0.0000
eval_Avg_Inference_Time  : 0.0010
eval_runtime             : 1.1433
eval_samples_per_second  : 124.2020
eval_steps_per_second    : 7.8720
epoch                    : 3.0000
----------------------------------


                 SAVING THE FINAL MODEL

[SUCCESS] Model and tokenizer have been saved to: './fine-tuned-bert-qna-model'

This model can now be loaded for inference in the 'Actual Testing' stage.

                PROCESS COMPLETE


# **Actual Testing**

Lastly, code executes an end-to-end inference pipeline to deploy the fine-tuned BERT model on unseen question-answering data, transforming raw text into predicted answers with confidence scores. This pipeline is implemented using Hugging Face's high-level API, pipeline(), abstracting away tokenization, model inference, and post-processingâ€”reducing deployment to just a few function calls from dozens. The workflow begins with the initialization of the QnA pipeline, configured to run on a GPU device=0 if available or CPU device=-1 by default, thus guaranteeing hardware-optimal execution irrespective of the environment in which it is executed.
Data preparation converts the held-out evaluation dataset from Hugging Face format to pandas DataFrame, then rearranges it into a list of dictionaries with question-context pairs-the format expected by the pipeline. This batch-oriented approach hugely boosts inference speed compared to sequential prediction loops. The actual inference step processes all samples simultaneously through qna_pipeline() with batch_size=16, which groups examples for GPU parallelization. The tqdm progress bar shows visual feedback during the potentially time-intensive inference that is critical for transparency when dealing with hundreds or thousands of examples.
Finally, predictions extract answer text and confidence scores from pipeline outputs, append these as new DataFrame columns, and show results next to ground-truth answers for validation. An optional export to CSV writes the predictions to Google Drive so that they can be used for further analysis or shared with stakeholders. This complete workflow shows how production-grade inference differs from training: simpler, optimized for throughput, and instrumented with monitoring features that turn a black-box model into an interpretable system.



In [230]:
from transformers import pipeline
import pandas as pd
import torch
from tqdm.auto import tqdm # Import tqdm for the progress bar

# Assuming 'model' and 'tokenizer' are already loaded from the training phase
# and 'df' is your original DataFrame.

# --- 1. Initialize the QnA Pipeline ---
# This is the same as your original code.
qna_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)
print("âœ… QnA Pipeline Initialized on GPU." if torch.cuda.is_available() else "QnA Pipeline Initialized on CPU.")


# --- 2. Prepare Data for Batch Inference ---
# Instead of a loop, we create a list of dictionaries. This is much faster.
# We'll use the 'eval_data' set created during splitting to test the model's performance on unseen data.
# If you want to run on the whole dataset, just use the original 'df'.

# Let's use the evaluation dataset for a fair test
test_df = eval_data.to_pandas() # Convert the Hugging Face dataset back to a pandas DataFrame

# Create a list of dictionaries in the format the pipeline expects
inference_samples = []
for _, row in test_df.iterrows():
    inference_samples.append({
        "question": row["Question"],
        "context": row["Context"]
    })

print(f"\nðŸš€ Prepared {len(inference_samples)} samples for inference.")


# --- 3. Run Batch Inference with a Progress Bar ---
# The pipeline processes the entire list at once. We wrap it with tqdm for a progress bar.
# You can adjust the 'batch_size' for performance tuning.
predictions = []
for result in tqdm(qna_pipeline(inference_samples, batch_size=16), total=len(inference_samples), desc="Running Inference"):
    predictions.append(result)


# --- 4. Process and Display Results ---
# Extract just the predicted answers and scores
predicted_answers = [p['answer'] for p in predictions]
scores = [p['score'] for p in predictions]

# Add the predictions and scores as new columns to our test DataFrame
test_df['predicted_answer'] = predicted_answers
test_df['confidence_score'] = scores

# Display the results for review. The 'display()' function provides better formatting in notebooks.
print("\nâœ… Inference Complete! Here are the results:")
display(test_df[['Question', 'Answer', 'predicted_answer', 'confidence_score']].head())


# --- 5. (Optional) Save the Results to a CSV ---
output_file = "/content/drive/My Drive/qna_predictions.csv"
test_df.to_csv(output_file, index=False)
print(f"\nðŸ’¾ Results saved to {output_file}")

Device set to use cuda:0


âœ… QnA Pipeline Initialized on GPU.

ðŸš€ Prepared 142 samples for inference.




Running Inference:   0%|          | 0/142 [00:00<?, ?it/s]


âœ… Inference Complete! Here are the results:


Unnamed: 0,Question,Answer,predicted_answer,confidence_score
0,How is sentiment analysis applied in customer ...,"In customer service, it analyzes reviews to un...",PRACTICAL APPLICATIONS OF SENTIMENT ANALYSIS 1.,7.6e-05
1,What type of prediction problem does the logis...,The logistic regression model determines wheth...,A slightly more complex example might be using...,0.001633
2,What are the two primary strategies for solvin...,Multi-class classification can be approached b...,Multi-class classification problems can be,0.000164
3,What are the three fundamental components that...,"The three components are the encoder, the cont...",Literature widely presents encoder,0.002081
4,What exhaustive approach characterizes grid se...,Grid search uses brute force by assembling all...,Grid search is a brute force method,0.000331



ðŸ’¾ Results saved to /content/drive/My Drive/qna_predictions.csv
