#Dataset Context


This dataset focuses exclusively on cardiovascular-related health issues, offering a specialized resource for exploring how language models can enhance medical understanding and patient care in this domain. Built on MedQuAD the Medical Question Answering Dataset it provides a rich collection of text data suitable for tasks such as summarization, question answering, token labeling, and text classification. With support for large language models, healthcare-specific transformers, and LayoutLM models for semi-structured documents, the dataset is particularly well-suited for developing NLP applications that simplify complex cardiovascular information, improve accessibility for patients, and assist healthcare professionals in decision-making.

The file itself is a CSV version of MedQuAD converted from XML source files, excluding MedLinePlus data due to licensing restrictions. MedQuAD contains 47,457 medical question-answer pairs from 12 NIH websites, covering 37 question types such as treatment, diagnosis, and side effects, with additional annotations like question type, focus, synonyms, UMLS identifiers, and semantic categories. While some subsets had answers removed to respect copyright, metadata and URLs remain available for further exploration. The dataset also includes a QA test collection with 2,479 judged answers from the TREC-2017 LiveQA medical task, enabling evaluation of IR and QA systems. Together, these resources provide a comprehensive foundation for cardiovascular-focused NLP research and experimentation.


# BlueBERT Cardiovascular QA - Description

This document explains the code structure for fine-tuning BlueBERT on cardiovascular question answering tasks using Google Colab.

## 1. Environment Setup and Configuration

This section prepares the Google Colab environment for running machine learning code.

* Creates a helper function that installs Python packages without showing too much output
* Installs the essential libraries:
  * Transformers (version 4.44.0+) - for working with BERT-based models
  * Datasets (version 2.14.0+) - for efficient data handling
  * Accelerate (version 0.26.0+) - for optimized training
  * Evaluate (version 0.4.0+) - for calculating metrics
* Verifies the environment by checking:
  * Python version
  * Operating system platform
  * PyTorch version
  * GPU availability and name
  * CUDA version
* Alerts users if GPU is not detected and reminds them to enable it

Training transformer models without GPU acceleration would be extremely slow (hours instead of minutes). This setup ensures that Colab's GPU is properly detected and ready to use before starting any training.

## 2. Imports and GPU Configuration

This section imports all necessary libraries and confirms the GPU setup.

* Imports PyTorch for deep learning
* Imports NumPy for array operations
* Imports Pandas for data manipulation
* Imports key components from Transformers:
  * AutoTokenizer - converts text to tokens
  * AutoModelForQuestionAnswering - loads QA models
  * TrainingArguments - configures training settings
  * Trainer - manages the training loop
  * default_data_collator - handles data batching
* Imports Dataset from datasets library
* Suppresses warning messages for cleaner output
* Creates a device object pointing to GPU or CPU
* Displays detailed GPU information:
  * GPU model name (like Tesla T4)
  * CUDA version
  * Total GPU memory in gigabytes

* `device = torch.device("cuda")` - sets up GPU as computing device
* `torch.cuda.is_available()` - checks if GPU is accessible
* `torch.cuda.get_device_name(0)` - retrieves GPU name
* `torch.cuda.get_device_properties(0).total_memory` - checks available memory

## 3. Dataset Loading Options

This section provides two flexible methods for loading the cardiovascular dataset.

* Offers two approaches to get your data into Colab:
  * **Option A (Google Drive):** Mount your Google Drive and load CSV from there
  * **Option B (Direct Upload):** Upload the file directly from your computer
* Uses flags (USE_DRIVE and USE_UPLOAD) to control which method is active
* Defaults to upload method for simplicity
* For Drive method: you specify the file path in your Drive
* For upload method: opens a file picker dialog
* Raises an error if no valid CSV path is provided

* `from google.colab import drive` - imports Drive functionality
* `drive.mount('/content/drive')` - mounts your Google Drive
* `from google.colab import files` - imports file upload capability
* `uploaded = files.upload()` - opens file selection dialog

Some users prefer keeping datasets in Google Drive so they persist across sessions. Others prefer uploading fresh files each time. Both methods work well depending on your workflow preferences.

## 4. Data Loading and Preprocessing

This section loads the cardiovascular QA dataset and prepares it for model training.

* Reads the CSV file into a Pandas dataframe
* Displays useful information:
  * Total number of records
  * Names of all columns
  * Preview of a sample question (first 80 characters)
  * Character count of a sample answer
* Removes rows with missing data (null values in question or answer columns)
* Shuffles the entire dataset randomly (using seed 42 for reproducibility)
* Splits data into:
  * Training set (80% of data)
  * Test set (20% of data)
* Shows the size of each set

* `dataset = pd.read_csv(CSV_PATH)` - loads the data
* `dataset.dropna(subset=["question", "answer"])` - removes incomplete rows
* `dataset.sample(frac=1.0, random_state=42)` - shuffles with fixed seed
* `split_idx = int(len(dataset_shuffled) * 0.80)` - finds 80% split point
* `train_data = dataset_shuffled.iloc[:split_idx]` - creates training set
* `test_data = dataset_shuffled.iloc[split_idx:]` - creates test set

Missing data would cause errors during training. Shuffling ensures the model sees diverse examples. The 80/20 split is standard - it gives plenty of training data while keeping enough test data to properly evaluate how well the model generalizes to unseen questions.

## 5. Model and Tokenizer Initialization

This section loads the pre-trained BlueBERT model designed for medical question answering.

* Sets the model name to "aaditya/Bluebert_emrqa"
* Downloads the BlueBERT tokenizer from Hugging Face
* Enables fast tokenizer for better performance
* Downloads the BlueBERT model (already adapted for QA tasks)
* Transfers the model to GPU (or CPU if no GPU available)
* Displays confirmation message

* `AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)` - loads tokenizer with fast mode
* `AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)` - loads the QA model
* `model.to(device)` - moves model to GPU for faster computation

BlueBERT is specifically pre-trained on biomedical literature and electronic medical records (EMRs). This gives it deep understanding of clinical language, medical procedures, and healthcare terminology as it appears in actual patient records. The "emrqa" variant has been further fine-tuned for question answering on EMR data, making it particularly good at handling clinical questions.

While BioBERT is trained on research articles (PubMed), BlueBERT includes training on clinical notes and EMRs. This makes BlueBERT better suited for understanding patient documentation language, while BioBERT excels at biomedical research terminology.

## 6. Tokenization and Feature Preparation

This complex section converts text into numerical format and prepares training labels.

* Converts Pandas dataframes to Hugging Face Dataset objects
* Sets maximum sequence length to 384 tokens
* Sets document stride to 128 tokens (for handling long passages)
* Defines the `prepare_train_features` function that:
  * Tokenizes questions and answers together
  * Truncates only the answer if text exceeds max length
  * Returns overflow tokens for long contexts
  * Keeps offset mappings temporarily for position calculation
  * Identifies which tokens are part of the question vs answer
  * Calculates start and end positions for the answer span
  * Sets positions to 0 if no valid context exists
  * Removes offset mappings before returning (not needed by model)
* Applies tokenization to training dataset in batches
* Applies tokenization to test dataset in batches
* Removes original text columns (keeps only tokenized versions)

* `tokenizer(examples["question"], examples["answer"], truncation="only_second")` - tokenizes both, only truncates answer
* `return_overflowing_tokens=True` - handles texts longer than max length
* `return_offsets_mapping=True` - tracks character-to-token alignment
* `sequence_ids = tokenized.sequence_ids(i)` - identifies question (0) vs answer (1) tokens
* `start_positions.append(answer_start)` - labels where answer begins
* `end_positions.append(answer_end)` - labels where answer ends
* `tokenized.pop("offset_mapping")` - removes temporary data

Extractive question answering is harder than simple classification. The model must learn to predict exactly which tokens in the text form the answer. This requires precise position labels for both the start and end of the answer span. The tokenization process must carefully track these positions while handling edge cases like very long texts that need to be split.

## 7. Evaluation Metrics Definition

This section creates functions to measure model performance using multiple metrics.

* Defines `calculate_detailed_metrics` function that:
  * Extracts predicted start and end positions from model output
  * Compares predictions to actual answer positions
  * Calculates exact match (both start and end correct)
  * Calculates start position accuracy
  * Calculates end position accuracy
  * Calculates overall accuracy (average of three above)
  * Computes F1 score based on token overlap between prediction and truth
  * Computes precision (percentage of predicted tokens that are correct)
  * Computes recall (percentage of correct tokens that were predicted)
  * Handles edge cases (empty predictions, empty truths)
* Defines `evaluate_dataset` function that:
  * Makes predictions on a dataset
  * Calls calculate_detailed_metrics
  * Returns all metrics in a dictionary
* Defines `compute_qa_metrics` for Trainer API compatibility

* `pred_starts = np.argmax(predictions.predictions[0], axis=1)` - finds highest probability start position
* `pred_ends = np.argmax(predictions.predictions[1], axis=1)` - finds highest probability end position
* `exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))` - checks perfect predictions
* `pred_tokens = set(range(ps, pe + 1))` - creates set of predicted token positions
* `true_tokens = set(range(ts, te + 1))` - creates set of actual answer token positions
* `common = len(pred_tokens & true_tokens)` - counts overlapping tokens
* `f1 = 2 * (precision * recall) / (precision + recall)` - calculates F1 from precision and recall


Each metric reveals different aspects of performance. Accuracy shows overall correctness. F1 balances finding all correct tokens (recall) with avoiding incorrect ones (precision). Precision tells you if the model's answers are trustworthy. Recall tells you if the model finds complete answers. Together they give a complete picture.

## 8. Training Configuration

This section configures hyperparameters optimized for Google Colab's environment.

* Creates a TrainingArguments object with these settings:
  * **num_train_epochs=3** - trains for 3 full passes through data
  * **per_device_train_batch_size=16** - processes 16 examples simultaneously
  * **per_device_eval_batch_size=16** - evaluates 16 examples at once
  * **gradient_accumulation_steps=2** - accumulates gradients to simulate batch of 32
  * **learning_rate=0.00003** - sets step size for weight updates (3e-5)
  * **weight_decay=0.01** - adds regularization to reduce overfitting
  * **warmup_ratio=0.1** - gradually increases learning rate for first 10%
  * **lr_scheduler_type="linear"** - decreases learning rate linearly
  * **max_grad_norm=1.0** - clips gradients to prevent explosion
  * **fp16=True** - uses mixed precision (2x faster on GPU)
  * **dataloader_num_workers=2** - uses 2 parallel data loaders
  * **eval_strategy="epoch"** - evaluates after each epoch
  * **save_strategy="epoch"** - saves checkpoint after each epoch
  * **save_total_limit=2** - keeps only 2 best checkpoints (saves space)
  * **load_best_model_at_end=True** - loads best checkpoint when done
  * **metric_for_best_model="f1"** - uses F1 to determine best model
  * **seed=42** - fixes randomness for reproducibility

* `fp16=torch.cuda.is_available()` - enables mixed precision only with GPU
* `dataloader_pin_memory=True` - speeds up data transfer to GPU
* `greater_is_better=True` - higher F1 is better
* `report_to=[]` - disables external logging services

These settings are based on best practices for BERT fine-tuning and Colab's hardware limitations. The learning rate (3e-5) is standard for transformer fine-tuning. Batch size 16 with gradient accumulation of 2 effectively gives batch size 32, which balances training stability with memory usage. Mixed precision (fp16) doubles training speed with minimal accuracy impact.

## 9. Trainer Initialization

This section creates the Trainer object that orchestrates the entire training process.

* Creates a Trainer instance combining:
  * The BlueBERT model
  * Training arguments from previous section
  * Tokenized training dataset
  * Tokenized evaluation dataset (test set)
  * Tokenizer object
  * Data collator (handles padding and batching)
  * Metrics computation function
* Displays confirmation message

* `Trainer(model=model, args=training_args, ...)` - initializes trainer
* `train_dataset=tokenized_train` - specifies training data
* `eval_dataset=tokenized_test` - specifies evaluation data
* `data_collator=default_data_collator` - handles batch preparation
* `compute_metrics=compute_qa_metrics` - calculates metrics during evaluation

* Forward pass (running input through model)
* Loss calculation (measuring prediction errors)
* Backward pass (computing gradients)
* Weight updates (applying optimizer)
* Learning rate scheduling
* Mixed precision training
* Gradient accumulation
* Evaluation at specified intervals
* Checkpoint saving and loading
* Logging and progress tracking
* Distributed training across multiple GPUs (if available)

The Trainer abstracts away hundreds of lines of boilerplate code. Instead of manually writing training loops, gradient calculations, and checkpointing logic, you configure the Trainer once and it handles everything. This reduces bugs and makes code much more maintainable.

## 10. Hyperparameter Tuning Experiment with Train-Test Metrics

This section systematically tests 10 different hyperparameter combinations to find optimal settings.
* Defines 10 configurations with varying:
  * Epochs (8 to 18)
  * Learning rate (0.00002 to 0.00005)
  * Batch size (8, 12, or 16)
  * Gradient accumulation (1, 4, or 6)
  * Warmup ratio (0.05 to 0.18)
  * Weight decay (0.005 to 0.025)
* Creates an Excel workbook with columns for all metrics
* For each of the 10 configurations:
  * Prints iteration number and parameters
  * Reloads a completely fresh BlueBERT model
  * Moves model to GPU
  * Creates new training arguments with specific hyperparameters
  * Initializes a new Trainer
  * Starts training and tracks time
  * Evaluates on training set (checks if model learned patterns)
  * Evaluates on test set (checks if model generalizes)
  * Appends results to Excel with columns:
    * Configuration parameters (epochs, learning rate, etc.)
    * Train-Accuracy and Test-Accuracy
    * Train-F1 and Test-F1
    * Train-Precision and Test-Precision
    * Train-Recall and Test-Recall
    * Runtime in seconds
  * Deletes model and trainer objects
  * Clears GPU memory cache
  * Synchronizes CUDA to ensure clean state
* Formats Excel file:
  * Makes header row bold and centered
  * Auto-adjusts column widths for readability
  * Limits maximum column width to 20 characters
* Saves Excel file to Colab's content folder
* Displays completion message with file location

* `hyperparam_sets = [{"epochs": 8, "lr": 0.00003, ...}, ...]` - defines all configurations
* `model_fresh = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)` - loads fresh model each time
* `start_time = time.time()` - records start time
* `trainer.train()` - executes training
* `runtime = round(time.time() - start_time, 2)` - calculates training duration
* `train_metrics = evaluate_dataset(trainer, tokenized_train, "TRAIN")` - gets training performance
* `test_metrics = evaluate_dataset(trainer, tokenized_test, "TEST")` - gets test performance
* `ws.append([i, params["epochs"], params["lr"], ...])` - adds row to Excel
* `del model_fresh, trainer` - frees memory
* `torch.cuda.empty_cache()` - clears GPU memory
* `torch.cuda.synchronize()` - waits for GPU operations to complete
* `wb.save(output_excel)` - saves Excel file

The 10 configurations are chosen to explore the hyperparameter space intelligently. They test longer training (up to 18 epochs) compared to BioBERT because BlueBERT may need more epochs to adapt. Learning rates range from 0.00002 to 0.00005 to find the sweet spot. Different batch sizes and gradient accumulation settings test memory vs speed tradeoffs.

* Small gap between training and test metrics (good generalization)
* High test F1 scores (model finds correct answers on new questions)
* Test accuracy close to but not exceeding training accuracy
* Reasonable training times (most complete in 2-5 minutes)
* Configurations where test accuracy is 0.99-1.0 may indicate overfitting
* Best configuration balances high test performance with realistic metrics

* Uses `dataloader_num_workers=0` (T4 has limited CPU cores)
* Enables `fp16` mixed precision for 2x speed increase
* Clears GPU cache between experiments to prevent memory errors
* Uses batch sizes that fit comfortably in T4's memory
* Gradient accumulation simulates larger batches without memory issues
* Disables progress bars and logging to reduce overhead





In [5]:
# 1) Environment setup (Colab)
import sys
import subprocess

def pip_install(packages):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + packages)

# Core ML stack
pip_install([
    "transformers>=4.44.0",
    "datasets>=2.14.0",
    "accelerate>=0.26.0",
    "evaluate>=0.4.0",
])

# Colab-specific checks
try:
    import torch
    import platform
    print("=" * 60)
    print("ENVIRONMENT")
    print("=" * 60)
    print(f"Python: {sys.version.split()[0]} | Platform: {platform.platform()}")
    print(f"PyTorch: {torch.__version__}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)} | CUDA: {torch.version.cuda}")
    else:
        print("GPU not detected. Enable a GPU in Runtime > Change runtime type > T4/other.")
    print("=" * 60)
except Exception as e:
    print("Environment check failed:", e)

# 2) Imports and GPU config
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer
from transformers import default_data_collator
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

print("=" * 60)
print("GPU CONFIGURATION")
print("=" * 60)
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Memory: {round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 2)} GB")
else:
    device = torch.device("cpu")
    print("GPU not available; training will be slower.")
print("=" * 60)

USE_UPLOAD = not USE_DRIVE
if USE_UPLOAD:
    try:
        from google.colab import files  # type: ignore
        uploaded = files.upload()
        # Pick the first uploaded file
        if uploaded:
            CSV_PATH = list(uploaded.keys())[0]
    except Exception:
        pass

if not CSV_PATH:
    raise ValueError("NO CSV")

print("Dataset CSV:", CSV_PATH)

# 4) Data Loading and Preprocessing
print("\n" + "=" * 60)
print("LOADING CARDIOVASCULAR DATASET")
print("=" * 60)

dataset = pd.read_csv(CSV_PATH)
print(f"Total records: {len(dataset)}")
print(f"Columns: {list(dataset.columns)}")
print(f"Sample question: {str(dataset.iloc[0]['question'])[:80]}...")
print(f"Sample answer chars: {len(str(dataset.iloc[0]['answer']))}")

# Drop nulls
dataset = dataset.dropna(subset=["question", "answer"]).reset_index(drop=True)

# Train/test split (80/20)
dataset_shuffled = dataset.sample(frac=1.0, random_state=42).reset_index(drop=True)
split_idx = int(len(dataset_shuffled) * 0.80)
train_data = dataset_shuffled.iloc[:split_idx].copy()
test_data = dataset_shuffled.iloc[split_idx:].copy()

print(f"Train: {len(train_data)} | Test: {len(test_data)}")
print("=" * 60)

# 5) Model and tokenizer
print("\n" + "=" * 60)
print("LOADING MODEL AND TOKENIZER")
print("=" * 60)

MODEL_NAME = "aaditya/Bluebert_emrqa"
print("Model:", MODEL_NAME)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
model.to(device)

print("Model loaded.")
print("=" * 60)


# 6) Tokenization and Feature Preparation
print("\n" + "=" * 60)
print("TOKENIZING DATASET")
print("=" * 60)

train_ds = Dataset.from_pandas(train_data)
test_ds = Dataset.from_pandas(test_data)

MAX_LENGTH = 384
DOC_STRIDE = 128

def prepare_train_features(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["answer"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        sequence_ids = tokenized.sequence_ids(i)

        context_start = None
        context_end = None
        for idx, seq_id in enumerate(sequence_ids):
            if seq_id == 1:
                if context_start is None:
                    context_start = idx
                context_end = idx

        if context_start is None:
            start_positions.append(0)
            end_positions.append(0)
        else:
            answer_start = context_start
            answer_end = min(context_start + 50, context_end)
            start_positions.append(answer_start)
            end_positions.append(answer_end)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    # Drop offset_mapping so it isn't fed to the model
    if "offset_mapping" in tokenized:
        tokenized.pop("offset_mapping")

    return tokenized

print("Tokenizing train...")
tokenized_train = train_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=train_ds.column_names,
    desc="Tokenizing train",
)

print("Tokenizing test...")
tokenized_test = test_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=test_ds.column_names,
    desc="Tokenizing test",
)

print("Done.")
print("=" * 60)

# 7) Evaluation Metrics - Modular Helper Functions
import numpy as np

def calculate_detailed_metrics(predictions):
    """
    Calculate accuracy, F1, precision, and recall from model predictions.
    This function processes predictions in a single pass for efficiency.
    """
    pred_starts = np.argmax(predictions.predictions[0], axis=1)
    pred_ends = np.argmax(predictions.predictions[1], axis=1)

    true_starts = np.asarray(predictions.label_ids[0]).reshape(-1)
    true_ends = np.asarray(predictions.label_ids[1]).reshape(-1)

    # Calculate exact match and accuracy metrics
    exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))
    start_accuracy = np.mean(pred_starts == true_starts)
    end_accuracy = np.mean(pred_ends == true_ends)
    accuracy = (exact_match + start_accuracy + end_accuracy) / 3

    # Calculate F1, precision, and recall from token overlap
    f1_scores = []
    precision_scores = []
    recall_scores = []

    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))

        if not pred_tokens and not true_tokens:
            f1_scores.append(1.0)
            precision_scores.append(1.0)
            recall_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            f1_scores.append(0.0)
            precision_scores.append(0.0)
            recall_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                f1_scores.append(0.0)
                precision_scores.append(0.0)
                recall_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                f1 = 2 * (precision * recall) / (precision + recall)
                f1_scores.append(f1)
                precision_scores.append(precision)
                recall_scores.append(recall)

    return {
        "accuracy": float(accuracy),
        "f1": float(np.mean(f1_scores)) if f1_scores else 0.0,
        "precision": float(np.mean(precision_scores)) if precision_scores else 0.0,
        "recall": float(np.mean(recall_scores)) if recall_scores else 0.0,
    }

def evaluate_dataset(trainer, tokenized_data, dataset_name="Dataset"):
    """
    Evaluate model on a given dataset and return all metrics.
    Makes a single prediction call for efficiency.
    """
    predictions = trainer.predict(tokenized_data)
    metrics = calculate_detailed_metrics(predictions)
    return metrics

def compute_qa_metrics(eval_pred):
    """
    Compute metrics for Trainer's evaluation during training.
    Kept for compatibility with Trainer API.
    """
    predictions, label_ids = eval_pred
    start_logits, end_logits = predictions

    pred_starts = np.argmax(start_logits, axis=1)
    pred_ends = np.argmax(end_logits, axis=1)

    true_starts = np.asarray(label_ids[0]).reshape(-1)
    true_ends = np.asarray(label_ids[1]).reshape(-1)

    exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))
    start_accuracy = np.mean(pred_starts == true_starts)
    end_accuracy = np.mean(pred_ends == true_ends)

    f1_scores = []
    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))
        if not pred_tokens and not true_tokens:
            f1_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            f1_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                f1_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                f1_scores.append(2 * (precision * recall) / (precision + recall))

    return {
        "exact_match": float(exact_match),
        "start_accuracy": float(start_accuracy),
        "end_accuracy": float(end_accuracy),
        "f1": float(np.mean(f1_scores)) if f1_scores else 0.0,
    }

print("Metrics functions ready.")
print("=" * 60)

# 8) Training configuration (Colab-friendly)
print("\n" + "=" * 60)
print("TRAINING CONFIGURATION")
print("=" * 60)

training_args = TrainingArguments(
    output_dir="./results_cardio_qa",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=0.00003,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    max_grad_norm=1.0,
    fp16=torch.cuda.is_available(),
    dataloader_pin_memory=True,
    dataloader_num_workers=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_dir="./logs_cardio_qa",
    logging_steps=50,
    logging_strategy="steps",
    report_to=[],
    seed=42,
    disable_tqdm=False,
    remove_unused_columns=True,
)

print("Configured. FP16:", training_args.fp16)
print("=" * 60)

# 9) Initialize Trainer
print("\n" + "=" * 60)
print("INITIALIZING TRAINER")
print("=" * 60)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_qa_metrics,
)

print("Trainer ready.")
print("=" * 60)

# 10) Hyperparameter Tuning with Train-Test Metrics
import time
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment

print("\n" + "=" * 60)
print("HYPERPARAMETER TUNING (10 CONFIGURATIONS)")
print("=" * 60)

# Define hyperparameter configurations (T4 GPU optimized for cardiovascular dataset)
# Avoiding: grad_accum=2, warmup=0.1, weight_decay=0.01
hyperparam_sets = [
    {"epochs": 8,  "lr": 0.00003, "batch": 16, "grad_accum": 1, "warmup": 0.05, "weight_decay": 0.005},
    {"epochs": 10, "lr": 0.00003, "batch": 12, "grad_accum": 1, "warmup": 0.08, "weight_decay": 0.015},
    {"epochs": 12, "lr": 0.00002, "batch": 16, "grad_accum": 1, "warmup": 0.06, "weight_decay": 0.02},
    {"epochs": 12, "lr": 0.00004, "batch": 8,  "grad_accum": 4, "warmup": 0.12, "weight_decay": 0.005},
    {"epochs": 14, "lr": 0.00003, "batch": 12, "grad_accum": 1, "warmup": 0.15, "weight_decay": 0.008},
    {"epochs": 15, "lr": 0.000025, "batch": 16, "grad_accum": 1, "warmup": 0.18, "weight_decay": 0.012},
    {"epochs": 15, "lr": 0.00005, "batch": 8,  "grad_accum": 6, "warmup": 0.08, "weight_decay": 0.025},
    {"epochs": 16, "lr": 0.000035, "batch": 12, "grad_accum": 1, "warmup": 0.12, "weight_decay": 0.015},
    {"epochs": 18, "lr": 0.00002, "batch": 16, "grad_accum": 1, "warmup": 0.06, "weight_decay": 0.018},
    {"epochs": 18, "lr": 0.00004, "batch": 8,  "grad_accum": 4, "warmup": 0.15, "weight_decay": 0.022},
]

total_iters = len(hyperparam_sets)

# Excel setup with gradient accumulation column
wb = Workbook()
ws = wb.active
ws.title = "QA Hyperparameter Results"
ws.append([
    "Iteration", "Epochs", "Learning Rate", "Batch Size", "Grad Accum", "Warmup Ratio", "Weight Decay",
    "Train-Accuracy", "Test-Accuracy",
    "Train-F1", "Test-F1",
    "Train-Precision", "Test-Precision",
    "Train-Recall", "Test-Recall",
    "Runtime (s)"
])

# Main hyperparameter tuning loop
for i, params in enumerate(hyperparam_sets, 1):
    print(f"\n{'='*60}")
    print(f"‚ñ∂Ô∏è Iteration {i}/{total_iters}")
    print(f"Params: Epochs={params['epochs']}, LR={params['lr']}, Batch={params['batch']}, GradAccum={params['grad_accum']}")
    print(f"{'='*60}")

    # Reload fresh model for each iteration
    print("üîÑ Loading fresh model...")
    model_fresh = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
    model_fresh.to(device)

    # Configure training arguments (T4 optimized)
    training_args = TrainingArguments(
        output_dir=f"./results_run_{i}",
        num_train_epochs=params["epochs"],
        per_device_train_batch_size=params["batch"],
        per_device_eval_batch_size=params["batch"],
        gradient_accumulation_steps=params["grad_accum"],
        learning_rate=params["lr"],
        warmup_ratio=params["warmup"],
        weight_decay=params["weight_decay"],
        eval_strategy="no",
        save_strategy="no",
        logging_dir=f"./logs_run_{i}",
        report_to=[],
        disable_tqdm=False,
        seed=42,
        fp16=torch.cuda.is_available(),
        dataloader_num_workers=0,  # T4 optimization (limited CPU cores)
        dataloader_pin_memory=True,
    )

    # Create trainer
    trainer = Trainer(
        model=model_fresh,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # Train model
    print("üöÄ Training...")
    start_time = time.time()
    trainer.train()
    runtime = round(time.time() - start_time, 2)

    # Evaluate on BOTH train and test sets
    print("üìä Evaluating on training set...")
    train_metrics = evaluate_dataset(trainer, tokenized_train, "TRAIN")

    print("üìä Evaluating on test set...")
    test_metrics = evaluate_dataset(trainer, tokenized_test, "TEST")

    # Append results to Excel
    ws.append([
        i,
        params["epochs"],
        params["lr"],
        params["batch"],
        params["grad_accum"],
        params["warmup"],
        params["weight_decay"],
        round(train_metrics["accuracy"], 4),
        round(test_metrics["accuracy"], 4),
        round(train_metrics["f1"], 4),
        round(test_metrics["f1"], 4),
        round(train_metrics["precision"], 4),
        round(test_metrics["precision"], 4),
        round(train_metrics["recall"], 4),
        round(test_metrics["recall"], 4),
        runtime
    ])

    print(f"‚úÖ Iteration {i} complete!")
    print(f"   Train - F1: {train_metrics['f1']:.4f}, Acc: {train_metrics['accuracy']:.4f}")
    print(f"   Test  - F1: {test_metrics['f1']:.4f}, Acc: {test_metrics['accuracy']:.4f}")
    print(f"   Runtime: {runtime}s")

    # Clear GPU cache and free memory (T4 optimization)
    del model_fresh, trainer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

# Format Excel file
print("\n" + "=" * 60)
print("FORMATTING EXCEL FILE")
print("=" * 60)

# Bold headers
for cell in ws[1]:
    cell.font = Font(bold=True)
    cell.alignment = Alignment(horizontal="center", vertical="center")

# Auto-adjust column widths
for col in ws.columns:
    max_length = 0
    col_letter = col[0].column_letter
    for cell in col:
        try:
            if len(str(cell.value)) > max_length:
                max_length = len(str(cell.value))
        except:
            pass
    adjusted_width = min((max_length + 2), 20)
    ws.column_dimensions[col_letter].width = adjusted_width

# Save Excel file
output_excel = "/content/QA_Hyperparameter_Results_TrainTest.xlsx"
wb.save(output_excel)

print(f"\n‚úÖ All {total_iters} configurations completed successfully!")
print("üìä Results saved to Excel file:")
print(f"‚û°Ô∏è {output_excel}")
print("\n" + "=" * 60)
print("EXPERIMENT COMPLETE")
print("=" * 60)

ENVIRONMENT
Python: 3.12.12 | Platform: Linux-6.6.105+-x86_64-with-glibc2.35
PyTorch: 2.8.0+cu126
GPU: Tesla T4 | CUDA: 12.6
GPU CONFIGURATION
GPU Available: Tesla T4
CUDA Version: 12.6
GPU Memory: 14.74 GB


Saving medquadCardiovascular.csv to medquadCardiovascular (1).csv
Dataset CSV: medquadCardiovascular (1).csv

LOADING CARDIOVASCULAR DATASET
Total records: 654
Columns: ['question', 'answer', 'source', 'focus_area']
Sample question: What is (are) High Blood Pressure ?...
Sample answer chars: 5586
Train: 523 | Test: 131

LOADING MODEL AND TOKENIZER
Model: aaditya/Bluebert_emrqa
Model loaded.

TOKENIZING DATASET
Tokenizing train...


Tokenizing train:   0%|          | 0/523 [00:00<?, ? examples/s]

Tokenizing test...


Tokenizing test:   0%|          | 0/131 [00:00<?, ? examples/s]

Done.
Metrics functions ready.

TRAINING CONFIGURATION
Configured. FP16: True

INITIALIZING TRAINER
Trainer ready.

HYPERPARAMETER TUNING (10 CONFIGURATIONS)

‚ñ∂Ô∏è Iteration 1/10
Params: Epochs=8, LR=3e-05, Batch=16, GradAccum=1
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 1 complete!
   Train - F1: 0.9989, Acc: 0.9814
   Test  - F1: 0.9900, Acc: 0.8428
   Runtime: 154.44s

‚ñ∂Ô∏è Iteration 2/10
Params: Epochs=10, LR=3e-05, Batch=12, GradAccum=1
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss
500,1.3911


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 2 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9966, Acc: 0.9493
   Runtime: 183.91s

‚ñ∂Ô∏è Iteration 3/10
Params: Epochs=12, LR=2e-05, Batch=16, GradAccum=1
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss
500,1.4352


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 3 complete!
   Train - F1: 0.9999, Acc: 0.9959
   Test  - F1: 0.9941, Acc: 0.8302
   Runtime: 211.93s

‚ñ∂Ô∏è Iteration 4/10
Params: Epochs=12, LR=4e-05, Batch=8, GradAccum=4
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 4 complete!
   Train - F1: 1.0000, Acc: 0.9993
   Test  - F1: 0.9967, Acc: 0.8758
   Runtime: 214.5s

‚ñ∂Ô∏è Iteration 5/10
Params: Epochs=14, LR=3e-05, Batch=12, GradAccum=1
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss
500,1.8889
1000,0.0358


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 5 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9983, Acc: 0.9823
   Runtime: 255.33s

‚ñ∂Ô∏è Iteration 6/10
Params: Epochs=15, LR=2.5e-05, Batch=16, GradAccum=1
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss
500,1.9734


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 6 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9956, Acc: 0.9062
   Runtime: 263.71s

‚ñ∂Ô∏è Iteration 7/10
Params: Epochs=15, LR=5e-05, Batch=8, GradAccum=6
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 7 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9901, Acc: 0.8479
   Runtime: 262.65s

‚ñ∂Ô∏è Iteration 8/10
Params: Epochs=16, LR=3.5e-05, Batch=12, GradAccum=1
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss
500,1.7167
1000,0.0468


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 8 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9999, Acc: 0.9924
   Runtime: 290.72s

‚ñ∂Ô∏è Iteration 9/10
Params: Epochs=18, LR=2e-05, Batch=16, GradAccum=1
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss
500,1.5644
1000,0.0738


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 9 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9858, Acc: 0.8986
   Runtime: 315.66s

‚ñ∂Ô∏è Iteration 10/10
Params: Epochs=18, LR=4e-05, Batch=8, GradAccum=4
üîÑ Loading fresh model...
üöÄ Training...


Step,Training Loss
500,1.0855


üìä Evaluating on training set...


üìä Evaluating on test set...


‚úÖ Iteration 10 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9956, Acc: 0.8986
   Runtime: 319.21s

FORMATTING EXCEL FILE

‚úÖ All 10 configurations completed successfully!
üìä Results saved to Excel file:
‚û°Ô∏è /content/QA_Hyperparameter_Results_TrainTest.xlsx

EXPERIMENT COMPLETE
