#Dataset Context

This dataset focuses exclusively on cardiovascular-related health issues, offering a specialized resource for exploring how language models can enhance medical understanding and patient care in this domain. Built on MedQuAD the Medical Question Answering Dataset it provides a rich collection of text data suitable for tasks such as summarization, question answering, token labeling, and text classification. With support for large language models, healthcare-specific transformers, and LayoutLM models for semi-structured documents, the dataset is particularly well-suited for developing NLP applications that simplify complex cardiovascular information, improve accessibility for patients, and assist healthcare professionals in decision-making.

The file itself is a CSV version of MedQuAD converted from XML source files, excluding MedLinePlus data due to licensing restrictions. MedQuAD contains 47,457 medical question-answer pairs from 12 NIH websites, covering 37 question types such as treatment, diagnosis, and side effects, with additional annotations like question type, focus, synonyms, UMLS identifiers, and semantic categories. While some subsets had answers removed to respect copyright, metadata and URLs remain available for further exploration. The dataset also includes a QA test collection with 2,479 judged answers from the TREC-2017 LiveQA medical task, enabling evaluation of IR and QA systems. Together, these resources provide a comprehensive foundation for cardiovascular-focused NLP research and experimentation.


# BioBERT Cardiovascular QA - Code Structure and Functionality

This document explains the code structure for fine-tuning BioBERT on cardiovascular question answering tasks using Google Colab.

## 1. Environment Setup and Configuration

This section sets up everything needed to run the machine learning code in Google Colab.


* Creates a helper function to install Python packages quietly without cluttering the output
* Installs the core libraries needed for the project:
  * Transformers (version 4.44.0 or higher) - works with pre-trained BERT models
  * Datasets (version 2.14.0 or higher) - handles data efficiently
  * Accelerate (version 0.26.0 or higher) - speeds up training
  * Evaluate (version 0.4.0 or higher) - calculates performance metrics
* Checks the Python version and operating system details
* Verifies that PyTorch is installed correctly
* Detects if a GPU is available for training
* Shows GPU information like model name and CUDA version
* Warns users if no GPU is found and reminds them to enable it in Colab settings

GPU acceleration is essential for training transformer models. Without it, training would take hours or days instead of minutes. This section makes sure everything is configured properly before starting the actual work.

## 2. Imports and GPU Configuration

This section imports all the necessary Python libraries and confirms GPU availability.

* Imports PyTorch for deep learning operations
* Imports NumPy for numerical calculations
* Imports Pandas for working with data tables
* Imports specific tools from Transformers library:
  * AutoTokenizer - converts text to numbers the model understands
  * AutoModelForQuestionAnswering - loads pre-trained QA models
  * TrainingArguments - sets up training parameters
  * Trainer - handles the training process
  * default_data_collator - organizes batches of data
* Imports Dataset from the datasets library for efficient data handling
* Turns off warning messages to keep output clean
* Creates a CUDA device object if GPU is available
* Displays detailed GPU information including memory size

**Important code sections:**
* `device = torch.device("cuda")` - tells PyTorch to use the GPU
* `torch.cuda.get_device_name(0)` - gets the GPU model name (like Tesla T4)
* `torch.cuda.get_device_properties(0).total_memory` - shows how much GPU memory is available

## 3. Dataset Loading Options

This section provides flexible ways to load the cardiovascular dataset in Colab.
* Uses boolean flags (USE_DRIVE and USE_UPLOAD) to switch between methods
* Defaults to the upload option for simplicity
* For Google Drive: requires you to specify the file path
* For upload: uses Colab's file picker interface
* Raises an error if no valid CSV path is provided

* `from google.colab import drive` - enables Google Drive mounting
* `drive.mount('/content/drive')` - connects your Google Drive
* `from google.colab import files` - enables file upload
* `files.upload()` - opens the file picker dialog



## 4. Data Loading and Preprocessing

This section loads the cardiovascular medical questions and answers, then prepares them for training.

* Reads the CSV file using Pandas
* Shows diagnostic information:
  * Total number of question-answer pairs
  * Column names in the dataset
  * Sample question preview
  * Length of a sample answer
* Removes any rows with missing questions or answers (null values)
* Shuffles the dataset randomly using a fixed random seed (42) for reproducibility
* Splits data into training set (80%) and test set (20%)
* Displays the size of each set

* `dataset.dropna(subset=["question", "answer"])` - removes incomplete data
* `dataset.sample(frac=1.0, random_state=42)` - shuffles with fixed seed
* `split_idx = int(len(dataset_shuffled) * 0.80)` - calculates 80/20 split point

Clean data is crucial. Null values would cause errors during training. The 80/20 split ensures we have plenty of data for training while keeping enough aside to test how well the model generalizes to new questions.

## 5. Model and Tokenizer Initialization

This section loads the pre-trained BioBERT model and its tokenizer.

* Sets the model name to "dmis-lab/biobert-base-cased-v1.1"
* Downloads the BioBERT tokenizer from Hugging Face
* Enables the fast tokenizer option for better performance
* Downloads the BioBERT model pre-trained for question answering
* Moves the model to the GPU (or CPU if no GPU available)

* `AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)` - loads tokenizer
* `AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)` - loads model
* `model.to(device)` - transfers model to GPU for faster computation

BioBERT is not just regular BERT. It has been pre-trained on millions of biomedical research papers from PubMed and PMC. This gives it deep knowledge of medical terminology, diseases, treatments, and healthcare concepts. When you fine-tune it on cardiovascular questions, it already understands medical language, so it learns faster and performs better than general-purpose models.

## 6. Tokenization and Feature Preparation

This is one of the most complex sections. It converts text into numbers and prepares labels for training.

* Converts Pandas dataframes into Hugging Face Dataset objects
* Sets maximum sequence length to 384 tokens (BERT's typical limit)
* Sets document stride to 128 tokens for handling long texts
* Creates a function called `prepare_train_features` that:
  * Tokenizes questions and answers together
  * Truncates only the answer part if text is too long
  * Identifies which tokens belong to the question vs the answer
  * Sets start and end positions for where the answer begins and ends
  * Handles cases where context might overflow into multiple examples
  * Removes temporary data structures not needed by the model
* Applies this tokenization to both training and test datasets in batches

* `tokenizer(examples["question"], examples["answer"], truncation="only_second")` - tokenizes both question and answer, only cutting the answer if needed
* `sequence_ids = tokenized.sequence_ids(i)` - identifies which tokens are question (0) and which are answer (1)
* `start_positions.append(answer_start)` - marks where the answer starts
* `end_positions.append(answer_end)` - marks where the answer ends

Question answering is different from simple classification. The model doesn't just predict yes/no or choose a category. Instead, it must predict exactly which tokens in the text contain the answer. This requires careful position labeling so the model learns to point at the correct span of text.

## 7. Evaluation Metrics Definition

This section defines how to measure model performance with multiple metrics.

* Creates a function `calculate_detailed_metrics` that:
  * Extracts predicted start and end positions from model output
  * Compares predictions to true answer positions
  * Calculates exact match (when both start and end are correct)
  * Calculates start accuracy (correct start position)
  * Calculates end accuracy (correct end position)
  * Calculates overall accuracy as average of the three above
  * Computes F1 score based on token overlap
  * Computes precision (what percent of predicted tokens are correct)
  * Computes recall (what percent of correct tokens were found)
* Creates `evaluate_dataset` function to evaluate any dataset efficiently
* Creates `compute_qa_metrics` for compatibility with Trainer API

* `pred_starts = np.argmax(predictions.predictions[0], axis=1)` - finds most likely start position
* `pred_ends = np.argmax(predictions.predictions[1], axis=1)` - finds most likely end position
* `exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))` - checks perfect predictions
* F1 calculation treats answer spans as sets of token positions and measures overlap

Different metrics tell different stories. Accuracy shows overall correctness. F1 balances precision and recall. Precision shows if predictions are reliable. Recall shows if the model finds all correct answers. Looking at all these together gives a complete picture of model performance.

## 8. Training Configuration

This section sets up all the hyperparameters and training options optimized for Colab's T4 GPU.

* Creates a TrainingArguments object with settings:
  * **num_train_epochs=3** - trains for 3 complete passes through the data
  * **per_device_train_batch_size=16** - processes 16 examples at once
  * **gradient_accumulation_steps=2** - simulates batch size of 32 (16 x 2)
  * **learning_rate=0.00003** - controls how fast the model updates (3e-5)
  * **weight_decay=0.01** - adds regularization to prevent overfitting
  * **warmup_ratio=0.1** - gradually increases learning rate for first 10% of training
  * **max_grad_norm=1.0** - prevents gradient explosion
  * **fp16=True** - uses mixed precision for 2x faster training on GPU
  * **eval_strategy="epoch"** - evaluates after each epoch
  * **save_strategy="epoch"** - saves checkpoint after each epoch
  * **save_total_limit=2** - keeps only best 2 checkpoints to save space
  * **metric_for_best_model="f1"** - uses F1 score to determine best model
  * **seed=42** - fixes random seed for reproducibility

* `fp16=torch.cuda.is_available()` - enables mixed precision only if GPU exists
* `dataloader_num_workers=0` - optimized for T4 GPU with limited CPU cores
* `load_best_model_at_end=True` - automatically loads best checkpoint when done

These values are carefully chosen based on what works well for BERT models and Colab's T4 GPU limitations. The batch size balances speed and memory usage. The learning rate is standard for fine-tuning. Mixed precision (fp16) dramatically speeds up training without hurting accuracy.

## 9. Trainer Initialization

This section creates the Trainer object that handles all the training complexity.

* Combines everything into one Trainer object:
  * The BioBERT model
  * Training arguments from previous section
  * Training dataset (tokenized)
  * Evaluation dataset (tokenized test set)
  * Tokenizer
  * Data collator (handles batching and padding)
  * Metrics function for evaluation

* `Trainer(model=model, args=training_args, ...)` - creates the trainer
* `train_dataset=tokenized_train` - provides training data
* `eval_dataset=tokenized_test` - provides test data for evaluation
* `compute_metrics=compute_qa_metrics` - enables custom metrics calculation

* Handles the training loop (forward pass, loss calculation, backward pass)
* Manages gradient computation and backpropagation
* Updates model weights using the optimizer
* Adjusts learning rate according to schedule
* Handles mixed precision training
* Runs evaluation at specified intervals
* Saves checkpoints
* Logs metrics to track progress
* Supports distributed training across multiple GPUs

Without the Trainer, you would need to write hundreds of lines of code to handle all these aspects manually. The Trainer API makes it simple while still offering extensive customization options.

## 10. Hyperparameter Tuning Experiment with Train-Test Metrics

This section runs 10 different training experiments to find the best hyperparameter combination.

* Defines 10 different configurations varying:
  * Number of epochs (5 to 14)
  * Learning rate (0.00003 to 0.00007)
  * Batch size (8, 12, or 16)
  * Gradient accumulation (1, 4, or 6)
  * Warmup ratio (0.06 to 0.18)
  * Weight decay (0.006 to 0.022)
* Creates an Excel workbook to store results
* For each configuration:
  * Reloads a fresh BioBERT model (ensures fair comparison)
  * Configures training arguments with specific hyperparameters
  * Creates a new Trainer
  * Trains the model and times how long it takes
  * Evaluates on training set (to check if model learned)
  * Evaluates on test set (to check if model generalizes)
  * Saves all results to Excel with columns:
    * Hyperparameter values
    * Train-Accuracy, Test-Accuracy
    * Train-F1, Test-F1
    * Train-Precision, Test-Precision
    * Train-Recall, Test-Recall
    * Runtime in seconds
  * Clears GPU memory before next iteration
* Formats the Excel file with bold headers and adjusted column widths
* Saves the file to Colab's content folder

* `model_fresh = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)` - reloads fresh model each time
* `trainer.train()` - runs the training
* `train_metrics = evaluate_dataset(trainer, tokenized_train, "TRAIN")` - evaluates training performance
* `test_metrics = evaluate_dataset(trainer, tokenized_test, "TEST")` - evaluates test performance
* `torch.cuda.empty_cache()` - clears GPU memory between experiments
* `wb.save(output_excel)` - saves Excel file with all results

By testing 10 carefully chosen configurations, you can identify which hyperparameters work best for this specific task. The side-by-side train-test metrics make it easy to spot overfitting (when training performance is much better than test performance). Starting fresh for each experiment ensures fair comparison without any carryover effects from previous training.

* Configurations where test accuracy is close to training accuracy (good generalization)
* High F1 scores on test set (model finds correct answers)
* Reasonable training time (under 3-4 minutes per experiment)
* Avoid configurations where training accuracy is perfect (1.0) but test accuracy drops significantly (overfitting)

* Uses `dataloader_num_workers=0` because T4 has limited CPU cores
* Enables `fp16` mixed precision for 2x speed boost
* Clears GPU cache between iterations to prevent memory issues
* Uses efficient batch sizes (8, 12, 16) that fit in T4's memory
* Gradient accumulation simulates larger batches without exceeding memory



In [4]:
# 1) Environment setup (Colab)
import sys
import subprocess

def pip_install(packages):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + packages)

# Core ML stack
pip_install([
    "transformers>=4.44.0",
    "datasets>=2.14.0",
    "accelerate>=0.26.0",
    "evaluate>=0.4.0",
])

# Colab-specific checks
try:
    import torch
    import platform
    print("=" * 60)
    print("ENVIRONMENT")
    print("=" * 60)
    print(f"Python: {sys.version.split()[0]} | Platform: {platform.platform()}")
    print(f"PyTorch: {torch.__version__}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)} | CUDA: {torch.version.cuda}")
    else:
        print("GPU not detected. Enable a GPU in Runtime > Change runtime type > T4/other.")
    print("=" * 60)
except Exception as e:
    print("Environment check failed:", e)

# 2) Imports and GPU config
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer
from transformers import default_data_collator
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

print("=" * 60)
print("GPU CONFIGURATION")
print("=" * 60)
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Memory: {round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 2)} GB")
else:
    device = torch.device("cpu")
    print("GPU not available; training will be slower.")
print("=" * 60)

USE_UPLOAD = not USE_DRIVE
if USE_UPLOAD:
    try:
        from google.colab import files  # type: ignore
        uploaded = files.upload()
        # Pick the first uploaded file
        if uploaded:
            CSV_PATH = list(uploaded.keys())[0]
    except Exception:
        pass

if not CSV_PATH:

    raise ValueError("provide CSV_PATH")

print("Dataset CSV:", CSV_PATH)

# 4) Load and split dataset
print("\n" + "=" * 60)
print("LOADING CARDIOVASCULAR DATASET")
print("=" * 60)

dataset = pd.read_csv(CSV_PATH)
print(f"Total records: {len(dataset)}")
print(f"Columns: {list(dataset.columns)}")
print(f"Sample question: {str(dataset.iloc[0]['question'])[:80]}...")
print(f"Sample answer chars: {len(str(dataset.iloc[0]['answer']))}")

# Drop nulls
dataset = dataset.dropna(subset=["question", "answer"]).reset_index(drop=True)

# Train/test split (80/20)
dataset_shuffled = dataset.sample(frac=1.0, random_state=42).reset_index(drop=True)
split_idx = int(len(dataset_shuffled) * 0.80)
train_data = dataset_shuffled.iloc[:split_idx].copy()
test_data = dataset_shuffled.iloc[split_idx:].copy()

print(f"Train: {len(train_data)} | Test: {len(test_data)}")
print("=" * 60)

# 5) Model and tokenizer
print("\n" + "=" * 60)
print("LOADING MODEL AND TOKENIZER")
print("=" * 60)

MODEL_NAME = "dmis-lab/biobert-base-cased-v1.1"
print("Model:", MODEL_NAME)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
model.to(device)

print("Model loaded.")
print("=" * 60)


# 6) Tokenization and label prep (extractive QA)
print("\n" + "=" * 60)
print("TOKENIZING DATASET")
print("=" * 60)

train_ds = Dataset.from_pandas(train_data)
test_ds = Dataset.from_pandas(test_data)

MAX_LENGTH = 384
DOC_STRIDE = 128

def prepare_train_features(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["answer"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        sequence_ids = tokenized.sequence_ids(i)

        context_start = None
        context_end = None
        for idx, seq_id in enumerate(sequence_ids):
            if seq_id == 1:
                if context_start is None:
                    context_start = idx
                context_end = idx

        if context_start is None:
            start_positions.append(0)
            end_positions.append(0)
        else:
            answer_start = context_start
            answer_end = min(context_start + 50, context_end)
            start_positions.append(answer_start)
            end_positions.append(answer_end)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    # Drop offset_mapping so it isn't fed to the model
    if "offset_mapping" in tokenized:
        tokenized.pop("offset_mapping")

    return tokenized

print("Tokenizing train...")
tokenized_train = train_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=train_ds.column_names,
    desc="Tokenizing train",
)

print("Tokenizing test...")
tokenized_test = test_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=test_ds.column_names,
    desc="Tokenizing test",
)

print("Done.")
print("=" * 60)

# 7) Evaluation Metrics - Modular Helper Functions
import numpy as np

def calculate_detailed_metrics(predictions):
    """
    Calculate accuracy, F1, precision, and recall from model predictions.
    This function processes predictions in a single pass for efficiency.
    """
    pred_starts = np.argmax(predictions.predictions[0], axis=1)
    pred_ends = np.argmax(predictions.predictions[1], axis=1)

    true_starts = np.asarray(predictions.label_ids[0]).reshape(-1)
    true_ends = np.asarray(predictions.label_ids[1]).reshape(-1)

    # Calculate exact match and accuracy metrics
    exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))
    start_accuracy = np.mean(pred_starts == true_starts)
    end_accuracy = np.mean(pred_ends == true_ends)
    accuracy = (exact_match + start_accuracy + end_accuracy) / 3

    # Calculate F1, precision, and recall from token overlap
    f1_scores = []
    precision_scores = []
    recall_scores = []

    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))

        if not pred_tokens and not true_tokens:
            f1_scores.append(1.0)
            precision_scores.append(1.0)
            recall_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            f1_scores.append(0.0)
            precision_scores.append(0.0)
            recall_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                f1_scores.append(0.0)
                precision_scores.append(0.0)
                recall_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                f1 = 2 * (precision * recall) / (precision + recall)
                f1_scores.append(f1)
                precision_scores.append(precision)
                recall_scores.append(recall)

    return {
        "accuracy": float(accuracy),
        "f1": float(np.mean(f1_scores)) if f1_scores else 0.0,
        "precision": float(np.mean(precision_scores)) if precision_scores else 0.0,
        "recall": float(np.mean(recall_scores)) if recall_scores else 0.0,
    }

def evaluate_dataset(trainer, tokenized_data, dataset_name="Dataset"):
    """
    Evaluate model on a given dataset and return all metrics.
    Makes a single prediction call for efficiency.
    """
    predictions = trainer.predict(tokenized_data)
    metrics = calculate_detailed_metrics(predictions)
    return metrics

def compute_qa_metrics(eval_pred):
    """
    Compute metrics for Trainer's evaluation during training.
    Kept for compatibility with Trainer API.
    """
    predictions, label_ids = eval_pred
    start_logits, end_logits = predictions

    pred_starts = np.argmax(start_logits, axis=1)
    pred_ends = np.argmax(end_logits, axis=1)

    true_starts = np.asarray(label_ids[0]).reshape(-1)
    true_ends = np.asarray(label_ids[1]).reshape(-1)

    exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))
    start_accuracy = np.mean(pred_starts == true_starts)
    end_accuracy = np.mean(pred_ends == true_ends)

    f1_scores = []
    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))
        if not pred_tokens and not true_tokens:
            f1_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            f1_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                f1_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                f1_scores.append(2 * (precision * recall) / (precision + recall))

    return {
        "exact_match": float(exact_match),
        "start_accuracy": float(start_accuracy),
        "end_accuracy": float(end_accuracy),
        "f1": float(np.mean(f1_scores)) if f1_scores else 0.0,
    }

print("Metrics functions ready.")
print("=" * 60)

# 8) Training configuration (Colab-friendly)
print("\n" + "=" * 60)
print("TRAINING CONFIGURATION")
print("=" * 60)

training_args = TrainingArguments(
    output_dir="./results_cardio_qa",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=0.00003,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    max_grad_norm=1.0,
    fp16=torch.cuda.is_available(),
    dataloader_pin_memory=True,
    dataloader_num_workers=0,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_dir="./logs_cardio_qa",
    logging_steps=50,
    logging_strategy="steps",
    report_to=[],
    seed=42,
    disable_tqdm=False,
    remove_unused_columns=True,
)

print("Configured. FP16:", training_args.fp16)
print("=" * 60)

# 9) Initialize Trainer
print("\n" + "=" * 60)
print("INITIALIZING TRAINER")
print("=" * 60)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_qa_metrics,
)

print("Trainer ready.")
print("=" * 60)

# 10) Hyperparameter Tuning with Train-Test Metrics
import time
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment

print("\n" + "=" * 60)
print("HYPERPARAMETER TUNING (10 CONFIGURATIONS)")
print("=" * 60)

# Define hyperparameter configurations (T4 GPU optimized for cardiovascular dataset)
# Avoiding: grad_accum=2, warmup=0.1, weight_decay=0.01
hyperparam_sets = [
    {"epochs": 5,  "lr": 0.00004,  "batch": 16, "grad_accum": 1, "warmup": 0.07, "weight_decay": 0.008},
    {"epochs": 6,  "lr": 0.00005,  "batch": 16, "grad_accum": 1, "warmup": 0.09, "weight_decay": 0.015},
    {"epochs": 6,  "lr": 0.000035, "batch": 12, "grad_accum": 1, "warmup": 0.12, "weight_decay": 0.018},
    {"epochs": 8,  "lr": 0.00006,  "batch": 8,  "grad_accum": 4, "warmup": 0.15, "weight_decay": 0.006},
    {"epochs": 8,  "lr": 0.000045, "batch": 16, "grad_accum": 1, "warmup": 0.18, "weight_decay": 0.012},
    {"epochs": 10, "lr": 0.00004,  "batch": 16, "grad_accum": 1, "warmup": 0.06, "weight_decay": 0.022},
    {"epochs": 10, "lr": 0.00007,  "batch": 8,  "grad_accum": 6, "warmup": 0.12, "weight_decay": 0.007},
    {"epochs": 12, "lr": 0.000035, "batch": 12, "grad_accum": 1, "warmup": 0.15, "weight_decay": 0.016},
    {"epochs": 12, "lr": 0.00003,  "batch": 16, "grad_accum": 1, "warmup": 0.08, "weight_decay": 0.02},
    {"epochs": 14, "lr": 0.00005,  "batch": 8,  "grad_accum": 4, "warmup": 0.18, "weight_decay": 0.009},
]

total_iters = len(hyperparam_sets)

# Excel setup with gradient accumulation column
wb = Workbook()
ws = wb.active
ws.title = "QA Hyperparameter Results"
ws.append([
    "Iteration", "Epochs", "Learning Rate", "Batch Size", "Grad Accum", "Warmup Ratio", "Weight Decay",
    "Train-Accuracy", "Test-Accuracy",
    "Train-F1", "Test-F1",
    "Train-Precision", "Test-Precision",
    "Train-Recall", "Test-Recall",
    "Runtime (s)"
])

# Main hyperparameter tuning loop
for i, params in enumerate(hyperparam_sets, 1):
    print(f"\n{'='*60}")
    print(f" Iteration {i}/{total_iters}")
    print(f"Params: Epochs={params['epochs']}, LR={params['lr']}, Batch={params['batch']}, GradAccum={params['grad_accum']}")
    print(f"{'='*60}")

    # Reload fresh model for each iteration
    print(" Loading fresh model...")
    model_fresh = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
    model_fresh.to(device)

    # Configure training arguments (T4 optimized)
    training_args = TrainingArguments(
        output_dir=f"./results_run_{i}",
        num_train_epochs=params["epochs"],
        per_device_train_batch_size=params["batch"],
        per_device_eval_batch_size=params["batch"],
        gradient_accumulation_steps=params["grad_accum"],
        learning_rate=params["lr"],
        warmup_ratio=params["warmup"],
        weight_decay=params["weight_decay"],
        eval_strategy="no",
        save_strategy="no",
        logging_dir=f"./logs_run_{i}",
        report_to=[],
        disable_tqdm=False,
        seed=42,
        fp16=torch.cuda.is_available(),
        dataloader_num_workers=0,  # T4 optimization (limited CPU cores)
        dataloader_pin_memory=True,
    )

    # Create trainer
    trainer = Trainer(
        model=model_fresh,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # Train model
    print(" Training...")
    start_time = time.time()
    trainer.train()
    runtime = round(time.time() - start_time, 2)

    # Evaluate on BOTH train and test sets
    print(" Evaluating on training set...")
    train_metrics = evaluate_dataset(trainer, tokenized_train, "TRAIN")

    print(" Evaluating on test set...")
    test_metrics = evaluate_dataset(trainer, tokenized_test, "TEST")

    # Append results to Excel
    ws.append([
        i,
        params["epochs"],
        params["lr"],
        params["batch"],
        params["grad_accum"],
        params["warmup"],
        params["weight_decay"],
        round(train_metrics["accuracy"], 4),
        round(test_metrics["accuracy"], 4),
        round(train_metrics["f1"], 4),
        round(test_metrics["f1"], 4),
        round(train_metrics["precision"], 4),
        round(test_metrics["precision"], 4),
        round(train_metrics["recall"], 4),
        round(test_metrics["recall"], 4),
        runtime
    ])

    print(f"✅ Iteration {i} complete!")
    print(f"   Train - F1: {train_metrics['f1']:.4f}, Acc: {train_metrics['accuracy']:.4f}")
    print(f"   Test  - F1: {test_metrics['f1']:.4f}, Acc: {test_metrics['accuracy']:.4f}")
    print(f"   Runtime: {runtime}s")
    del model_fresh, trainer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

print("\n" + "=" * 60)
print("FORMATTING EXCEL FILE")
print("=" * 60)
for cell in ws[1]:
    cell.font = Font(bold=True)
    cell.alignment = Alignment(horizontal="center", vertical="center")
for col in ws.columns:
    max_length = 0
    col_letter = col[0].column_letter
    for cell in col:
        try:
            if len(str(cell.value)) > max_length:
                max_length = len(str(cell.value))
        except:
            pass
    adjusted_width = min((max_length + 2), 20)
    ws.column_dimensions[col_letter].width = adjusted_width

# Save Excel file
output_excel = "/content/Salameda_QA_Hyperparameter_Results_TrainTest.xlsx"
wb.save(output_excel)


ENVIRONMENT
Python: 3.12.12 | Platform: Linux-6.6.105+-x86_64-with-glibc2.35
PyTorch: 2.8.0+cu126
GPU: Tesla T4 | CUDA: 12.6
GPU CONFIGURATION
GPU Available: Tesla T4
CUDA Version: 12.6
GPU Memory: 14.74 GB


Saving medquadCardiovascular.csv to medquadCardiovascular (2).csv
Dataset CSV: medquadCardiovascular (2).csv

LOADING CARDIOVASCULAR DATASET
Total records: 654
Columns: ['question', 'answer', 'source', 'focus_area']
Sample question: What is (are) High Blood Pressure ?...
Sample answer chars: 5586
Train: 523 | Test: 131

LOADING MODEL AND TOKENIZER
Model: dmis-lab/biobert-base-cased-v1.1


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded.

TOKENIZING DATASET
Tokenizing train...


Tokenizing train:   0%|          | 0/523 [00:00<?, ? examples/s]

Tokenizing test...


Tokenizing test:   0%|          | 0/131 [00:00<?, ? examples/s]

Done.
Metrics functions ready.

TRAINING CONFIGURATION
Configured. FP16: True

INITIALIZING TRAINER
Trainer ready.

HYPERPARAMETER TUNING (10 CONFIGURATIONS)

 Iteration 1/10
Params: Epochs=5, LR=4e-05, Batch=16, GradAccum=1
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 1 complete!
   Train - F1: 0.9995, Acc: 0.9832
   Test  - F1: 0.9944, Acc: 0.8824
   Runtime: 106.95s

 Iteration 2/10
Params: Epochs=6, LR=5e-05, Batch=16, GradAccum=1
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 2 complete!
   Train - F1: 0.9998, Acc: 0.9940
   Test  - F1: 0.9996, Acc: 0.9730
   Runtime: 116.54s

 Iteration 3/10
Params: Epochs=6, LR=3.5e-05, Batch=12, GradAccum=1
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 3 complete!
   Train - F1: 0.9925, Acc: 0.9142
   Test  - F1: 0.9816, Acc: 0.7966
   Runtime: 120.62s

 Iteration 4/10
Params: Epochs=8, LR=6e-05, Batch=8, GradAccum=4
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 4 complete!
   Train - F1: 0.9993, Acc: 0.9537
   Test  - F1: 0.9954, Acc: 0.9363
   Runtime: 153.42s

 Iteration 5/10
Params: Epochs=8, LR=4.5e-05, Batch=16, GradAccum=1
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss
500,0.9729


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 5 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9997, Acc: 0.9951
   Runtime: 151.93s

 Iteration 6/10
Params: Epochs=10, LR=4e-05, Batch=16, GradAccum=1
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss
500,0.8519


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 6 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 1.0000, Acc: 0.9975
   Runtime: 189.31s

 Iteration 7/10
Params: Epochs=10, LR=7e-05, Batch=8, GradAccum=6
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 7 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9999, Acc: 0.9951
   Runtime: 187.99s

 Iteration 8/10
Params: Epochs=12, LR=3.5e-05, Batch=12, GradAccum=1
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss
500,1.3103


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 8 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 1.0000, Acc: 0.9975
   Runtime: 236.19s

 Iteration 9/10
Params: Epochs=12, LR=3e-05, Batch=16, GradAccum=1
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss
500,1.0836


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 9 complete!
   Train - F1: 1.0000, Acc: 0.9987
   Test  - F1: 0.9993, Acc: 0.9681
   Runtime: 228.68s

 Iteration 10/10
Params: Epochs=14, LR=5e-05, Batch=8, GradAccum=4
 Loading fresh model...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


✅ Iteration 10 complete!
   Train - F1: 1.0000, Acc: 0.9966
   Test  - F1: 0.9950, Acc: 0.9044
   Runtime: 268.47s

FORMATTING EXCEL FILE
