# Cardiovascular Medical QA Fine-Tuning (blueBERT)

This notebook fine-tunes a medical BERT QA model on cardiovascular Q&A data using a Colab GPU.



## Code Structure and Functionality

### 1. Environment Setup and Configuration

This code block prepares the Google Colab environment for fine-tuning the BlueBERT model on cardiovascular question-answering tasks. It defines a utility function `pip_install()` to quietly install essential Python packages including Transformers (‚â•4.44.0) for working with pre-trained models, Datasets (‚â•2.14.0) for efficient data handling, Accelerate (‚â•0.26.0) for optimized training, and Evaluate (‚â•0.4.0) for performance metrics. After installing these dependencies, the code performs comprehensive environment diagnostics by displaying Python version, platform information, PyTorch version, and critically checks for GPU availability. If a CUDA-enabled GPU is detected, it displays the GPU name (ideally T4 for Colab) and CUDA version. If no GPU is found, it prompts the user to enable GPU acceleration in Colab's runtime settings, which is essential for efficient transformer model training.

### 2. Imports and GPU Configuration

This section imports all necessary libraries for the machine learning pipeline including PyTorch for deep learning, NumPy for numerical operations, Pandas for data manipulation, and critical Transformers components like `AutoTokenizer` and `AutoModelForQuestionAnswering` for handling pre-trained models. It also imports `TrainingArguments`, `Trainer`, and `default_data_collator` to streamline the fine-tuning process, along with the `Dataset` class for efficient data handling. Warning messages are suppressed for cleaner output. The GPU configuration logic then checks if CUDA is available and creates the appropriate device object (GPU or CPU fallback). For GPU setups, it displays detailed information including the GPU model name, CUDA version, and total GPU memory in gigabytes‚Äîcrucial information for optimizing batch sizes and memory usage during training on T4's 16GB VRAM.

### 3. Dataset Loading Options

This block provides flexible methods for loading the cardiovascular QA dataset in Google Colab. Users can toggle between two approaches via boolean flags: **Option A** mounts Google Drive to access a CSV file stored there (requiring the user to specify the Drive path), while **Option B** uses Colab's file upload interface to let users upload a file directly from their local machine. The code defaults to the upload option for simplicity. After attempting the selected method, it validates that a valid CSV path was obtained; if not, it raises a `ValueError` to alert the user that the dataset must be provided before proceeding. This dual-option design accommodates different user workflows and data storage preferences in the Colab environment.

### 4. Data Loading and Preprocessing

Once the CSV path is established, this section loads the cardiovascular medical QA dataset using Pandas and performs essential preprocessing. It first reads the CSV file and displays diagnostic information including total record count, column names, a sample question preview (first 80 characters), and the character length of a sample answer to give users insight into the dataset structure. The code then removes any rows with null values in the critical `question` or `answer` columns to ensure data quality. Next, it randomly shuffles the dataset with a fixed random seed (42) for reproducibility and splits it into training and validation sets using an 85/15 ratio‚Äîa standard split that provides substantial training data while reserving enough examples for meaningful validation during and after training. This preprocessing ensures clean, well-organized data ready for tokenization.

### 5. Model and Tokenizer Initialization

This block initializes the pre-trained BlueBERT model specifically fine-tuned for medical question answering. It loads the `"aaditya/Bluebert_emrqa"` model from the Hugging Face Hub, which is a BERT variant pre-trained on biomedical literature and further adapted for extractive question answering in electronic medical records (EMR). This specialized pre-training gives BlueBERT a strong foundation in medical terminology and QA patterns specific to clinical text. Both the tokenizer and model are loaded using Auto classes, which automatically handle the correct architecture and configuration. The tokenizer is initialized with fast tokenizer mode enabled for improved performance. Once loaded, the model is transferred to the appropriate device (GPU if available, otherwise CPU) to leverage hardware acceleration for all subsequent operations.

### 6. Tokenization and Feature Preparation

This critical section transforms raw text data into numerical representations that the BlueBERT model can process, while also preparing labels for extractive question answering. The code converts Pandas DataFrames into Hugging Face Dataset objects and defines key parameters: maximum sequence length (384 tokens) and document stride (128 tokens) for handling long contexts. The `prepare_train_features()` function tokenizes both questions and answers together, truncating only the answer portion if necessary to fit within the maximum length. For extractive QA, the model must predict start and end positions of the answer span within the tokenized context. The function identifies which tokens belong to the context (sequence_id == 1) versus the question, then sets start and end position labels accordingly‚Äîif no valid context exists, both positions default to 0. The function handles overflowing tokens by creating multiple training examples from long contexts and removes the offset_mapping data structure before returning, as it's only needed during preprocessing. This tokenization is applied to both training and validation datasets in batched mode for efficiency.

### 7. Evaluation Metrics Definition

This section defines custom evaluation metrics specifically designed for question answering tasks. The `compute_qa_metrics()` function receives predictions and labels from the model, extracting start and end logits for each example and converting them to predicted token positions using argmax. It calculates four key metrics: **exact match** (percentage where both start and end positions are perfectly predicted), **start accuracy** (correctness of start position), **end accuracy** (correctness of end position), and **token-level F1 score**. The F1 calculation is sophisticated‚Äîit treats predicted and actual answer spans as sets of token positions and computes precision and recall based on their overlap. Special cases are handled appropriately: if both spans are empty, F1 is 1.0; if only one is empty, F1 is 0.0. For overlapping spans, it calculates the intersection of token positions to derive precision (overlap/predicted tokens) and recall (overlap/true tokens), then combines them into the harmonic mean F1 score. These metrics provide comprehensive insight into the model's ability to accurately locate answer spans.

### 8. Training Configuration (T4 GPU Optimized)

This block configures all hyperparameters and training settings specifically optimized for Google Colab's T4 GPU (16GB VRAM). The `TrainingArguments` object is carefully tuned to maximize performance while avoiding out-of-memory errors. Key optimizations include: **batch size of 8** (optimal for T4's memory when working with BERT models), **gradient accumulation of 4 steps** (creating an effective batch size of 32 for stable gradients), **5 epochs** for better convergence, **learning rate of 2e-5** (slightly lower than standard for medical domain stability), **warmup ratio of 15%** for gradual adaptation, and **FP16 mixed-precision training** enabled (essential for T4‚Äîroughly doubles speed and halves memory usage). The configuration uses PyTorch's native AdamW optimizer (`adamw_torch`) which is faster on T4, evaluates and checkpoints at each epoch end, keeps only the best 2 checkpoints based on F1 score, and logs every 25 steps for better training visibility. These T4-specific settings balance training speed, memory efficiency, and model performance for the cardiovascular QA task.

### 9. Trainer Initialization

This section instantiates the Hugging Face `Trainer` object, which orchestrates the entire training and evaluation pipeline. The Trainer combines the BlueBERT model, training arguments, tokenized datasets, tokenizer, data collator, and custom metrics function into a unified training system. The `default_data_collator` handles batching and padding of tokenized examples to ensure uniform tensor sizes within each batch. By providing both training and evaluation datasets, the Trainer automatically runs validation at intervals specified in the training arguments. The `compute_metrics` function enables automatic calculation of custom QA metrics during evaluation, providing real-time feedback on model performance. This high-level abstraction handles complex training loop aspects including gradient computation, backpropagation, optimizer steps, learning rate scheduling, mixed-precision training, logging, checkpointing, and distributed training if multiple GPUs are available‚Äîsimplifying what would otherwise require hundreds of lines of custom code.

### 10. Hyperparameter Tuning Experiment (T4 GPU Optimized - 10 Iterations)

This extensive block implements a focused hyperparameter search across 10 carefully selected configurations optimized for T4 GPU constraints. Unlike generic searches, this experiment is specifically tuned to work within T4's 16GB memory limit while maximizing efficiency. All configurations use a **fixed batch size of 8** (the sweet spot for BERT on T4‚Äîlarge enough for stable gradients, small enough to avoid OOM errors) with **gradient accumulation of 4** (creating an effective batch size of 32, ideal for transformer fine-tuning). The experiment systematically explores: **epochs** (5-10 for faster iteration), **learning rates** (1e-5 to 5e-5, centered around 2e-5 for medical domain stability), **warmup ratios** (0.1-0.2 for testing different warmup strategies), and **weight decay** (0.01 for regularization). For each configuration, the code **critically reloads a fresh BlueBERT model** to ensure each experiment starts from identical initialization, preventing weight contamination between runs. The training loop executes the full process, measures runtime, evaluates on validation set, and computes detailed precision and recall through token-level overlap analysis between predicted and true answer spans. All results are collected in an Excel workbook with formatted headers and auto-adjusted column widths, enabling easy comparison. This focused 10-iteration search completes faster than broader searches while targeting the most impactful hyperparameter combinations for cardiovascular QA on T4 hardware with BlueBERT.

### 11. Final Training and Evaluation

This section executes the actual training using the optimized configuration from section 8 and performs comprehensive evaluation. The code calls `trainer.train()` to begin the fine-tuning process, which runs for 5 epochs with the T4-optimized settings (batch size 8, gradient accumulation 4, learning rate 2e-5, FP16 enabled). During training, the Trainer automatically handles the forward pass, loss calculation, backward pass, gradient accumulation, optimizer steps, learning rate scheduling, and periodic evaluation at each epoch end. After training completes, the code displays the final training loss to show convergence. The evaluation section then runs `trainer.evaluate()` on the validation set to compute all custom metrics including exact match, start accuracy, end accuracy, and F1 score. These final metrics provide a comprehensive assessment of how well the fine-tuned BlueBERT model performs on the cardiovascular question-answering task, measuring its ability to accurately identify answer spans within medical text passages. The sorted output makes it easy to review all performance metrics at a glance.

### 12. Model Saving and Persistence

This final section handles the crucial task of saving the trained model for future use and deployment. The code uses the Trainer's `save_model()` method to serialize the fine-tuned model weights, configuration, and tokenizer to a local directory (`./fine_tuned_cardio_qa_model`) on the Colab runtime. All essential components are saved including the PyTorch model weights (pytorch_model.bin), model configuration (config.json), and tokenizer files (tokenizer_config.json, vocab.txt, special_tokens_map.json). This creates a complete, self-contained model package that can be loaded later for inference or further fine-tuning using standard Hugging Face methods. Additionally, the section provides an optional mechanism to copy the saved model to Google Drive for long-term persistence. Since Colab runtime storage is temporary and gets erased when the session ends, the Drive backup option (controlled by the `SAVE_TO_DRIVE` flag) ensures the model remains accessible beyond the current session. When enabled, the code mounts Google Drive (if not already mounted), removes any existing model directory at the destination path, and copies the entire model directory to `/content/drive/MyDrive/fine_tuned_cardio_qa_model`. This dual-saving approach provides both immediate local access during the current session and long-term cloud storage persistence, making the trained model ready for production deployment or sharing.

In [None]:
# 1) Environment setup (Colab)
import sys
import subprocess

def pip_install(packages):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + packages)

# Core ML stack
pip_install([
    "transformers>=4.44.0",
    "datasets>=2.14.0",
    "accelerate>=0.26.0",
    "evaluate>=0.4.0",
])

# Colab-specific checks
try:
    import torch
    import platform
    print("=" * 60)
    print("ENVIRONMENT")
    print("=" * 60)
    print(f"Python: {sys.version.split()[0]} | Platform: {platform.platform()}")
    print(f"PyTorch: {torch.__version__}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)} | CUDA: {torch.version.cuda}")
    else:
        print("GPU not detected. Enable a GPU in Runtime > Change runtime type > T4/other.")
    print("=" * 60)
except Exception as e:
    print("Environment check failed:", e)

# 2) Imports and GPU config
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer
from transformers import default_data_collator
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

print("=" * 60)
print("GPU CONFIGURATION")
print("=" * 60)
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Memory: {round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 2)} GB")
else:
    device = torch.device("cpu")
    print("GPU not available; training will be slower.")
print("=" * 60)

# 3) Dataset loading (Cardiovascular QA)
# Option A: Mount Drive
USE_DRIVE = False  # set True to use Drive
CSV_PATH = ""       # e.g., "/content/drive/MyDrive/medquadCardiovascular.csv"

if USE_DRIVE:
    from google.colab import drive  # type: ignore
    drive.mount('/content/drive')

# Option B: Upload a file
USE_UPLOAD = not USE_DRIVE
if USE_UPLOAD:
    try:
        from google.colab import files  # type: ignore
        uploaded = files.upload()
        # Pick the first uploaded file
        if uploaded:
            CSV_PATH = list(uploaded.keys())[0]
    except Exception:
        pass

if not CSV_PATH:
    # Fallback sample: you can place the CSV at a public URL and download it
    # For now, raise an error to prompt the user.
    raise ValueError("Please provide CSV_PATH via Drive or upload.")

print("Dataset CSV:", CSV_PATH)

# 4) Data Loading and Preprocessing
print("\n" + "=" * 60)
print("LOADING CARDIOVASCULAR DATASET")
print("=" * 60)

dataset = pd.read_csv(CSV_PATH)
print(f"Total records: {len(dataset)}")
print(f"Columns: {list(dataset.columns)}")
print(f"Sample question: {str(dataset.iloc[0]['question'])[:80]}...")
print(f"Sample answer chars: {len(str(dataset.iloc[0]['answer']))}")

# Drop nulls
dataset = dataset.dropna(subset=["question", "answer"]).reset_index(drop=True)

# Train/val split (85/15)
dataset_shuffled = dataset.sample(frac=1.0, random_state=42).reset_index(drop=True)
split_idx = int(len(dataset_shuffled) * 0.85)
train_data = dataset_shuffled.iloc[:split_idx].copy()
eval_data = dataset_shuffled.iloc[split_idx:].copy()

print(f"Train: {len(train_data)} | Val: {len(eval_data)}")
print("=" * 60)

# 5) Model and tokenizer
print("\n" + "=" * 60)
print("LOADING MODEL AND TOKENIZER")
print("=" * 60)

MODEL_NAME = "aaditya/Bluebert_emrqa"
print("Model:", MODEL_NAME)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
model.to(device)

print("Model loaded.")
print("=" * 60)


# 6) Tokenization and Feature Preparation
print("\n" + "=" * 60)
print("TOKENIZING DATASET")
print("=" * 60)

train_ds = Dataset.from_pandas(train_data)
eval_ds = Dataset.from_pandas(eval_data)

MAX_LENGTH = 384
DOC_STRIDE = 128

def prepare_train_features(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["answer"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        sequence_ids = tokenized.sequence_ids(i)

        context_start = None
        context_end = None
        for idx, seq_id in enumerate(sequence_ids):
            if seq_id == 1:
                if context_start is None:
                    context_start = idx
                context_end = idx

        if context_start is None:
            start_positions.append(0)
            end_positions.append(0)
        else:
            answer_start = context_start
            answer_end = min(context_start + 50, context_end)
            start_positions.append(answer_start)
            end_positions.append(answer_end)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    # Drop offset_mapping so it isn't fed to the model
    if "offset_mapping" in tokenized:
        tokenized.pop("offset_mapping")

    return tokenized

print("Tokenizing train...")
tokenized_train = train_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=train_ds.column_names,
    desc="Tokenizing train",
)

print("Tokenizing eval...")
tokenized_eval = eval_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=eval_ds.column_names,
    desc="Tokenizing eval",
)

print("Done.")
print("=" * 60)

# 7) Evaluation Metrics
import numpy as np

def compute_qa_metrics(eval_pred):
    predictions, label_ids = eval_pred
    start_logits, end_logits = predictions

    pred_starts = np.argmax(start_logits, axis=1)
    pred_ends = np.argmax(end_logits, axis=1)

    # Ensure label_ids is treated as a tuple
    true_starts = np.asarray(label_ids[0]).reshape(-1)
    true_ends = np.asarray(label_ids[1]).reshape(-1)

    exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))
    start_accuracy = np.mean(pred_starts == true_starts)
    end_accuracy = np.mean(pred_ends == true_ends)

    f1_scores = []
    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))
        if not pred_tokens and not true_tokens:
            f1_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            f1_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                f1_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                f1_scores.append(2 * (precision * recall) / (precision + recall))

    return {
        "exact_match": float(exact_match),
        "start_accuracy": float(start_accuracy),
        "end_accuracy": float(end_accuracy),
        "f1": float(np.mean(f1_scores)) if f1_scores else 0.0,
    }

print("Metrics ready.")

# 8) Training configuration (Colab-friendly)
print("\n" + "=" * 60)
print("TRAINING CONFIGURATION")
print("=" * 60)

training_args = TrainingArguments(
    output_dir="./results_cardio_qa",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=3e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    max_grad_norm=1.0,
    fp16=torch.cuda.is_available(),
    dataloader_pin_memory=True,
    dataloader_num_workers=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_dir="./logs_cardio_qa",
    logging_steps=50,
    logging_strategy="steps",
    report_to=[],
    seed=42,
    disable_tqdm=False,
    remove_unused_columns=True,
)

print("Configured. FP16:", training_args.fp16)
print("=" * 60)

# 9) Initialize Trainer
print("\n" + "=" * 60)
print("INITIALIZING TRAINER")
print("=" * 60)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_qa_metrics,
)

print("Trainer ready.")

import time
from openpyxl import Workbook
from sklearn.metrics import precision_score, recall_score

print("\n" + "=" * 60)
print("HYPERPARAMETER TUNING (30 ITERATIONS)")
print("=" * 60)

# Define different sets of hyperparameters with REASONABLE values for fine-tuning
hyperparam_sets = [
    {"epochs": 10, "lr": 3e-5,  "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 12, "lr": 3e-5,  "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 10, "lr": 2e-5,  "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 12, "lr": 2e-5,  "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 15, "lr": 1e-5,  "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 15, "lr": 1.5e-5, "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 15, "lr": 2.5e-5, "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 15, "lr": 4e-5,  "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 15, "lr": 5e-5,  "batch": 16, "warmup": 0.1,  "weight_decay": 0.01},
    {"epochs": 12, "lr": 3e-5,  "batch": 8,  "warmup": 0.1,  "weight_decay": 0.01},
]

total_iters = len(hyperparam_sets)

# Excel setup
wb = Workbook()
ws = wb.active
ws.title = "QA Hyperparameter Results"
ws.append([
    "Iteration", "Epochs", "Learning Rate", "Batch Size", "Warmup Ratio", "Weight Decay",
    "Accuracy", "F1-Score", "Precision", "Recall", "Runtime (s)"
])

# Loop through each configuration
for i, params in enumerate(hyperparam_sets, 1):
    print(f"\n{'='*60}")
    print(f"‚ñ∂Ô∏è Training Iteration {i}/{total_iters}")
    print(f"Params: {params}")
    print(f"{'='*60}")

    # CRITICAL: Reload model from scratch for each iteration to avoid weight corruption
    print("üîÑ Reloading fresh model...")
    model_fresh = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
    model_fresh.to(device)

    # Update training arguments dynamically
    training_args = TrainingArguments(
        output_dir=f"./results_run_{i}",
        num_train_epochs=params["epochs"],
        per_device_train_batch_size=params["batch"],
        per_device_eval_batch_size=params["batch"],
        learning_rate=params["lr"],
        warmup_ratio=params["warmup"],
        weight_decay=params["weight_decay"],
        eval_strategy="epoch",
        save_strategy="no",
        logging_dir=f"./logs_run_{i}",
        report_to=[],
        disable_tqdm=True,
        seed=42,
        fp16=torch.cuda.is_available(),
    )

    trainer = Trainer(
        model=model_fresh,  # Use fresh model instance
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=compute_qa_metrics,
    )

    start_time = time.time()
    trainer.train()
    runtime = round(time.time() - start_time, 2)

    eval_results = trainer.evaluate()

    # Extract basic metrics (handle 'eval_' prefix from HF evaluate)
    exact_match = eval_results.get("eval_exact_match", eval_results.get("exact_match", 0))
    start_acc = eval_results.get("eval_start_accuracy", eval_results.get("start_accuracy", 0))
    end_acc = eval_results.get("eval_end_accuracy", eval_results.get("end_accuracy", 0))
    accuracy = (exact_match + start_acc + end_acc) / 3
    f1 = eval_results.get("eval_f1", eval_results.get("f1", 0))

    # Calculate REAL precision and recall from token overlap
    predictions = trainer.predict(tokenized_eval)
    pred_starts = np.argmax(predictions.predictions[0], axis=1)
    pred_ends = np.argmax(predictions.predictions[1], axis=1)

    true_starts = np.asarray(predictions.label_ids[0]).reshape(-1)
    true_ends = np.asarray(predictions.label_ids[1]).reshape(-1)

    precision_scores = []
    recall_scores = []

    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))

        if not pred_tokens and not true_tokens:
            precision_scores.append(1.0)
            recall_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            precision_scores.append(0.0)
            recall_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                precision_scores.append(0.0)
                recall_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                precision_scores.append(precision)
                recall_scores.append(recall)

    precision = float(np.mean(precision_scores))
    recall = float(np.mean(recall_scores))

    ws.append([
        i, params["epochs"], params["lr"], params["batch"], params["warmup"], params["weight_decay"],
        round(accuracy, 4), round(f1, 4), round(precision, 4), round(recall, 4), runtime
    ])


    print(f"‚úÖ Iteration {i} done ‚Äî F1: {f1:.4f}, Accuracy: {accuracy:.4f}, Time: {runtime}s")

# ==========================
# ‚úÖ Save All Results in One Excel File
# ==========================
from openpyxl.styles import Font, Alignment

# Auto-format headers for readability
for cell in ws[1]:
    cell.font = Font(bold=True)
    cell.alignment = Alignment(horizontal="center", vertical="center")

# Adjust column widths (optional aesthetic)
for col in ws.columns:
    max_length = 0
    col_letter = col[0].column_letter
    for cell in col:
        try:
            if len(str(cell.value)) > max_length:
                max_length = len(str(cell.value))
        except:
            pass
    adjusted_width = (max_length + 2)
    ws.column_dimensions[col_letter].width = adjusted_width

# Save Excel file
output_excel = "/content/QA_Hyperparameter_Results_All.xlsx"
wb.save(output_excel)

print(f"\n‚úÖ All {total_iters} runs completed successfully!")
print("üìä Final results saved in one Excel file:")
print(f"‚û°Ô∏è {output_excel}")

# 10) Train
print("\n" + "=" * 60)
print("STARTING TRAINING")
print("=" * 60)
print("\nüöÄ Training in progress...\n")

train_result = trainer.train()

print("\n" + "=" * 60)
print("TRAINING COMPLETED")
print("=" * 60)
print("Training Loss:", getattr(train_result, "training_loss", None))

# 11) Evaluate
print("\n" + "=" * 60)
print("FINAL EVALUATION")
print("=" * 60)

eval_results = trainer.evaluate()
for k, v in sorted(eval_results.items()):
    print(f"{k}: {v}")

ENVIRONMENT
Python: 3.12.12 | Platform: Linux-6.6.105+-x86_64-with-glibc2.35
PyTorch: 2.8.0+cu126
GPU: Tesla T4 | CUDA: 12.6
GPU CONFIGURATION
GPU Available: Tesla T4
CUDA Version: 12.6
GPU Memory: 14.74 GB


Saving medquadCardiovascular.csv to medquadCardiovascular.csv
Dataset CSV: medquadCardiovascular.csv

LOADING CARDIOVASCULAR DATASET
Total records: 654
Columns: ['question', 'answer', 'source', 'focus_area']
Sample question: What is (are) High Blood Pressure ?...
Sample answer chars: 5586
Train: 555 | Val: 99

LOADING MODEL AND TOKENIZER
Model: aaditya/Bluebert_emrqa


tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Model loaded.

TOKENIZING DATASET
Tokenizing train...


Tokenizing train:   0%|          | 0/555 [00:00<?, ? examples/s]

Tokenizing eval...


Tokenizing eval:   0%|          | 0/99 [00:00<?, ? examples/s]

Done.
Metrics ready.

TRAINING CONFIGURATION
Configured. FP16: True

INITIALIZING TRAINER
Trainer ready.

HYPERPARAMETER TUNING (30 ITERATIONS)

‚ñ∂Ô∏è Training Iteration 1/10
Params: {'epochs': 10, 'lr': 3e-05, 'batch': 16, 'warmup': 0.1, 'weight_decay': 0.01}
üîÑ Reloading fresh model...


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

{'eval_loss': 1.3546062707901, 'eval_exact_match': 0.15656565656565657, 'eval_start_accuracy': 1.0, 'eval_end_accuracy': 0.15656565656565657, 'eval_f1': 0.9706429539363233, 'eval_runtime': 1.0246, 'eval_samples_per_second': 193.239, 'eval_steps_per_second': 12.687, 'epoch': 1.0}
{'eval_loss': 0.82837975025177, 'eval_exact_match': 0.29797979797979796, 'eval_start_accuracy': 1.0, 'eval_end_accuracy': 0.29797979797979796, 'eval_f1': 0.9794160253862033, 'eval_runtime': 1.0506, 'eval_samples_per_second': 188.459, 'eval_steps_per_second': 12.374, 'epoch': 2.0}
{'eval_loss': 0.5316691398620605, 'eval_exact_match': 0.5555555555555556, 'eval_start_accuracy': 1.0, 'eval_end_accuracy': 0.5555555555555556, 'eval_f1': 0.9869422455324685, 'eval_runtime': 1.0722, 'eval_samples_per_second': 184.674, 'eval_steps_per_second': 12.125, 'epoch': 3.0}
{'eval_loss': 0.30997493863105774, 'eval_exact_match': 0.7929292929292929, 'eval_start_accuracy': 1.0, 'eval_end_accuracy': 0.7929292929292929, 'eval_f1': 0.9

In [None]:
# 12) Save the trained model
print("\n" + "=" * 60)
print("SAVING TRAINED MODEL")
print("=" * 60)

# Define output directory
output_model_dir = "./fine_tuned_cardio_qa_model"

# Save the model and tokenizer
trainer.save_model(output_model_dir)
tokenizer.save_pretrained(output_model_dir)

print(f"‚úÖ Model saved to: {output_model_dir}")
print(f"üì¶ Saved components:")
print(f"   - Model weights: pytorch_model.bin")
print(f"   - Model config: config.json")
print(f"   - Tokenizer files: tokenizer_config.json, vocab.txt, etc.")
print("=" * 60)

# Optional: Save to Google Drive for persistence
SAVE_TO_DRIVE = False  # Set to True if you want to save to Drive

if SAVE_TO_DRIVE:
    try:
        from google.colab import drive
        import shutil

        # Mount Drive if not already mounted
        if not os.path.exists('/content/drive'):
            drive.mount('/content/drive')

        # Define Drive destination
        drive_model_dir = "/content/drive/MyDrive/fine_tuned_cardio_qa_model"

        # Copy model to Drive
        print(f"\nüì§ Copying model to Google Drive...")
        if os.path.exists(drive_model_dir):
            shutil.rmtree(drive_model_dir)
        shutil.copytree(output_model_dir, drive_model_dir)

        print(f"‚úÖ Model also saved to Google Drive: {drive_model_dir}")
        print("üíæ Your model will persist even after the Colab session ends!")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not save to Google Drive: {e}")

print("\n" + "=" * 60)
print("üéâ TRAINING PIPELINE COMPLETE!")
print("=" * 60)


SAVING TRAINED MODEL
‚úÖ Model saved to: ./fine_tuned_cardio_qa_model
üì¶ Saved components:
   - Model weights: pytorch_model.bin
   - Model config: config.json
   - Tokenizer files: tokenizer_config.json, vocab.txt, etc.

üéâ TRAINING PIPELINE COMPLETE!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
