# Cardiovascular Medical QA Fine-Tuning (BlueBERT) - Random Grid Search

This notebook fine-tunes a medical BERT QA model on cardiovascular Q&A data using a Colab GPU with **Random Grid Search** for hyperparameter optimization.


### 1. Environment Setup and Configuration

This section prepares the computational environment for fine-tuning the cardiovascular QA model. The setup process begins by defining a utility function to install essential Python packages using pip, including the Transformers library (version 4.44.0 or higher) for working with pre-trained models, Datasets for efficient data handling, Accelerate for optimized training, and Evaluate for model performance metrics. After installing these core dependencies, the code performs comprehensive environment checks to verify the Python version, platform details, and PyTorch installation. Most importantly, it detects whether a GPU is available for training and displays relevant information such as the GPU model name and CUDA version. If no GPU is detected, the system alerts the user to enable GPU acceleration in Google Colab's runtime settings, as GPU support is crucial for efficient training of transformer models. This initial setup ensures that all necessary dependencies are in place and that the hardware configuration is optimal for the computationally intensive task of fine-tuning a BERT-based question answering model on medical data.


In [None]:
# 1) Environment setup (Colab)
import sys
import subprocess

def pip_install(packages):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + packages)

pip_install([
    "transformers>=4.44.0",
    "datasets>=2.14.0",
    "accelerate>=0.26.0",
    "evaluate>=0.4.0",
])
try:
    import torch
    import platform
    print("=" * 60)
    print("ENVIRONMENT")
    print("=" * 60)
    print(f"Python: {sys.version.split()[0]} | Platform: {platform.platform()}")
    print(f"PyTorch: {torch.__version__}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)} | CUDA: {torch.version.cuda}")
    else:
        print("GPU not detected. Enable a GPU in Runtime > Change runtime type > T4/other.")
    print("=" * 60)
except Exception as e:
    print("Environment check failed:", e)


ENVIRONMENT
Python: 3.12.12 | Platform: Linux-6.6.105+-x86_64-with-glibc2.35
PyTorch: 2.8.0+cu126
GPU: Tesla T4 | CUDA: 12.6


### 2. Imports and GPU Configuration

Following the initial setup, the code imports all necessary libraries for the machine learning pipeline, including PyTorch for deep learning operations, NumPy for numerical computations, Pandas for data manipulation, and specific modules from the Transformers library such as AutoTokenizer and AutoModelForQuestionAnswering for handling pre-trained models. The code also imports training utilities like TrainingArguments, Trainer, and default_data_collator to streamline the fine-tuning process, along with the Datasets library for efficient data handling. Warning messages are suppressed to keep the output clean. The GPU configuration section then performs a detailed check to determine if CUDA-enabled GPU hardware is available. If a GPU is detected, the code creates a CUDA device object and displays comprehensive information including the GPU model name, CUDA version, and total available GPU memory in gigabytes. This information is critical for understanding the computational resources available and for optimizing batch sizes and memory usage during training. If no GPU is found, the system falls back to CPU mode and warns the user that training will be significantly slower, emphasizing the importance of GPU acceleration for transformer model fine-tuning.


In [None]:
# 2) Imports and GPU config
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer
from transformers import default_data_collator
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

print("=" * 60)
print("GPU CONFIGURATION")
print("=" * 60)
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Memory: {round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 2)} GB")
else:
    device = torch.device("cpu")
    print("GPU not available; training will be slower.")
print("=" * 60)


GPU CONFIGURATION
GPU Available: Tesla T4
CUDA Version: 12.6
GPU Memory: 14.74 GB


### 3. Dataset Loading Options

This section provides flexible options for loading the cardiovascular QA dataset in Google Colab. User can directly upload a file from their local machine

In [None]:
# 3) Dataset loading (Cardiovascular QA)
USE_DRIVE = False

USE_UPLOAD = not USE_DRIVE
if USE_UPLOAD:
    try:
        from google.colab import files
        uploaded = files.upload()
        if uploaded:
            CSV_PATH = list(uploaded.keys())[0]
    except Exception:
        pass

if not CSV_PATH:
    raise ValueError("Please provide CSV_PATH via Drive or upload.")

print("Dataset CSV:", CSV_PATH)

Saving medquadCardiovascular.csv to medquadCardiovascular (1).csv
Dataset CSV: medquadCardiovascular (1).csv



### 4. Data Loading and Preprocessing

Once the dataset path is established, this section loads the cardiovascular medical QA data from the CSV file using Pandas and performs essential preprocessing steps. The code first reads the CSV and displays diagnostic information including the total number of records, column names, a sample question preview, and the character length of a sample answer to give users insight into the dataset structure and content. To ensure data quality, the code removes any rows containing null values in the critical question or answer columns, preventing training issues from incomplete data. The dataset is then randomly shuffled with a fixed random seed for reproducibility and split into training and test sets using an 80/20 ratio, which provides substantial training data while reserving enough examples for unbiased testing. The training set will be used to update the model weights, while the test set allows for evaluation of model performance after training. This preprocessing ensures clean, well-organized data that's ready for the tokenization and model training phases.

In [None]:

# 4) Data Loading and Preprocessing
print("\n" + "=" * 60)
print("LOADING CARDIOVASCULAR DATASET")
print("=" * 60)

dataset = pd.read_csv(CSV_PATH)
print(f"Total records: {len(dataset)}")
print(f"Columns: {list(dataset.columns)}")
print(f"Sample question: {str(dataset.iloc[0]['question'])[:80]}...")
print(f"Sample answer chars: {len(str(dataset.iloc[0]['answer']))}")

# Drop nulls
dataset = dataset.dropna(subset=["question", "answer"]).reset_index(drop=True)

# Train/test split (80/20)
dataset_shuffled = dataset.sample(frac=1.0, random_state=42).reset_index(drop=True)
split_idx = int(len(dataset_shuffled) * 0.80)
train_data = dataset_shuffled.iloc[:split_idx].copy()
test_data = dataset_shuffled.iloc[split_idx:].copy()

print(f"Train: {len(train_data)} | Test: {len(test_data)}")
print("=" * 60)



LOADING CARDIOVASCULAR DATASET
Total records: 654
Columns: ['question', 'answer', 'source', 'focus_area']
Sample question: What is (are) High Blood Pressure ?...
Sample answer chars: 5586
Train: 523 | Test: 131


### 5. Model and Tokenizer Initialization

This section initializes the pre-trained BlueBERT model specifically fine-tuned for medical question answering tasks. The code loads the "aaditya/Bluebert_emrqa" model from the Hugging Face model hub, which is a BERT variant pre-trained on biomedical literature and further adapted for extractive question answering in electronic medical records. Both the tokenizer and the model are loaded using the Auto classes from the Transformers library, which automatically handle the correct architecture and configuration. The tokenizer is initialized with the fast tokenizer option enabled for improved performance during text processing. Once loaded, the model is transferred to the appropriate device (GPU if available, otherwise CPU) to ensure all subsequent operations leverage the available hardware acceleration. This pre-trained model provides a strong foundation with domain-specific knowledge of medical terminology and question-answering patterns, significantly reducing the amount of training data and time needed compared to training from scratch.

In [None]:
# 5) Model and tokenizer
print("\n" + "=" * 60)
print("LOADING MODEL AND TOKENIZER")
print("=" * 60)

MODEL_NAME = "aaditya/Bluebert_emrqa"
print("Model:", MODEL_NAME)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
model.to(device)

print("Model loaded.")
print("=" * 60)


LOADING MODEL AND TOKENIZER
Model: aaditya/Bluebert_emrqa


tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/667 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Model loaded.


### 6. Tokenization and Feature Preparation

This critical section transforms the raw text data into numerical representations that the model can process, while also preparing the labels for extractive question answering. The code converts the Pandas dataframes into Hugging Face Dataset objects for efficient processing, then defines key parameters including maximum sequence length (384 tokens) and document stride (128 tokens) for handling long contexts. The prepare_train_features function tokenizes both questions and answers together, truncating only the answer portion if necessary to fit within the maximum length constraint. For extractive QA, the model needs to predict the start and end positions of the answer span within the tokenized context. The code implements logic to identify which tokens belong to the context (sequence_id == 1) versus the question, then sets start and end position labels accordingly. If no valid context exists, both positions are set to 0 as a fallback. The function also handles overflowing tokens by creating multiple training examples from long contexts and removes the offset_mapping data structure before returning, as it's only needed during preprocessing. This tokenization process is applied to both training and validation datasets in batched mode for efficiency, preparing the data in the exact format required by the question answering model.

In [None]:
# 6) Tokenization and Feature Preparation
print("\n" + "=" * 60)
print("TOKENIZING DATASET")
print("=" * 60)

train_ds = Dataset.from_pandas(train_data)
test_ds = Dataset.from_pandas(test_data)

MAX_LENGTH = 384
DOC_STRIDE = 128

def prepare_train_features(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["answer"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        sequence_ids = tokenized.sequence_ids(i)

        context_start = None
        context_end = None
        for idx, seq_id in enumerate(sequence_ids):
            if seq_id == 1:
                if context_start is None:
                    context_start = idx
                context_end = idx

        if context_start is None:
            start_positions.append(0)
            end_positions.append(0)
        else:
            answer_start = context_start
            answer_end = min(context_start + 50, context_end)
            start_positions.append(answer_start)
            end_positions.append(answer_end)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    # Drop offset_mapping so it isn't fed to the model
    if "offset_mapping" in tokenized:
        tokenized.pop("offset_mapping")

    return tokenized

print("Tokenizing train...")
tokenized_train = train_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=train_ds.column_names,
    desc="Tokenizing train",
)

print("Tokenizing test...")
tokenized_test = test_ds.map(
    prepare_train_features,
    batched=True,
    remove_columns=test_ds.column_names,
    desc="Tokenizing test",
)

print("Done.")
print("=" * 60)



TOKENIZING DATASET
Tokenizing train...


Tokenizing train:   0%|          | 0/523 [00:00<?, ? examples/s]

Tokenizing test...


Tokenizing test:   0%|          | 0/131 [00:00<?, ? examples/s]

Done.


### 7. Evaluation Metrics Definition

This section defines modular helper functions for calculating comprehensive evaluation metrics specifically designed for question answering tasks. The calculate_detailed_metrics function processes model predictions in a single efficient pass, extracting start and end logits and converting them to predicted token positions. It calculates four key metrics: accuracy (average of exact match, start accuracy, and end accuracy), token-level F1 score, precision, and recall. The F1 calculation treats predicted and actual answer spans as sets of token positions and computes precision and recall based on their overlap, handling special cases appropriately. The evaluate_dataset function wraps the prediction and metric calculation process for easy evaluation of any dataset, making a single prediction call for efficiency. An additional compute_qa_metrics function is maintained for compatibility with the Hugging Face Trainer API during training. This modular structure eliminates code duplication and provides a clean interface for evaluating model performance on both training and test datasets.

In [None]:
# 7) Evaluation Metrics - Modular Helper Functions
import numpy as np

def calculate_detailed_metrics(predictions):
    """
    Calculate accuracy, F1, precision, and recall from model predictions.
    This function processes predictions in a single pass for efficiency.
    """
    pred_starts = np.argmax(predictions.predictions[0], axis=1)
    pred_ends = np.argmax(predictions.predictions[1], axis=1)

    true_starts = np.asarray(predictions.label_ids[0]).reshape(-1)
    true_ends = np.asarray(predictions.label_ids[1]).reshape(-1)

    # Calculate exact match and accuracy metrics
    exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))
    start_accuracy = np.mean(pred_starts == true_starts)
    end_accuracy = np.mean(pred_ends == true_ends)
    accuracy = (exact_match + start_accuracy + end_accuracy) / 3

    # Calculate F1, precision, and recall from token overlap
    f1_scores = []
    precision_scores = []
    recall_scores = []

    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))

        if not pred_tokens and not true_tokens:
            f1_scores.append(1.0)
            precision_scores.append(1.0)
            recall_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            f1_scores.append(0.0)
            precision_scores.append(0.0)
            recall_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                f1_scores.append(0.0)
                precision_scores.append(0.0)
                recall_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                f1 = 2 * (precision * recall) / (precision + recall)
                f1_scores.append(f1)
                precision_scores.append(precision)
                recall_scores.append(recall)

    return {
        "accuracy": float(accuracy),
        "f1": float(np.mean(f1_scores)) if f1_scores else 0.0,
        "precision": float(np.mean(precision_scores)) if precision_scores else 0.0,
        "recall": float(np.mean(recall_scores)) if recall_scores else 0.0,
    }

def evaluate_dataset(trainer, tokenized_data, dataset_name="Dataset"):
    """
    Evaluate model on a given dataset and return all metrics.
    Makes a single prediction call for efficiency.
    """
    predictions = trainer.predict(tokenized_data)
    metrics = calculate_detailed_metrics(predictions)
    return metrics

def compute_qa_metrics(eval_pred):
    """
    Compute metrics for Trainer's evaluation during training.
    Kept for compatibility with Trainer API.
    """
    predictions, label_ids = eval_pred
    start_logits, end_logits = predictions

    pred_starts = np.argmax(start_logits, axis=1)
    pred_ends = np.argmax(end_logits, axis=1)

    true_starts = np.asarray(label_ids[0]).reshape(-1)
    true_ends = np.asarray(label_ids[1]).reshape(-1)

    exact_match = np.mean((pred_starts == true_starts) & (pred_ends == true_ends))
    start_accuracy = np.mean(pred_starts == true_starts)
    end_accuracy = np.mean(pred_ends == true_ends)

    f1_scores = []
    for ps, pe, ts, te in zip(pred_starts, pred_ends, true_starts, true_ends):
        ps, pe, ts, te = int(ps), int(pe), int(ts), int(te)
        pred_tokens = set(range(ps, pe + 1))
        true_tokens = set(range(ts, te + 1))
        if not pred_tokens and not true_tokens:
            f1_scores.append(1.0)
        elif not pred_tokens or not true_tokens:
            f1_scores.append(0.0)
        else:
            common = len(pred_tokens & true_tokens)
            if common == 0:
                f1_scores.append(0.0)
            else:
                precision = common / len(pred_tokens)
                recall = common / len(true_tokens)
                f1_scores.append(2 * (precision * recall) / (precision + recall))

    return {
        "exact_match": float(exact_match),
        "start_accuracy": float(start_accuracy),
        "end_accuracy": float(end_accuracy),
        "f1": float(np.mean(f1_scores)) if f1_scores else 0.0,
    }

print("Metrics functions ready.")
print("=" * 60)

Metrics functions ready.


### 8. Training Configuration

This section configures all the hyperparameters and training settings optimized for Google Colab's computational environment. The TrainingArguments object specifies that the model will train for 3 epochs with a batch size of 16 examples per device, using gradient accumulation over 2 steps to effectively simulate a larger batch size of 32 while managing memory constraints. The learning rate is set to 0.00003, a typical value for fine-tuning pre-trained models, with 10% of training steps allocated to warmup where the learning rate gradually increases from zero. Weight decay (0.01) provides regularization to prevent overfitting, and gradient clipping (max_grad_norm=1.0) prevents exploding gradients. The configuration enables mixed-precision training (fp16) when a GPU is available, which significantly speeds up training and reduces memory usage with minimal impact on model quality. The code sets up evaluation and checkpoint saving at the end of each epoch, keeping only the best 2 checkpoints based on F1 score to conserve storage space. Data loading is optimized with pin_memory enabled and 2 worker processes. Logging occurs every 50 steps to track training progress, and the random seed is fixed at 42 for reproducibility. These carefully chosen settings balance training efficiency, model performance, and resource constraints typical of Colab's free tier GPU allocation.

In [None]:
# 8) Training configuration (Colab-friendly)
print("\n" + "=" * 60)
print("TRAINING CONFIGURATION")
print("=" * 60)

training_args = TrainingArguments(
    output_dir="./results_cardio_qa",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=0.00003,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    max_grad_norm=1.0,
    fp16=torch.cuda.is_available(),
    dataloader_pin_memory=True,
    dataloader_num_workers=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_dir="./logs_cardio_qa",
    logging_steps=50,
    logging_strategy="steps",
    report_to=[],
    seed=42,
    disable_tqdm=False,
    remove_unused_columns=True,
)

print("Configured. FP16:", training_args.fp16)
print("=" * 60)



TRAINING CONFIGURATION
Configured. FP16: True



### 9. Trainer Initialization

This section instantiates the Hugging Face Trainer object, which orchestrates the entire training and evaluation process. The Trainer combines the model, training arguments, datasets, tokenizer, data collator, and metrics function into a unified training pipeline. The data collator handles the batching and padding of tokenized examples to ensure uniform tensor sizes within each batch. By providing both training and evaluation datasets, the Trainer can automatically run validation at the intervals specified in the training arguments. The compute_metrics function enables automatic calculation of custom metrics during evaluation, providing real-time feedback on model performance. This high-level abstraction handles many complex aspects of the training loop including gradient computation, backpropagation, optimizer steps, learning rate scheduling, mixed-precision training, logging, checkpointing, and distributed training if multiple GPUs are available. The Trainer API simplifies what would otherwise require hundreds of lines of custom training code, while still offering extensive customization options through the training arguments.

In [None]:
# 9) Initialize Trainer
print("\n" + "=" * 60)
print("INITIALIZING TRAINER")
print("=" * 60)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_qa_metrics,
)

print("Trainer ready.")
print("=" * 60)


INITIALIZING TRAINER
Trainer ready.


### 10. Random Grid Search Hyperparameter Tuning

This section implements an automated Random Grid Search to identify optimal hyperparameters for the cardiovascular QA task. Unlike manual hyperparameter selection, Random Grid Search systematically explores the hyperparameter space by randomly sampling configurations from defined ranges. The search space includes epochs (5-15 for faster execution), learning rates (0.00001 to 0.00006), batch sizes ([8, 12, 16]), gradient accumulation steps ([1, 2, 4, 6]), warmup ratios (0.05 to 0.20), and weight decay values (0.005 to 0.025). The experiment runs 15 random configurations, providing broad coverage of the parameter space while remaining computationally efficient on Google Colab's T4 GPU. For each configuration, a fresh copy of the pre-trained BlueBERT model is loaded to ensure fair comparison. All hyperparameters are displayed in real number format (e.g., 0.00003 instead of 3e-5) for clarity. The code evaluates performance on both training and test datasets using modular helper functions that calculate accuracy, F1 score, precision, and recall in a single efficient pass. Results are exported to an Excel file with columns for all hyperparameters and metrics, formatted with real numbers throughout. The effective batch size (batch_size Ã— gradient_accumulation_steps) is also tracked to understand the true batch size used during training. GPU cache is cleared between iterations to optimize memory usage on the T4 GPU. A fixed random seed (42) ensures reproducibility of the random sampling process, allowing the experiment to be repeated with identical configurations if needed.

In [None]:
# 10) Random Grid Search Hyperparameter Tuning
import time
import random
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment, numbers

print("\n" + "=" * 60)
print("RANDOM GRID SEARCH HYPERPARAMETER TUNING (15 CONFIGURATIONS)")
print("=" * 60)

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Define hyperparameter search space
param_space = {
    "epochs": list(range(5, 16)),  # 5 to 15
    "lr": [0.00001, 0.000015, 0.00002, 0.000025, 0.00003, 0.000035, 0.00004, 0.000045, 0.00005, 0.000055, 0.00006],
    "batch": [8, 12, 16],
    "grad_accum": [1, 2, 4, 6],
    "warmup": [round(x, 2) for x in np.linspace(0.05, 0.20, 16)],  # 0.05 to 0.20
    "weight_decay": [round(x, 4) for x in np.linspace(0.005, 0.025, 21)]  # 0.005 to 0.025
}

print("Hyperparameter Search Space:")
print(f"  Epochs: {min(param_space['epochs'])} - {max(param_space['epochs'])}")
print(f"  Learning Rate: {min(param_space['lr'])} - {max(param_space['lr'])}")
print(f"  Batch Size: {param_space['batch']}")
print(f"  Gradient Accumulation: {param_space['grad_accum']}")
print(f"  Warmup Ratio: {min(param_space['warmup'])} - {max(param_space['warmup'])}")
print(f"  Weight Decay: {min(param_space['weight_decay'])} - {max(param_space['weight_decay'])}")
print()

# Generate 15 random configurations
num_configs = 15
hyperparam_sets = []

for _ in range(num_configs):
    config = {
        "epochs": random.choice(param_space["epochs"]),
        "lr": random.choice(param_space["lr"]),
        "batch": random.choice(param_space["batch"]),
        "grad_accum": random.choice(param_space["grad_accum"]),
        "warmup": random.choice(param_space["warmup"]),
        "weight_decay": random.choice(param_space["weight_decay"])
    }
    hyperparam_sets.append(config)

total_iters = len(hyperparam_sets)

print(f"Generated {total_iters} random configurations")
print("=" * 60)

# Excel setup
wb = Workbook()
ws = wb.active
ws.title = "Random Grid Search Results"
ws.append([
    "Iteration", "Epochs", "Learning Rate", "Batch Size", "Grad Accum", "Effective Batch",
    "Warmup Ratio", "Weight Decay",
    "Train-Accuracy", "Test-Accuracy",
    "Train-F1", "Test-F1",
    "Train-Precision", "Test-Precision",
    "Train-Recall", "Test-Recall",
    "Runtime (s)"
])

# Main random grid search loop
for i, params in enumerate(hyperparam_sets, 1):
    print(f"\n{'='*60}")
    print(f" Iteration {i}/{total_iters}")
    print(f"Params: Epochs={params['epochs']}, LR={params['lr']:.6f}, Batch={params['batch']}, GradAccum={params['grad_accum']}")
    print(f"        Warmup={params['warmup']:.2f}, WeightDecay={params['weight_decay']:.4f}")
    print(f"        Effective Batch Size: {params['batch'] * params['grad_accum']}")
    print(f"{'='*60}")

    # Reload fresh model for each iteration
    print("Loading fresh model...")
    model_fresh = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)
    model_fresh.to(device)

    # Configure training arguments (T4 optimized)
    training_args = TrainingArguments(
        output_dir=f"./results_run_{i}",
        num_train_epochs=params["epochs"],
        per_device_train_batch_size=params["batch"],
        per_device_eval_batch_size=params["batch"],
        gradient_accumulation_steps=params["grad_accum"],
        learning_rate=params["lr"],
        warmup_ratio=params["warmup"],
        weight_decay=params["weight_decay"],
        eval_strategy="no",
        save_strategy="no",
        logging_dir=f"./logs_run_{i}",
        report_to=[],
        disable_tqdm=False,
        seed=42,
        fp16=torch.cuda.is_available(),
        dataloader_num_workers=0,  # T4 optimization (limited CPU cores)
        dataloader_pin_memory=True,
    )

    # Create trainer
    trainer = Trainer(
        model=model_fresh,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # Train model
    print(" Training...")
    start_time = time.time()
    trainer.train()
    runtime = round(time.time() - start_time, 2)

    # Evaluate on BOTH train and test sets
    print(" Evaluating on training set...")
    train_metrics = evaluate_dataset(trainer, tokenized_train, "TRAIN")

    print(" Evaluating on test set...")
    test_metrics = evaluate_dataset(trainer, tokenized_test, "TEST")

    # Calculate effective batch size
    effective_batch = params["batch"] * params["grad_accum"]

    # Append results to Excel (using real numbers, not scientific notation)
    ws.append([
        i,
        params["epochs"],
        params["lr"],  # Will be formatted as real number
        params["batch"],
        params["grad_accum"],
        effective_batch,
        params["warmup"],
        params["weight_decay"],
        round(train_metrics["accuracy"], 4),
        round(test_metrics["accuracy"], 4),
        round(train_metrics["f1"], 4),
        round(test_metrics["f1"], 4),
        round(train_metrics["precision"], 4),
        round(test_metrics["precision"], 4),
        round(train_metrics["recall"], 4),
        round(test_metrics["recall"], 4),
        runtime
    ])

    print(f" Iteration {i} complete!")
    print(f"   Train - F1: {train_metrics['f1']:.4f}, Acc: {train_metrics['accuracy']:.4f}")
    print(f"   Test  - F1: {test_metrics['f1']:.4f}, Acc: {test_metrics['accuracy']:.4f}")
    print(f"   Runtime: {runtime}s")

    # Clear GPU cache and free memory (T4 optimization)
    del model_fresh, trainer
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

# Format Excel file
print("\n" + "=" * 60)
print("FORMATTING EXCEL FILE")
print("=" * 60)

# Bold headers
for cell in ws[1]:
    cell.font = Font(bold=True)
    cell.alignment = Alignment(horizontal="center", vertical="center")

# Format Learning Rate column to show real numbers (not scientific notation)
for row in range(2, ws.max_row + 1):
    # Learning Rate (column C)
    ws.cell(row=row, column=3).number_format = '0.000000'
    # Warmup Ratio (column G)
    ws.cell(row=row, column=7).number_format = '0.00'
    # Weight Decay (column H)
    ws.cell(row=row, column=8).number_format = '0.0000'
    # All metric columns (I-P)
    for col in range(9, 17):
        ws.cell(row=row, column=col).number_format = '0.0000'

# Auto-adjust column widths
for col in ws.columns:
    max_length = 0
    col_letter = col[0].column_letter
    for cell in col:
        try:
            if len(str(cell.value)) > max_length:
                max_length = len(str(cell.value))
        except:
            pass
    adjusted_width = min((max_length + 2), 20)
    ws.column_dimensions[col_letter].width = adjusted_width

# Save Excel file
output_excel = "/content/QA_RandomGridSearch_Results_blueBERT.xlsx"
wb.save(output_excel)

print(f"\n All {total_iters} configurations completed successfully!")
print(" Results saved to Excel file:")
print(f" {output_excel}")
print("\n" + "=" * 60)
print("RANDOM GRID SEARCH COMPLETE")
print("=" * 60)
print("=" * 60)


RANDOM GRID SEARCH HYPERPARAMETER TUNING (15 CONFIGURATIONS)
Hyperparameter Search Space:
  Epochs: 5 - 15
  Learning Rate: 1e-05 - 6e-05
  Batch Size: [8, 12, 16]
  Gradient Accumulation: [1, 2, 4, 6]
  Warmup Ratio: 0.05 - 0.2
  Weight Decay: 0.005 - 0.025

Generated 15 random configurations

 Iteration 1/15
Params: Epochs=15, LR=0.000015, Batch=8, GradAccum=4
        Warmup=0.12, WeightDecay=0.0120
        Effective Batch Size: 32
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 1 complete!
   Train - F1: 0.9989, Acc: 0.9428
   Test  - F1: 0.9924, Acc: 0.6730
   Runtime: 279.13s

 Iteration 2/15
Params: Epochs=7, LR=0.000015, Batch=16, GradAccum=1
        Warmup=0.18, WeightDecay=0.0060
        Effective Batch Size: 16
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 2 complete!
   Train - F1: 0.9970, Acc: 0.8504
   Test  - F1: 0.9905, Acc: 0.6806
   Runtime: 129.82s

 Iteration 3/15
Params: Epochs=5, LR=0.000015, Batch=8, GradAccum=2
        Warmup=0.05, WeightDecay=0.0220
        Effective Batch Size: 16
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 3 complete!
   Train - F1: 0.9927, Acc: 0.7304
   Test  - F1: 0.9893, Acc: 0.5615
   Runtime: 97.51s

 Iteration 4/15
Params: Epochs=8, LR=0.000060, Batch=16, GradAccum=6
        Warmup=0.12, WeightDecay=0.0190
        Effective Batch Size: 96
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 4 complete!
   Train - F1: 0.9968, Acc: 0.8463
   Test  - F1: 0.9923, Acc: 0.6679
   Runtime: 137.76s

 Iteration 5/15
Params: Epochs=14, LR=0.000030, Batch=8, GradAccum=2
        Warmup=0.18, WeightDecay=0.0150
        Effective Batch Size: 16
Loading fresh model...
 Training...


Step,Training Loss
500,1.7451


 Evaluating on training set...


 Evaluating on test set...


 Iteration 5 complete!
   Train - F1: 1.0000, Acc: 1.0000
   Test  - F1: 0.9955, Acc: 0.9493
   Runtime: 268.79s

 Iteration 6/15
Params: Epochs=9, LR=0.000020, Batch=8, GradAccum=4
        Warmup=0.08, WeightDecay=0.0070
        Effective Batch Size: 32
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 6 complete!
   Train - F1: 0.9965, Acc: 0.8642
   Test  - F1: 0.9914, Acc: 0.6578
   Runtime: 167.38s

 Iteration 7/15
Params: Epochs=11, LR=0.000015, Batch=12, GradAccum=4
        Warmup=0.13, WeightDecay=0.0060
        Effective Batch Size: 48
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 7 complete!
   Train - F1: 0.9938, Acc: 0.7759
   Test  - F1: 0.9881, Acc: 0.5894
   Runtime: 193.0s

 Iteration 8/15
Params: Epochs=12, LR=0.000050, Batch=8, GradAccum=6
        Warmup=0.07, WeightDecay=0.0220
        Effective Batch Size: 48
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 8 complete!
   Train - F1: 0.9999, Acc: 0.9966
   Test  - F1: 0.9947, Acc: 0.8504
   Runtime: 218.86s

 Iteration 9/15
Params: Epochs=9, LR=0.000060, Batch=16, GradAccum=4
        Warmup=0.11, WeightDecay=0.0070
        Effective Batch Size: 64
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 9 complete!
   Train - F1: 0.9993, Acc: 0.9642
   Test  - F1: 0.9951, Acc: 0.7947
   Runtime: 156.17s

 Iteration 10/15
Params: Epochs=5, LR=0.000060, Batch=8, GradAccum=4
        Warmup=0.07, WeightDecay=0.0120
        Effective Batch Size: 32
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 10 complete!
   Train - F1: 0.9983, Acc: 0.9297
   Test  - F1: 0.9942, Acc: 0.7744
   Runtime: 93.17s

 Iteration 11/15
Params: Epochs=6, LR=0.000040, Batch=12, GradAccum=6
        Warmup=0.16, WeightDecay=0.0100
        Effective Batch Size: 72
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 11 complete!
   Train - F1: 0.9918, Acc: 0.6829
   Test  - F1: 0.9884, Acc: 0.5640
   Runtime: 105.16s

 Iteration 12/15
Params: Epochs=10, LR=0.000035, Batch=8, GradAccum=4
        Warmup=0.07, WeightDecay=0.0240
        Effective Batch Size: 32
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 12 complete!
   Train - F1: 0.9994, Acc: 0.9841
   Test  - F1: 0.9949, Acc: 0.7871
   Runtime: 185.44s

 Iteration 13/15
Params: Epochs=15, LR=0.000020, Batch=16, GradAccum=2
        Warmup=0.10, WeightDecay=0.0190
        Effective Batch Size: 32
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 13 complete!
   Train - F1: 0.9998, Acc: 0.9897
   Test  - F1: 0.9924, Acc: 0.7414
   Runtime: 264.85s

 Iteration 14/15
Params: Epochs=11, LR=0.000030, Batch=16, GradAccum=2
        Warmup=0.15, WeightDecay=0.0060
        Effective Batch Size: 32
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 14 complete!
   Train - F1: 0.9998, Acc: 0.9841
   Test  - F1: 0.9894, Acc: 0.7719
   Runtime: 194.14s

 Iteration 15/15
Params: Epochs=8, LR=0.000010, Batch=12, GradAccum=6
        Warmup=0.13, WeightDecay=0.0070
        Effective Batch Size: 72
Loading fresh model...
 Training...


Step,Training Loss


 Evaluating on training set...


 Evaluating on test set...


 Iteration 15 complete!
   Train - F1: 0.9693, Acc: 0.4526
   Test  - F1: 0.9713, Acc: 0.4157
   Runtime: 140.86s

FORMATTING EXCEL FILE

 All 15 configurations completed successfully!
 Results saved to Excel file:
 /content/QA_RandomGridSearch_Results_blueBERT.xlsx

RANDOM GRID SEARCH COMPLETE
