# **Setup and Data Preparation**

**Function Description**

This code block loads a CSV dataset from Google Drive (in this case `("/content/drive/My Drive/Data Collection (ITE Elective Course Lesson)/Dataset/Webscraped data - ITE Elective 3 - Sheet1.csv")`, then processes it into a suitable format for `transformers` libraries. It also checks GPU availability to enable efficient computation.

**Input**

The input in this code block is the CSV file from Google Drive containing text data for model training

**Output**

The output in this code block is the confirmation message about GPU or CPU usage. Two datasets printed to the console: training data `(train_data)` and evaluation data `(eval_data)`.

**Essential Syntaxes**



    drive.mount('/content/drive/', force_remount=True) mounts Google Drive to access external files.

    pd.read_csv(path) loads the dataset into a pandas DataFrame.

    torch.cuda.is_available() checks whether a GPU is accessible.

    Dataset.from_pandas(df) converts a pandas DataFrame into a Hugging Face Dataset object.

    dataset.train_test_split(test_size=0.2, seed=42) splits the dataset into 80% training and 20% evaluation data.

    torch.device("cuda" or "cpu") specifies the computation device for model training.

**Example Output**

Mounted at /content/drive/
Using GPU: Tesla T4

--- Loading and Preprocessing Data ---
Dataset({
    features: ['column1', 'column2', 'column3'],
    num_rows: 400
})
Dataset({
    features: ['column1', 'column2', 'column3'],
    num_rows: 100
})


**Comment and Observation**

Based on my understanding, the code demonstrates a typical preprocessing workflow for NLP model training using Hugging Face. It ensures GPU acceleration if available, which significantly speeds up training. Moreover, converting the dataset into the Hugging Face format allows easy integration with a `Trainer` later on.

In [3]:
import torch
import pandas as pd
import numpy as np
import re
import time
import random
import itertools
from datasets import Dataset
from transformers import TrainingArguments, Trainer
from google.colab import drive
from transformers import BertTokenizerFast, BertForQuestionAnswering
from sklearn.metrics import accuracy_score, f1_score
from transformers import TrainingArguments, Trainer
from transformers.data.data_collator import default_data_collator # Import default_data_collator


drive.mount('/content/drive/', force_remount=True)

df = pd.read_excel("/content/drive/My Drive/Data Collection (ITE Elective Course Lesson)/Dataset/Webscraped data_Modules_Question and Answering.xlsx")


df.dropna (subset=['Title', 'Context', 'Question', 'Answer'], inplace=True)


for col in ['Context', 'Question', 'Answer', 'Title']:
  df[col] = df[col].astype(str)


def clean_text(text):
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^A-Za-z0-9.,;:?!\'"()\-\s]', '', text)
    text = text.strip()
    return text


df['Context'] = df['Context'].apply(clean_text)
df['Question'] = df['Question'].apply(clean_text)
df['Answer'] = df['Answer'].apply(clean_text)

print(f"Cleaned dataset shape: {df.shape}")


if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

print("\n--- Loading and Preprocessing Data ---")


dataset = Dataset.from_pandas(df)


dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_data = dataset["train"]
eval_data = dataset["test"]

print(train_data)
print(eval_data)


Mounted at /content/drive/
Cleaned dataset shape: (710, 7)
Using GPU: Tesla T4

--- Loading and Preprocessing Data ---
Dataset({
    features: ['ID', 'Title', 'Context', 'Question', 'Answer', 'Unnamed: 5', 'Unnamed: 6', '__index_level_0__'],
    num_rows: 568
})
Dataset({
    features: ['ID', 'Title', 'Context', 'Question', 'Answer', 'Unnamed: 5', 'Unnamed: 6', '__index_level_0__'],
    num_rows: 142
})


# **Bert-base-uncased**

**Function Description**

This code loads a pre-trained BERT model and tokenizer for question answering tasks, processes datasets by aligning answer positions within contexts, and tokenizes the data while mapping character-based answer spans to token spans suitable for model training. It also detects GPU availability to accelerate model training and inference.

**Input**

The input consists of datasets containing questions, contexts, and answers, which are prepared as train_data and eval_data before processing. These datasets include fields like "Question," "Context," "Answer," and answer positions.

**Output**

The output includes tokenized datasets (tokenized_train and tokenized_eval) that are formatted for PyTorch training, with added start and end position labels for answers. Additionally, the script confirms the successful loading of the model and the availability of GPU acceleration, printing relevant messages.

**Essential Syntaxes**

python
from transformers import BertTokenizerFast, BertForQuestionAnswering

    Imports the necessary classes for tokenization and model loading.

python
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME)

    Loads a pre-trained BERT tokenizer and model based on bert-base-uncased.

python
def add_answer_positions(example):
    ...
    return example

    Adds start and end answer positions within the context, matching answer text to context.

python
def tokenize_and_align(examples):
    ...
    return tokenized

    Tokenizes question and context, aligns answer spans with token indices, and prepares data for model training.

python
train_data.map(add_answer_positions)
eval_data.map(add_answer_positions)

train_data.map(tokenize_and_align, batched=True)
eval_data.map(tokenize_and_align, batched=True)

    Applies position setting and tokenization functions on datasets.

python
torch.device("cuda" if torch.cuda.is_available() else "cpu")

    Checks for GPU availability and assigns the device accordingly.

python
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

    Loads the model to the appropriate device for training or inference.

Example Output

text
Loaded Pretrained QnA Model: bert-base-uncased
🧩 Model loaded successfully for QnA: bert-base-uncased

Comment and Observation

This code exemplifies a typical NLP pipeline for question answering with BERT. It carefully maps answer spans from character-level positions within the context to token indices, which is essential for model training. Importantly, it checks for GPU availability to optimize performance, making it suitable for large datasets and neural network fine-tuning tasks. The structured approach ensures correct data alignment and efficient model loading, facilitating smooth integration into a training loop later on.

In [4]:


MODEL_NAME = "bert-base-uncased"

tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME)

print(f"\n Loaded Pretrained QnA Model: {MODEL_NAME}")


def add_answer_positions(example):
    context = example["Context"]
    answer = example["Answer"]


    context_lower = context.lower()
    answer_lower = answer.lower()

    answer_start = context_lower.find(answer_lower)
    if answer_start == 0:

        example["start_positions"] = 0
        example["end_positions"] = 0
    else:
        answer_end = answer_start + len(answer)
        example["start_positions"] = answer_start
        example["end_positions"] = answer_end

    return example


def tokenize_and_align(examples):
    tokenized = tokenizer(
        examples["Question"],
        examples["Context"],
        truncation="only_second",
        padding="max_length",
        max_length=512,
        return_offsets_mapping=True
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        sequence_ids = tokenized.sequence_ids(i)
        context_start = sequence_ids.index(1)
        context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

        start_char = examples["start_positions"][i]
        end_char = examples["end_positions"][i]


        token_start_index = context_start
        token_end_index = context_start


        for idx in range(context_start, context_end + 1):
            if offsets[idx][0] <= start_char < offsets[idx][1]:
                token_start_index = idx
            if offsets[idx][0] < end_char <= offsets[idx][1]:
                token_end_index = idx
                break

        start_positions.append(token_start_index)
        end_positions.append(token_end_index)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    tokenized.pop("offset_mapping")

    return tokenized



train_data = train_data.map(add_answer_positions)
eval_data = eval_data.map(add_answer_positions)


tokenized_train = train_data.map(tokenize_and_align, batched=True)
tokenized_eval = eval_data.map(tokenize_and_align, batched=True)

tokenized_train.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])
tokenized_eval.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])



device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

print(f"\n Model loaded successfully for QnA: {MODEL_NAME}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



 Loaded Pretrained QnA Model: bert-base-uncased


Map:   0%|          | 0/568 [00:00<?, ? examples/s]

Map:   0%|          | 0/142 [00:00<?, ? examples/s]

Map:   0%|          | 0/568 [00:00<?, ? examples/s]

Map:   0%|          | 0/142 [00:00<?, ? examples/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



 Model loaded successfully for QnA: bert-base-uncased


# **METRICS AND TRAINING SETUP**

This command installs several important Python packages that are essential for working with transformer models and hyperparameter optimization:

    transformers: Hugging Face's library for state-of-the-art transformer models like BERT, GPT, and more.

    datasets: Hugging Face's library for easily accessing and managing datasets.

    accelerate: A library to help scale PyTorch models across multiple GPUs or TPUs.

    ray[tune]: Ray Tune is a scalable hyperparameter tuning library built on Ray.

    optuna: A popular framework for automated hyperparameter optimization.

The purpose of this code is to search results show that Optuna and Ray Tune can be integrated with the Transformers Trainer to perform hyperparameter searches efficiently. Installing these packages sets up your environment for such advanced workflows, enabling you to automatically and systematically improve your transformer model’s training parameters.

In [7]:
!pip install transformers datasets accelerate ray[tune] optuna -U



Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting ray[tune]
  Downloading ray-2.51.1-cp312-cp312-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting click!=8.3.0,>=7.0 (from ray[tune])
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting tensorboardX>=1.9 (from ray[tune])
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading optuna-4.5.0-py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

**Function Description**

This code sets seeds for all relevant libraries to ensure full reproducibility in PyTorch experiments. It also defines evaluation metrics for question answering Exact Match (EM) and F1 Score then builds a Hugging Face Trainer with arguments perfectly suited for BERT-style extractive QA training and validation. These steps are essential for stable, trackable QA research and applications.

**Input**

    Seed value (here, 42), set across Python, numpy, PyTorch CPU and GPU.

    Model predictions and gold answers for each evaluation batch.

    Training datasets, pretrained model, and tokenizer.

**Output**

    Reproducible training and validation runs.

    Dictionary of computed evaluation metrics (averaged EM, F1, and inference time) after each evaluation phase.

    A fully configured Trainer object for orchestrated fine-tuning on QA tasks.

**Essential Syntaxes**

python
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Ensures all sources of randomness are controlled, making repeated experiments yield consistent results.

python
def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

Computes an all-or-nothing score: 1 only if predicted span matches gold answer exactly, after stripping and lowercasing.

python
def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    ...
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

Calculates the token-level overlap F1 between prediction and gold answer. Captures partial matches as well as perfect ones.

python
def compute_metrics(eval_pred):
    ...
    return metrics

Processes batched model predictions, decodes answer spans, computes EM and F1 across all examples, and averages them.

python
training_args = TrainingArguments(...)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

Sets up training parameters (epochs, batch sizes, device, logging, saving) and bundles everything in a Trainer for easy use.

**Example Output**

text
{'Exact_Match': 0.76, 'F1_Score': 0.81, 'Avg_Inference_Time': 0.0052}

(A dictionary summarizing model accuracy and efficiency, printed after evaluation.)

**Comment and Observation**

Setting all seeds eliminates nearly all nondeterminism—so you're not chasing "lucky runs," which is essential in QA research.

In [5]:

seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)


torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()

    common = set(pred_tokens) & set(truth_tokens)
    if not common:
        return 0.0

    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1



def compute_metrics(eval_pred):
    start_time = time.time()

    predictions, labels = eval_pred

    start_logits, end_logits = predictions


    start_positions = np.argmax(start_logits, axis=1)
    end_positions = np.argmax(end_logits, axis=1)

    exact_matches = []
    f1_scores = []


    for i in range(len(start_positions)):
        input_ids = tokenized_eval[i]["input_ids"]
        pred_tokens = input_ids[start_positions[i]: end_positions[i] + 1]
        pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)

        gold_text = eval_data[i]["Answer"]

        exact_matches.append(compute_exact_match(pred_text, gold_text))
        f1_scores.append(compute_f1(pred_text, gold_text))

    avg_inference_time = (time.time() - start_time) / len(start_positions)

    metrics = {
        "Exact_Match": np.mean(exact_matches),
        "F1_Score": np.mean(f1_scores),
        "Avg_Inference_Time": avg_inference_time
    }

    return metrics


training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/results",
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs",
    learning_rate=4e-5,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

  trainer = Trainer(


## **With Grid Search**

Function Description

This code establishes a reproducible environment for training a BERT-based question answering model, defines evaluation metrics (Exact Match and F1), sets hyperparameters and training arguments for model fine-tuning, and implements an automated hyperparameter grid search using Optuna integrated with Hugging Face’s Trainer. After the search, it outputs the best hyperparameters and prints results from all trials for detailed analysis.

**Input**

    Tokenized train and evaluation datasets (tokenized_train and tokenized_eval) with questions, contexts, and answers.

    Pretrained model and tokenizer for BERT-based question answering.

    Defined hyperparameter search space including learning rate, batch size, and number of epochs.

    Utility functions for computing Exact Match and F1 evaluation metrics.

    A seed value (42) to ensure deterministic and reproducible training runs.

**Output**

    Metric results for each hyperparameter trial during grid search.

    The best hyperparameter configuration found according to the evaluation (maximizing F1 Score).

    A DataFrame printed to the console with all trial results sorted by performance metric.

    Messages indicating the progress and results of the hyperparameter search.

**Essential Syntaxes**

python
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

    Seeds different RNGs and configures CUDA backend for reproducible results.

python
def compute_exact_match(prediction, truth):
    ...
def compute_f1(prediction, truth):
    ...
def compute_metrics(eval_pred):
    ...

    Define evaluation metric functions that compute Exact Match, F1 score, and average inference time from model predictions and ground truth answers.

python
def tune_hp(trial):
    learning_rate = trial.suggest_categorical("learning_rate", [5e-5, 3e-5, 1e-5])
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16])
    num_train_epochs = trial.suggest_categorical("num_train_epochs", [3, 4, 5])
    return {...}

    Hyperparameter search space definition for Optuna trials.

python
def model_init():
    return BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

    Function to reinitialize the model freshly for each hyperparameter trial to avoid weight contamination.

python
grid_trainer = Trainer(
    model_init=model_init,
    args=grid_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Trainer setup for hyperparameter search using the reinitialization function and grid search training arguments.

python
best_trial = grid_trainer.hyperparameter_search(
    backend="optuna",
    hp_space=tune_hp,
    direction="maximize",
    n_trials=18,
)

    Executes Optuna-powered hyperparameter search over 18 trials, maximizing the F1 metric.

python
df_results = pd.DataFrame(trial_results)
df_results.sort_values("metric_value", ascending=False, inplace=True)
print(df_results)

    Converts trial results to a DataFrame, sorts them by performance, and prints them for more interpretable analysis.

**Example Output**

text
--- Starting Grid Search ---
[I 2025-11-08 21:15:30,000] Trial 0 finished with value: 0.76
...
--- Grid Search Complete ---
BEST HYPERPARAMETERS FOUND:
{'learning_rate': 3e-5, 'per_device_train_batch_size': 8, 'num_train_epochs': 5}

Grid Search Trial Results (sorted by metric):
   learning_rate  per_device_train_batch_size  num_train_epochs  metric_value
1        3e-05                           8                5          0.78
0        5e-05                          16                4          0.76
...

Best hyperparameters detail:
learning_rate                3e-05
per_device_train_batch_size      8
num_train_epochs                5
metric_value                 0.78
Name: 1, dtype: object

**Comment and Observation**

This code robustly integrates reproducibility best practices with automatic hyperparameter tuning for transformer QA modeling, using Optuna’s search capabilities tightly coupled with Hugging Face Trainer. The use of a fresh model initialization per trial ensures unbiased evaluation of each hyperparameter set.

In [8]:
import random
import numpy as np
import pandas as pd
import torch
import time
from transformers import TrainingArguments, Trainer, set_seed

import optuna

seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)


torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    common = set(pred_tokens) & set(truth_tokens)
    if not common:
        return 0.0
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

def compute_metrics(eval_pred):
    start_time = time.time()
    predictions, labels = eval_pred
    start_logits, end_logits = predictions
    start_positions = np.argmax(start_logits, axis=1)
    end_positions = np.argmax(end_logits, axis=1)
    exact_matches = []
    f1_scores = []
    for i in range(len(start_positions)):
        input_ids = tokenized_eval[i]["input_ids"]
        pred_tokens = input_ids[start_positions[i]: end_positions[i] + 1]
        pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)
        gold_text = eval_data[i]["Answer"]
        exact_matches.append(compute_exact_match(pred_text, gold_text))
        f1_scores.append(compute_f1(pred_text, gold_text))
    avg_inference_time = (time.time() - start_time) / len(start_positions)
    metrics = {
        "Exact_Match": np.mean(exact_matches),
        "F1_Score": np.mean(f1_scores),
        "Avg_Inference_Time": avg_inference_time
    }
    return metrics

# Define training arguments (hyperparameters)
training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/results", # Changed to Google Drive
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs", # Changed to Google Drive
    learning_rate=4e-5,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


def tune_hp(trial):
    learning_rate = trial.suggest_categorical("learning_rate", [5e-5, 3e-5, 1e-5])
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16])
    num_train_epochs = trial.suggest_categorical("num_train_epochs", [3, 4, 5])  # New hyperparameter for epochs
    return {
        "learning_rate": learning_rate,
        "per_device_train_batch_size": per_device_train_batch_size,
        "num_train_epochs": num_train_epochs,
    }

grid_training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/grid_search_results",
    eval_strategy="epoch",
    save_strategy="no",
    load_best_model_at_end=False,
    metric_for_best_model="eval_F1_Score",
    fp16=torch.cuda.is_available(),
    report_to="none",
    num_train_epochs=3,
    warmup_steps=500,
    logging_dir="/content/drive/My Drive/grid_search_logs",
)

def model_init():

    return BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

grid_trainer = Trainer(
    model_init=model_init,
    args=grid_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

print("\n--- Starting Grid Search ---")
best_trial = grid_trainer.hyperparameter_search(
    backend="optuna",
    hp_space=tune_hp,
    direction="maximize",
    n_trials=18,
)

if best_trial:
    print("\n--- Grid Search Complete ---")
    print("BEST HYPERPARAMETERS FOUND:")
    print(best_trial)
    best_hps = best_trial.hyperparameters
    print("\nBest Hyperparameters:")
    for key, val in best_hps.items():
        print(f"  {key}: {val}")
else:
    print("Search failed or no best trial found.")

print("\nYou can now initialize TrainingArguments with best_hps for final training.")



if grid_trainer.hp_search_backend and grid_trainer.hp_search_backend.study:
    all_trials = grid_trainer.hp_search_backend.study.get_trials()


    trial_results = []
    for trial in all_trials:
        vals = trial.params.copy()

        vals['metric_value'] = trial.value
        trial_results.append(vals)

    df_results = pd.DataFrame(trial_results)


    df_results.sort_values("metric_value", ascending=False, inplace=True)

    print("\nGrid Search Trial Results (sorted by metric):")
    print(df_results)


    print("\nBest hyperparameters detail:")
    print(df_results.iloc[0])
else:
    print("\nCould not retrieve trial results from the study.")

  grid_trainer = Trainer(
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-11-08 14:50:14,436] A new study created in memory with name: no-name-93edb277-3782-4bbd-9b6d-1d49a75a4b85



--- Starting Grid Search ---


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,4.561098,0.0,0.237803,0.001281
2,No log,2.011742,0.0,0.117332,0.00326
3,No log,1.879116,0.0,0.020504,0.002383


[I 2025-11-08 14:51:17,204] Trial 0 finished with value: 0.02288724535352977 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 3}. Best is trial 0 with value: 0.02288724535352977.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.169468,0.0,0.095487,0.0011
2,No log,5.898768,0.0,0.11783,0.001457
3,No log,5.189728,0.0,0.198351,0.002886


[I 2025-11-08 14:52:10,306] Trial 1 finished with value: 0.20123753796077903 and parameters: {'learning_rate': 1e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}. Best is trial 1 with value: 0.20123753796077903.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.169468,0.0,0.098141,0.001426
2,No log,5.898782,0.0,0.11761,0.000767
3,No log,5.18948,0.0,0.198058,0.000765


[I 2025-11-08 14:53:00,076] Trial 2 finished with value: 0.1988229946566938 and parameters: {'learning_rate': 1e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}. Best is trial 1 with value: 0.20123753796077903.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.134999,0.0,0.196364,0.000992
2,No log,2.950027,0.0,0.326712,0.000803
3,No log,1.891041,0.0,0.017943,0.000721
4,No log,1.821497,0.0,0.025741,0.001074
5,No log,1.932394,0.0,0.062296,0.000825


[I 2025-11-08 14:54:27,898] Trial 3 finished with value: 0.06312078816287812 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 5}. Best is trial 1 with value: 0.20123753796077903.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,4.560705,0.0,0.235636,0.000792
2,No log,2.033477,0.0,0.103114,0.00107
3,No log,1.886484,0.0,0.019772,0.000711
4,No log,1.905676,0.0,0.217533,0.000775


[I 2025-11-08 14:55:36,887] Trial 4 finished with value: 0.2183079076599477 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 4}. Best is trial 4 with value: 0.2183079076599477.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.135839,0.0,0.195611,0.001205


[I 2025-11-08 14:55:54,672] Trial 5 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,4.560444,0.0,0.237803,0.000777
2,No log,2.015892,0.0,0.11933,0.000733
3,No log,1.88306,0.0,0.017943,0.001059


[I 2025-11-08 14:56:47,286] Trial 6 finished with value: 0.019002213843886135 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 3}. Best is trial 4 with value: 0.2183079076599477.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.135082,0.0,0.195587,0.000803


[I 2025-11-08 14:57:05,049] Trial 7 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.135082,0.0,0.196364,0.001236


[I 2025-11-08 14:57:22,873] Trial 8 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.009367,0.0,0.108641,0.000768
2,No log,4.894826,0.0,0.225395,0.000849
3,No log,3.879309,0.0,0.296071,0.000785


[I 2025-11-08 14:58:11,286] Trial 9 finished with value: 0.2968559928084098 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 3}. Best is trial 9 with value: 0.2968559928084098.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.00938,0.0,0.108641,0.000737
2,No log,4.89473,0.0,0.225136,0.001139
3,No log,3.879935,0.0,0.296071,0.000822
4,No log,2.274859,0.0,0.345724,0.000787


[I 2025-11-08 14:59:14,865] Trial 10 finished with value: 0.34651100628764797 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 4}. Best is trial 10 with value: 0.34651100628764797.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.009298,0.0,0.108823,0.000754
2,No log,4.894401,0.0,0.225395,0.001127
3,No log,3.879143,0.0,0.296071,0.000773
4,No log,2.277247,0.0,0.347941,0.000812


[I 2025-11-08 15:00:18,530] Trial 11 finished with value: 0.34875268419384425 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 4}. Best is trial 11 with value: 0.34875268419384425.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.009436,0.0,0.108641,0.000733


[I 2025-11-08 15:00:34,786] Trial 12 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.00934,0.0,0.108823,0.000985
2,No log,4.894119,0.0,0.223696,0.000784


[I 2025-11-08 15:01:06,723] Trial 13 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.009311,0.0,0.108823,0.000793
2,No log,4.894951,0.0,0.223696,0.000863


[I 2025-11-08 15:01:38,528] Trial 14 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.169468,0.0,0.095487,0.000728


[I 2025-11-08 15:01:54,884] Trial 15 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.00938,0.0,0.108823,0.000755
2,No log,4.894104,0.0,0.224901,0.00083


[I 2025-11-08 15:02:26,909] Trial 16 pruned. 
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.00938,0.0,0.108823,0.00077
2,No log,4.894675,0.0,0.223696,0.000797


[I 2025-11-08 15:02:58,785] Trial 17 pruned. 



--- Grid Search Complete ---
BEST HYPERPARAMETERS FOUND:
BestRun(run_id='11', objective=0.34875268419384425, hyperparameters={'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 4}, run_summary=None)

Best Hyperparameters:
  learning_rate: 3e-05
  per_device_train_batch_size: 16
  num_train_epochs: 4

You can now initialize TrainingArguments with best_hps for final training.

Could not retrieve trial results from the study.


## **With Random Search**

**Function Description**

This code performs a random hyperparameter search for fine-tuning a BERT question answering model using Hugging Face’s Trainer API. It sets global reproducibility seeds, defines evaluation metrics (Exact Match and F1), and iteratively trains multiple models with randomly chosen hyperparameters (learning rate, batch size, epochs). Each trial trains a fresh model instance, evaluates performance, and stores results for later analysis.

**Input**

    Tokenized training and evaluation datasets (tokenized_train and tokenized_eval).

    Pretrained BERT question answering model and its corresponding tokenizer.

    Defined search space for three hyperparameters: learning rate, batch size, and number of training epochs.

    Seed value for deterministic behavior and reproducibility.

**Output**

    Training and evaluation metrics (Exact Match, F1 Score, average inference time) for each hyperparameter trial.

    Training time per trial.

    A sorted pandas DataFrame showing results of all trials ranked by F1 score.

    Display of the best-performing hyperparameter combination identified from the random search.

**Essential Syntaxes**

python
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

    Sets global random seeds and CUDA deterministic behavior to ensure reproducible results.

python
def compute_metrics(eval_pred):
    ...
    return metrics

    Defines the evaluation function computing Exact Match, F1 Score, and average inference time by decoding predicted token spans back to text.

python
args = copy.deepcopy(base_training_args)
args.per_device_train_batch_size = batch_size
args.learning_rate = lr
args.num_train_epochs = epochs
args.seed = seed_value + trial_num

    Creates an independent set of training arguments for each trial with randomized hyperparameters and a unique seed.

python
trainer = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Initializes a fresh Trainer for each trial with a new model instance and associated hyperparameters.

python
trainer.train()
metrics = trainer.evaluate()

    Runs training and evaluation phases for each trial.

python
df_results = pd.DataFrame(results)
df_results.sort_values("eval_f1_score", ascending=False, inplace=True)

    Aggregates trial results into a DataFrame and sorts by evaluation F1 score to identify the best trial.

**Example Output**

text
Trial 4: Learning Rate=3e-05, Batch Size=16, Epochs=5
...
Random Search Results Sorted by F1 Score:
   trial  learning_rate  batch_size  epochs  train_time_sec  eval_exact_match  eval_f1_score  eval_avg_inference_time
3      4        3e-05          16       5         1189.32              0.75           0.80                  0.0051
...
Best Hyperparameters:
trial                  4
learning_rate       3e-05
batch_size            16
epochs                 5
train_time_sec     1189.32
eval_exact_match    0.75
eval_f1_score       0.80
eval_avg_inference_time 0.0051
Name: 3, dtype: object

Comment and Observation

This implementation follows a straightforward randomized hyperparameter search strategy, ensuring each trial is statistically independent by reseeding and initializing a new model.

In [9]:
import random
import numpy as np
import pandas as pd
import torch
import time
from transformers import TrainingArguments, Trainer, set_seed
from transformers.data.data_collator import default_data_collator
from transformers import BertForQuestionAnswering  # Make sure you import your model
import optuna
import copy

seed_value = 42
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)


torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    common = set(pred_tokens) & set(truth_tokens)
    if not common:
        return 0.0
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

def compute_metrics(eval_pred):
    start_time = time.time()
    predictions, labels = eval_pred
    start_logits, end_logits = predictions
    start_positions = np.argmax(start_logits, axis=1)
    end_positions = np.argmax(end_logits, axis=1)
    exact_matches = []
    f1_scores = []
    for i in range(len(start_positions)):
        input_ids = tokenized_eval[i]["input_ids"]
        pred_tokens = input_ids[start_positions[i]: end_positions[i] + 1]
        pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)
        gold_text = eval_data[i]["Answer"]
        exact_matches.append(compute_exact_match(pred_text, gold_text))
        f1_scores.append(compute_f1(pred_text, gold_text))
    avg_inference_time = (time.time() - start_time) / len(start_positions)
    metrics = {
        "Exact_Match": np.mean(exact_matches),
        "F1_Score": np.mean(f1_scores),
        "Avg_Inference_Time": avg_inference_time
    }
    return metrics


base_training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/random_search_results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/random_search_logs",
    learning_rate=4e-5,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="no",
    load_best_model_at_end=False,
    fp16=torch.cuda.is_available(),
    report_to=[],

    seed=42,
    data_seed=42,
)


random.seed(seed_value)

num_trials = 18
results = []

for trial_num in range(1, num_trials + 1):

    lr = random.choice([5e-5, 3e-5, 1e-5])
    batch_size = random.choice([8, 16])
    epochs = random.choice([3, 4, 5])

    print(f"\nTrial {trial_num}: Learning Rate={lr}, Batch Size={batch_size}, Epochs={epochs}")


    args = copy.deepcopy(base_training_args)
    args.per_device_train_batch_size = batch_size
    args.per_device_eval_batch_size = batch_size
    args.learning_rate = lr
    args.num_train_epochs = epochs
    args.output_dir = f"/content/drive/My Drive/random_search_results/trial_{trial_num}"

    args.seed = seed_value + trial_num
    args.data_seed = seed_value + trial_num


    trainer = Trainer(
        model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
        args=args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    start_time = time.time()
    trainer.train()
    train_time = time.time() - start_time

    metrics = trainer.evaluate()

    results.append({
        "trial": trial_num,
        "learning_rate": lr,
        "batch_size": batch_size,
        "epochs": epochs,
        "train_time_sec": train_time,
        "eval_exact_match": metrics.get("eval_Exact_Match"),
        "eval_f1_score": metrics.get("eval_F1_Score"),
        "eval_avg_inference_time": metrics.get("eval_Avg_Inference_Time"),
    })


df_results = pd.DataFrame(results)
df_results.sort_values("eval_f1_score", ascending=False, inplace=True)

print("\nRandom Search Results Sorted by F1 Score:")
display(df_results)

print("\nBest Hyperparameters:")
display(df_results.iloc[0])


Trial 1: Learning Rate=1e-05, Batch Size=8, Epochs=3


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.983701,0.0,0.1116,0.000722
2,6.064300,4.704707,0.0,0.219678,0.000761
3,4.756400,3.6741,0.0,0.319048,0.000821



Trial 2: Learning Rate=5e-05, Batch Size=16, Epochs=5


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.864092,0.0,0.164926,0.001136
2,No log,4.325779,0.0,0.219094,0.000785
3,5.158500,2.173169,0.0,0.271548,0.000806
4,5.158500,1.900278,0.0,0.017943,0.000706
5,5.158500,1.876008,0.0,0.074347,0.000737



Trial 3: Learning Rate=3e-05, Batch Size=8, Epochs=3


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.48681,0.0,0.16835,0.000747
2,5.795700,3.341656,0.0,0.341755,0.000797
3,3.209700,1.917612,0.0,0.018539,0.000715



Trial 4: Learning Rate=3e-05, Batch Size=16, Epochs=4


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.057562,0.0,0.212314,0.000757
2,No log,4.968213,0.0,0.268551,0.000781
3,5.562600,3.790273,0.0,0.326573,0.000808
4,5.562600,2.115113,0.0,0.264201,0.000764



Trial 5: Learning Rate=5e-05, Batch Size=16, Epochs=3


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.904035,0.0,0.191229,0.000771
2,No log,4.383355,0.0,0.254797,0.000782
3,5.180600,2.265693,0.0,0.322808,0.000879



Trial 6: Learning Rate=3e-05, Batch Size=8, Epochs=4


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.395769,0.0,0.208279,0.001068
2,5.778200,2.905244,0.0,0.343269,0.000894
3,2.967100,1.908113,0.0,0.017943,0.000699
4,2.967100,1.8438,0.0,0.128543,0.001105



Trial 7: Learning Rate=1e-05, Batch Size=16, Epochs=3


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.014236,0.0,0.069681,0.000736
2,No log,5.743755,0.0,0.099555,0.000936
3,5.903800,5.122689,0.0,0.148255,0.000769



Trial 8: Learning Rate=5e-05, Batch Size=16, Epochs=4


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.708737,0.0,0.143472,0.001036
2,No log,4.389945,0.0,0.227909,0.000798
3,5.148800,2.892556,0.0,0.306425,0.000828
4,5.148800,1.860859,0.0,0.026923,0.000734



Trial 9: Learning Rate=3e-05, Batch Size=16, Epochs=4


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.069349,0.0,0.200905,0.001234
2,No log,5.080105,0.0,0.176783,0.001251
3,5.592600,3.813518,0.0,0.238855,0.000822
4,5.592600,2.200797,0.0,0.280889,0.000789



Trial 10: Learning Rate=5e-05, Batch Size=8, Epochs=3


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,4.876513,0.0,0.129772,0.001016
2,5.375800,1.982022,0.0,0.018539,0.000766
3,2.236900,1.98117,0.0,0.017943,0.000745



Trial 11: Learning Rate=3e-05, Batch Size=8, Epochs=5


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.586625,0.0,0.267894,0.001122
2,5.830400,2.894347,0.0,0.29517,0.000878
3,3.045400,1.87984,0.0,0.017943,0.000753
4,3.045400,1.830453,0.0,0.017943,0.000757
5,1.775400,1.933125,0.0,0.06449,0.000723



Trial 12: Learning Rate=1e-05, Batch Size=8, Epochs=4


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.933566,0.0,0.112529,0.000716
2,6.089800,4.618569,0.0,0.202398,0.001107
3,4.702300,3.246699,0.0,0.311729,0.000813
4,4.702300,1.953852,0.0,0.017943,0.000703



Trial 13: Learning Rate=5e-05, Batch Size=16, Epochs=5


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.881423,0.0,0.175839,0.00075
2,No log,4.4144,0.0,0.273777,0.001038
3,5.192600,2.234987,0.0,0.345864,0.000791
4,5.192600,1.856277,0.0,0.017943,0.000697
5,5.192600,1.880855,0.0,0.086606,0.000806



Trial 14: Learning Rate=5e-05, Batch Size=8, Epochs=3


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,4.544921,0.0,0.269496,0.001017
2,5.229200,1.959491,0.0,0.020494,0.000797
3,2.217000,1.851912,0.0,0.020886,0.000713



Trial 15: Learning Rate=1e-05, Batch Size=8, Epochs=4


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.907475,0.0,0.1987,0.001174
2,6.033500,4.595242,0.0,0.199794,0.000814
3,4.630100,3.040465,0.0,0.300789,0.000791
4,4.630100,1.987139,0.0,0.024496,0.001047



Trial 16: Learning Rate=5e-05, Batch Size=16, Epochs=5


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.894642,0.0,0.212264,0.00078
2,No log,4.286855,0.0,0.302245,0.000935
3,5.109300,2.148203,0.0,0.331361,0.000832
4,5.109300,1.865143,0.0,0.02011,0.000732
5,5.109300,1.830909,0.0,0.036626,0.001034



Trial 17: Learning Rate=1e-05, Batch Size=8, Epochs=3


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.024662,0.0,0.217208,0.000776
2,6.106600,4.789709,0.0,0.257456,0.000793
3,4.788800,3.424017,0.0,0.343958,0.001219



Trial 18: Learning Rate=5e-05, Batch Size=8, Epochs=5


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,4.508865,0.0,0.285296,0.000792
2,5.305100,1.950616,0.0,0.020457,0.000707
3,2.225700,1.827825,0.0,0.017943,0.001147
4,2.225700,1.979432,0.0,0.047433,0.000724
5,1.603100,2.000849,0.0,0.124879,0.000739



Random Search Results Sorted by F1 Score:


Unnamed: 0,trial,learning_rate,batch_size,epochs,train_time_sec,eval_exact_match,eval_f1_score,eval_avg_inference_time
16,17,1e-05,8,3,51.064417,0.0,0.343958,0.000815
4,5,5e-05,16,3,46.995197,0.0,0.322808,0.000781
0,1,1e-05,8,3,51.301821,0.0,0.319048,0.000789
8,9,3e-05,16,4,62.373662,0.0,0.280889,0.000897
3,4,3e-05,16,4,62.209715,0.0,0.264201,0.000996
6,7,1e-05,16,3,46.980091,0.0,0.148255,0.00086
5,6,3e-05,8,4,68.232312,0.0,0.128543,0.001031
17,18,5e-05,8,5,84.058161,0.0,0.124879,0.000747
12,13,5e-05,16,5,76.957572,0.0,0.086606,0.000802
1,2,5e-05,16,5,77.271131,0.0,0.074347,0.001034



Best Hyperparameters:


Unnamed: 0,16
trial,17.0
learning_rate,1e-05
batch_size,8.0
epochs,3.0
train_time_sec,51.064417
eval_exact_match,0.0
eval_f1_score,0.343958
eval_avg_inference_time,0.000815


# **Execution (Evaluation)**

## **The First Execution (From Previous Activity)**

**Function Description**

This code block manages the full run of fine-tuning a BERT-based question answering model. It sets random seeds for reproducibility, trains the model, evaluates its performance, saves the trained model checkpoint, and communicates progress to the user.

**Input**

    The trainer object: A Hugging Face Trainer instance preconfigured with model, datasets, hyperparameters, metrics, and tokenizer.

    Model saving path: './bert_qa_best' for checkpoint storage.

**Output**

    Console messages detailing the start of training, evaluation results, and completion status.

    Printed evaluation metrics such as F1 Score and Exact Match after training.

    Model checkpoint saved to disk for future inference tasks.

**Essential Syntaxes**

python
set_seed(42)

Ensures all random processes (Python, NumPy, Torch) are seeded, producing reliable and repeatable results.

python
trainer.train()

Starts model fine-tuning using configured settings on your training data.

python
trainer.evaluate()

Computes performance metrics on your validation or evaluation dataset.

python
trainer.save_model("./bert_qa_best")

Saves the trained model, tokenizer, and config to a specified directory for later use.

**Example Output**

--- Starting Fine-Tuning (Expected Time: 1–4 hours on GPU) ---

[Training progress output]

--- Final Evaluation Results ---
{'eval_loss': 0.95, 'eval_Exact_Match': 0.77, 'eval_F1_Score': 0.82, ...}

Fine-tuning process complete. The resulting model can now be used for Inference (Stage 4).

**Comment and Observation**

In this block, you see a typical workflow for preparing a BERT QA model: not just training, but also evaluating, saving, and documenting progress in clear stages.

In [10]:
from transformers import set_seed


print("\n--- Starting Fine-Tuning (Expected Time: 1–4 hours on GPU) ---")

set_seed(42)


trainer.train()


print("\n--- Final Evaluation Results ---")


set_seed(42)

eval_results = trainer.evaluate()
print(eval_results)

Save the best model checkpoint for later Inference
trainer.save_model("./bert_qa_best")

print("\n Fine-tuning process complete. The resulting model can now be used for Inference (Stage 4).")


SyntaxError: invalid syntax (ipython-input-2606965218.py, line 20)

**Function Description**

This code block kicks off and completes the fine-tuning process for your BERT-based question answering model using the best hyperparameters from your random search. It ensures reproducibility, performs model training, evaluates performance, saves the model checkpoint, and provides feedback for your workflow status.

**Input**

    An initialized trainer object (already set up with best random search hyperparameters, model, datasets, metrics, and tokenizer).

    A Google Drive save path for the best checkpoint.

**Output**

    Progress and status messages in the console about the training, evaluation, and model saving stages.

    Final evaluation metrics displayed (like F1 and Exact Match for your validation set).

    A saved model checkpoint in Google Drive, ready for use during inference (Stage 4).

**Essential Syntaxes**

python
set_seed(42)

    Fixes random seeds for all libraries, ensuring results are reproducible across different runs.

python
trainer.train()

    Trains your BERT QA model on the training data using the best hyperparameters from your random search.

python
trainer.evaluate()

    Evaluates the trained model's performance on your held-out validation dataset.

python
trainer.save_model("/content/drive/My Drive/bert_qa_best_random_hps")

    Saves the complete, fine-tuned model checkpoint to Google Drive so you can load it later for question-answering tasks.

**Example Output**


--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1–4 hours on GPU) ---

[284/284 02:38, Epoch 4/4]
Epoch 	Training Loss 	Validation Loss 	Exact Match 	F1 Score 	Avg Inference Time
1 	No log 	1.811164 	0.000000 	0.034671 	0.000789
2 	1.607500 	1.942523 	0.000000 	0.235041 	0.001137
3 	1.159300 	2.277214 	0.000000 	0.270486 	0.002345
4 	1.159300 	2.429585 	0.000000 	0.294740 	0.000993

**Comment and Observation**

This phase puts your best-found hyperparameters into practice, ensuring the training results you see are reliable and repeatable by setting the seed.

In [None]:
from transformers import set_seed

print("\n--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1–4 hours on GPU) ---")


set_seed(42)

trainer.train()

print("\n--- Final Evaluation Results with Best Hyperparameters ---")


set_seed(42)

eval_results = trainer.evaluate()
print(eval_results)

trainer.save_model("/content/drive/My Drive/bert_qa_best_random_hps")

print("\n Fine-tuning process complete with best hyperparameters. The resulting model can now be used for Inference (Stage 4).")

## **Execution using the best hyperparameters found (Grid Search)**

**Function Description**

This code configures and initializes a Hugging Face Trainer using the best hyperparameters obtained from a prior grid search for fine-tuning a BERT question answering model. It sets up training arguments such as learning rate, batch size, number of epochs, and device preferences, then prepares the trainer with the selected datasets, metrics, and tokenizer for the final training run.

**Input**

    Best hyperparameters dictionary (best_hps) containing keys: num_train_epochs, per_device_train_batch_size, and learning_rate discovered from grid search.

    Pretrained BERT model instance (model).

    Tokenized training and evaluation datasets (tokenized_train and tokenized_eval).

    Predefined evaluation metric function (compute_metrics).

    Tokenizer for data collation.

    Default data collator to handle batch preparation.

**Output**

    An initialized Trainer object configured to train the model using the best hyperparameters.

    Printed confirmation output displaying the current training arguments, including directories, learning rate, batch size, and other settings.

**Essential Syntaxes**

python
best_training_args = TrainingArguments(
    output_dir=...,
    num_train_epochs=best_hps['num_train_epochs'],
    per_device_train_batch_size=best_hps['per_device_train_batch_size'],
    learning_rate=best_hps['learning_rate'],
    ...
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    seed=42,
    data_seed=42,
)

    Creates training argument instance using the best hyperparameters while maintaining consistent logging and saving configurations.

python
trainer = Trainer(
    model=model,
    args=best_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Sets up the Hugging Face Trainer with the finalized arguments, enabling efficient training and evaluation.

python
print(best_training_args)

    Prints the training configuration to confirm settings prior to starting training.

**Example Output**


TrainingArguments(
  output_dir=/content/drive/My Drive/results_best_hps,
  num_train_epochs=5,
  per_device_train_batch_size=8,
  per_device_eval_batch_size=8,
  warmup_steps=500,
  weight_decay=0.01,
  learning_rate=3e-05,
  logging_dir=/content/drive/My Drive/logs_best_hps,
  save_strategy=epoch,
  eval_strategy=epoch,
  load_best_model_at_end=True,
  fp16=True,
  seed=42,
  data_seed=42,
)

**Comment and Observation**

This code completes the model fine-tuning pipeline by initializing the Trainer with carefully selected hyperparameters validated through grid search. Using load_best_model_at_end=True ensures that the best checkpoint saved during training will be loaded for final evaluations or deployment.

In [11]:

best_training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/results_best_hps",
    num_train_epochs=best_hps['num_train_epochs'],
    per_device_train_batch_size=best_hps['per_device_train_batch_size'],
    per_device_eval_batch_size=best_hps['per_device_train_batch_size'],
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs_best_hps",
    learning_rate=best_hps['learning_rate'],
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer = Trainer(
    model=model,
    args=best_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

print("\n--- Initialized Trainer with Best Hyperparameters ---")
print(best_training_args)


--- Initialized Trainer with Best Hyperparameters ---
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=42,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=

  trainer = Trainer(


Function Description

This code block performs the final stages of BERT-based question answering model fine-tuning, including:

    Training the model using previously identified best hyperparameters.

    Evaluating the fine-tuned model's performance on the evaluation dataset.

    Saving the trained model checkpoint for future inference.

    Ensuring all random seeds are set for reproducibility.

Input

    The trainer object: configured with your best hyperparameters, model, datasets, tokenizer, and metric function.

    Google Drive path for saving the model checkpoint (as a string).

Output

    Console messages indicating the progress of training and evaluation.

    Printed final evaluation metrics (like F1 Score, Exact Match).

    The trained model saved in your specified Google Drive directory for Stage 4 (Inference).

Essential Syntaxes

python
set_seed(42)

    Fixes all Python and GPU-related random sources to make training and evaluation reproducible.

python
trainer.train()

    Triggers model fine-tuning (training) using the best hyperparameters found during search.

python
trainer.evaluate()

    Runs evaluation on the held-out dataset to report final performance metrics.

python
trainer.save_model("/content/drive/My Drive/bert_qa_best_hps")

    Saves a complete checkpoint of your trained model (weights, config, and tokenizer) to Google Drive. You can reload this checkpoint later for QA inference.

Example Output

text
--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1–4 hours on GPU) ---

[Training progress output]

--- Final Evaluation Results with Best Hyperparameters ---
{'eval_loss': 1.02, 'eval_Exact_Match': 0.76, 'eval_F1_Score': 0.81, ...}

Fine-tuning process complete with best hyperparameters. The resulting model can now be used for Inference (Stage 4).

Comment and Observation

This final phase puts everything together: your model is trained on the best settings, thoroughly evaluated, and safely saved for later use.

In [12]:
from transformers import set_seed


print("\n--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1–4 hours on GPU) ---")


set_seed(42)

trainer.train()


print("\n--- Final Evaluation Results with Best Hyperparameters ---")


set_seed(42)

eval_results = trainer.evaluate()
print(eval_results)

trainer.save_model("/content/drive/My Drive/bert_qa_best_hps")

print("\n Fine-tuning process complete with best hyperparameters. The resulting model can now be used for Inference (Stage 4).")


--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1–4 hours on GPU) ---


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,6.018747,0.007042,0.266756,0.000814
2,No log,5.291263,0.0,0.227263,0.001024
3,5.619400,3.783773,0.0,0.276288,0.002263
4,5.619400,2.121073,0.0,0.173171,0.000872



--- Final Evaluation Results with Best Hyperparameters ---


{'eval_loss': 2.121073007583618, 'eval_Exact_Match': 0.0, 'eval_F1_Score': 0.1731714489590942, 'eval_Avg_Inference_Time': 0.0009119880031531965, 'eval_runtime': 1.1836, 'eval_samples_per_second': 119.971, 'eval_steps_per_second': 7.604, 'epoch': 4.0}

 Fine-tuning process complete with best hyperparameters. The resulting model can now be used for Inference (Stage 4).


## **Execution using the best hyperparameters found (Random Search)**

**Function Description**

This code block sets up a Hugging Face Trainer to fine-tune a BERT-based question answering model using the best hyperparameters found from a random search. It extracts the optimal combination of hyperparameters from your trial results and configures the training process to use these settings for a final training run.
Input

    best_hps_random_search: A dictionary containing the best hyperparameters (epochs, batch_size, learning_rate) identified during random search.

    Pretrained model and tokenizer for BERT QA.

    Tokenized training and evaluation datasets (tokenized_train, tokenized_eval).

    Previously defined metric function (compute_metrics).

**Output**

    An initialized Trainer object configured with the selected best hyperparameters—ready for fine-tuning and evaluation.

    Printed confirmation of the training arguments for transparency and tracking.

**Essential Syntaxes**

python
best_hps_random_search = df_results.iloc[0].to_dict()

    Grabs the best hyperparameter set as a Python dictionary from your sorted trial results DataFrame.

python
best_training_args_random_search = TrainingArguments(
    output_dir=...,  # folder for saving model outputs
    num_train_epochs=int(best_hps_random_search['epochs']),
    per_device_train_batch_size=int(best_hps_random_search['batch_size']),
    learning_rate=best_hps_random_search['learning_rate'],
    ...
)

    Initializes TrainingArguments with the values from your best random search trial.

python
trainer_random_search = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=best_training_args_random_search,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Prepares a new Trainer object with all required settings for final training.

**Example Output**

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- Initialized Trainer with Best Hyperparameters from Random Search ---
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=42,

**Comment and Observation**

This block ensures you are leveraging the most effective hyperparameter configuration discovered during random search. By saving outputs and logs to Google Drive, results and checkpoints are preserved for later validation or deployment

In [14]:

best_hps_random_search = df_results.iloc[0].to_dict()


best_training_args_random_search = TrainingArguments(
    output_dir="/content/drive/My Drive/results_best_hps_random_search",
    num_train_epochs=int(best_hps_random_search['epochs']),
    per_device_train_batch_size=int(best_hps_random_search['batch_size']),
    per_device_eval_batch_size=int(best_hps_random_search['batch_size']),
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs_best_hps_random_search",
    learning_rate=best_hps_random_search['learning_rate'],
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer_random_search = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=best_training_args_random_search,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

print("\n--- Initialized Trainer with Best Hyperparameters from Random Search ---")
print(best_training_args_random_search)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- Initialized Trainer with Best Hyperparameters from Random Search ---
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=42,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=N

  trainer_random_search = Trainer(


Function Description

This code performs the final round of fine-tuning for your BERT-based QA model using the best hyperparameters discovered from random search. It takes care to set seeds for reproducibility, starts model training, evaluates your model, saves the best checkpoint, and prints friendly, readable progress updates throughout.
Input

    The trainer_random_search object: a Hugging Face Trainer, pre-loaded with the model, datasets, tokenizer, metrics, and best random search hyperparameters.

    Save path for the trained model checkpoint (here: /content/drive/My Drive/bert_qa_best_random_hps).

Output

    Console logs and printed messages showing the start, completion, and high-level metrics of fine-tuning and evaluation.

    Final evaluation dictionary (eval_results_random_search) summarizing performance (e.g., F1 Score, Exact Match).

    Saved model checkpoint for future inference.

Essential Syntaxes

python
set_seed(42)

Sets all sources of randomness in your environment. This is crucial for reliable, repeatable experiments.

python
trainer_random_search.train()

Begins fine-tuning using best random search settings. This usually takes 1–4 hours with a GPU — so sit back, maybe grab a coffee while it runs!

​

python
trainer_random_search.evaluate()

Checks the model against your validation set and reports metrics, helping you measure how successful fine-tuning was.

python
trainer_random_search.save_model(...)

Stores your fully trained model on Google Drive, so you don’t need to retrain later.
Example Output

text
--- Starting Fine-Tuning with Best Hyperparameters from Random Search (Expected Time: 1–4 hours on GPU) ---

[Training progress output]

--- Final Evaluation Results with Best Hyperparameters from Random Search ---
{'eval_loss': 0.98, 'eval_Exact_Match': 0.80, 'eval_F1_Score': 0.87, ...}

Fine-tuning process complete with best hyperparameters from random search. The resulting model can now be used for Inference (Stage 4).

**Comment and Observation**

This workflow is designed to be user-friendly and transparent: you get clear status updates before, during, and after training. This is just a test Dr. Raga but as you can see this is the flow that we want to build which is q & a essay type.

In [15]:
from transformers import set_seed

print("\n--- Starting Fine-Tuning with Best Hyperparameters from Random Search (Expected Time: 1–4 hours on GPU) ---")


set_seed(42)


trainer_random_search.train()

print("\n--- Final Evaluation Results with Best Hyperparameters from Random Search ---")


set_seed(42)

eval_results_random_search = trainer_random_search.evaluate()
print(eval_results_random_search)

trainer_random_search.save_model("/content/drive/My Drive/bert_qa_best_random_hps")

print("\n Fine-tuning process complete with best hyperparameters from random search. The resulting model can now be used for Inference (Stage 4).")


--- Starting Fine-Tuning with Best Hyperparameters from Random Search (Expected Time: 1–4 hours on GPU) ---


Epoch,Training Loss,Validation Loss,Exact Match,F1 Score,Avg Inference Time
1,No log,5.973124,0.0,0.111054,0.00073
2,6.068700,4.714879,0.0,0.224718,0.00138
3,4.762800,3.688576,0.0,0.311595,0.001053



--- Final Evaluation Results with Best Hyperparameters from Random Search ---


{'eval_loss': 3.6885762214660645, 'eval_Exact_Match': 0.0, 'eval_F1_Score': 0.31159524856842297, 'eval_Avg_Inference_Time': 0.0008031818228708186, 'eval_runtime': 1.234, 'eval_samples_per_second': 115.069, 'eval_steps_per_second': 14.586, 'epoch': 3.0}

 Fine-tuning process complete with best hyperparameters from random search. The resulting model can now be used for Inference (Stage 4).


# **Actual Testing**

**Function Description**

This code creates an interactive question-answering session using a Hugging Face model pipeline. For each question from your DataFrame, it poses the question, gets your answer, compares it to the ground truth using exact match and F1 metrics, and also measures the semantic similarity using embeddings and cosine similarity. It uses the Hugging Face pipeline utility for simple and effective inference.

**Input**

    df: A DataFrame containing columns "Question", "Context", and "Answer" for each example.

    User input (your answer to each posed question).

**Output**

    Prints detailed evaluation for each test round: ground truth, model's predicted answer, your answer, Exact Match, F1 score, cosine similarity, and inference time.

**Essential Syntaxes**

python
qna_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=device_num
)

    Sets up a question-answering pipeline using your fine-tuned model and tokenizer. Uses GPU if available, else CPU.

python
def get_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
    with torch.no_grad():
        outputs = model.bert(**inputs)
    return outputs.last_hidden_state[0][0].cpu().numpy()

    Computes the embedding for a given text by passing it through the base BERT model. Used to compare semantic similarity of answers.

python
bert_result = qna_pipeline({"question": row['Question'], "context": row['Context']})

    Performs inference: predicts the answer span using the model.

python
exact = compute_exact_match(user_answer, row['Answer'])
f1 = compute_f1(user_answer, row['Answer'])

    Calculates the exact match and F1 metrics for your answer against the reference answer.

python
cos_sim = cosine_similarity([user_emb], [gt_emb])[0][0]

    Finds cosine similarity (semantic closeness) between your answer and the ground truth, using embeddings.

**Example Output**

text
Question: 1
What does GeeksforGeeks provide?
Please type your answer:
<user types>

Evaluation
Ground Truth Answer: resources for computer science
Model Predicted Answer: resources for computer science
Your Answer: resources for computer science
Exact Match: 1
F1 Score: 1.0000
Cosine Similarity: 1.0000
Inference Time (BERT QA): 0.1122 seconds

**Comment and Observation**

This approach is ideal for human-in-the-loop evaluation. You can check your own understanding, see how the model performs, and compare results both exactly (Exact Match, F1) and by semantics (cosine similarity of embeddings).

In [16]:
import random
import time
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline


device_num = 0 if torch.cuda.is_available() else -1
question_times = 0


qna_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=device_num
)

def get_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
    with torch.no_grad():
        outputs = model.bert(**inputs)
    return outputs.last_hidden_state[0][0].cpu().numpy()

def interactive_test(df):

    row = df.sample(n=1).iloc[0]


    print("\nQuestion:", question_times)
    print(row['Question'])


    user_answer = input("\nPlease type your answer:\n")


    start = time.time()
    bert_result = qna_pipeline({
        "question": row['Question'],
        "context": row['Context']
    })
    inference_time = time.time() - start


    exact = compute_exact_match(user_answer, row['Answer'])
    f1 = compute_f1(user_answer, row['Answer'])

    user_emb = get_embedding(user_answer, tokenizer, model)
    gt_emb = get_embedding(row['Answer'], tokenizer, model)

    cos_sim = cosine_similarity([user_emb], [gt_emb])[0][0]

    print("\n--- Evaluation ---")
    print(f"Ground Truth Answer: {row['Answer']}")
    print(f"Model Predicted Answer: {bert_result['answer']}")
    print(f"Your Answer: {user_answer}")
    print(f"Exact Match: {exact}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Cosine Similarity: {cos_sim:.4f}")
    print(f"Inference Time (BERT QA): {inference_time:.4f} seconds")



while question_times < 10:
    interactive_test(df)
    question_times += 1





Device set to use cuda:0



Question: 0
What is the fundamental purpose of a vector embedding, and what types of data can it represent for machine learning models?

Please type your answer:
Converting text into numerical representaiton





--- Evaluation ---
Ground Truth Answer: Their fundamental purpose is to serve as numerical representations of data points, expressing nonmathematical data like words or images as an array of numbers that ML models can process.
Model Predicted Answer: Vector embeddings
Your Answer: Converting text into numerical representaiton
Exact Match: 0
F1 Score: 0.0588
Cosine Similarity: 0.6943
Inference Time (BERT QA): 0.0734 seconds

Question: 1
What elimination strategy does Hyperband employ for early stopping, how does successive halving progressively reduce the configuration pool, and what selection criterion determines which configurations advance through iterations?

Please type your answer:
idk





--- Evaluation ---
Ground Truth Answer: Hyperband uses successive halving for early stopping, removing the worst-performing half after each training round and carrying the top 50 forward until one optimal configuration remains.
Model Predicted Answer: This
Your Answer: idk
Exact Match: 0
F1 Score: 0.0000
Cosine Similarity: 0.3365
Inference Time (BERT QA): 0.0150 seconds

Question: 2
What two sampling dimensions do subsample and colsamplebytree control in XGBoost training, and how do they respectively regulate data and feature utilization during model construction?

Please type your answer:
subsample sets data sample percentage per training round, and colsample_bytree sets feature percentage for tree construction.





--- Evaluation ---
Ground Truth Answer: subsample sets data sample percentage per training round, and colsamplebytree sets feature percentage for tree construction.
Model Predicted Answer: subsample
Your Answer: subsample sets data sample percentage per training round, and colsample_bytree sets feature percentage for tree construction.
Exact Match: 0
F1 Score: 0.8125
Cosine Similarity: 0.9936
Inference Time (BERT QA): 0.0140 seconds

Question: 3
What must be considered to ensure a successful linear-regression analysis?


KeyboardInterrupt: Interrupted by user