# **Setup and Data Preparation**

**Function Description**

This code block loads a CSV dataset from Google Drive (in this case `("/content/drive/My Drive/Data Collection (ITE Elective Course Lesson)/Dataset/Webscraped data - ITE Elective 3 - Sheet1.csv")`, then processes it into a suitable format for `transformers` libraries. It also checks GPU availability to enable efficient computation.

**Input**

The input in this code block is the CSV file from Google Drive containing text data for model training

**Output**

The output in this code block is the confirmation message about GPU or CPU usage. Two datasets printed to the console: training data `(train_data)` and evaluation data `(eval_data)`.

**Essential Syntaxes**



    drive.mount('/content/drive/', force_remount=True) mounts Google Drive to access external files.

    pd.read_csv(path) loads the dataset into a pandas DataFrame.

    torch.cuda.is_available() checks whether a GPU is accessible.

    Dataset.from_pandas(df) converts a pandas DataFrame into a Hugging Face Dataset object.

    dataset.train_test_split(test_size=0.2, seed=42) splits the dataset into 80% training and 20% evaluation data.

    torch.device("cuda" or "cpu") specifies the computation device for model training.

**Example Output**

Mounted at /content/drive/
Using GPU: Tesla T4

--- Loading and Preprocessing Data ---
Dataset({
    features: ['column1', 'column2', 'column3'],
    num_rows: 400
})
Dataset({
    features: ['column1', 'column2', 'column3'],
    num_rows: 100
})


**Comment and Observation**

Based on my understanding, the code demonstrates a typical preprocessing workflow for NLP model training using Hugging Face. It ensures GPU acceleration if available, which significantly speeds up training. Moreover, converting the dataset into the Hugging Face format allows easy integration with a `Trainer` later on.

In [None]:
!ls "/content/drive/My Drive/Data Collection/Dataset_QnA.xlsx"

ls: cannot access '/content/drive/My Drive/Data Collection/Dataset_QnA.xlsx': No such file or directory


In [None]:
import torch
import pandas as pd
import numpy as np
import re
import time
import random
import itertools
from datasets import Dataset
from transformers import TrainingArguments, Trainer
from google.colab import drive
from transformers import BertTokenizerFast, BertForQuestionAnswering
from sklearn.metrics import accuracy_score, f1_score
from transformers import TrainingArguments, Trainer
from transformers.data.data_collator import default_data_collator # Import default_data_collator


drive.mount('/content/drive/', force_remount=True)

# df = pd.read_excel("/content/drive/My Drive/Data Collection (ITE Elective Course Lesson)/Dataset/Webscraped data_Modules_Question and Answering.xlsx")
# df = pd.read_excel("/content/drive/My Drive/Data Collection (ITE Elective Course Lesson)/Dataset/Webscraped data_Modules_Question and Answering_Revised.xlsx")
df = pd.read_excel("/content/drive/My Drive/Data Collection/Dataset_QnA.xlsx")

# Clean column names by stripping whitespace and newline characters
df.columns = [col.strip().replace('\n', '') for col in df.columns]

# Print column names to debug KeyError
print("DataFrame columns:", df.columns.tolist())

# Diagnostic prints before dropna
print(f"DataFrame shape before dropna: {df.shape}")
print("NaN counts before dropna (in specified subset columns):")
print(df[['Title', 'Context', 'Exam Question', 'Exam Answer', 'Quiz Question', 'Quiz Answer']].isnull().sum())

df.dropna (subset=['Title', 'Context', 'Exam Question', 'Exam Answer', 'Quiz Question', 'Quiz Answer'], inplace=True)

# Diagnostic print after dropna
print(f"DataFrame shape after dropna: {df.shape}")


for col in ['Context', 'Exam Question', 'Exam Answer', 'Quiz Question', 'Quiz Answer', 'Title']:
  df[col] = df[col].astype(str)


def clean_text(text):
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^A-Za-z0-9.,;:?!\"()\-\s]', '', text)
    text = text.strip()
    return text


df['Context'] = df['Context'].apply(clean_text)
df['Exam Question'] = df['Exam Question'].apply(clean_text)
df['Exam Answer'] = df['Exam Answer'].apply(clean_text)
df['Quiz Question'] = df['Quiz Question'].apply(clean_text)
df['Quiz Answer'] = df['Quiz Answer'].apply(clean_text)

print(f"Cleaned dataset shape: {df.shape}")


if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

print("\n--- Loading and Preprocessing Data ---")


dataset = Dataset.from_pandas(df)


dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_data = dataset["train"]
eval_data = dataset["test"]

print(train_data)
print(eval_data)

AttributeError: partially initialized module 'datasets' has no attribute 'utils' (most likely due to a circular import)

# **Bert-base-uncased**

**Function Description**

This code loads a pre-trained BERT model and tokenizer for question answering tasks, processes datasets by aligning answer positions within contexts, and tokenizes the data while mapping character-based answer spans to token spans suitable for model training. It also detects GPU availability to accelerate model training and inference.

**Input**

The input consists of datasets containing questions, contexts, and answers, which are prepared as train_data and eval_data before processing. These datasets include fields like "Question," "Context," "Answer," and answer positions.

**Output**

The output includes tokenized datasets (tokenized_train and tokenized_eval) that are formatted for PyTorch training, with added start and end position labels for answers. Additionally, the script confirms the successful loading of the model and the availability of GPU acceleration, printing relevant messages.

**Essential Syntaxes**

python
from transformers import BertTokenizerFast, BertForQuestionAnswering

    Imports the necessary classes for tokenization and model loading.

python
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME)

    Loads a pre-trained BERT tokenizer and model based on bert-base-uncased.

python
def add_answer_positions(example):
    ...
    return example

    Adds start and end answer positions within the context, matching answer text to context.

python
def tokenize_and_align(examples):
    ...
    return tokenized

    Tokenizes question and context, aligns answer spans with token indices, and prepares data for model training.

python
train_data.map(add_answer_positions)
eval_data.map(add_answer_positions)

train_data.map(tokenize_and_align, batched=True)
eval_data.map(tokenize_and_align, batched=True)

    Applies position setting and tokenization functions on datasets.

python
torch.device("cuda" if torch.cuda.is_available() else "cpu")

    Checks for GPU availability and assigns the device accordingly.

python
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

    Loads the model to the appropriate device for training or inference.

Example Output

text
Loaded Pretrained QnA Model: bert-base-uncased
ðŸ§© Model loaded successfully for QnA: bert-base-uncased

Comment and Observation

This code exemplifies a typical NLP pipeline for question answering with BERT. It carefully maps answer spans from character-level positions within the context to token indices, which is essential for model training. Importantly, it checks for GPU availability to optimize performance, making it suitable for large datasets and neural network fine-tuning tasks. The structured approach ensures correct data alignment and efficient model loading, facilitating smooth integration into a training loop later on.

In [None]:


MODEL_NAME = "bert-base-uncased"

tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME)

print(f"\n Loaded Pretrained QnA Model: {MODEL_NAME}")


def add_answer_positions(example):
    context = example["Context"]
    quiz_answer = example["Quiz Answer"]
    exam_answer = example["Exam Answer"]


    context_lower = context.lower()
    quiz_answer_lower = quiz_answer.lower()
    exam_answer_lower = exam_answer.lower()

    quiz_answer_start = context_lower.find(quiz_answer_lower)
    exam_answer_start = context_lower.find(exam_answer_lower)
    if quiz_answer_start & exam_answer_start == 0:

        example["start_positions"] = 0
        example["end_positions"] = 0
    else:
        answer_end = answer_start + len(answer)
        example["start_positions"] = answer_start
        example["end_positions"] = answer_end

    return example


def tokenize_and_align(examples):
    tokenized = tokenizer(
        examples["Quiz Question"],
        examples["Exam Question"],
        examples["Context"],
        truncation="only_second",
        padding="max_length",
        max_length=512,
        return_offsets_mapping=True
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized["offset_mapping"]):
        sequence_ids = tokenized.sequence_ids(i)
        context_start = sequence_ids.index(1)
        context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

        start_char = examples["start_positions"][i]
        end_char = examples["end_positions"][i]


        token_start_index = context_start
        token_end_index = context_start


        for idx in range(context_start, context_end + 1):
            if offsets[idx][0] <= start_char < offsets[idx][1]:
                token_start_index = idx
            if offsets[idx][0] < end_char <= offsets[idx][1]:
                token_end_index = idx
                break

        start_positions.append(token_start_index)
        end_positions.append(token_end_index)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    tokenized.pop("offset_mapping")

    return tokenized



train_data = train_data.map(add_answer_positions)
eval_data = eval_data.map(add_answer_positions)


tokenized_train = train_data.map(tokenize_and_align, batched=True)
tokenized_eval = eval_data.map(tokenize_and_align, batched=True)

tokenized_train.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])
tokenized_eval.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])



device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

print(f"\n Model loaded successfully for QnA: {MODEL_NAME}")

# **METRICS AND TRAINING SETUP**

This command installs several important Python packages that are essential for working with transformer models and hyperparameter optimization:

    transformers: Hugging Face's library for state-of-the-art transformer models like BERT, GPT, and more.

    datasets: Hugging Face's library for easily accessing and managing datasets.

    accelerate: A library to help scale PyTorch models across multiple GPUs or TPUs.

    ray[tune]: Ray Tune is a scalable hyperparameter tuning library built on Ray.

    optuna: A popular framework for automated hyperparameter optimization.

The purpose of this code is to search results show that Optuna and Ray Tune can be integrated with the Transformers Trainer to perform hyperparameter searches efficiently. Installing these packages sets up your environment for such advanced workflows, enabling you to automatically and systematically improve your transformer modelâ€™s training parameters.

In [None]:
!pip install transformers datasets accelerate ray[tune] optuna -U



**Function Description**

This code sets seeds for all relevant libraries to ensure full reproducibility in PyTorch experiments. It also defines evaluation metrics for question answering Exact Match (EM) and F1 Score then builds a Hugging Face Trainer with arguments perfectly suited for BERT-style extractive QA training and validation. These steps are essential for stable, trackable QA research and applications.

**Input**

    Seed value (here, 42), set across Python, numpy, PyTorch CPU and GPU.

    Model predictions and gold answers for each evaluation batch.

    Training datasets, pretrained model, and tokenizer.

**Output**

    Reproducible training and validation runs.

    Dictionary of computed evaluation metrics (averaged EM, F1, and inference time) after each evaluation phase.

    A fully configured Trainer object for orchestrated fine-tuning on QA tasks.

**Essential Syntaxes**

python
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Ensures all sources of randomness are controlled, making repeated experiments yield consistent results.

python
def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

Computes an all-or-nothing score: 1 only if predicted span matches gold answer exactly, after stripping and lowercasing.

python
def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    ...
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

Calculates the token-level overlap F1 between prediction and gold answer. Captures partial matches as well as perfect ones.

python
def compute_metrics(eval_pred):
    ...
    return metrics

Processes batched model predictions, decodes answer spans, computes EM and F1 across all examples, and averages them.

python
training_args = TrainingArguments(...)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

Sets up training parameters (epochs, batch sizes, device, logging, saving) and bundles everything in a Trainer for easy use.

**Example Output**

text
{'Exact_Match': 0.76, 'F1_Score': 0.81, 'Avg_Inference_Time': 0.0052}

(A dictionary summarizing model accuracy and efficiency, printed after evaluation.)

**Comment and Observation**

Setting all seeds eliminates nearly all nondeterminismâ€”so you're not chasing "lucky runs," which is essential in QA research.

In [None]:

seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)


torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()

    common = set(pred_tokens) & set(truth_tokens)
    if not common:
        return 0.0

    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1



def compute_metrics(eval_pred):
    start_time = time.time()

    predictions, labels = eval_pred

    start_logits, end_logits = predictions


    start_positions = np.argmax(start_logits, axis=1)
    end_positions = np.argmax(end_logits, axis=1)

    exact_matches = []
    f1_scores = []


    for i in range(len(start_positions)):
        input_ids = tokenized_eval[i]["input_ids"]
        pred_tokens = input_ids[start_positions[i]: end_positions[i] + 1]
        pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)

        gold_text = eval_data[i]["Answer"]

        exact_matches.append(compute_exact_match(pred_text, gold_text))
        f1_scores.append(compute_f1(pred_text, gold_text))

    avg_inference_time = (time.time() - start_time) / len(start_positions)

    metrics = {
        "Exact_Match": np.mean(exact_matches),
        "F1_Score": np.mean(f1_scores),
        "Avg_Inference_Time": avg_inference_time
    }

    return metrics


training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/results",
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs",
    learning_rate=4e-5,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

## **With Grid Search**

Function Description

This code establishes a reproducible environment for training a BERT-based question answering model, defines evaluation metrics (Exact Match and F1), sets hyperparameters and training arguments for model fine-tuning, and implements an automated hyperparameter grid search using Optuna integrated with Hugging Faceâ€™s Trainer. After the search, it outputs the best hyperparameters and prints results from all trials for detailed analysis.

**Input**

    Tokenized train and evaluation datasets (tokenized_train and tokenized_eval) with questions, contexts, and answers.

    Pretrained model and tokenizer for BERT-based question answering.

    Defined hyperparameter search space including learning rate, batch size, and number of epochs.

    Utility functions for computing Exact Match and F1 evaluation metrics.

    A seed value (42) to ensure deterministic and reproducible training runs.

**Output**

    Metric results for each hyperparameter trial during grid search.

    The best hyperparameter configuration found according to the evaluation (maximizing F1 Score).

    A DataFrame printed to the console with all trial results sorted by performance metric.

    Messages indicating the progress and results of the hyperparameter search.

**Essential Syntaxes**

python
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

    Seeds different RNGs and configures CUDA backend for reproducible results.

python
def compute_exact_match(prediction, truth):
    ...
def compute_f1(prediction, truth):
    ...
def compute_metrics(eval_pred):
    ...

    Define evaluation metric functions that compute Exact Match, F1 score, and average inference time from model predictions and ground truth answers.

python
def tune_hp(trial):
    learning_rate = trial.suggest_categorical("learning_rate", [5e-5, 3e-5, 1e-5])
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16])
    num_train_epochs = trial.suggest_categorical("num_train_epochs", [3, 4, 5])
    return {...}

    Hyperparameter search space definition for Optuna trials.

python
def model_init():
    return BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

    Function to reinitialize the model freshly for each hyperparameter trial to avoid weight contamination.

python
grid_trainer = Trainer(
    model_init=model_init,
    args=grid_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Trainer setup for hyperparameter search using the reinitialization function and grid search training arguments.

python
best_trial = grid_trainer.hyperparameter_search(
    backend="optuna",
    hp_space=tune_hp,
    direction="maximize",
    n_trials=18,
)

    Executes Optuna-powered hyperparameter search over 18 trials, maximizing the F1 metric.

python
df_results = pd.DataFrame(trial_results)
df_results.sort_values("metric_value", ascending=False, inplace=True)
print(df_results)

    Converts trial results to a DataFrame, sorts them by performance, and prints them for more interpretable analysis.

**Example Output**

text
--- Starting Grid Search ---
[I 2025-11-08 21:15:30,000] Trial 0 finished with value: 0.76
...
--- Grid Search Complete ---
BEST HYPERPARAMETERS FOUND:
{'learning_rate': 3e-5, 'per_device_train_batch_size': 8, 'num_train_epochs': 5}

Grid Search Trial Results (sorted by metric):
   learning_rate  per_device_train_batch_size  num_train_epochs  metric_value
1        3e-05                           8                5          0.78
0        5e-05                          16                4          0.76
...

Best hyperparameters detail:
learning_rate                3e-05
per_device_train_batch_size      8
num_train_epochs                5
metric_value                 0.78
Name: 1, dtype: object

**Comment and Observation**

This code robustly integrates reproducibility best practices with automatic hyperparameter tuning for transformer QA modeling, using Optunaâ€™s search capabilities tightly coupled with Hugging Face Trainer. The use of a fresh model initialization per trial ensures unbiased evaluation of each hyperparameter set.

In [None]:
import random
import numpy as np
import pandas as pd
import torch
import time
from transformers import TrainingArguments, Trainer, set_seed

import optuna

seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)


torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    common = set(pred_tokens) & set(truth_tokens)
    if not common:
        return 0.0
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

def compute_metrics(eval_pred):
    start_time = time.time()
    predictions, labels = eval_pred
    start_logits, end_logits = predictions
    start_positions = np.argmax(start_logits, axis=1)
    end_positions = np.argmax(end_logits, axis=1)
    exact_matches = []
    f1_scores = []
    for i in range(len(start_positions)):
        input_ids = tokenized_eval[i]["input_ids"]
        pred_tokens = input_ids[start_positions[i]: end_positions[i] + 1]
        pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)
        gold_text = eval_data[i]["Answer"]
        exact_matches.append(compute_exact_match(pred_text, gold_text))
        f1_scores.append(compute_f1(pred_text, gold_text))
    avg_inference_time = (time.time() - start_time) / len(start_positions)
    metrics = {
        "Exact_Match": np.mean(exact_matches),
        "F1_Score": np.mean(f1_scores),
        "Avg_Inference_Time": avg_inference_time
    }
    return metrics

# Define training arguments (hyperparameters)
training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/results", # Changed to Google Drive
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs", # Changed to Google Drive
    learning_rate=4e-5,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


def tune_hp(trial):
    learning_rate = trial.suggest_categorical("learning_rate", [5e-5, 3e-5, 1e-5])
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16])
    num_train_epochs = trial.suggest_categorical("num_train_epochs", [3, 4, 5])  # New hyperparameter for epochs
    return {
        "learning_rate": learning_rate,
        "per_device_train_batch_size": per_device_train_batch_size,
        "num_train_epochs": num_train_epochs,
    }

grid_training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/grid_search_results",
    eval_strategy="epoch",
    save_strategy="no",
    load_best_model_at_end=False,
    metric_for_best_model="eval_F1_Score",
    fp16=torch.cuda.is_available(),
    report_to="none",
    num_train_epochs=3,
    warmup_steps=500,
    logging_dir="/content/drive/My Drive/grid_search_logs",
)

def model_init():

    return BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

grid_trainer = Trainer(
    model_init=model_init,
    args=grid_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

print("\n--- Starting Grid Search ---")
best_trial = grid_trainer.hyperparameter_search(
    backend="optuna",
    hp_space=tune_hp,
    direction="maximize",
    n_trials=18,
)

if best_trial:
    print("\n--- Grid Search Complete ---")
    print("BEST HYPERPARAMETERS FOUND:")
    print(best_trial)
    best_hps = best_trial.hyperparameters
    print("\nBest Hyperparameters:")
    for key, val in best_hps.items():
        print(f"  {key}: {val}")
else:
    print("Search failed or no best trial found.")

print("\nYou can now initialize TrainingArguments with best_hps for final training.")



if grid_trainer.hp_search_backend and grid_trainer.hp_search_backend.study:
    all_trials = grid_trainer.hp_search_backend.study.get_trials()


    trial_results = []
    for trial in all_trials:
        vals = trial.params.copy()

        vals['metric_value'] = trial.value
        trial_results.append(vals)

    df_results = pd.DataFrame(trial_results)


    df_results.sort_values("metric_value", ascending=False, inplace=True)

    print("\nGrid Search Trial Results (sorted by metric):")
    print(df_results)


    print("\nBest hyperparameters detail:")
    print(df_results.iloc[0])
else:
    print("\nCould not retrieve trial results from the study.")

## **With Random Search**

**Function Description**

This code performs a random hyperparameter search for fine-tuning a BERT question answering model using Hugging Faceâ€™s Trainer API. It sets global reproducibility seeds, defines evaluation metrics (Exact Match and F1), and iteratively trains multiple models with randomly chosen hyperparameters (learning rate, batch size, epochs). Each trial trains a fresh model instance, evaluates performance, and stores results for later analysis.

**Input**

    Tokenized training and evaluation datasets (tokenized_train and tokenized_eval).

    Pretrained BERT question answering model and its corresponding tokenizer.

    Defined search space for three hyperparameters: learning rate, batch size, and number of training epochs.

    Seed value for deterministic behavior and reproducibility.

**Output**

    Training and evaluation metrics (Exact Match, F1 Score, average inference time) for each hyperparameter trial.

    Training time per trial.

    A sorted pandas DataFrame showing results of all trials ranked by F1 score.

    Display of the best-performing hyperparameter combination identified from the random search.

**Essential Syntaxes**

python
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

    Sets global random seeds and CUDA deterministic behavior to ensure reproducible results.

python
def compute_metrics(eval_pred):
    ...
    return metrics

    Defines the evaluation function computing Exact Match, F1 Score, and average inference time by decoding predicted token spans back to text.

python
args = copy.deepcopy(base_training_args)
args.per_device_train_batch_size = batch_size
args.learning_rate = lr
args.num_train_epochs = epochs
args.seed = seed_value + trial_num

    Creates an independent set of training arguments for each trial with randomized hyperparameters and a unique seed.

python
trainer = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Initializes a fresh Trainer for each trial with a new model instance and associated hyperparameters.

python
trainer.train()
metrics = trainer.evaluate()

    Runs training and evaluation phases for each trial.

python
df_results = pd.DataFrame(results)
df_results.sort_values("eval_f1_score", ascending=False, inplace=True)

    Aggregates trial results into a DataFrame and sorts by evaluation F1 score to identify the best trial.

**Example Output**

text
Trial 4: Learning Rate=3e-05, Batch Size=16, Epochs=5
...
Random Search Results Sorted by F1 Score:
   trial  learning_rate  batch_size  epochs  train_time_sec  eval_exact_match  eval_f1_score  eval_avg_inference_time
3      4        3e-05          16       5         1189.32              0.75           0.80                  0.0051
...
Best Hyperparameters:
trial                  4
learning_rate       3e-05
batch_size            16
epochs                 5
train_time_sec     1189.32
eval_exact_match    0.75
eval_f1_score       0.80
eval_avg_inference_time 0.0051
Name: 3, dtype: object

Comment and Observation

This implementation follows a straightforward randomized hyperparameter search strategy, ensuring each trial is statistically independent by reseeding and initializing a new model.

In [None]:
import random
import numpy as np
import pandas as pd
import torch
import time
from transformers import TrainingArguments, Trainer, set_seed
from transformers.data.data_collator import default_data_collator
from transformers import BertForQuestionAnswering  # Make sure you import your model
import optuna
import copy

seed_value = 42
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)


torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    common = set(pred_tokens) & set(truth_tokens)
    if not common:
        return 0.0
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

def compute_metrics(eval_pred):
    start_time = time.time()
    predictions, labels = eval_pred
    start_logits, end_logits = predictions
    start_positions = np.argmax(start_logits, axis=1)
    end_positions = np.argmax(end_logits, axis=1)
    exact_matches = []
    f1_scores = []
    for i in range(len(start_positions)):
        input_ids = tokenized_eval[i]["input_ids"]
        pred_tokens = input_ids[start_positions[i]: end_positions[i] + 1]
        pred_text = tokenizer.decode(pred_tokens, skip_special_tokens=True)
        gold_text = eval_data[i]["Answer"]
        exact_matches.append(compute_exact_match(pred_text, gold_text))
        f1_scores.append(compute_f1(pred_text, gold_text))
    avg_inference_time = (time.time() - start_time) / len(start_positions)
    metrics = {
        "Exact_Match": np.mean(exact_matches),
        "F1_Score": np.mean(f1_scores),
        "Avg_Inference_Time": avg_inference_time
    }
    return metrics


base_training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/random_search_results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/random_search_logs",
    learning_rate=4e-5,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="no",
    load_best_model_at_end=False,
    fp16=torch.cuda.is_available(),
    report_to=[],

    seed=42,
    data_seed=42,
)


random.seed(seed_value)

num_trials = 18
results = []

for trial_num in range(1, num_trials + 1):

    lr = random.choice([5e-5, 3e-5, 1e-5])
    batch_size = random.choice([8, 16])
    epochs = random.choice([3, 4, 5])

    print(f"\nTrial {trial_num}: Learning Rate={lr}, Batch Size={batch_size}, Epochs={epochs}")


    args = copy.deepcopy(base_training_args)
    args.per_device_train_batch_size = batch_size
    args.per_device_eval_batch_size = batch_size
    args.learning_rate = lr
    args.num_train_epochs = epochs
    args.output_dir = f"/content/drive/My Drive/random_search_results/trial_{trial_num}"

    args.seed = seed_value + trial_num
    args.data_seed = seed_value + trial_num


    trainer = Trainer(
        model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
        args=args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    start_time = time.time()
    trainer.train()
    train_time = time.time() - start_time

    metrics = trainer.evaluate()

    results.append({
        "trial": trial_num,
        "learning_rate": lr,
        "batch_size": batch_size,
        "epochs": epochs,
        "train_time_sec": train_time,
        "eval_exact_match": metrics.get("eval_Exact_Match"),
        "eval_f1_score": metrics.get("eval_F1_Score"),
        "eval_avg_inference_time": metrics.get("eval_Avg_Inference_Time"),
    })


df_results = pd.DataFrame(results)
df_results.sort_values("eval_f1_score", ascending=False, inplace=True)

print("\nRandom Search Results Sorted by F1 Score:")
display(df_results)

print("\nBest Hyperparameters:")
display(df_results.iloc[0])

# **Execution (Evaluation)**

## **The First Execution (From Previous Activity)**

**Function Description**

This code block manages the full run of fine-tuning a BERT-based question answering model. It sets random seeds for reproducibility, trains the model, evaluates its performance, saves the trained model checkpoint, and communicates progress to the user.

**Input**

    The trainer object: A Hugging Face Trainer instance preconfigured with model, datasets, hyperparameters, metrics, and tokenizer.

    Model saving path: './bert_qa_best' for checkpoint storage.

**Output**

    Console messages detailing the start of training, evaluation results, and completion status.

    Printed evaluation metrics such as F1 Score and Exact Match after training.

    Model checkpoint saved to disk for future inference tasks.

**Essential Syntaxes**

python
set_seed(42)

Ensures all random processes (Python, NumPy, Torch) are seeded, producing reliable and repeatable results.

python
trainer.train()

Starts model fine-tuning using configured settings on your training data.

python
trainer.evaluate()

Computes performance metrics on your validation or evaluation dataset.

python
trainer.save_model("./bert_qa_best")

Saves the trained model, tokenizer, and config to a specified directory for later use.

**Example Output**

--- Starting Fine-Tuning (Expected Time: 1â€“4 hours on GPU) ---

[Training progress output]

--- Final Evaluation Results ---
{'eval_loss': 0.95, 'eval_Exact_Match': 0.77, 'eval_F1_Score': 0.82, ...}

Fine-tuning process complete. The resulting model can now be used for Inference (Stage 4).

**Comment and Observation**

In this block, you see a typical workflow for preparing a BERT QA model: not just training, but also evaluating, saving, and documenting progress in clear stages.

In [None]:
from transformers import set_seed


print("\n--- Starting Fine-Tuning (Expected Time: 1â€“4 hours on GPU) ---")

set_seed(42)


trainer.train()


print("\n--- Final Evaluation Results ---")


set_seed(42)

eval_results = trainer.evaluate()
print(eval_results)

# Save the best model checkpoint for later Inference
trainer.save_model("./bert_qa_best")

print("\n Fine-tuning process complete. The resulting model can now be used for Inference (Stage 4).")


**Function Description**

This code block kicks off and completes the fine-tuning process for your BERT-based question answering model using the best hyperparameters from your random search. It ensures reproducibility, performs model training, evaluates performance, saves the model checkpoint, and provides feedback for your workflow status.

**Input**

    An initialized trainer object (already set up with best random search hyperparameters, model, datasets, metrics, and tokenizer).

    A Google Drive save path for the best checkpoint.

**Output**

    Progress and status messages in the console about the training, evaluation, and model saving stages.

    Final evaluation metrics displayed (like F1 and Exact Match for your validation set).

    A saved model checkpoint in Google Drive, ready for use during inference (Stage 4).

**Essential Syntaxes**

python
set_seed(42)

    Fixes random seeds for all libraries, ensuring results are reproducible across different runs.

python
trainer.train()

    Trains your BERT QA model on the training data using the best hyperparameters from your random search.

python
trainer.evaluate()

    Evaluates the trained model's performance on your held-out validation dataset.

python
trainer.save_model("/content/drive/My Drive/bert_qa_best_random_hps")

    Saves the complete, fine-tuned model checkpoint to Google Drive so you can load it later for question-answering tasks.

**Example Output**


--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1â€“4 hours on GPU) ---

[284/284 02:38, Epoch 4/4]
Epoch 	Training Loss 	Validation Loss 	Exact Match 	F1 Score 	Avg Inference Time
1 	No log 	1.811164 	0.000000 	0.034671 	0.000789
2 	1.607500 	1.942523 	0.000000 	0.235041 	0.001137
3 	1.159300 	2.277214 	0.000000 	0.270486 	0.002345
4 	1.159300 	2.429585 	0.000000 	0.294740 	0.000993

**Comment and Observation**

This phase puts your best-found hyperparameters into practice, ensuring the training results you see are reliable and repeatable by setting the seed.

In [None]:
from transformers import set_seed

print("\n--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1â€“4 hours on GPU) ---")


set_seed(42)

trainer.train()

print("\n--- Final Evaluation Results with Best Hyperparameters ---")


set_seed(42)

eval_results = trainer.evaluate()
print(eval_results)

trainer.save_model("/content/drive/My Drive/bert_qa_best_random_hps")

print("\n Fine-tuning process complete with best hyperparameters. The resulting model can now be used for Inference (Stage 4).")

## **Execution using the best hyperparameters found (Grid Search)**

**Function Description**

This code configures and initializes a Hugging Face Trainer using the best hyperparameters obtained from a prior grid search for fine-tuning a BERT question answering model. It sets up training arguments such as learning rate, batch size, number of epochs, and device preferences, then prepares the trainer with the selected datasets, metrics, and tokenizer for the final training run.

**Input**

    Best hyperparameters dictionary (best_hps) containing keys: num_train_epochs, per_device_train_batch_size, and learning_rate discovered from grid search.

    Pretrained BERT model instance (model).

    Tokenized training and evaluation datasets (tokenized_train and tokenized_eval).

    Predefined evaluation metric function (compute_metrics).

    Tokenizer for data collation.

    Default data collator to handle batch preparation.

**Output**

    An initialized Trainer object configured to train the model using the best hyperparameters.

    Printed confirmation output displaying the current training arguments, including directories, learning rate, batch size, and other settings.

**Essential Syntaxes**

python
best_training_args = TrainingArguments(
    output_dir=...,
    num_train_epochs=best_hps['num_train_epochs'],
    per_device_train_batch_size=best_hps['per_device_train_batch_size'],
    learning_rate=best_hps['learning_rate'],
    ...
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    seed=42,
    data_seed=42,
)

    Creates training argument instance using the best hyperparameters while maintaining consistent logging and saving configurations.

python
trainer = Trainer(
    model=model,
    args=best_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Sets up the Hugging Face Trainer with the finalized arguments, enabling efficient training and evaluation.

python
print(best_training_args)

    Prints the training configuration to confirm settings prior to starting training.

**Example Output**


TrainingArguments(
  output_dir=/content/drive/My Drive/results_best_hps,
  num_train_epochs=5,
  per_device_train_batch_size=8,
  per_device_eval_batch_size=8,
  warmup_steps=500,
  weight_decay=0.01,
  learning_rate=3e-05,
  logging_dir=/content/drive/My Drive/logs_best_hps,
  save_strategy=epoch,
  eval_strategy=epoch,
  load_best_model_at_end=True,
  fp16=True,
  seed=42,
  data_seed=42,
)

**Comment and Observation**

This code completes the model fine-tuning pipeline by initializing the Trainer with carefully selected hyperparameters validated through grid search. Using load_best_model_at_end=True ensures that the best checkpoint saved during training will be loaded for final evaluations or deployment.

In [None]:

best_training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/results_best_hps",
    num_train_epochs=best_hps['num_train_epochs'],
    per_device_train_batch_size=best_hps['per_device_train_batch_size'],
    per_device_eval_batch_size=best_hps['per_device_train_batch_size'],
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs_best_hps",
    learning_rate=best_hps['learning_rate'],
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer = Trainer(
    model=model,
    args=best_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

print("\n--- Initialized Trainer with Best Hyperparameters ---")
print(best_training_args)

Function Description

This code block performs the final stages of BERT-based question answering model fine-tuning, including:

    Training the model using previously identified best hyperparameters.

    Evaluating the fine-tuned model's performance on the evaluation dataset.

    Saving the trained model checkpoint for future inference.

    Ensuring all random seeds are set for reproducibility.

Input

    The trainer object: configured with your best hyperparameters, model, datasets, tokenizer, and metric function.

    Google Drive path for saving the model checkpoint (as a string).

Output

    Console messages indicating the progress of training and evaluation.

    Printed final evaluation metrics (like F1 Score, Exact Match).

    The trained model saved in your specified Google Drive directory for Stage 4 (Inference).

Essential Syntaxes

python
set_seed(42)

    Fixes all Python and GPU-related random sources to make training and evaluation reproducible.

python
trainer.train()

    Triggers model fine-tuning (training) using the best hyperparameters found during search.

python
trainer.evaluate()

    Runs evaluation on the held-out dataset to report final performance metrics.

python
trainer.save_model("/content/drive/My Drive/bert_qa_best_hps")

    Saves a complete checkpoint of your trained model (weights, config, and tokenizer) to Google Drive. You can reload this checkpoint later for QA inference.

Example Output

text
--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1â€“4 hours on GPU) ---

[Training progress output]

--- Final Evaluation Results with Best Hyperparameters ---
{'eval_loss': 1.02, 'eval_Exact_Match': 0.76, 'eval_F1_Score': 0.81, ...}

Fine-tuning process complete with best hyperparameters. The resulting model can now be used for Inference (Stage 4).

Comment and Observation

This final phase puts everything together: your model is trained on the best settings, thoroughly evaluated, and safely saved for later use.

In [None]:
from transformers import set_seed


print("\n--- Starting Fine-Tuning with Best Hyperparameters (Expected Time: 1â€“4 hours on GPU) ---")


set_seed(42)

trainer.train()


print("\n--- Final Evaluation Results with Best Hyperparameters ---")


set_seed(42)

eval_results = trainer.evaluate()
print(eval_results)

trainer.save_model("/content/drive/My Drive/bert_qa_best_hps")

print("\n Fine-tuning process complete with best hyperparameters. The resulting model can now be used for Inference (Stage 4).")

## **Execution using the best hyperparameters found (Random Search)**

**Function Description**

This code block sets up a Hugging Face Trainer to fine-tune a BERT-based question answering model using the best hyperparameters found from a random search. It extracts the optimal combination of hyperparameters from your trial results and configures the training process to use these settings for a final training run.
Input

    best_hps_random_search: A dictionary containing the best hyperparameters (epochs, batch_size, learning_rate) identified during random search.

    Pretrained model and tokenizer for BERT QA.

    Tokenized training and evaluation datasets (tokenized_train, tokenized_eval).

    Previously defined metric function (compute_metrics).

**Output**

    An initialized Trainer object configured with the selected best hyperparametersâ€”ready for fine-tuning and evaluation.

    Printed confirmation of the training arguments for transparency and tracking.

**Essential Syntaxes**

python
best_hps_random_search = df_results.iloc[0].to_dict()

    Grabs the best hyperparameter set as a Python dictionary from your sorted trial results DataFrame.

python
best_training_args_random_search = TrainingArguments(
    output_dir=...,  # folder for saving model outputs
    num_train_epochs=int(best_hps_random_search['epochs']),
    per_device_train_batch_size=int(best_hps_random_search['batch_size']),
    learning_rate=best_hps_random_search['learning_rate'],
    ...
)

    Initializes TrainingArguments with the values from your best random search trial.

python
trainer_random_search = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=best_training_args_random_search,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

    Prepares a new Trainer object with all required settings for final training.

**Example Output**

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


--- Initialized Trainer with Best Hyperparameters from Random Search ---
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=42,

**Comment and Observation**

This block ensures you are leveraging the most effective hyperparameter configuration discovered during random search. By saving outputs and logs to Google Drive, results and checkpoints are preserved for later validation or deployment

In [None]:

best_hps_random_search = df_results.iloc[0].to_dict()


best_training_args_random_search = TrainingArguments(
    output_dir="/content/drive/My Drive/results_best_hps_random_search",
    num_train_epochs=int(best_hps_random_search['epochs']),
    per_device_train_batch_size=int(best_hps_random_search['batch_size']),
    per_device_eval_batch_size=int(best_hps_random_search['batch_size']),
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="/content/drive/My Drive/logs_best_hps_random_search",
    learning_rate=best_hps_random_search['learning_rate'],
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer_random_search = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=best_training_args_random_search,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

print("\n--- Initialized Trainer with Best Hyperparameters from Random Search ---")
print(best_training_args_random_search)

Function Description

This code performs the final round of fine-tuning for your BERT-based QA model using the best hyperparameters discovered from random search. It takes care to set seeds for reproducibility, starts model training, evaluates your model, saves the best checkpoint, and prints friendly, readable progress updates throughout.
Input

    The trainer_random_search object: a Hugging Face Trainer, pre-loaded with the model, datasets, tokenizer, metrics, and best random search hyperparameters.

    Save path for the trained model checkpoint (here: /content/drive/My Drive/bert_qa_best_random_hps).

Output

    Console logs and printed messages showing the start, completion, and high-level metrics of fine-tuning and evaluation.

    Final evaluation dictionary (eval_results_random_search) summarizing performance (e.g., F1 Score, Exact Match).

    Saved model checkpoint for future inference.

Essential Syntaxes

python
set_seed(42)

Sets all sources of randomness in your environment. This is crucial for reliable, repeatable experiments.

python
trainer_random_search.train()

Begins fine-tuning using best random search settings. This usually takes 1â€“4 hours with a GPU â€” so sit back, maybe grab a coffee while it runs!

â€‹

python
trainer_random_search.evaluate()

Checks the model against your validation set and reports metrics, helping you measure how successful fine-tuning was.

python
trainer_random_search.save_model(...)

Stores your fully trained model on Google Drive, so you donâ€™t need to retrain later.
Example Output

text
--- Starting Fine-Tuning with Best Hyperparameters from Random Search (Expected Time: 1â€“4 hours on GPU) ---

[Training progress output]

--- Final Evaluation Results with Best Hyperparameters from Random Search ---
{'eval_loss': 0.98, 'eval_Exact_Match': 0.80, 'eval_F1_Score': 0.87, ...}

Fine-tuning process complete with best hyperparameters from random search. The resulting model can now be used for Inference (Stage 4).

**Comment and Observation**

This workflow is designed to be user-friendly and transparent: you get clear status updates before, during, and after training. This is just a test Dr. Raga but as you can see this is the flow that we want to build which is q & a essay type.

In [None]:
from transformers import set_seed

print("\n--- Starting Fine-Tuning with Best Hyperparameters from Random Search (Expected Time: 1â€“4 hours on GPU) ---")


set_seed(42)


trainer_random_search.train()

print("\n--- Final Evaluation Results with Best Hyperparameters from Random Search ---")


set_seed(42)

eval_results_random_search = trainer_random_search.evaluate()
print(eval_results_random_search)

trainer_random_search.save_model("/content/drive/My Drive/bert_qa_best_random_hps")

print("\n Fine-tuning process complete with best hyperparameters from random search. The resulting model can now be used for Inference (Stage 4).")

## **Execution using the all best hyperparameters found (Random Search & Grid Search)**

In [None]:

best_hps = df_results.iloc[0].to_dict() # Best hyperparameters that I found (num_train_epochs, per_device_train_batch_size, per_device_eval_batch_size, learning_rate)


best_training_args_with_best_hps = TrainingArguments(
    output_dir="/content/drive/My Drive/results_best_hps_random_search",
    num_train_epochs=int(best_hps['epochs']),
    per_device_train_batch_size=int(best_hps['batch_size']),
    per_device_eval_batch_size=int(best_hps['batch_size']),
    warmup_steps=500,   # Best warmup_steps found from my groupmates
    weight_decay=0.75, # Best weight decay found from my groupmates
    logging_dir="/content/drive/My Drive/logs_best_hps_random_search",
    learning_rate=best_hps['learning_rate'],
    logging_steps=50,
    eval_strategy="epoch", # Best eval_strategy found from my groupmates
    save_strategy="epoch", # Best save_strategy found from my groupmates
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer_with_best_hps = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=best_training_args_with_best_hps,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

# Train the model
trainer_with_best_hps.train()

print("\n--- Trainer with Best Hyperparameters ---")
print(best_training_args_with_best_hps)

best_model_path = "/content/drive/My Drive/best_qa_model"
trainer_with_best_hps.save_model(best_model_path)
tokenizer.save_pretrained(best_model_path)
print(f"\nModel saved to {best_model_path}")

# **Actual Testing**

**Function Description**

This code creates an interactive question-answering session using a Hugging Face model pipeline. For each question from your DataFrame, it poses the question, gets your answer, compares it to the ground truth using exact match and F1 metrics, and also measures the semantic similarity using embeddings and cosine similarity. It uses the Hugging Face pipeline utility for simple and effective inference.

**Input**

    df: A DataFrame containing columns "Question", "Context", and "Answer" for each example.

    User input (your answer to each posed question).

**Output**

    Prints detailed evaluation for each test round: ground truth, model's predicted answer, your answer, Exact Match, F1 score, cosine similarity, and inference time.

**Essential Syntaxes**

python
qna_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=device_num
)

    Sets up a question-answering pipeline using your fine-tuned model and tokenizer. Uses GPU if available, else CPU.

python
def get_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
    with torch.no_grad():
        outputs = model.bert(**inputs)
    return outputs.last_hidden_state[0][0].cpu().numpy()

    Computes the embedding for a given text by passing it through the base BERT model. Used to compare semantic similarity of answers.

python
bert_result = qna_pipeline({"question": row['Question'], "context": row['Context']})

    Performs inference: predicts the answer span using the model.

python
exact = compute_exact_match(user_answer, row['Answer'])
f1 = compute_f1(user_answer, row['Answer'])

    Calculates the exact match and F1 metrics for your answer against the reference answer.

python
cos_sim = cosine_similarity([user_emb], [gt_emb])[0][0]

    Finds cosine similarity (semantic closeness) between your answer and the ground truth, using embeddings.

**Example Output**

text
Question: 1
What does GeeksforGeeks provide?
Please type your answer:
<user types>

Evaluation
Ground Truth Answer: resources for computer science
Model Predicted Answer: resources for computer science
Your Answer: resources for computer science
Exact Match: 1
F1 Score: 1.0000
Cosine Similarity: 1.0000
Inference Time (BERT QA): 0.1122 seconds

**Comment and Observation**

This approach is ideal for human-in-the-loop evaluation. You can check your own understanding, see how the model performs, and compare results both exactly (Exact Match, F1) and by semantics (cosine similarity of embeddings).

In [None]:
# import random
# import time
# import torch
# from sklearn.metrics.pairwise import cosine_similarity
# from transformers import pipeline, BertForQuestionAnswering, BertTokenizer
# import re # Import regex for placeholder F1 score calculation
# from collections import Counter # Import Counter for F1 score calculation


# # Load the fine-tuned model and tokenizer

# best_model_path  = "/content/drive/My Drive/best_qa_model"
# model = BertForQuestionAnswering.from_pretrained(best_model_path)
# tokenizer = BertTokenizer.from_pretrained(best_model_path)


# device_num = 0 if torch.cuda.is_available() else -1
# device = torch.device(f"cuda:{device_num}" if torch.cuda.is_available() else "cpu")
# model.to(device) # Move the model to the correct device


# correct_answer_treshold = 0.75 # Cosine similarity score to be considered a "correct" answer
# mastery_threshold = 0.80 # Percentage of correct answers (e.g., 9 out of 10) to master a topic
# max_questions_topic = 10 # Max Questions Per Topic


# qna_pipeline = pipeline(
#     "question-answering",
#     model=model,
#     tokenizer=tokenizer,
#     device=device_num
# )

# def get_embedding(text, tokenizer, model):
#     # Ensure model is in evaluation mode
#     inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
#     with torch.no_grad():
#         outputs = model.bert(**inputs)
#     return outputs.last_hidden_state[0][0].cpu().numpy()


# def choos_QnA_mode(df):

#    available_modes = ['Quiz', 'Exam']

#    print("\nAvailable Modes:")
#    for m in available_modes:
#        print(f"- {m}")

#    while True:
#        user_input = input("\nType the mode you want to study: ").strip().lower()

#        if user_input in available_modes:
#             print(f"\nYou selected QnA mode: {user_input}")
#             return user_input
#         else:
#             print("Invalid mode. Please choose from the available options.")



# def choose_topic(df):
#     available_topics = df['Title'].unique().tolist()

#     print("\nAvailable Topics:")
#     for t in available_topics:
#         print(f"- {t}")

#     while True:
#         user_input = input("\nType the topic you want to study: ").strip().lower()

#         # Exact match
#         for topic in available_topics:
#             if user_input == topic.lower():
#                 print(f"\nYou selected topic: {topic}")
#                 return topic

#         # Partial match
#         partial_matches = [topic for topic in available_topics if user_input in topic.lower()]

#         if len(partial_matches) == 1:
#             print(f"\nYou selected topic: {partial_matches[0]}")
#             return partial_matches[0]

#         elif len(partial_matches) > 1:
#             print("\nDid you mean one of these?")
#             for pm in partial_matches:
#                 print(f"- {pm}")
#             print("Please type the full topic name.")

#         else:
#             print("\nTopic not found. Please try again.")
#             print("Available topics include:")
#             for t in available_topics:
#                 print(f"- {t}")



# def interactive_question_answering(df, topic, asked_indices_set, question_answering_times, qna_mode):
#     """
#     df: full dataframe
#     topic: chosen Title (string)
#     asked_indices_set: a set() of df.index values already asked for this session,
#                        used to avoid repeats.
#     Returns: the index asked (or None if no more questions)
#     """

#     if(qna_mode == 'Quiz'):
#         question_mode_column = 'Quiz Question'
#     elif qna_mode == 'Exam':
#         question_mode_column = 'Exam Question'
#     else:
#         print("Invalid QnA Mode. Must be 'Quiz' or 'Exam'.")
#         return None, False

#     # Filter by topic
#     topic_df = df[df['Title'] == topic].copy()

#     # Filter out rrows that don't have a valid question for the selected mode
#     topic_df.dropna(subset=[mode_column], inplace=True)

#     # Filter out rows that have already been asked
#     topic_df = topic_df[~topic_df.index.isin(asked_indices_set)]

#     if topic_df.empty:
#         print("\nNo questions available for this topic.")
#         return None


#     filtered = topic_df

#     if filtered.empty:
#         print("\nNo essay-type questions available for this topic (or after filtering).")
#         return None

#     # Get remaining rows (not asked yet)
#     remaining = filtered[~filtered.index.isin(asked_indices_set)]

#     if remaining.empty:
#         # If no unseen questions left, either reset or tell user
#         print("\nYou've seen all available questions for this topic.")
#         return None

#     # Pick one random row (without replacement)
#     row = remaining.sample(n=1).iloc[0]
#     row_idx = row.name  # dataframe index

#     # Use the correct columns for question and anser
#     question_text = row[question_mode_column]
#     ground_truth_answer = row[question_mode_column]

#     # Show question and collect answer
#     print(f"\nQuestion {question_answering_times}:")
#     print(row['Question'])

#     user_answer = input("\nPlease type your answer:\n")

#     # Evaluate with your existing functions / model
#     start = time.time()
#     bert_result = qna_pipeline({
#         "question": row['Question'],
#         "context": row.get('Context', '')
#     })
#     inference_time = time.time() - start

#     exact = compute_exact_match(user_answer, row.get('Answer', ''))
#     f1 = compute_f1(user_answer, row.get('Answer', ''))

#     user_emb = get_embedding(user_answer, tokenizer, model)
#     gt_emb = get_embedding(row.get('Answer', ''), tokenizer, model)

#     cos_sim = cosine_similarity([user_emb], [gt_emb])[0][0]

#     print("\n Evaluation Per Question")
#     print(f"Ground Truth Answer: {row.get('Answer', '')}")
#     print(f"Model Predicted Answer: {bert_result.get('answer', '')}")
#     print(f"Your Answer: {user_answer}")
#     print(f"Exact Match: {exact}")
#     print(f"F1 Score: {f1:.4f}")
#     print(f"Cosine Similarity: {cos_sim:.4f}")

#     is_correct = cos_sim >= correct_answer_treshold
#     if is_correct:
#         print("Result: Correct! Your answer is semantically similar to the expected one.")
#     else:
#         print("Result: Incorrect. Need Improvement. Your answer is a bit different from the expected one.")
#     print(f"Inference Time: {inference_time:.4f} seconds\n")

#     return row_idx, is_correct



# def study_session(df):

#   question_times = 0 # Number of Questions

#   while True: # Loop for the entire session, continues until user quits
#     topic = choose_topic(df)

#     # if topic is none: # User typed 'quit'
#     #   break


#     # Start of a new topic round
#     asked_indices_this_topic = set()
#     correct_answers_this_topic = 0
#     questions_asked_this_topic = 0

#     for i in range (max_questions_topic):
#         question_times += 1

#         idx, is_correct = interactive_question_answering(df, topic, asked_indices_this_topic, question_answering_times=question_times)
#         if idx is None:
#             break


#         asked_indices_this_topic.add(idx)
#         questions_asked_this_topic += 1

#         if is_correct:
#            correct_answers_this_topic += 1


#          # Topic Round-up and Mastery Check
#         if questions_asked_this_topic != 0:
#             score_percentage = correct_answers_this_topic / questions_asked_this_topic
#             print(f"\nTopic Summary for '{topic}'")
#             print(f"You answered {correct_answers_this_topic} out of {questions_asked_this_topic} questions correctly.")

#             if score_percentage >= mastery_threshold:
#                 print("Congratulations! You have a good grasp of this topic.")
#             else:
#                 print("You're making progress! A little more practice on this topic would be helpful.")

#     print("\nStudy session finished. Great work!")




# study_session(df)


# # print("\nStudy session finished!")

import random
import time
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline, BertForQuestionAnswering, BertTokenizer
import re # Import regex for placeholder F1 score calculation
from collections import Counter # Import Counter for F1 score calculation


# Load the fine-tuned model and tokenizer

best_model_path  = "/content/drive/My Drive/best_qa_model"
model = BertForQuestionAnswering.from_pretrained(best_model_path)
tokenizer = BertTokenizer.from_pretrained(best_model_path)


device_num = 0 if torch.cuda.is_available() else -1
device = torch.device(f"cuda:{device_num}" if torch.cuda.is_available() else "cpu")
model.to(device) # Move the model to the correct device


correct_answer_treshold = 0.75 # Cosine similarity score to be considered a "correct" answer
mastery_threshold = 0.80 # Percentage of correct answers (e.g., 9 out of 10) to master a topic
max_questions_topic = 10 # Max Questions Per Topic


qna_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=device_num
)

def get_embedding(text, tokenizer, model):
    # Ensure model is in evaluation mode
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
    with torch.no_grad():
        outputs = model.bert(**inputs)
    return outputs.last_hidden_state[0][0].cpu().numpy()


def choose_QnA_mode(): # Corrected function name
   available_modes = ['Quiz', 'Exam']

   print("\nAvailable Modes:")
   for m in available_modes:
       print(f"- {m}")

   while True:
       user_input = input("\nType the mode you want to study: ").strip().title() # Use .title() for consistent capitalization

       if user_input in available_modes:
            print(f"\nYou selected QnA mode: {user_input}")
            return user_input
       else:
            print("Invalid mode. Please choose from 'Quiz' or 'Exam'.")


def choose_topic(df):
    available_topics = df['Title'].unique().tolist()

    print("\nAvailable Topics:")
    for t in available_topics:
        print(f"- {t}")

    while True:
        user_input = input("\nType the topic you want to study: ").strip().lower()

        # Exact match
        for topic in available_topics:
            if user_input == topic.lower():
                print(f"\nYou selected topic: {topic}")
                return topic

        # Partial match
        partial_matches = [topic for topic in available_topics if user_input in topic.lower()]

        if len(partial_matches) == 1:
            print(f"\nYou selected topic: {partial_matches[0]}")
            return partial_matches[0]

        elif len(partial_matches) > 1:
            print("\nDid you mean one of these?")
            for pm in partial_matches:
                print(f"- {pm}")
            print("Please type the full topic name.")

        else:
            print("\nTopic not found. Please try again.")
            print("Available topics include:")
            for t in available_topics:
                print(f"- {t}")


def interactive_question_answering(df, topic, asked_indices_set, question_answering_times, qna_mode):
    """
    df: full dataframe
    topic: chosen Title (string)
    asked_indices_set: a set() of df.index values already asked for this session,
                       used to avoid repeats.
    qna_mode: 'Quiz' or 'Exam'
    Returns: the index asked (or None) and whether the answer was correct
    """
    # MODIFIED: Determine the correct question and answer columns based on the mode
    if qna_mode == 'Quiz':
        question_col = 'Quiz Question'
        answer_col = 'Quiz Answer'
    elif qna_mode == 'Exam':
        question_col = 'Exam Question'
        answer_col = 'Exam Answer'
    else:
        print("Invalid QnA Mode passed to function.")
        return None, False

    # Filter by topic
    topic_df = df[df['Title'] == topic].copy()

    # Filter out rows that don't have a valid question for the selected mode
    topic_df.dropna(subset=[question_col], inplace=True)
    topic_df = topic_df[topic_df[question_col].str.strip() != '']

    if topic_df.empty:
        print(f"\nNo '{qna_mode}' questions are available for this topic.")
        return None, False

    # Get remaining rows (not asked yet)
    remaining = topic_df[~topic_df.index.isin(asked_indices_set)]

    if remaining.empty:
        print("\nYou've seen all available questions for this topic in this mode.")
        return None, False

    # Pick one random row (without replacement)
    row = remaining.sample(n=1).iloc[0]
    row_idx = row.name  # dataframe index

    # MODIFIED: Use the correct columns for question and answer
    question_text = row[question_col]
    ground_truth_answer = row[answer_col]

    # Show question and collect answer
    print(f"\n--- Question {question_answering_times} ---")
    print(question_text)

    user_answer = input("\nPlease type your answer:\n")

    # Evaluate with your existing functions / model
    start = time.time()
    bert_result = qna_pipeline({
        "question": question_text,
        "context": row.get('Context', '')
    })
    inference_time = time.time() - start

    # MODIFIED: Compare against the correct ground truth answer
    exact = compute_exact_match(user_answer, ground_truth_answer)
    f1 = compute_f1(user_answer, ground_truth_answer)

    user_emb = get_embedding(user_answer, tokenizer, model)
    gt_emb = get_embedding(ground_truth_answer, tokenizer, model)

    cos_sim = cosine_similarity([user_emb], [gt_emb])[0][0]

    print("\n--- Evaluation Per Question ---")
    print(f"Ground Truth Answer: {ground_truth_answer}")
    print(f"Model Predicted Answer: {bert_result.get('answer', '')}")
    print(f"Your Answer: {user_answer}")
    print(f"Exact Match: {exact}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Cosine Similarity: {cos_sim:.4f}")

    is_correct = cos_sim >= correct_answer_treshold
    if is_correct:
        print("Result: Correct! Your answer is semantically similar to the expected one.")
    else:
        print("Result: Incorrect. Need Improvement. Your answer is a bit different from the expected one.")
    print(f"Inference Time: {inference_time:.4f} seconds\n")

    return row_idx, is_correct


def study_session(df):
    question_times = 0 # Number of Questions

    while True: # Loop for the entire session, continues until user quits
        # MODIFIED: Call choose_QnA_mode first
        qna_mode = choose_QnA_mode()
        topic = choose_topic(df)

        # Start of a new topic round
        asked_indices_this_topic = set()
        correct_answers_this_topic = 0
        questions_asked_this_topic = 0

        for i in range(max_questions_topic):
            question_times += 1

            # MODIFIED: Pass the qna_mode to the function
            idx, is_correct = interactive_question_answering(df, topic, asked_indices_this_topic, question_times, qna_mode)

            # MODIFIED: Bug fix for when no questions are left
            if idx is None:
                break

            asked_indices_this_topic.add(idx)
            questions_asked_this_topic += 1

            if is_correct:
               correct_answers_this_topic += 1

        # Topic Round-up and Mastery Check
        if questions_asked_this_topic > 0:
            score_percentage = correct_answers_this_topic / questions_asked_this_topic
            print(f"\n--- Topic Summary for '{topic}' ({qna_mode} Mode) ---") # MODIFIED: Added mode to summary
            print(f"You answered {correct_answers_this_topic} out of {questions_asked_this_topic} questions correctly.")

            if score_percentage >= mastery_threshold:
                print("Congratulations! You have a good grasp of this topic.")
            else:
                print("You're making progress! A little more practice on this topic would be helpful.")

        # Ask user if they want to continue
        another_round = input("\nWould you like to choose another topic or mode? (yes/no): ").strip().lower()
        if another_round != 'yes':
            break

    print("\nStudy session finished. Great work!")

# Assuming `df` is already loaded from your CSV file in a previous step
# For example:
# import pandas as pd
# df = pd.read_csv("path/to/your/file.csv")
# df.columns = [col.strip().replace('"', '').replace('\n', '') for col in df.columns] # Clean column names

study_session(df)
