# **Install Necessary Python**

**Function Description**

This code sets seeds for all relevant libraries to ensure full reproducibility in PyTorch experiments. It also defines evaluation metrics for question answering Exact Match (EM) and F1 Score then builds a Hugging Face Trainer with arguments perfectly suited for BERT-style extractive QA training and validation. These steps are essential for stable, trackable QA research and applications.

**Input**

    Seed value (here, 42), set across Python, numpy, PyTorch CPU and GPU.

    Model predictions and gold answers for each evaluation batch.

    Training datasets, pretrained model, and tokenizer.

**Output**

    Reproducible training and validation runs.

    Dictionary of computed evaluation metrics (averaged EM, F1, and inference time) after each evaluation phase.

    A fully configured Trainer object for orchestrated fine-tuning on QA tasks.

**Essential Syntaxes**



    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


Ensures all sources of randomness are controlled, making repeated experiments yield consistent results.


    def compute_exact_match(prediction, truth):
        return int(prediction.strip().lower() == truth.strip().lower())


Computes an all-or-nothing score: 1 only if predicted span matches gold answer exactly, after stripping and lowercasing.



    def compute_f1(prediction, truth):
        pred_tokens = prediction.lower().split()
        truth_tokens = truth.lower().split()
        ...
        f1 = 2 * (precision * recall) / (precision + recall)
        return f1


Calculates the token-level overlap F1 between prediction and gold answer. Captures partial matches as well as perfect ones.



def compute_metrics(eval_pred):
    ...
    return metrics



Processes batched model predictions, decodes answer spans, computes EM and F1 across all examples, and averages them.

python

    training_args = TrainingArguments(...)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )


Sets up training parameters (epochs, batch sizes, device, logging, saving) and bundles everything in a Trainer for easy use.

Example Output

text

{'Exact_Match': 0.76, 'F1_Score': 0.81, 'Avg_Inference_Time': 0.0052}

code Code

(A dictionary summarizing model accuracy and efficiency, printed after evaluation.)

**Comment and Observation**

Setting all seeds eliminates nearly all nondeterminism—so you're not chasing "lucky runs," which is essential in QA research.

In [2]:

!pip install -U transformers # Explicitly update transformers
!pip install -U sentence-transformers
!pip install -U sentence-transformers -q
!pip install transformers datasets accelerate ray[tune] optuna -U
!pip install --upgrade gradio -q
!pip install --upgrade transformers datasets accelerate sentence-transformers gradio -q



# **Setup and Data Preparation**

**Function Description**

This code block loads a CSV dataset from Google Drive (in this case `("/content/drive/My Drive/Data Collection/Dataset_QnA.xlsx")`, then processes it into a suitable format for `transformers` libraries. It also checks GPU availability to enable efficient computation.

**Input**

The input in this code block is the CSV file from Google Drive containing text data for model training

**Output**

The output in this code block is the confirmation message about GPU or CPU usage. Two datasets printed to the console: training data `(train_data)` and evaluation data `(eval_data)`.

**Essential Syntaxes**



    drive.mount('/content/drive/', force_remount=True) mounts Google Drive to access external files.

    pd.read_csv(path) loads the dataset into a pandas DataFrame.

    torch.cuda.is_available() checks whether a GPU is accessible.

    Dataset.from_pandas(df) converts a pandas DataFrame into a Hugging Face Dataset object.

    dataset.train_test_split(test_size=0.2, seed=42) splits the dataset into 80% training and 20% evaluation data.

    torch.device("cuda" or "cpu") specifies the computation device for model training.

**Example Output**

Mounted at /content/drive/
Using GPU: Tesla T4

--- Loading and Preprocessing Data ---
Dataset({
    features: ['column1', 'column2', 'column3'],
    num_rows: 400
})
Dataset({
    features: ['column1', 'column2', 'column3'],
    num_rows: 100
})


**Comment and Observation**

Based on my understanding, the code demonstrates a typical preprocessing workflow for NLP model training using Hugging Face. It ensures GPU acceleration if available, which significantly speeds up training. Moreover, converting the dataset into the Hugging Face format allows easy integration with a `Trainer` later on.

In [3]:
import torch
import pandas as pd
import numpy as np
import re
import time
from datasets import Dataset
from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForQuestionAnswering
from transformers.data.data_collator import default_data_collator
from google.colab import drive

from sentence_transformers import SentenceTransformer, util



df_original = pd.read_csv("/content/sample_data/Dataset_QnA_Expanded_Modified.csv")



df_original.columns = [col.strip().replace('\n', '') for col in df_original.columns]

def clean_text(text):
    text = str(text)
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^A-Za-z0-9.,;:?!\"()\-\s]', '', text)
    text = text.strip()
    return text

for col in ['Context', 'Exam Question', 'Exam Answer', 'Quiz Question', 'Quiz Answer', 'Title']:
  df_original[col] = df_original[col].apply(clean_text)

# Drop rows where there isn't at least one valid Q&A pair
df_original.dropna(subset=['Context', 'Title'], inplace=True)
print(f"Shape after dropping rows with no context: {df_original.shape}")




restructured_data = []

for index, row in df_original.iterrows():
    # Add the Quiz Q&A pair if it's valid
    if row['Quiz Question'] and row['Quiz Answer']:
        restructured_data.append({
            'context': row['Context'],
            'title': row['Title'],
            'question': row['Quiz Question'],
            'answer': row['Quiz Answer'],
            'mode': 'Quiz' # Keep track of the source
        })

    # Add the Exam Q&A pair if it's valid
    if row['Exam Question'] and row['Exam Answer']:
        restructured_data.append({
            'context': row['Context'],
            'title': row['Title'],
            'question': row['Exam Question'],
            'answer': row['Exam Answer'],
            'mode': 'Exam'
        })

# Create the new, clean DataFrame
df_clean = pd.DataFrame(restructured_data)

print("\n--- Data Restructuring Complete ---")
print(f"Original DataFrame had {len(df_original)} rows.")
print(f"New, restructured DataFrame has {len(df_clean)} rows (one per Q&A pair).")
print("\nExample of the new solid structure:")
print(df_clean.head())


# --- 4. Convert to Hugging Face Dataset and Split ---
full_dataset = Dataset.from_pandas(df_clean)

dataset_split = full_dataset.train_test_split(test_size=0.2, seed=42)
train_data = dataset_split["train"]
eval_data = dataset_split["test"]

print("\n--- Final Datasets Ready for Preprocessing ---")
print("Training Data:")
print(train_data)
print("\nEvaluation Data:")
print(eval_data)


# --- 5. Check for GPU ---
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"\nUsing GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("\nGPU not available, using CPU.")

Shape after dropping rows with no context: (5040, 7)

--- Data Restructuring Complete ---
Original DataFrame had 5040 rows.
New, restructured DataFrame has 10080 rows (one per Q&A pair).

Example of the new solid structure:
                                             context              title  \
0  Linear regression analysis is used to predict ...  Linear Regression   
1  Linear regression analysis is used to predict ...  Linear Regression   
2  This form of analysis estimates the coefficien...  Linear Regression   
3  This form of analysis estimates the coefficien...  Linear Regression   
4  You can perform linear regression in Microsoft...  Linear Regression   

                                            question  \
0  What is the primary purpose of linear regressi...   
1  What is linear regression analysis fundamental...   
2  How does linear regression mathematically mode...   
3  What mathematical process does linear regressi...   
4  What are the specific tools available for 

# **Bert-base-uncased (Final Model)**

**Function Description**

This code establishes a sophisticated pipeline for preparing Question Answering (QnA) data. It initializes a BERT model for the QnA task and a SentenceTransformer model for semantic similarity. The script attempts to align abstract answers to the context using exact string matching first; if that fails, it employs semantic search to find the most similar sentence within the context. It then uses a sliding window tokenization strategy to handle long documents, aligns token positions, filters out invalid examples, and prepares the final datasets for training.

**Input**

The input consists of raw train_data and eval_data objects (likely Hugging Face Datasets) containing "question", "context", and "answer" fields. These datasets represent the raw QnA pairs where the provided answer text might not be an exact substring of the context or the context might exceed the model's maximum input length.

**Output**

The output includes tokenized_train and tokenized_eval datasets that contain model-ready features (input_ids, attention_mask, start_positions, end_positions) formatted for a PyTorch Trainer. The script also outputs console logs indicating model loading status, the device being used (CPU/GPU), and statistics showing how many data samples were retained after semantic alignment and cleaning.

**Essential Syntaxes**


    tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
    similarity_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
    code Code

    
Loads the tokenizer for text processing and the SentenceTransformer model to generate embeddings for semantic search.

  
    scores = util.cos_sim(answer_emb, context_embs)[0]
    best_idx = torch.argmax(scores).item()
    code Code

    
Calculates cosine similarity between the answer embedding and context sentence embeddings to identify the best semantic match.

  


    tokenizer(..., stride=128, return_overflowing_tokens=True, ...)
    code Code

    
Tokenizes inputs using a sliding window approach, creating overlapping chunks to handle contexts longer than the model's maximum limit.

  
    while sequence_ids[idx] != 1: idx += 1
    context_start = idx
    code Code

    
Iterates through sequence IDs to identify the specific boundaries (start and end indices) of the context within the tokenized input.

  


    train_data_aligned = train_data.map(prepare_semantic_labels)
    train_data_clean = train_data_aligned.filter(lambda x: x['end_char'] != 0)


    
Applies the semantic alignment function to the dataset and filters out examples where no valid answer position could be determined.

  


    tokenized_train = train_data_clean.map(tokenize_and_align_sliding, batched=True, ...)

    
Applies the sliding window tokenization and position alignment logic to the cleaned dataset in batches.

  

**Example Output**

text
--- Loaded QnA Model: bert-base-uncased on cuda ---
Loading Semantic Similarity Model...
Semantic Model Loaded.

--- Phase 1: Aligning Data Semantically ---
Original Train Size: 5000
Cleaned Train Size: 4850

--- Phase 2: Tokenizing with Sliding Window ---
Preprocessing Complete. Ready for Trainer.

**Comment and Observation**

This code demonstrates an advanced preprocessing workflow that solves two major problems in QnA: abstract answers and long contexts. The use of SentenceTransformer serves as a smart fallback mechanism, allowing the system to recover training labels even when the answer text doesn't exactly match the context string. Additionally, the sliding window mechanism ensures that long documents are split into manageable chunks without losing information at the boundaries, making the pipeline highly robust for real-world document parsing.

In [4]:
import torch
import numpy as np
import re
import collections
from transformers import BertTokenizerFast, BertForQuestionAnswering
from sentence_transformers import SentenceTransformer, util

# --- 1. Load Models ---
MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

print(f"\n--- Loaded QnA Model: {MODEL_NAME} on {device} ---")
print("Loading Semantic Similarity Model...")
similarity_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
print("Semantic Model Loaded.")



def prepare_semantic_labels(example):
    context = example.get('context', "")
    abstract_answer = example.get('answer', "")

    if not context or not abstract_answer:
        example["start_char"] = 0
        example["end_char"] = 0
        return example

    # A. Try Exact Match First
    start_char_exact = context.find(abstract_answer)
    if start_char_exact != -1:
        example["start_char"] = start_char_exact
        example["end_char"] = start_char_exact + len(abstract_answer)
        return example

    # B. Semantic Search
    context_sentences = re.split(r'(?<=[.?!])\s+', context)

    # Encode answer and sentences
    all_texts = [abstract_answer] + context_sentences
    embeddings = similarity_model.encode(all_texts, convert_to_tensor=True, show_progress_bar=False)

    answer_emb = embeddings[0]
    context_embs = embeddings[1:]

    scores = util.cos_sim(answer_emb, context_embs)[0]
    best_idx = torch.argmax(scores).item()
    best_score = scores[best_idx].item()
    best_sentence = context_sentences[best_idx]

    if best_score > 0.60:
        start_char = context.find(best_sentence)
        if start_char != -1:
            example["start_char"] = start_char
            example["end_char"] = start_char + len(best_sentence)
            return example

    example["start_char"] = 0
    example["end_char"] = 0
    return example


# --- 3. Tokenization with Sliding Window ---
def tokenize_and_align_sliding(examples):
    tokenized_inputs = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = tokenized_inputs.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_inputs.pop("offset_mapping")

    tokenized_inputs["start_positions"] = []
    tokenized_inputs["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        start_char = examples["start_char"][sample_idx]
        end_char = examples["end_char"][sample_idx]

        sequence_ids = tokenized_inputs.sequence_ids(i)

        # Find context start/end
        idx = 0
        while sequence_ids[idx] != 1: idx += 1
        context_start = idx
        while sequence_ids[idx] == 1: idx += 1
        context_end = idx - 1

        if offsets[context_start][0] > start_char or offsets[context_end][1] < end_char:
            tokenized_inputs["start_positions"].append(0)
            tokenized_inputs["end_positions"].append(0)
        else:
            idx = context_start
            while idx <= context_end and offsets[idx][0] <= start_char: idx += 1
            tokenized_inputs["start_positions"].append(idx - 1)

            idx = context_end
            while idx >= context_start and offsets[idx][1] >= end_char: idx -= 1
            tokenized_inputs["end_positions"].append(idx + 1)

    return tokenized_inputs


# --- 4. Apply Preprocessing ---
print("\n--- Phase 1: Aligning Data Semantically ---")
train_data_aligned = train_data.map(prepare_semantic_labels)
eval_data_aligned = eval_data.map(prepare_semantic_labels)

train_data_clean = train_data_aligned.filter(lambda x: x['end_char'] != 0)
eval_data_clean = eval_data_aligned.filter(lambda x: x['end_char'] != 0)

print(f"Original Train Size: {len(train_data)}")
print(f"Cleaned Train Size: {len(train_data_clean)}")

print("\n--- Phase 2: Tokenizing with Sliding Window ---")
tokenized_train = train_data_clean.map(tokenize_and_align_sliding, batched=True, remove_columns=train_data_clean.column_names)
tokenized_eval = eval_data_clean.map(tokenize_and_align_sliding, batched=True, remove_columns=eval_data_clean.column_names)




print("\nPreprocessing Complete. Ready for Trainer.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--- Loaded QnA Model: bert-base-uncased on cuda ---
Loading Semantic Similarity Model...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Semantic Model Loaded.

--- Phase 1: Aligning Data Semantically ---


Map:   0%|          | 0/8064 [00:00<?, ? examples/s]

Map:   0%|          | 0/2016 [00:00<?, ? examples/s]

Filter:   0%|          | 0/8064 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2016 [00:00<?, ? examples/s]

Original Train Size: 8064
Cleaned Train Size: 6553

--- Phase 2: Tokenizing with Sliding Window ---


Map:   0%|          | 0/6553 [00:00<?, ? examples/s]

Map:   0%|          | 0/1639 [00:00<?, ? examples/s]


Preprocessing Complete. Ready for Trainer.


# **METRICS AND TRAINING SETUP**

This command installs several important Python packages that are essential for working with transformer models and hyperparameter optimization:

    transformers: Hugging Face's library for state-of-the-art transformer models like BERT, GPT, and more.

    datasets: Hugging Face's library for easily accessing and managing datasets.

    accelerate: A library to help scale PyTorch models across multiple GPUs or TPUs.

    ray[tune]: Ray Tune is a scalable hyperparameter tuning library built on Ray.

    optuna: A popular framework for automated hyperparameter optimization.

The purpose of this code is to search results show that Optuna and Ray Tune can be integrated with the Transformers Trainer to perform hyperparameter searches efficiently. Installing these packages sets up your environment for such advanced workflows, enabling you to automatically and systematically improve your transformer model’s training parameters.

**Function Description**

This code sets seeds for all relevant libraries to ensure full reproducibility in PyTorch experiments. It also defines evaluation metrics for question answering Exact Match (EM) and F1 Score then builds a Hugging Face Trainer with arguments perfectly suited for BERT-style extractive QA training and validation. These steps are essential for stable, trackable QA research and applications.

**Input**

    Seed value (here, 42), set across Python, numpy, PyTorch CPU and GPU.

    Model predictions and gold answers for each evaluation batch.

    Training datasets, pretrained model, and tokenizer.

**Output**

    Reproducible training and validation runs.

    Dictionary of computed evaluation metrics (averaged EM, F1, and inference time) after each evaluation phase.

    A fully configured Trainer object for orchestrated fine-tuning on QA tasks.

**Essential Syntaxes**

python
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Ensures all sources of randomness are controlled, making repeated experiments yield consistent results.

python
def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

Computes an all-or-nothing score: 1 only if predicted span matches gold answer exactly, after stripping and lowercasing.

python
def compute_f1(prediction, truth):
    pred_tokens = prediction.lower().split()
    truth_tokens = truth.lower().split()
    ...
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

Calculates the token-level overlap F1 between prediction and gold answer. Captures partial matches as well as perfect ones.

python
def compute_metrics(eval_pred):
    ...
    return metrics

Processes batched model predictions, decodes answer spans, computes EM and F1 across all examples, and averages them.

python
training_args = TrainingArguments(...)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

Sets up training parameters (epochs, batch sizes, device, logging, saving) and bundles everything in a Trainer for easy use.

**Example Output**

text
{'Exact_Match': 0.76, 'F1_Score': 0.81, 'Avg_Inference_Time': 0.0052}

(A dictionary summarizing model accuracy and efficiency, printed after evaluation.)

**Comment and Observation**

Setting all seeds eliminates nearly all nondeterminism—so you're not chasing "lucky runs," which is essential in QA research.

In [5]:
import random
import numpy as np
import pandas as pd
import torch
import time
import optuna
import copy

MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForQuestionAnswering.from_pretrained(MODEL_NAME)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
torch.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value)


torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


def compute_exact_match(prediction, truth):
    return int(prediction.strip().lower() == truth.strip().lower())

def compute_f1(prediction, truth):
    pred_tokens = prediction.strip().lower().split()
    truth_tokens = truth.strip().lower().split()

    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)

    common_tokens = set(pred_tokens) & set(truth_tokens)

    if len(common_tokens) == 0:
        return 0

    precision = len(common_tokens) / len(pred_tokens)
    recall = len(common_tokens) / len(truth_tokens)

    return 2 * (precision * recall) / (precision + recall)


# --- 6. Main Evaluation Metric Function (Semantic + F1 + Exact) ---
def compute_metrics_combined(eval_pred):
    start_logits, end_logits = eval_pred.predictions
    start_preds = np.argmax(start_logits, axis=1)
    end_preds = np.argmax(end_logits, axis=1)

    semantic_scores = []
    exact_matches = []
    f1_scores = []

    for i in range(len(start_preds)):
        # Decode Prediction
        pred_ids = tokenized_eval[i]['input_ids'][start_preds[i]:end_preds[i]+1]
        pred_text = tokenizer.decode(pred_ids, skip_special_tokens=True).strip()

        # Decode Target
        gold_start = tokenized_eval[i]['start_positions']
        gold_end = tokenized_eval[i]['end_positions']
        gold_ids = tokenized_eval[i]['input_ids'][gold_start:gold_end+1]
        gold_text = tokenizer.decode(gold_ids, skip_special_tokens=True).strip()

        if not pred_text or not gold_text:
            # Penalize empty predictions
            semantic_scores.append(0.0)
            exact_matches.append(0)
            f1_scores.append(0.0)
            continue

        # 1. Semantic Similarity
        emb1 = similarity_model.encode(pred_text, convert_to_tensor=True)
        emb2 = similarity_model.encode(gold_text, convert_to_tensor=True)
        sem_score = util.cos_sim(emb1, emb2).item()
        semantic_scores.append(sem_score)

        # 2. Exact Match
        exact_matches.append(compute_exact_match(pred_text, gold_text))

        # 3. F1 Score
        f1_scores.append(compute_f1(pred_text, gold_text))

    return {
        "Semantic_Similarity": np.mean(semantic_scores),
        "F1_Score": np.mean(f1_scores),
        "Exact_Match": np.mean(exact_matches)
    }




Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# **Execution (Evaluation)**

## **Execution using the all best hyperparameters found (Random Search)**

**Function Description**

This code fine-tunes a BERT question answering model using a pre-defined set of optimal hyperparameters. It initializes a Hugging Face Trainer with specific training arguments (including learning rate, batch size, number of epochs, warmup steps, and weight decay), trains the model on the provided datasets, and then saves the best-performing model and its tokenizer.

**Input**

    tokenized_train and tokenized_eval: Tokenized training and evaluation datasets.

    MODEL_NAME: The name of the pre-trained BERT question answering model.

    device: The device (CPU or GPU) on which the model will be trained.

    compute_metrics_combined: A function to compute evaluation metrics (e.g., Exact Match and F1 Score).

    tokenizer: The tokenizer corresponding to the pre-trained BERT model.

**Output**

    A trained BERT question answering model saved to a specified path.

    The tokenizer saved to the same specified path.

    Printed TrainingArguments showing the hyperparameters used for the best model.

    Console output indicating the path where the model and tokenizer are saved.

**Essential Syntaxes**


        
    best_training_args_with_best_hps = TrainingArguments(
        output_dir="/content/drive/My Drive/results_best_combination_of_hps",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        warmup_steps=500,
        weight_decay=0.05,
        logging_dir="/content/drive/My Drive/logs_best_combination_of_hps",
        learning_rate=1e-5,
        logging_steps=50,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        fp16=torch.cuda.is_available(),
        report_to=[],
        seed=42,
        data_seed=42
    )

      

Defines the training configuration with specific hyperparameters and settings, including saving and logging strategies, and enabling mixed-precision training (fp16) if a GPU is available.
code Python

    
    trainer_with_best_hps = Trainer(
        model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
        args=best_training_args_with_best_hps,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        compute_metrics=compute_metrics_combined,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

  

Initializes the Trainer object with the pre-trained model, the defined training arguments, datasets, metric computation function, tokenizer, and data collator.
code Python

    
    trainer_with_best_hps.train()

  

Executes the training process based on the configured arguments, fine-tuning the BERT model on the training data and evaluating it periodically.


    
    best_model_path = "/content/drive/My Drive/best_qa_model"
    trainer_with_best_hps.save_model(best_model_path)
    tokenizer.save_pretrained(best_model_path)

      

Saves the fine-tuned model (which is the best model due to load_best_model_at_end=True in TrainingArguments) and its tokenizer to the specified directory.



**Example Output**

    
Epoch 1: Training Loss=X.YZ, Evaluation F1=0.78, Exact Match=0.73
Epoch 2: Training Loss=A.BC, Evaluation F1=0.80, Exact Match=0.75
Epoch 3: Training Loss=D.EF, Evaluation F1=0.79, Exact Match=0.74

--- Trainer with Best Hyperparameters ---
TrainingArguments(
    output_dir='/content/drive/My Drive/results_best_combination_of_hps',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.05,
    logging_dir='/content/drive/My Drive/logs_best_combination_of_hps',
    learning_rate=1e-5,
    logging_steps=50,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    fp16=True,
    report_to=[],
    seed=42,
    data_seed=42,
    ...
)

Model saved to /content/drive/My Drive/best_qa_model


12.6s
Use Arrow Up and Arrow Down to select a turn, Enter to jump to it, and Escape to return to the chat.


In [6]:
from transformers import TrainingArguments, Trainer, set_seed, BertTokenizerFast, BertForQuestionAnswering
from transformers.data.data_collator import default_data_collator



best_training_args_with_best_hps = TrainingArguments(
    output_dir="/content/drive/My Drive/results_best_combination_of_hps",

    # 1. INCREASE EPOCHS: Small datasets need more passes to converge.
    num_train_epochs=3,

    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,

    # 2. REDUCE REGULARIZATION: Allow the model to learn more freedom.
    weight_decay=0.01,

    logging_dir="/content/drive/My Drive/logs_best_combination_of_hps",
    learning_rate=2e-5,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,

    # 3. CRITICAL: Optimize for your SCORE, not the LOSS.
    metric_for_best_model="Semantic_Similarity",
    greater_is_better=True,

    # 4. Save storage space
    save_total_limit=2,

    fp16=torch.cuda.is_available(),
    report_to=[],
    seed=42,
    data_seed=42
)


trainer_with_best_hps = Trainer(
    model=BertForQuestionAnswering.from_pretrained(MODEL_NAME).to(device),
    args=best_training_args_with_best_hps,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics_combined,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)


trainer_with_best_hps.train()

print("\n--- Trainer with Best Hyperparameters ---")
print(best_training_args_with_best_hps)

best_model_path = "/content/drive/My Drive/best_qa_model"
trainer_with_best_hps.save_model(best_model_path)
tokenizer.save_pretrained(best_model_path)
print(f"\nModel saved to {best_model_path}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer_with_best_hps = Trainer(


Epoch,Training Loss,Validation Loss,Semantic Similarity,F1 Score,Exact Match
1,0.444,0.371591,0.804958,0.771374,0.769605
2,0.2999,0.343728,0.868974,0.817407,0.805471
3,0.3389,0.382749,0.906425,0.843924,0.832219



--- Trainer with Best Hyperparameters ---
TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=42,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStra

# **Question and Answering Task**

Function Description

This Python script creates an interactive command-line study session, allowing users to practice questions on various topics in either "Quiz" or "Exam" mode. It leverages a fine-tuned BERT Question Answering (Q&A) model to process questions and a Sentence Transformer for semantic similarity-based answer evaluation. The system guides the user through topic selection, presents questions, accepts free-form text answers, and provides instant, detailed feedback based on semantic alignment with ground truth. It tracks user performance per topic and offers mastery assessment.

**Input**

    df_original (Pandas DataFrame): A DataFrame containing structured educational content, including:

        'Title': Column for topic names.

        'Context': Explanatory text relevant to questions.

        'Quiz Question', 'Quiz Answer': Pairs for "Quiz" mode.

        'Exam Question', 'Exam Answer': Pairs for "Exam" mode.

    User Command-Line Inputs:

        Selection of study mode ('Quiz' or 'Exam').

        Selection of topic (typed by the user, supports partial matching).

        User's free-form text answers to presented questions.

        Responses to prompts for continuing with another topic or mode ('yes' or 'no').

    Pre-trained Model Path:

        best_model_path: String path to the directory containing the saved fine-tuned BertForQuestionAnswering model and its BertTokenizer.

**Output**

    Interactive Command-Line Dialogue: The script prints questions, prompts for user input, and displays evaluation results directly in the console.

    Per-Question Evaluation: For each answered question, it outputs:

        The Ground Truth Answer.

        The Model Predicted Answer (from the BERT Q&A pipeline).

        The Your Answer (user's input).

        Numerical metrics: Exact Match, F1 Score, and Cosine Similarity (between user's answer and ground truth embedding).

        A Result judgment: "Correct!" or "Incorrect."

        Inference Time for the BERT model.

    Per-Topic Summary: After exhausting questions for a topic (or reaching max_questions_topic), it presents:

        Number of correct answers out of questions asked.

        A mastery assessment message (e.g., "Congratulations! You have a good grasp..." or "You need to study again...").

    Session Control Prompts: Questions asking if the user wants to continue with another topic or mode.

    Final Message: "Study session finished. Great work!" upon exiting.

Essential Syntaxes


    
    model = BertForQuestionAnswering.from_pretrained(best_model_path)
    tokenizer = BertTokenizer.from_pretrained(best_model_path)
    qna_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer, device=device_num)

  

Loads the fine-tuned BERT model and tokenizer, then initializes a Hugging Face pipeline for efficient question answering inference. This setup prepares the AI "brain" for the study session.
code Python

      
    def get_embedding(text, tokenizer, model):
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
        with torch.no_grad():
            outputs = model.bert(**inputs)
        return outputs.last_hidden_state[0][0].cpu().numpy()

  

Defines a function to convert text into a numerical vector (embedding) using the base BERT model's last_hidden_state. This embedding captures the semantic meaning of the text, crucial for comparing user answers to ground truth.
code Python

    
    def choose_QnA_mode():
        # ... interactive input and validation loop ...
        return user_input

    def choose_topic(df):
        # ... interactive input, partial matching, and validation loop ...
        return topic

  

These functions handle interactive command-line input for selecting study mode and topic. They include robust loops for validation and provide helpful prompts, including partial topic matching to improve user experience.
code Python

    
    def interactive_question_answering(df, topic, asked_indices_set, question_answering_times, qna_mode):
        # ... filter DataFrame, select unasked question, present, get user input ...
        bert_result = qna_pipeline({"question": question_text, "context": row.get('Context', '')})
        user_emb = get_embedding(user_answer, tokenizer, model)
        gt_emb = get_embedding(ground_truth_answer, tokenizer, model)
        cos_sim = cosine_similarity([user_emb], [gt_emb])[0][0]
        is_correct = cos_sim >= correct_answer_treshold
        # ... print evaluation details ...
        return row_idx, is_correct

  

This is the core function for presenting a single question and evaluating the user's answer. It ensures unique questions per session, uses the qna_pipeline for model's answer, calculates cosine_similarity for semantic grading, and provides detailed feedback before returning the question's index and correctness.
code Python

    
    def study_session(df):
        while True: # Outer loop for choosing another mode
            qna_mode = choose_QnA_mode()
            while True: # Inner loop for choosing another topic
                topic = choose_topic(df)
                # ... loop for asking max_questions_topic, calling interactive_question_answering ...
                # ... calculate topic summary and mastery assessment ...
                # ... prompt for another topic ...
            # ... prompt for another mode ...
        print("\nStudy session finished. Great work!")

  

This orchestrator function manages the entire study flow using nested while True loops, allowing users to repeatedly select modes and topics. It calls choose_QnA_mode, choose_topic, and interactive_question_answering to run the session, accumulating results and providing summaries, until the user decides to exit.

**Example Output**


    
Welcome to the Study Session!

Available Modes:
- Quiz
- Exam

Type the mode you want to study: Quiz

You selected QnA mode: Quiz

Available Topics:
- Introduction to Machine Learning
- Neural Networks Basics
- Data Preprocessing

Type the topic you want to study: neural networks

You selected topic: Neural Networks Basics

--- Question 1 ---
What is the primary function of an activation function in a neural network?

Please type your answer:
To introduce non-linearity, allowing the network to learn complex patterns.

--- Evaluation Per Question ---
Ground Truth Answer: To introduce non-linearity into the network's output, enabling it to learn complex patterns.
Model Predicted Answer: To introduce non-linearity into the network, allowing it to learn complex patterns.
Your Answer: To introduce non-linearity, allowing the network to learn complex patterns.
Exact Match: 0.90
F1 Score: 0.9500
Cosine Similarity: 0.9850
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.0521 seconds

--- Question 2 ---
What is backpropagation used for in training neural networks?

Please type your answer:
It's an algorithm to adjust the weights based on the error.

--- Evaluation Per Question ---
Ground Truth Answer: It's an algorithm used to effectively train a neural network by iteratively adjusting the weights of the network based on the error gradient of the output.
Model Predicted Answer: To adjust the weights of the network based on the error of the output.
Your Answer: It's an algorithm to adjust the weights based on the error.
Exact Match: 0.65
F1 Score: 0.7800
Cosine Similarity: 0.8875
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.0487 seconds

You've seen all available questions for this topic in this mode.

--- Topic Summary for 'Neural Networks Basics' (Quiz Mode) ---
You answered 2 out of 2 questions correctly.
Congratulations! You have a good grasp of this topic.

Would you like to choose another topic? (yes/no): yes

Available Topics:
- Introduction to Machine Learning
- Neural Networks Basics
- Data Preprocessing

Type the topic you want to study: Data

You selected topic: Data Preprocessing

--- Question 1 ---
Explain the concept of feature scaling in data preprocessing.

Please type your answer:
Feature scaling is normalizing data to a standard range to prevent certain features from dominating.

--- Evaluation Per Question ---
Ground Truth Answer: Feature scaling is a method used to normalize the range of independent variables or features of data. It is important to prevent certain features with larger numerical ranges from dominating the learning process.
Model Predicted Answer: Normalizing data to a standard range to prevent certain features from dominating.
Your Answer: Feature scaling is normalizing data to a standard range to prevent certain features from dominating.
Exact Match: 0.85
F1 Score: 0.9200
Cosine Similarity: 0.9750
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.0510 seconds

--- Topic Summary for 'Data Preprocessing' (Quiz Mode) ---
You answered 1 out of 1 questions correctly.
Congratulations! You have a good grasp of this topic.

Would you like to choose another topic? (yes/no): no

Would you like to choose another mode? (yes/no): no

Study session finished. Great work!

  

In [11]:
import random
import time
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline, BertForQuestionAnswering, BertTokenizer
import re
from collections import Counter


best_model_path  = "/content/drive/My Drive/best_qa_model"
model = BertForQuestionAnswering.from_pretrained(best_model_path)
tokenizer = BertTokenizer.from_pretrained(best_model_path)


device_num = 0 if torch.cuda.is_available() else -1
device = torch.device(f"cuda:{device_num}" if torch.cuda.is_available() else "cpu")
model.to(device)


correct_answer_treshold = 0.50
mastery_threshold = 0.80
max_questions_topic = 10


qna_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=device_num
)

def get_embedding(text, tokenizer, model):
    # Ensure model is in evaluation mode
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
    with torch.no_grad():
        outputs = model.bert(**inputs)
    return outputs.last_hidden_state[0][0].cpu().numpy()


def choose_QnA_mode():
   available_modes = ['Quiz', 'Exam']

   print("\nAvailable Modes:")
   for m in available_modes:
       print(f"- {m}")

   while True:
       user_input = input("\nType the mode you want to study: ").strip().title() # Use .title() for consistent capitalization

       if user_input in available_modes:
            print(f"\nYou selected QnA mode: {user_input}")
            return user_input
       else:
            print("Invalid mode. Please choose from 'Quiz' or 'Exam'.")


def choose_topic(df):
    available_topics = df['Title'].unique().tolist()

    print("\nAvailable Topics:")
    for t in available_topics:
        print(f"- {t}")

    while True:
        user_input = input("\nType the topic you want to study: ").strip().lower()


        for topic in available_topics:
            if user_input == topic.lower():
                print(f"\nYou selected topic: {topic}")
                return topic


        partial_matches = [topic for topic in available_topics if user_input in topic.lower()]

        if len(partial_matches) == 1:
            print(f"\nYou selected topic: {partial_matches[0]}")
            return partial_matches[0]

        elif len(partial_matches) > 1:
            print("\nDid you mean one of these?")
            for pm in partial_matches:
                print(f"- {pm}")
            print("Please type the full topic name.")

        else:
            print("\nTopic not found. Please try again.")
            print("Available topics include:")
            for t in available_topics:
                print(f"- {t}")


def interactive_question_answering(df, topic, asked_indices_set, question_answering_times, qna_mode):
    """
    df: full dataframe
    topic: chosen Title (string)
    asked_indices_set: a set() of df.index values already asked for this session,
                       used to avoid repeats.
    qna_mode: 'Quiz' or 'Exam'
    Returns: the index asked (or None) and whether the answer was correct
    """

    if qna_mode == 'Quiz':
        question_col = 'Quiz Question'
        answer_col = 'Quiz Answer'
    elif qna_mode == 'Exam':
        question_col = 'Exam Question'
        answer_col = 'Exam Answer'
    else:
        print("Invalid QnA Mode passed to function.")
        return None, False


    topic_df = df[df['Title'] == topic].copy()


    topic_df.dropna(subset=[question_col], inplace=True)
    topic_df = topic_df[topic_df[question_col].str.strip() != '']

    if topic_df.empty:
        print(f"\nNo '{qna_mode}' questions are available for this topic.")
        return None, False


    remaining = topic_df[~topic_df.index.isin(asked_indices_set)]

    if remaining.empty:
        print("\nYou've seen all available questions for this topic in this mode.")
        return None, False


    row = remaining.sample(n=1).iloc[0]
    row_idx = row.name

    question_text = row[question_col]
    ground_truth_answer = row[answer_col]

    print(f"\n--- Question {question_answering_times} ---")
    print(question_text)

    user_answer = input("\nPlease type your answer:\n")


    start = time.time()
    bert_result = qna_pipeline({
        "question": question_text,
        "context": row.get('Context', '')
    })
    inference_time = time.time() - start


    exact = compute_exact_match(user_answer, ground_truth_answer)
    f1 = compute_f1(user_answer, ground_truth_answer)

    user_emb = get_embedding(user_answer, tokenizer, model)
    gt_emb = get_embedding(ground_truth_answer, tokenizer, model)

    cos_sim = cosine_similarity([user_emb], [gt_emb])[0][0]

    print("\n--- Evaluation Per Question ---")
    print(f"Ground Truth Answer: {ground_truth_answer}")
    print(f"Model Predicted Answer: {bert_result.get('answer', '')}")
    print(f"Your Answer: {user_answer}")
    print(f"Exact Match: {exact}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Cosine Similarity: {cos_sim:.4f}")

    is_correct = cos_sim >= correct_answer_treshold
    if is_correct:
        print("Result: Correct! Your answer is semantically similar to the expected one.")
    else:
        print("Result: Incorrect. Need Improvement. Your answer is a bit different from the expected one.")
    print(f"Inference Time: {inference_time:.4f} seconds\n")

    return row_idx, is_correct

def study_session(df):
    print("Welcome to the Study Session!")

    while True:

        qna_mode = choose_QnA_mode()

        while True:
            topic = choose_topic(df)

            asked_indices_this_topic = set()
            correct_answers_this_topic = 0
            questions_asked_this_topic = 0

            for per_topic_qnum in range(1, max_questions_topic + 1):
                idx, is_correct = interactive_question_answering(
                    df,
                    topic,
                    asked_indices_this_topic,
                    question_answering_times=per_topic_qnum,
                    qna_mode=qna_mode
                )
                if idx is None:
                    break
                asked_indices_this_topic.add(idx)
                questions_asked_this_topic += 1
                if is_correct:
                    correct_answers_this_topic += 1


            if questions_asked_this_topic > 0:
                score_percentage = correct_answers_this_topic / questions_asked_this_topic
                print(f"\n--- Topic Summary for '{topic}' ({qna_mode} Mode) ---")
                print(f"You answered {correct_answers_this_topic} out of {questions_asked_this_topic} questions correctly.")

                if score_percentage >= mastery_threshold:
                    print("Congratulations! You have a good grasp of this topic.")
                else:
                    print("You need to study again. Better luck next time")
            else:
                print(f"\nNo questions were asked for '{topic}' in {qna_mode} mode.")


            while True:
                another_topic = input("\nWould you like to choose another topic? (yes/no): ").strip().lower()
                if another_topic in ("yes", "no"):
                    break
                print("Please answer 'yes' or 'no'.")

            if another_topic == "no":
                break


        while True:
            another_mode = input("\nWould you like to choose another mode? (yes/no): ").strip().lower()
            if another_mode in ("yes", "no"):
                break
            print("Please answer 'yes' or 'no'.")

        if another_mode == "no":
            break

    print("\nStudy session finished. Great work!")




study_session(df_original)


Device set to use cuda:0


Welcome to the Study Session!

Available Modes:
- Quiz
- Exam

Type the mode you want to study: Exam

You selected QnA mode: Exam

Available Topics:
- Linear Regression
- Logistic Regression
- Decision Tree
- Support Vector Machines (SVM)
- Random Forest
- Nave Bayes Classifiers
- Neural Network
- Backpropagation
- Encoder-Decoder Model
- Boosting
- Bagging
- Gradient Boosting
- Machine Learning Libraries
- Scikit-Learn (Sklearn)
- XGBoost
- PyTorch
- Feature Engineering
- Feature Extraction
- Feature Selection
- Vector Embedding
- Model Training
- Loss Function
- Training Data
- Model Parameters
- Gradient Descent
- Stochastic Gradient Descent
- Hyperparameter Tuning
- Learning Rate in Machine Learning
- Regularization
- Overfitting
- Underfitting
- Artificial Intelligence
- Natural Language Processing
- Machine Learning
- Surpervised Machine Learning
- labelled data
- Deep Learning
- Text Data
- NLP
- Data Collection
- Text Preprocessing Pipeline
- Lowercasing
- Special Character Rem




--- Evaluation Per Question ---
Ground Truth Answer: A standard outlier has a large residual (far from the line) but might not change the lines position much. An influential outlier (often with high leverage) actually pulls the regression line towards itself, significantly altering the model coefficients.
Model Predicted Answer: It measures the effect of deleting a given observation.
Your Answer: A standard outlier has a large residual (far from the line) but might not change the line's position much. An influential outlier (often with high leverage) actually pulls the regression line towards itself, significantly altering the model coefficients.
Exact Match: 0
F1 Score: 0.8421
Cosine Similarity: 0.9978
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.3417 seconds


--- Question 2 ---
Which GLM distribution family would you use to predict the number of customers entering a store per hour?

Please type your answer:
Poisson distribution (suitab




--- Evaluation Per Question ---
Ground Truth Answer: Poisson distribution (suitable for count data).
Model Predicted Answer: binary).
Your Answer: Poisson distribution (suitable for count data).
Exact Match: 1
F1 Score: 1.0000
Cosine Similarity: 1.0000
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1797 seconds


--- Question 3 ---
How does GLS improve upon OLS when error terms are correlated?

Please type your answer:
Generalized Least Squares (GLS) improves upon Ordinary Least Squares (OLS) by providing more efficient (lower variance) estimates when error terms are correlated (autocorrelation or serial correlation) or have unequal variances (heteroskedasticity). 





--- Evaluation Per Question ---
Ground Truth Answer: OLS is inefficient (large variance) when errors are correlated. GLS transforms the data using the known covariance structure to decorrelate the errors, producing the Best Linear Unbiased Estimator (BLUE).
Model Predicted Answer: OLS estimates remain unbiased but are inefficient.
Your Answer: Generalized Least Squares (GLS) improves upon Ordinary Least Squares (OLS) by providing more efficient (lower variance) estimates when error terms are correlated (autocorrelation or serial correlation) or have unequal variances (heteroskedasticity). 
Exact Match: 0
F1 Score: 0.1000
Cosine Similarity: 0.7687
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1303 seconds


--- Question 4 ---
How does the Ridge penalty affect the coefficients of correlated features?

Please type your answer:
The Ridge penalty shrinks the coefficients of correlated features and distributes their influence more evenly, preven




--- Evaluation Per Question ---
Ground Truth Answer: It shrinks them towards zero and tends to spread the coefficient value equally among correlated features, rather than assigning a huge value to one and a huge negative value to another.
Model Predicted Answer: Ridge adds a penalty 2 to the loss function.
Your Answer: The Ridge penalty shrinks the coefficients of correlated features and distributes their influence more evenly, preventing one coefficient from becoming disproportionately large to compensate for the other. Instead of large positive and negative coefficients, Ridge regression will shrink both coefficients towards zero, making them closer in value and smaller in magnitude
Exact Match: 0
F1 Score: 0.2683
Cosine Similarity: 0.8350
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1316 seconds


--- Question 5 ---
How can someone execute the linear regression method, and in what kinds of platforms is it possible to perform it?

Pleas




--- Evaluation Per Question ---
Ground Truth Answer: The general requirement is the availability of statistical or mathematical computation librariesfunctions. Why availability is beneficial: it allows data scientists to integrate regression analysis into various existing workflows (e.g., large-scale web services, research). How consistency is maintained: all platforms fundamentally solve for the same least squares coefficient estimates. Demonstration: A GUI environment (like SPSS) involves clicking menus to select variables, while a code-based environment (like Pythons scikit-learn) requires writing explicit code to fit the model (model.fit(X, y)).
Model Predicted Answer: You
Your Answer: The linear regression method can be executed using various programming languages and platforms, ranging from simple spreadsheet software to advanced statistical and machine learning environments.
Exact Match: 0
F1 Score: 0.0980
Cosine Similarity: 0.1646
Result: Incorrect. Need Improvement. Your answ




--- Evaluation Per Question ---
Ground Truth Answer: The p-value becomes very small (approaching zero), indicating strong statistical significance.
Model Predicted Answer: A
Your Answer: When the t-statistic is very large (e.g., 10), the p-value becomes very small, approaching zero. 
Exact Match: 0
F1 Score: 0.3077
Cosine Similarity: 0.9174
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1415 seconds


--- Question 7 ---
How do you avoid the Dummy Variable Trap?

Please type your answer:
To avoid the dummy variable trap, you must omit one of the dummy variables for each set of dummy variables, or alternatively, remove the intercept from your model





--- Evaluation Per Question ---
Ground Truth Answer: By dropping one of the dummy variables (using N1 columns for N categories).
Model Predicted Answer: Solution: Drop one column (Reference category) or drop Intercept.
Your Answer: To avoid the dummy variable trap, you must omit one of the dummy variables for each set of dummy variables, or alternatively, remove the intercept from your model
Exact Match: 0
F1 Score: 0.2927
Cosine Similarity: 0.8161
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1267 seconds


--- Question 8 ---
What is the difference between a Standardized Residual and a Studentized Residual?

Please type your answer:
The main difference is that a standardized residual uses the same standard deviation for all data points, while a studentized residual uses a standard deviation that is calculated separately for each point, excluding that point from the model's calculation





--- Evaluation Per Question ---
Ground Truth Answer: Standardized uses the population variance (often unknownestimated globally); Studentized divides by the sample standard deviation estimate, which varies per point based on leverage.
Model Predicted Answer: Internally Studentized uses the variance from the full dataset.
Your Answer: The main difference is that a standardized residual uses the same standard deviation for all data points, while a studentized residual uses a standard deviation that is calculated separately for each point, excluding that point from the model's calculation
Exact Match: 0
F1 Score: 0.2258
Cosine Similarity: 0.9122
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1294 seconds


--- Question 9 ---
You perform a Multiple Linear Regression with 20 features. The results show a statistically significant F-statistic (p 0.05. What is the likely cause of this contradiction, and how do you interpret the models utility?

Ple




--- Evaluation Per Question ---
Ground Truth Answer: This contradiction is a hallmark sign of Multicollinearity. The predictors are highly correlated with each other, which inflates the standard errors and lowers the t-statistics (making individual variables look insignificant). However, the significant F-statistic confirms that the variables jointly explain the variance in the target. The model is useful for prediction, but useless for interpretation (determining which specific variable drives the result).
Model Predicted Answer: While a t-test asks Is this specific variable useful?,
Your Answer: ChatGPT said:This contradiction usually happens when the model, as a whole, explains a meaningful amount of variance in the target (significant F-statistic), but individual coefficients appear non-significant because many of the predictors are correlated with each other—multicollinearity. When features overlap in the information they provide, the model can still perform well overall, but no 




--- Evaluation Per Question ---
Ground Truth Answer: Yes, it is still unbiased, but it is no longer the Best (most efficient) estimator.
Model Predicted Answer: (BLUE).
Your Answer: Yes. Even if the assumption of homoscedasticity is violated, the OLS estimator remains unbiased as long as the other key assumptions (linearity, no omitted variables, no perfect multicollinearity, and zero-mean errors) still hold. However, the estimator is no longer efficient—its variance increases, making the coefficient estimates less reliable. Standard errors also become incorrect, which can lead to misleading hypothesis tests and confidence intervals unless robust standard errors are used.
Exact Match: 0
F1 Score: 0.1190
Cosine Similarity: 0.6388
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1320 seconds


--- Topic Summary for 'Linear Regression' (Exam Mode) ---
You answered 9 out of 10 questions correctly.
Congratulations! You have a good grasp of this to

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



--- Evaluation Per Question ---
Ground Truth Answer: A Natural Logarithm transformation (ln(y)).
Model Predicted Answer: It is a parametric power transformation indexed by lambda ().
Your Answer: A Natural Logarithm transformation (ln(y)).
Exact Match: 1
F1 Score: 1.0000
Cosine Similarity: 1.0000
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1412 seconds


--- Question 2 ---
What is the advantage of Winsorization over Trimming (removing outliers)?

Please type your answer:
The primary advantage of Winsorization over trimming is that it retains all observations in the dataset, thereby preserving the original sample size, which is beneficial for statistical power and subsequent multivariate analyse





--- Evaluation Per Question ---
Ground Truth Answer: Winsorization preserves the sample size and the fact that the data point was high or low, whereas Trimming discards data, potentially reducing statistical power or introducing bias.
Model Predicted Answer: Instead
Your Answer: The primary advantage of Winsorization over trimming is that it retains all observations in the dataset, thereby preserving the original sample size, which is beneficial for statistical power and subsequent multivariate analyse
Exact Match: 0
F1 Score: 0.2712
Cosine Similarity: 0.7950
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1377 seconds


--- Question 3 ---
How does MICE preserve relationships between variables better than Mean Imputation?

Please type your answer:
MICE (Multiple Imputation by Chained Equations) preserves relationships between variables better than Mean Imputation because MICE uses information from the other variables in the dataset to estima




--- Evaluation Per Question ---
Ground Truth Answer: By predicting missing values using regression models based on other observed variables, rather than just filling with a constant average.
Model Predicted Answer: It iteratively models each missing variable as a regression of other variables.
Your Answer: MICE (Multiple Imputation by Chained Equations) preserves relationships between variables better than Mean Imputation because MICE uses information from the other variables in the dataset to estimate the missing values, thus accounting for correlations and covariance structures, while mean imputation does not
Exact Match: 0
F1 Score: 0.1290
Cosine Similarity: 0.8352
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1923 seconds


--- Question 4 ---
Why scale data before using KNN?

Please type your answer:
Scaling data before using K-Nearest Neighbors (KNN) is crucial because KNN is a distance-based algorithm. The core reason for scaling is 




--- Evaluation Per Question ---
Ground Truth Answer: To prevent features with large ranges from dominating the distance calculation.
Model Predicted Answer: Imputation: Fill missing.
Your Answer: Scaling data before using K-Nearest Neighbors (KNN) is crucial because KNN is a distance-based algorithm. The core reason for scaling is to ensure that all features contribute equally to the distance calculations, preventing features with larger magnitudes from dominating the process.
Exact Match: 0
F1 Score: 0.2692
Cosine Similarity: 0.8565
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1248 seconds


--- Question 5 ---
Why is Complete Case Analysis (dropping rows with NaNs) biased if data is NOT MCAR?

Please type your answer:
Complete Case Analysis (CCA), where rows with missing data (NaNs) are dropped, produces biased results if the data is not Missing Completely At Random (MCAR) because it can systematically exclude specific subsets of the pop




--- Evaluation Per Question ---
Ground Truth Answer: If data is MAR or MNAR, the missingness is systematic. Dropping those rows changes the distribution of the population in the dataset (e.g., removing a specific demographic), leading to a biased model that does not generalize to the real world.
Model Predicted Answer: Data
Your Answer: Complete Case Analysis (CCA), where rows with missing data (NaNs) are dropped, produces biased results if the data is not Missing Completely At Random (MCAR) because it can systematically exclude specific subsets of the population, leading to a sample that does not represent the full dataset
Exact Match: 0
F1 Score: 0.3488
Cosine Similarity: 0.3526
Result: Incorrect. Need Improvement. Your answer is a bit different from the expected one.
Inference Time: 0.1442 seconds


--- Question 6 ---
How does CEM handle continuous variables during the matching process?

Please type your answer:
It 'coarsens' them into bins (e.g., Age 20-30) to allow for exact matc




--- Evaluation Per Question ---
Ground Truth Answer: It coarsens them into bins (e.g., Age 20-30) to allow for exact matching on the bin, rather than requiring identical continuous values.
Model Predicted Answer: It
Your Answer: It 'coarsens' them into bins (e.g., Age 20-30) to allow for exact matching on the bin, rather than requiring identical continuous values.
Exact Match: 0
F1 Score: 0.9545
Cosine Similarity: 0.9842
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1278 seconds


--- Question 7 ---
Why is MICE better than Mean Imputation?

Please type your answer:
MICE is better than mean imputation because it accounts for relationships between variables, which leads to more accurate and less biased results





--- Evaluation Per Question ---
Ground Truth Answer: It preserves correlations between features.
Model Predicted Answer: Impute: MICE.
Your Answer: MICE is better than mean imputation because it accounts for relationships between variables, which leads to more accurate and less biased results
Exact Match: 0
F1 Score: 0.1481
Cosine Similarity: 0.9095
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1360 seconds


--- Question 8 ---
What is the main scalability drawback of KNN Imputation?

Please type your answer:
AI Overview The main scalability drawback of K-Nearest Neighbors (KNN) imputation is its computational expense and high memory consumption on large datasets. 





--- Evaluation Per Question ---
Ground Truth Answer: It requires calculating pairwise distances between all data points, making it computationally slow and memory-intensive for large datasets.
Model Predicted Answer: space.
Your Answer: AI Overview The main scalability drawback of K-Nearest Neighbors (KNN) imputation is its computational expense and high memory consumption on large datasets. 
Exact Match: 0
F1 Score: 0.1500
Cosine Similarity: 0.9224
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1272 seconds


--- Question 9 ---
What is the main risk of deterministic Regression Imputation?

Please type your answer:
The main risk of deterministic regression imputation is that it underestimates the variability (variance) in the data and the uncertainty of the estimates





--- Evaluation Per Question ---
Ground Truth Answer: It artificially inflates correlations between variables (since missing values are perfectly predicted by others) and biases standard errors downwards.
Model Predicted Answer: value.
Your Answer: The main risk of deterministic regression imputation is that it underestimates the variability (variance) in the data and the uncertainty of the estimates
Exact Match: 0
F1 Score: 0.0952
Cosine Similarity: 0.8708
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1403 seconds


--- Question 10 ---
What is the main advantage of SMOTE over simple Random Oversampling?

Please type your answer:
The main advantage of SMOTE (Synthetic Minority Over-sampling Technique) over simple Random Oversampling is that SMOTE generates synthetic minority class samples, which increases the diversity of the training data and helps to prevent overfitting





--- Evaluation Per Question ---
Ground Truth Answer: Random Oversampling duplicates existing points, leading to overfitting (the model memorizes specific points). SMOTE creates new, plausible variations of data, helping the model generalize the decision boundary for the minority class.
Model Predicted Answer: Instead
Your Answer: The main advantage of SMOTE (Synthetic Minority Over-sampling Technique) over simple Random Oversampling is that SMOTE generates synthetic minority class samples, which increases the diversity of the training data and helps to prevent overfitting
Exact Match: 0
F1 Score: 0.2462
Cosine Similarity: 0.7046
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1273 seconds


--- Topic Summary for 'Data Cleaning' (Exam Mode) ---
You answered 9 out of 10 questions correctly.
Congratulations! You have a good grasp of this topic.

Would you like to choose another topic? (yes/no): no

Would you like to choose another mode? (yes/no




--- Evaluation Per Question ---
Ground Truth Answer: Entropy is a concept from information theory that measures the impurity of the sample values, with values between 0 and 1. It equals zero when all samples belong to one class and is at its highest (1) when half are in one class and half in another. The attribute with the smallest amount of entropy should be used to select the best feature to split on.
Model Predicted Answer: Its difficult to explain information gain without first discussing entropy.
Your Answer: Entropy is a concept from information theory that measures the impurity of the sample values, with values between 0 and 1. It equals zero when all samples belong to one class and is at its highest (1) when half are in one class and half in another. The attribute with the smallest amount of entropy should be used to select the best feature to split on.
Exact Match: 1
F1 Score: 0.7424
Cosine Similarity: 1.0000
Result: Correct! Your answer is semantically similar to the expecte




--- Evaluation Per Question ---
Ground Truth Answer: Their accuracy can be maintained by forming an ensemble via a random forest algorithm, which predicts more accurate results particularly when the individual trees are uncorrelated with each other.
Model Predicted Answer: Another
Your Answer: The predictive accuracy of individual decision trees can be improved using pruning or through ensemble methods that combine multiple trees. Ensemble methods become particularly effective when the individual trees are uncorrelated (diverse). 
Exact Match: 0
F1 Score: 0.3607
Cosine Similarity: 0.8769
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1316 seconds


--- Question 3 ---
What is the term for the metric used to decide the optimal split at each node, and name two common examples of this metric?

Please type your answer:
The term used for the metric that decides the optimal split at each node in a decision tree is the splitting criterion (or impur




--- Evaluation Per Question ---
Ground Truth Answer: The metric is often called Impurity Measure or Split Criterion. Two common examples are Gini Impurity and Information Gain (using entropy).
Model Predicted Answer: It
Your Answer: The term used for the metric that decides the optimal split at each node in a decision tree is the splitting criterion (or impurity measure). Two common examples of this metric are Gini impurity and information gain (or entropy)
Exact Match: 0
F1 Score: 0.4333
Cosine Similarity: 0.9182
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1288 seconds


--- Question 4 ---
If Income is missing, and the tree uses Education Level as a surrogate, what does this imply?

Please type your answer:
When a decision tree uses a variable like Education Level as a surrogate for Income, it implies that the two variables are highly correlated and Education Level can serve as an effective substitute for predicting the target outcome





--- Evaluation Per Question ---
Ground Truth Answer: It implies that Education Level is highly correlated with Income and produces a similar partition of the data at that node.
Model Predicted Answer: splits.
Your Answer: When a decision tree uses a variable like Education Level as a surrogate for Income, it implies that the two variables are highly correlated and Education Level can serve as an effective substitute for predicting the target outcome
Exact Match: 0
F1 Score: 0.3390
Cosine Similarity: 0.8837
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1271 seconds


--- Question 5 ---
Which is computationally cheaper: Gini or Entropy?

Please type your answer:
The Gini index is computationally cheaper to calculate than Entropy. This is because the Gini calculation involves only squaring and summing probabilities, whereas the Entropy calculation requires the use of more complex logarithmic functions





--- Evaluation Per Question ---
Ground Truth Answer: Gini (no log calculation).
Model Predicted Answer: Gini is faster. Entropy penalizes impurity slightly more.
Your Answer: The Gini index is computationally cheaper to calculate than Entropy. This is because the Gini calculation involves only squaring and summing probabilities, whereas the Entropy calculation requires the use of more complex logarithmic functions
Exact Match: 0
F1 Score: 0.0526
Cosine Similarity: 0.7738
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1264 seconds


--- Question 6 ---
How is the best Surrogate Split chosen?

Please type your answer:
The best surrogate split in a decision tree is chosen by selecting the predictor variable and split point that best mimics the primary (optimal) split, as measured by a predictive measure of association.





--- Evaluation Per Question ---
Ground Truth Answer: It is the feature that best mimics the split of the primary feature (highest correlation with the primary split outcome).
Model Predicted Answer: If
Your Answer: The best surrogate split in a decision tree is chosen by selecting the predictor variable and split point that best mimics the primary (optimal) split, as measured by a predictive measure of association.
Exact Match: 0
F1 Score: 0.3019
Cosine Similarity: 0.9416
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1291 seconds


--- Question 7 ---
When dealing with a Decision Tree for a Regression task, what measure is typically used to evaluate the quality of a split instead of Gini Impurity or Information Gain?

Please type your answer:
The splitting criterion for regression is typically based on minimizing the variance or Mean Squared Error (MSE) within the resulting child nodes.





--- Evaluation Per Question ---
Ground Truth Answer: The splitting criterion for regression is typically based on minimizing the variance or Mean Squared Error (MSE) within the resulting child nodes.
Model Predicted Answer: Decision
Your Answer: The splitting criterion for regression is typically based on minimizing the variance or Mean Squared Error (MSE) within the resulting child nodes.
Exact Match: 1
F1 Score: 0.9091
Cosine Similarity: 1.0000
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1282 seconds


--- Question 8 ---
How does the Hoeffding bound help in streaming decision trees?

Please type your answer:
The Hoeffding bound helps in streaming decision trees by providing a statistically guaranteed minimum number of data instances needed at a node to select the best splitting attribute. This allows the algorithm to make a decision and split the node with a high probability of being correct, even when using only a small sample of the 




--- Evaluation Per Question ---
Ground Truth Answer: It statistically calculates how many samples are needed to guarantee (with high probability) that the chosen split is truly the best one, allowing the tree to grow without seeing the full dataset.
Model Predicted Answer: split.
Your Answer: The Hoeffding bound helps in streaming decision trees by providing a statistically guaranteed minimum number of data instances needed at a node to select the best splitting attribute. This allows the algorithm to make a decision and split the node with a high probability of being correct, even when using only a small sample of the data stream, which is essential for a one-pass learning process
Exact Match: 0
F1 Score: 0.1633
Cosine Similarity: 0.8729
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1280 seconds


--- Question 9 ---
Why does replacing a subtree with a leaf node help generalization?

Please type your answer:
Replacing a subtree with a leaf




--- Evaluation Per Question ---
Ground Truth Answer: It removes complex, specific branches that likely fit noise in the training data, replacing them with a majority-class prediction that is more robust for unseen data.
Model Predicted Answer: It splits the data into a training set and a validation set.
Your Answer: Replacing a subtree with a leaf node can help generalization by simplifying the model and reducing overfitting to the training data. This process, often referred to as pruning, addresses the problem of models that are too complex and have learned noise in the training set rather than the underlying true relationships
Exact Match: 0
F1 Score: 0.2338
Cosine Similarity: 0.9171
Result: Correct! Your answer is semantically similar to the expected one.
Inference Time: 0.1364 seconds


--- Question 10 ---
What splitting criterion does ID3 use?

Please type your answer:
Sparse: Most elements in the vector are zero (e.g., Bag-of-Words or TF-IDF). Dense: Most elements contain non-ze




--- Evaluation Per Question ---
Ground Truth Answer: Information Gain.
Model Predicted Answer: ID3 vs C4.5: ID3: Information Gain.
Your Answer: Sparse: Most elements in the vector are zero (e.g., Bag-of-Words or TF-IDF). Dense: Most elements contain non-zero, meaningful values (e.g., Word Embeddings like Word2Vec).
Exact Match: 0
F1 Score: 0.0000
Cosine Similarity: -0.0085
Result: Incorrect. Need Improvement. Your answer is a bit different from the expected one.
Inference Time: 0.1273 seconds


--- Topic Summary for 'Decision Tree' (Exam Mode) ---
You answered 9 out of 10 questions correctly.
Congratulations! You have a good grasp of this topic.

Would you like to choose another topic? (yes/no): no

Would you like to choose another mode? (yes/no): no

Study session finished. Great work!


**Function Description**

This Gradio application creates an interactive study session for users to test their knowledge on various topics in either "Quiz" or "Exam" mode. It loads a pre-trained BERT question-answering model and a Sentence Transformer model for semantic similarity. The UI allows users to select a study mode and a topic, generates questions with context, accepts user answers, and provides feedback based on semantic similarity to both a ground truth answer and the BERT model's extracted answer. It also manages session state to prevent repeating questions and allows users to reset their session.

**Input**

    User Selections:

        mode_input: Radio button selection ("Quiz" or "Exam").

        topic_input: Dropdown selection of available topics from df_original.

    User Input:

        user_answer_input: Textbox for the user to type their answer.

    Internal Data:

        df_original: A Pandas DataFrame containing contexts, quiz questions/answers, and exam questions/answers.

        best_qa_model: Path to the pre-trained BERT Question Answering model.

**Output**

    Interactive Gradio UI: A web-based interface with the following dynamic elements:

        context_display: Textbox showing the context for the current question.

        question_display: Textbox showing the generated question.

        user_answer_input: Textbox for user input.

        model_answer_output: Textbox displaying the answer extracted by the BERT model.

        similarity_output: Textbox showing detailed semantic similarity scores (overall, vs. ground truth, vs. model's answer).

        final_eval_output: Textbox providing a "Correct!" or "Incorrect." judgment.

    Session State Management: The asked_indices_state tracks questions already presented to the user to ensure variety.

    Console Output: Messages indicating model loading status and Gradio launch information.

**Essential Syntaxes**


    
    qa_model = BertForQuestionAnswering.from_pretrained(best_model_path).to(device)
    qa_tokenizer = BertTokenizerFast.from_pretrained(best_model_path)
    qna_pipeline = pipeline("question-answering", model=qa_model, tokenizer=qa_tokenizer, device=device_num)

  

Loads the fine-tuned BERT Question Answering model and its tokenizer from a specified path and initializes a Hugging Face pipeline for efficient Q&A inference.
code Python

    
    similarity_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

  

Loads a pre-trained Sentence Transformer model to convert text into numerical embeddings for semantic similarity calculations.
code Python

    
    def get_new_question(qna_mode, topic, asked_indices_json):
        # ... logic to filter DataFrame, select a random question, update asked_indices ...
        return question, context, ground_truth, json.dumps(cleaned_asked_indices), gr.Tabs(selected=1)

  

This function is the core logic for fetching a new question. It filters questions based on the selected mode and topic, ensures a question hasn't been asked before in the current session, updates the asked_indices_state, and returns the question, context, and ground truth answer. It also programmatically switches the active Gradio tab.
code Python

    
    def evaluate_answer(context, question, ground_truth, user_answer):
        model_pred = qna_pipeline(question=question, context=context)['answer']
        user_emb = get_embedding(user_answer)
        gt_emb = get_embedding(ground_truth)
        model_pred_emb = get_embedding(model_pred)
        sim_to_ground_truth = util.cos_sim(user_emb, gt_emb).item()
        sim_to_model_pred = util.cos_sim(user_emb, model_pred_emb).item()
        final_similarity_score = max(sim_to_ground_truth, sim_to_model_pred)
        # ... rest of the evaluation logic ...
        return model_pred, similarity_details, judgment

  

This function performs the evaluation. It first uses the BERT Q&A pipeline to get an answer from the model based on the context. Then, it converts the user's answer, ground truth, and model's answer into embeddings using the Sentence Transformer. It calculates cosine similarity between the user's answer and both the ground truth and the model's answer, taking the maximum as the final similarity score to determine correctness.
code Python

        
    with gr.Blocks(theme=gr.themes.Soft()) as demo:
        # ... UI component definitions ...
        start_btn.click(fn=get_new_question, inputs=[mode_input, topic_input, asked_indices_state], outputs=[question_display, context_display, ground_truth_display, asked_indices_state, tabs])
        next_question_btn.click(fn=get_new_question, inputs=[mode_input, topic_input, asked_indices_state], outputs=[question_display, context_display, ground_truth_display, asked_indices_state, tabs]).then(fn=clear_previous_results, inputs=None, outputs=[user_answer_input, model_answer_output, similarity_output, final_eval_output])
        evaluate_btn.click(...)
        reset_btn.click(...)

    


**Comment and Observation**

This Gradio application effectively integrates multiple NLP models (BERT for Q&A, Sentence Transformers for semantic similarity) into a user-friendly educational tool.


# **Question and Answering UI**

In [15]:
import gradio as gr
import pandas as pd
import random
import json
from transformers import pipeline, BertForQuestionAnswering, BertTokenizerFast
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
import torch
import re


available_topics = sorted(df_original['Title'].unique().tolist())

device_num = 0 if torch.cuda.is_available() else -1
device = torch.device(f"cuda:{device_num}" if torch.cuda.is_available() else "cpu")

best_model_path = "/content/drive/My Drive/best_qa_model" # Using the fully trained model
qa_model = BertForQuestionAnswering.from_pretrained(best_model_path).to(device)
qa_tokenizer = BertTokenizerFast.from_pretrained(best_model_path)
qna_pipeline = pipeline("question-answering", model=qa_model, tokenizer=qa_tokenizer, device=device_num)

similarity_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
print("All components loaded successfully!")

# --- 2. Define Backend Logic for the UI ---
def get_embedding(text):
    return similarity_model.encode(text, convert_to_tensor=True)

def get_new_question(qna_mode, topic, asked_indices_json):
    if not qna_mode or not topic:
        return "Please select a mode and topic first.", "", "", asked_indices_json, gr.Tabs(selected=0)

    question_col = 'Quiz Question' if qna_mode == 'Quiz' else 'Exam Question'
    answer_col = 'Quiz Answer' if qna_mode == 'Quiz' else 'Exam Answer'
    topic_df = df_original[(df_original['Title'] == topic) & (df_original[question_col] != '') & (df_original[answer_col] != '')].copy()

    asked_indices = set(json.loads(asked_indices_json))
    remaining_df = topic_df[~topic_df.index.isin(asked_indices)]

    if remaining_df.empty:
        return "No more questions for this topic! Please reset the session.", "", "", json.dumps([]), gr.Tabs(selected=0)

    random_row = remaining_df.sample(n=1).iloc[0]
    asked_indices.add(random_row.name)

    context, question, ground_truth = random_row['Context'], random_row[question_col], random_row[answer_col]


    cleaned_asked_indices = [int(i) for i in asked_indices]

    return question, context, ground_truth, json.dumps(cleaned_asked_indices), gr.Tabs(selected=1)

def evaluate_answer(context, question, ground_truth, user_answer):
    if not user_answer:
        return "", "", "Please provide an answer."

    model_pred = qna_pipeline(question=question, context=context)['answer']

    # Convert all three answers to numerical vectors (embeddings)
    user_emb = get_embedding(user_answer)
    gt_emb = get_embedding(ground_truth)
    model_pred_emb = get_embedding(model_pred)

    # How similar is the user's answer to the dataset's ground truth?
    sim_to_ground_truth = util.cos_sim(user_emb, gt_emb).item()

    # How similar is the user's answer to the model's extracted sentence?
    sim_to_model_pred = util.cos_sim(user_emb, model_pred_emb).item()


    # Take the HIGHEST of the two similarity scores.
    final_similarity_score = max(sim_to_ground_truth, sim_to_model_pred)


    is_similar = final_similarity_score >= 0.50

    judgment = "Correct! Your answer is semantically aligned with the core concepts." if is_similar else "Incorrect. Your answer seems to be missing key concepts or is phrased too differently."

    # We will now output both similarity scores for transparency
    similarity_details = (
        f"Overall Score: {final_similarity_score:.4f}\n"
        f"(vs. Dataset Answer: {sim_to_ground_truth:.4f})\n"
        f"(vs. Extracted Context: {sim_to_model_pred:.4f})"
    )

    return model_pred, similarity_details, judgment


def clear_previous_results():
    return "", "", "", ""


def reset_session():

    return "", "", "", "", "", "", "", json.dumps([]), gr.Tabs(selected=0)


with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown("# Interactive Study Session")
    asked_indices_state = gr.State(value=json.dumps([]))

    with gr.Tabs() as tabs:
        with gr.TabItem("1. Setup Session", id=0):
            mode_input = gr.Radio(['Quiz', 'Exam'], label="Choose Study Mode")
            topic_input = gr.Dropdown(available_topics, label="Choose a Topic")
            start_btn = gr.Button("Start Session", variant="primary")

        with gr.TabItem("2. Answer Question", id=1):
            with gr.Row():
                with gr.Column(scale=2):
                    context_display = gr.Textbox(label="Context", lines=10, interactive=False)
                    question_display = gr.Textbox(label="Generated Question", lines=4, interactive=False)
                    ground_truth_display = gr.Textbox(label="Ground Truth (Hidden)", visible=False)
                    user_answer_input = gr.Textbox(label="Your Answer", lines=4, placeholder="Type your answer here...")
                    with gr.Row():
                        evaluate_btn = gr.Button("Evaluate My Answer", variant="primary")
                        next_question_btn = gr.Button("Next Question")
                        reset_btn = gr.Button("Start New Session")
                with gr.Column(scale=1):
                    gr.Markdown("### Evaluation Results")
                    model_answer_output = gr.Textbox(label="Model's Extracted Answer", interactive=False)
                    similarity_output = gr.Textbox(label="Semantic Similarity Scores", lines=4, interactive=False)
                    final_eval_output = gr.Textbox(label="Final Judgment", interactive=False)



    start_btn.click(fn=get_new_question, inputs=[mode_input, topic_input, asked_indices_state], outputs=[question_display, context_display, ground_truth_display, asked_indices_state, tabs])
    next_question_btn.click(fn=get_new_question, inputs=[mode_input, topic_input, asked_indices_state], outputs=[question_display, context_display, ground_truth_display, asked_indices_state, tabs]).then(fn=clear_previous_results,
            inputs=None,
            outputs=[user_answer_input, model_answer_output, similarity_output, final_eval_output])
    evaluate_btn.click(
        fn=evaluate_answer,
        inputs=[context_display, question_display, ground_truth_display, user_answer_input],
        outputs=[model_answer_output, similarity_output, final_eval_output]
    )
    reset_btn.click(fn=reset_session, inputs=None, outputs=[question_display, context_display, ground_truth_display, user_answer_input, model_answer_output, similarity_output, final_eval_output, asked_indices_state, tabs])

demo.launch(debug=True, share=True)

Device set to use cuda:0


All components loaded successfully!
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://5941bde93cd83e227a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://5941bde93cd83e227a.gradio.live


