<a href="https://colab.research.google.com/github/ma55530/SemEval2026-CLARITY-FER/blob/main/hierarchiral/BERT(ClearRvsRest)v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing the necessary libraries**

In [1]:
%pip -q install scikit-learn torch pandas datasets transformers torchvision accelerate

In [2]:
import pandas as pd
from datasets import Dataset, load_dataset
import os
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
from datasets import load_dataset
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments


# **Loading the data**

In [3]:
print("Loading dataset...")
df = load_dataset("ailsntua/QEvasion")

def clarity_to_label(row):
    mapping = {
        "Clear Reply": 0,
        "Ambivalent": 1,
        "Clear Non-Reply": 2
    }
    row["label"] = mapping[row["clarity_label"]]

    # --- FIXED SHORT/LONG BINARY LABEL ---
    # This line was incorrect: row["binary_label"] = 0 if (row["clarity_label"] == 0 or row["clarity_label"] == 1) else 1
    # It should compare with the assigned integer label:
    row["binary_label"] = 0 if row["label"] == 0 else 1
    return row

df = df.map(clarity_to_label)
y_test = df["test"]["label"]


non_reply_lengths = [len(ans.split()) for ans, label in zip(df["train"]["interview_answer"], df["train"]["label"]) if label == 2]
other_lengths     = [len(ans.split()) for ans, label in zip(df["train"]["interview_answer"], df["train"]["label"]) if label != 2]

print("Mean length Clear Non-Reply:", np.mean(non_reply_lengths))
print("Mean length Others (Reply + Ambivalent):", np.mean(other_lengths))

print("Count Non-reply:", len(non_reply_lengths))
print("Count Others:", len(other_lengths))



Loading dataset...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.90M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/259k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3448 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/308 [00:00<?, ? examples/s]

Map:   0%|          | 0/3448 [00:00<?, ? examples/s]

Map:   0%|          | 0/308 [00:00<?, ? examples/s]

Mean length Clear Non-Reply: 137.8061797752809
Mean length Others (Reply + Ambivalent): 311.506468305304
Count Non-reply: 356
Count Others: 3092


# **Defining the evaluation**

In [4]:
def evaluate(y_true, y_pred):
    print(classification_report(y_true, y_pred))
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred)
    plt.show()

def compute_metrics_binary(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "precision": precision_score(labels, preds, average='binary', zero_division=0),
        "recall": recall_score(labels, preds, average='binary', zero_division=0),
        "f1": f1_score(labels, preds, average='binary', zero_division=0)
    }

def compute_metrics_fine(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "precision": precision_score(labels, preds, average='macro', zero_division=0),
        "recall": recall_score(labels, preds, average='macro', zero_division=0),
        "f1": f1_score(labels, preds, average='macro', zero_division=0)
    }


# **Loading the model and tokenizer**

In [5]:


binary_model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

fine_model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=3
)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(example):
    encoded = tokenizer(
        "Question: " + example["interview_question"] +
        "Answer: " + example["interview_answer"],
        padding="max_length",
        max_length=256,
        truncation=True
    )
    encoded["label"] = example["label"]
    encoded["binary_label"] = example["binary_label"]
    return encoded

tokenized_train = df["train"].map(tokenize)
tokenized_test = df["test"].map(tokenize)

cols_to_remove = ["interview_question", "interview_answer", "clarity_label"]

tokenized_train = tokenized_train.remove_columns(cols_to_remove)
tokenized_test  = tokenized_test.remove_columns(cols_to_remove)

tokenized_train.set_format("torch")
tokenized_test.set_format("torch")



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3448 [00:00<?, ? examples/s]

Map:   0%|          | 0/308 [00:00<?, ? examples/s]

# **Preparing the data for training**

In [6]:
# -----------------------------
# PREPARE BINARY DATA
# -----------------------------
binary_train = tokenized_train.remove_columns(["label"]).rename_column("binary_label", "labels")
binary_test  = tokenized_test.remove_columns(["label"]).rename_column("binary_label", "labels")

print("Binary label distribution (train):", set(int(x) for x in binary_train["labels"]))
print("Binary label distribution (test):",  set(int(x) for x in binary_test["labels"]))


# -----------------------------
# NOTE: Fine model data will be prepared AFTER binary training
# -----------------------------
print("Fine model data will be prepared after binary training using binary model predictions")
print("Final binary sizes:", len(binary_train), len(binary_test))


Binary label distribution (train): {0, 1}
Binary label distribution (test): {0, 1}
Fine model data will be prepared after binary training using binary model predictions
Final binary sizes: 3448 308


# **Defining the evaluation function and training**

In [None]:
os.environ["WANDB_DISABLED"] = "true"

class WeightedTrainer(Trainer):
    def __init__(self, weights=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.weights = weights

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs["labels"]
        outputs = model(**inputs)
        logits = outputs.logits

        loss_fct = torch.nn.CrossEntropyLoss(weight=self.weights.to(model.device))
        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss

In [8]:
# -----------------------------
# PREPARE FINE DATA (ONLY LABELS 1 and 2)
# -----------------------------
# 1. Filter dataset to keep only Ambivalent (1) and Non-Reply (2)
fine_indices_train = [i for i, label in enumerate(tokenized_train["label"]) if label in [1, 2]]
fine_indices_test = [i for i, label in enumerate(tokenized_test["label"]) if label in [1, 2]]

train_fine_filtered = tokenized_train.select(fine_indices_train)
test_fine_filtered = tokenized_test.select(fine_indices_test)

def map_fine_labels(example):
    # Map: 1 (Ambivalent) -> 0
    #      2 (Non-Reply)  -> 1
    new_label = 0 if example["label"] == 1 else 1
    return {"labels": new_label}

# 2. Apply mapping and format
train_fine = train_fine_filtered.map(map_fine_labels).remove_columns(["label", "binary_label"])
test_fine = test_fine_filtered.map(map_fine_labels).remove_columns(["label", "binary_label"])

print(f"Fine model training data size: {len(train_fine)}")
print(f"Fine model test data size: {len(test_fine)}")
print("Labels mapped: 1->0 (Ambivalent), 2->1 (Non-Reply)")

Map:   0%|          | 0/2396 [00:00<?, ? examples/s]

Map:   0%|          | 0/229 [00:00<?, ? examples/s]

Fine model training data size: 2396
Fine model test data size: 229
Labels mapped: 1->0 (Ambivalent), 2->1 (Non-Reply)


In [9]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
import torch

# Configuration
n_runs = 5
results = {
    "accuracy": [],
    "precision": [],
    "recall": [],
    "f1": []
}

# Ensure y_test is ready for comparison
truth = np.array(df["test"]["label"])

print(f"Starting evaluation of {n_runs} runs...")

for i in range(n_runs):
    current_seed = 42 + i  # Change seed for each run to observe variance
    print(f"\n--- Run {i+1}/{n_runs} (Seed: {current_seed}) ---")
    
    # 1. Re-initialize models
    binary_model_loop = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-uncased", num_labels=2
    )
    # Fine model now has 2 labels (Ambivalent vs Non-Reply)
    fine_model_loop = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-uncased", num_labels=2
    )
    
    # 2. Train Binary Model
    trainer_binary_loop = WeightedTrainer(
        weights=torch.tensor([1.0, 1.0]),
        model=binary_model_loop,
        train_dataset=binary_train,
        eval_dataset=binary_test,
        compute_metrics=compute_metrics_binary,
        args=TrainingArguments(
            output_dir=f"binary_output_run_{i}",
            num_train_epochs=3,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            learning_rate=2e-5,
            weight_decay=0.01,
            eval_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            fp16=True,
            report_to="none",
            overwrite_output_dir=True,
            seed=current_seed # IMPORTANT: Set varied seed
        ),
    )
    trainer_binary_loop.train()
    
    # 3. Train Fine Model
    # Using filtered/mapped train_fine dataset (labels 0/1)
    trainer_fine_loop = Trainer(
        model=fine_model_loop,
        train_dataset=train_fine,
        eval_dataset=test_fine, 
        compute_metrics=compute_metrics_binary, # Use binary metrics for fine model part too
        args=TrainingArguments(
            output_dir=f"fine_output_run_{i}",
            num_train_epochs=3,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            learning_rate=2e-5,
            weight_decay=0.01,
            eval_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            fp16=True,
            report_to="none",
            overwrite_output_dir=True,
            seed=current_seed # IMPORTANT: Set varied seed
        ),
    )
    trainer_fine_loop.train()
    
    # 4. Evaluation Pipeline
    
    # Get predictions from Binary Model
    bin_out = trainer_binary_loop.predict(binary_test)
    bin_preds = np.argmax(bin_out.predictions, axis=-1)
    
    # Get predictions from Fine Model 
    # We predict on ALL test data, but only care about the output for samples where binary_model said "Others"
    test_for_fine = tokenized_test.remove_columns(["binary_label", "label"]) # Remove original labels to avoid conflicts
    fine_out = trainer_fine_loop.predict(test_for_fine)
    fine_preds_raw = np.argmax(fine_out.predictions, axis=-1)
    
    # Combine predictions
    final_preds = []
    for bp, fp in zip(bin_preds, fine_preds_raw):
        if bp == 0:
            final_preds.append(0) # Clear Reply
        else:
            # bp == 1 -> Others
            # Check fine model prediction (0=Ambivalent, 1=Non-Reply)
            if fp == 0:
                final_preds.append(1) # Ambivalent
            else:
                final_preds.append(2) # Non-Reply
    
    # Compute Metrics
    acc = accuracy_score(truth, final_preds)
    prec = precision_score(truth, final_preds, average='macro', zero_division=0)
    rec = recall_score(truth, final_preds, average='macro', zero_division=0)
    f1 = f1_score(truth, final_preds, average='macro', zero_division=0)
    
    results["accuracy"].append(acc)
    results["precision"].append(prec)
    results["recall"].append(rec)
    results["f1"].append(f1)
    
    print(f"Run {i+1} -> Acc: {acc:.4f}, Prec: {prec:.4f}, Rec: {rec:.4f}, F1: {f1:.4f}")

# 5. Report Statistics
print("\n===== AGGREGATED RESULTS (5 Runs) =====")
metrics_list = ["accuracy", "precision", "recall", "f1"]

for m in metrics_list:
    vals = results[m]
    print(f"{m.capitalize()}:")
    print(f"  Avg: {np.mean(vals):.4f}")
    print(f"  Min: {np.min(vals):.4f}")
    print(f"  Max: {np.max(vals):.4f}")
    print(f"  Std: {np.std(vals):.4f}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting evaluation of 5 runs...

--- Run 1/5 (Seed: 42) ---


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.537475,0.791822,0.930131,0.855422
2,No log,0.52463,0.805344,0.921397,0.85947
3,0.551800,0.545502,0.819277,0.89083,0.853556


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.286448,0.454545,0.217391,0.294118
2,No log,0.272229,0.375,0.521739,0.436364
3,No log,0.269911,0.482759,0.608696,0.538462


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Run 1 -> Acc: 0.7045, Prec: 0.5934, Rec: 0.5913, F1: 0.5760

--- Run 2/5 (Seed: 43) ---


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.556213,0.751656,0.991266,0.854991
2,No log,0.544811,0.76431,0.991266,0.863118
3,0.568900,0.5506,0.805861,0.960699,0.876494


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.278329,0.0,0.0,0.0
2,No log,0.262844,0.48,0.521739,0.5
3,No log,0.277029,0.48,0.521739,0.5


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Run 2 -> Acc: 0.7208, Prec: 0.6207, Rec: 0.5685, F1: 0.5631

--- Run 3/5 (Seed: 44) ---


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.542592,0.75,0.995633,0.855535
2,No log,0.563768,0.795455,0.917031,0.851927
3,0.558600,0.546813,0.801498,0.934498,0.862903


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.304237,0.555556,0.217391,0.3125
2,No log,0.248781,0.636364,0.304348,0.411765
3,No log,0.26421,0.526316,0.434783,0.47619


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Run 3 -> Acc: 0.7143, Prec: 0.6110, Rec: 0.5395, F1: 0.5562

--- Run 4/5 (Seed: 45) ---


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.533477,0.782609,0.943231,0.855446
2,No log,0.531049,0.773519,0.969432,0.860465
3,0.555700,0.568469,0.791822,0.930131,0.855422


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.283782,1.0,0.043478,0.083333
2,No log,0.234831,0.642857,0.391304,0.486486
3,No log,0.250322,0.521739,0.521739,0.521739


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Run 4 -> Acc: 0.6948, Prec: 0.5964, Rec: 0.5243, F1: 0.5081

--- Run 5/5 (Seed: 46) ---


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.550644,0.75,0.995633,0.855535
2,No log,0.533136,0.791822,0.930131,0.855422
3,0.564100,0.55053,0.8,0.925764,0.8583


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.260506,1.0,0.173913,0.296296
2,No log,0.225757,0.6,0.521739,0.55814
3,No log,0.248708,0.461538,0.521739,0.489796


Run 5 -> Acc: 0.7110, Prec: 0.6248, Rec: 0.5636, F1: 0.5784

===== AGGREGATED RESULTS (5 Runs) =====
Accuracy:
  Avg: 0.7091
  Min: 0.6948
  Max: 0.7208
  Std: 0.0089
Precision:
  Avg: 0.6093
  Min: 0.5934
  Max: 0.6248
  Std: 0.0126
Recall:
  Avg: 0.5574
  Min: 0.5243
  Max: 0.5913
  Std: 0.0234
F1:
  Avg: 0.5564
  Min: 0.5081
  Max: 0.5784
  Std: 0.0255
