# BERT & RoBERTa: Zero-Shot, Few-Shot Classification

This notebook runs the same 4 evaluation conditions as the Claude and Groq notebooks,
adapted for **encoder-only models** (BERT, RoBERTa).

## How each condition is implemented

| Condition | Implementation |
|---|---|
| **Zero-Shot** | NLI (Natural Language Inference) pipeline — the task definition is the hypothesis |
| **Zero-Shot CoT** | Richer, more descriptive NLI hypothesis (encoders cannot reason step-by-step) |
| **Few-Shot** | Fine-tune the model on the 20 labelled examples, then evaluate |
| **Few-Shot CoT** | Same fine-tuning but with augmented training text (closest equivalent) |

### Models used
- **Zero-Shot**: `cross-encoder/nli-roberta-base` and `cross-encoder/nli-MiniLM2-L6-H768` (BERT-family)
- **Few-Shot**: `roberta-base` and `bert-base-uncased` (fine-tuned on 20 examples)

> **Runtime**: Set Colab to **GPU** (Runtime → Change runtime type → T4 GPU) before running.

In [1]:
!pip install transformers datasets scikit-learn torch -q

In [2]:
import os
import time
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import torch
from tqdm import tqdm

from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import Dataset
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report
)

# Device
DEVICE = 0 if torch.cuda.is_available() else -1
DEVICE_NAME = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {DEVICE_NAME}")

# Column names
TEXT_COLUMN  = 'text'
LABEL_COLUMN = 'label'

Using device: cuda


In [3]:
# ── LOAD DATA ─────────────────────────────────────────────────────────────────
# Same files generated in your original data-prep notebook
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("final_merged_dataset.csv")

print(f"Original dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())




Original dataset shape: (309, 4)
Columns: ['id', 'text', 'similarity_score', 'label']

Label distribution:
label
0    173
1    136
Name: count, dtype: int64


In [4]:
# Step 2: Create a copy and save the full labeled dataset as backup
print("\nSaving full labeled dataset as backup...")
df_backup = df.copy()
df_backup.to_csv('labeled_data_backup.csv', index=False)


Saving full labeled dataset as backup...


In [5]:
# Step 3: Extract 20 examples where label = 1 for few-shot learning
print("\nExtracting 40 examples where label = 1 for few-shot prompts")
burnout_examples = df[df['label'] == 1].sample(n=40, random_state=42)
fewshot_examples = burnout_examples.copy()


Extracting 40 examples where label = 1 for few-shot prompts


In [6]:
# Save few-shot examples (with labels)
fewshot_examples.to_csv('fewshot_examples.csv', index=False)
print(f"Few-shot examples saved: {fewshot_examples.shape}")

Few-shot examples saved: (40, 4)


In [7]:
# Step 4: Create test set (remaining data after removing few-shot examples)
# Get indices of few-shot examples
fewshot_indices = burnout_examples.index

# Remove few-shot examples from the dataset to create test set
test_data = df.drop(fewshot_indices)

print(f"Test set shape: {test_data.shape}")
print(f"Test set label distribution:")
print(test_data['label'].value_counts())

Test set shape: (269, 4)
Test set label distribution:
label
0    173
1     96
Name: count, dtype: int64


In [8]:
# Step 5: Save test set WITH labels (for validation later)
test_data.to_csv('test_data_with_labels.csv', index=False)
print("\nTest data with labels saved: test_data_with_labels.csv")


Test data with labels saved: test_data_with_labels.csv


In [9]:
# Step 6: Create and save test set WITHOUT labels (for actual inference)
test_data_unlabeled = test_data.drop(columns=['label'])
test_data_unlabeled.to_csv('test_data_unlabeled.csv', index=False)
print("Test data without labels saved: test_data_unlabeled.csv")

Test data without labels saved: test_data_unlabeled.csv


---
## Part 1 — Zero-Shot & Zero-Shot CoT (NLI Pipeline)

We use **Natural Language Inference** to do zero-shot classification.
The model checks whether the text *entails* a hypothesis that describes the burnout label.

- **Zero-Shot**: concise hypothesis
- **Zero-Shot CoT**: more detailed hypothesis that walks through the reasoning criteria
  (the closest possible equivalent for an encoder model)

In [10]:
# ── NLI HYPOTHESES ────────────────────────────────────────────────────────────
# These replace the text prompts used for LLMs.
# The NLI pipeline will output a score for each candidate label.

# Zero-Shot: concise label descriptions
CANDIDATE_LABELS_ZS = [
    "The author is experiencing work-related burnout or stress in cybersecurity.",
    "The text is not about the author's own work-related burnout or stress."
]

# Zero-Shot CoT: richer hypothesis that mirrors the prompt wording
# (more context for the NLI model to reason about)
CANDIDATE_LABELS_ZS_COT = [
    (
        "The author personally discusses burnout or work-related stress in the past or present, "
        "related to their career, job responsibilities, workplace environment, "
        "or professional life in cybersecurity."
    ),
    (
        "The text does not describe the author's own work-related burnout or stress. "
        "It may discuss hypothetical situations, future concerns, or topics unrelated "
        "to the author's personal mental health at work."
    )
]

print("NLI hypotheses defined.")

NLI hypotheses defined.


In [11]:
# ── NLI MODEL NAMES ───────────────────────────────────────────────────────────
# cross-encoder/nli-roberta-base  → RoBERTa family
# cross-encoder/nli-MiniLM2-L6-H768 → BERT family (MiniLM is BERT-based)

NLI_MODELS = {
    'RoBERTa': 'cross-encoder/nli-roberta-base',
    'BERT':    'cross-encoder/nli-MiniLM2-L6-H768',
}

print("NLI models:", NLI_MODELS)

NLI models: {'RoBERTa': 'cross-encoder/nli-roberta-base', 'BERT': 'cross-encoder/nli-MiniLM2-L6-H768'}


In [25]:
def truncate_text(text, max_chars=512):
    """Keep texts short enough for encoder models (512 token limit)
    Note: Text values are usually small enough, so this only affects the few longest ones.
    """
    return str(text)[:max_chars]


def run_nli_classification(test_df, classifier, candidate_labels, approach_name, model_name):
    """
    Run zero-shot NLI classification.
    The first candidate label is treated as 'positive' (label=1).
    Whichever label scores higher wins.
    """
    results = []
    print(f"\n{'='*60}")
    print(f"Running : {approach_name}")
    print(f"Model   : {model_name}")
    print(f"{'='*60}")

    for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Classifying"):
        text = truncate_text(row[TEXT_COLUMN])

        try:
            output = classifier(
                text,
                candidate_labels,
                multi_label=False
            )
            # output['labels'][0] is the highest-scoring label
            top_label = output['labels'][0]
            # First candidate = burnout = label 1
            predicted = 1 if top_label == candidate_labels[0] else 0
            raw_output = f"Top: '{top_label}' | Scores: {dict(zip(output['labels'], [round(s,3) for s in output['scores']]))}"

        except Exception as e:
            predicted = 0
            raw_output = f"ERROR: {str(e)}"

        results.append({
            'id':              row['id'], # Store the actual 'id' for merging
            'text':            row[TEXT_COLUMN],
            'raw_output':      raw_output,
            'predicted_label': predicted,
            'approach':        approach_name,
            'model':           model_name
        })

    return pd.DataFrame(results)

In [15]:
# ── RUN ZERO-SHOT + ZERO-SHOT CoT FOR BOTH MODELS ─────────────────────────────
# all_results = []  <-- Moved to a separate initialization cell

for model_short_name, model_id in NLI_MODELS.items():
    print(f"\nLoading NLI pipeline: {model_id}")
    classifier = pipeline(
        'zero-shot-classification',
        model=model_id,
        device=DEVICE
    )

    # Zero-Shot
    approach = f"Zero-Shot_{model_short_name}"
    df = run_nli_classification(test_data, classifier, CANDIDATE_LABELS_ZS, approach, model_id)
    df.to_csv(f'results_{approach}.csv', index=False)
    all_results.append(df)
    print(f"Saved: results_{approach}.csv")

    # Zero-Shot CoT
    approach = f"Zero-Shot-CoT_{model_short_name}"
    df = run_nli_classification(test_data, classifier, CANDIDATE_LABELS_ZS_COT, approach, model_id)
    df.to_csv(f'results_{approach}.csv', index=False)
    all_results.append(df)
    print(f"Saved: results_{approach}.csv")

    # Free memory before loading next model
    del classifier
    torch.cuda.empty_cache() if torch.cuda.is_available() else None

print("\nZero-shot experiments complete.")


Loading NLI pipeline: cross-encoder/nli-roberta-base


Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: cross-encoder/nli-roberta-base
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Running : Zero-Shot_RoBERTa
Model   : cross-encoder/nli-roberta-base


Classifying:   4%|▎         | 10/269 [00:01<00:29,  8.91it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Classifying: 100%|██████████| 269/269 [00:20<00:00, 13.41it/s]


Saved: results_Zero-Shot_RoBERTa.csv

Running : Zero-Shot-CoT_RoBERTa
Model   : cross-encoder/nli-roberta-base


Classifying: 100%|██████████| 269/269 [00:11<00:00, 23.39it/s]


Saved: results_Zero-Shot-CoT_RoBERTa.csv

Loading NLI pipeline: cross-encoder/nli-MiniLM2-L6-H768


config.json:   0%|          | 0.00/875 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: cross-encoder/nli-MiniLM2-L6-H768
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]


Running : Zero-Shot_BERT
Model   : cross-encoder/nli-MiniLM2-L6-H768


Classifying: 100%|██████████| 269/269 [00:03<00:00, 80.95it/s]


Saved: results_Zero-Shot_BERT.csv

Running : Zero-Shot-CoT_BERT
Model   : cross-encoder/nli-MiniLM2-L6-H768


Classifying: 100%|██████████| 269/269 [00:03<00:00, 71.09it/s]

Saved: results_Zero-Shot-CoT_BERT.csv

Zero-shot experiments complete.





---
## Part 2 — Few-Shot & Few-Shot CoT (Fine-Tuning)

Encoder models learn from examples through **fine-tuning**, not in-context prompting.

- **Few-Shot**: fine-tune on the 20 labelled examples with the text as-is
- **Few-Shot CoT**: fine-tune on the same examples but with the task description
  prepended to each text — giving the model more signal about *what to look for*

In [16]:
# ── FEW-SHOT FINE-TUNING CONFIG ───────────────────────────────────────────────
FINETUNE_MODELS = {
    'RoBERTa': 'roberta-base',
    'BERT':    'bert-base-uncased',
}

# Training hyperparameters — tuned for very small datasets (20 examples)
FINETUNING_ARGS = dict(
    num_train_epochs        = 10,    # more epochs needed with tiny data
    per_device_train_batch_size = 4,
    learning_rate           = 2e-5,
    weight_decay            = 0.01,
    warmup_ratio            = 0.1,
    save_strategy           = 'no',
    logging_steps           = 5,
    seed                    = 42,
)

# CoT prefix: prepended to training text to add task context
COT_PREFIX = (
    "Classify whether the author personally experiences work-related burnout "
    "or stress in cybersecurity (past or present, not hypothetical). Text: "
)

print("Few-shot config ready.")

Few-shot config ready.


In [26]:
def tokenize_dataset(examples, tokenizer, max_length=256):
    return tokenizer(
        examples[TEXT_COLUMN],
        truncation=True,
        padding='max_length',
        max_length=max_length
    )


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1':       f1_score(labels, predictions, zero_division=0)
    }


def finetune_and_predict(train_df, test_df, model_id, approach_name, use_cot_prefix=False):
    """
    Fine-tune a BERT/RoBERTa model on train_df, then predict on test_df.
    use_cot_prefix=True prepends a task description to each training example
    (Few-Shot CoT equivalent for encoder models).
    """
    print(f"\n{'='*60}")
    print(f"Running : {approach_name}")
    print(f"Model   : {model_id}")
    print(f"{'='*60}")

    # ── Prepare training data ─────────────────────────────────────────────────
    train_texts = train_df[TEXT_COLUMN].astype(str).tolist()
    if use_cot_prefix:
        train_texts = [COT_PREFIX + t for t in train_texts]

    train_dataset = Dataset.from_dict({
        TEXT_COLUMN: train_texts,
        'labels':    train_df[LABEL_COLUMN].tolist()
    })

    # ── Tokenizer ─────────────────────────────────────────────────────────────
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    train_dataset = train_dataset.map(
        lambda x: tokenize_dataset(x, tokenizer),
        batched=True
    )
    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

    # ── Model ─────────────────────────────────────────────────────────────────
    model = AutoModelForSequenceClassification.from_pretrained(
        model_id, num_labels=2
    )

    # ── Training ──────────────────────────────────────────────────────────────
    output_dir = f'./ft_{approach_name.replace("/", "_")}'
    training_args = TrainingArguments(
        output_dir=output_dir,
        **FINETUNING_ARGS
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        compute_metrics=compute_metrics,
    )

    print(f"Fine-tuning on {len(train_dataset)} examples...")
    trainer.train()

    # ── Predict on test set ───────────────────────────────────────────────────
    print("Predicting on test set...")
    test_texts = test_df[TEXT_COLUMN].astype(str).tolist()
    if use_cot_prefix:
        test_texts = [COT_PREFIX + t for t in test_texts]

    test_dataset = Dataset.from_dict({TEXT_COLUMN: test_texts})
    test_dataset = test_dataset.map(
        lambda x: tokenize_dataset(x, tokenizer),
        batched=True
    )
    test_dataset.set_format('torch', columns=['input_ids', 'attention_mask'])

    raw_preds = trainer.predict(test_dataset)
    predicted_labels = np.argmax(raw_preds.predictions, axis=-1).tolist()

    # ── Build results DataFrame ───────────────────────────────────────────────
    results = []
    for i, (idx, row) in enumerate(test_df.iterrows()):
        results.append({
            'id':              row['id'], # Store the actual 'id' for merging
            'text':            row[TEXT_COLUMN],
            'raw_output':      str(predicted_labels[i]),
            'predicted_label': predicted_labels[i],
            'approach':        approach_name,
            'model':           model_id
        })

    # Free GPU memory
    del model, trainer
    torch.cuda.empty_cache() if torch.cuda.is_available() else None

    return pd.DataFrame(results)

In [19]:
# ── RUN FEW-SHOT + FEW-SHOT CoT FOR BOTH MODELS ───────────────────────────────
for model_short_name, model_id in FINETUNE_MODELS.items():

    # Few-Shot (plain fine-tuning)
    approach = f"Few-Shot_{model_short_name}"
    df = finetune_and_predict(
        fewshot_examples, test_data,
        model_id=model_id,
        approach_name=approach,
        use_cot_prefix=False
    )
    df.to_csv(f'results_{approach}.csv', index=False)
    all_results.append(df)
    print(f"Saved: results_{approach}.csv")

    # Few-Shot CoT (fine-tuning with task-description prefix)
    approach = f"Few-Shot-CoT_{model_short_name}"
    df = finetune_and_predict(
        fewshot_examples, test_data,
        model_id=model_id,
        approach_name=approach,
        use_cot_prefix=True
    )
    df.to_csv(f'results_{approach}.csv', index=False)
    all_results.append(df)
    print(f"Saved: results_{approach}.csv")

print("\nFew-shot experiments complete.")


Running : Few-Shot_RoBERTa
Model   : roberta-base


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
classifier.out_proj.bias        | MISSING    | 
classifier.dense.weight         | MISSING    | 
classifier.dense.bias           | MISSING    | 
classifier.out_proj.weight      | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Fine-tuning on 40 examples...


Step,Training Loss
5,0.671834
10,0.604039
15,0.366582
20,0.075781
25,0.015215
30,0.004888
35,0.002216
40,0.001212
45,0.000965
50,0.000825


Predicting on test set...


Map:   0%|          | 0/269 [00:00<?, ? examples/s]

Saved: results_Few-Shot_RoBERTa.csv

Running : Few-Shot-CoT_RoBERTa
Model   : roberta-base


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.dense.weight            | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
classifier.out_proj.bias        | MISSING    | 
classifier.dense.weight         | MISSING    | 
classifier.dense.bias           | MISSING    | 
classifier.out_proj.weight      | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Fine-tuning on 40 examples...


Step,Training Loss
5,0.715654
10,0.594682
15,0.330407
20,0.058734
25,0.017524
30,0.007374
35,0.002591
40,0.001899
45,0.001083
50,0.001034


Predicting on test set...


Map:   0%|          | 0/269 [00:00<?, ? examples/s]

Saved: results_Few-Shot-CoT_RoBERTa.csv

Running : Few-Shot_BERT
Model   : bert-base-uncased


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
warmup_ratio is deprecated and will b

Fine-tuning on 40 examples...


Step,Training Loss
5,0.503379
10,0.373966
15,0.216156
20,0.126808
25,0.065819
30,0.036579
35,0.020054
40,0.01222
45,0.008529
50,0.006781


Predicting on test set...


Map:   0%|          | 0/269 [00:00<?, ? examples/s]

Saved: results_Few-Shot_BERT.csv

Running : Few-Shot-CoT_BERT
Model   : bert-base-uncased


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
warmup_ratio is deprecated and will b

Fine-tuning on 40 examples...


Step,Training Loss
5,0.511815
10,0.349293
15,0.211866
20,0.114385
25,0.055642
30,0.028984
35,0.015593
40,0.010425
45,0.007503
50,0.005565


Predicting on test set...


Map:   0%|          | 0/269 [00:00<?, ? examples/s]

Saved: results_Few-Shot-CoT_BERT.csv

Few-shot experiments complete.


In [33]:
# ── SAVE COMBINED RESULTS ─────────────────────────────────────────────────────
combined_bert_roberta = pd.concat(all_results, ignore_index=True)
combined_bert_roberta.to_csv('all_results_bert_roberta_combined.csv', index=False)
print(f"Combined results saved: {len(combined_bert_roberta)} rows")
print(f"Approaches: {combined_bert_roberta['approach'].unique().tolist()}")

Combined results saved: 2152 rows
Approaches: ['Zero-Shot_RoBERTa', 'Zero-Shot-CoT_RoBERTa', 'Zero-Shot_BERT', 'Zero-Shot-CoT_BERT', 'Few-Shot_RoBERTa', 'Few-Shot-CoT_RoBERTa', 'Few-Shot_BERT', 'Few-Shot-CoT_BERT']


---
## Part 3 — download the files

In [28]:
# ── DOWNLOAD ALL OUTPUT FILES ─────────────────────────────────────────────────
from google.colab import files

files_to_download = [
    'results_Zero-Shot_RoBERTa.csv',
    'results_Zero-Shot-CoT_RoBERTa.csv',
    'results_Few-Shot_RoBERTa.csv',
    'results_Few-Shot-CoT_RoBERTa.csv',
    'results_Zero-Shot_BERT.csv',
    'results_Zero-Shot-CoT_BERT.csv',
    'results_Few-Shot_BERT.csv',
    'results_Few-Shot-CoT_BERT.csv',
    'all_results_bert_roberta_combined.csv',
    'bert_roberta_evaluation_summary.csv',
]

for fname in files_to_download:
    if os.path.exists(fname):
        print(f"Downloading: {fname}")
        files.download(fname)
    else:
        print(f"Skipped (not found): {fname}")

Downloading: results_Zero-Shot_RoBERTa.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: results_Zero-Shot-CoT_RoBERTa.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: results_Few-Shot_RoBERTa.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: results_Few-Shot-CoT_RoBERTa.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: results_Zero-Shot_BERT.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: results_Zero-Shot-CoT_BERT.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: results_Few-Shot_BERT.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: results_Few-Shot-CoT_BERT.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading: all_results_bert_roberta_combined.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Skipped (not found): bert_roberta_evaluation_summary.csv


## part 4: Evaluation

In [40]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# ── LOAD FILES ────────────────────────────────────────────────────────────
combined   = pd.read_csv('all_results_bert_roberta_combined.csv')
test_data  = pd.read_csv('test_data_with_labels.csv')   # has true labels

print("combined columns :", combined.columns.tolist())
print("test_data columns:", test_data.columns.tolist())
print(f"\ncombined 'index' sample : {combined['index'].head().tolist()}")
print(f"test_data 'id'   sample : {test_data['id'].head().tolist()}")

# ── ROOT CAUSE FIX ────────────────────────────────────────────────────────
# combined['index'] = pandas row position in test_df (0, 1, 2, ...)
# test_data has its own 'id' column (original dataset IDs)
#
# Solution: use the ROW POSITION of test_data to align, not the 'id' column.
# We reset test_data's index so position 0,1,2... matches combined['index'].

test_data_reset = test_data.reset_index(drop=True)   # position = 0,1,2,...

ground_truth = test_data_reset[[LABEL_COLUMN]].rename(columns={LABEL_COLUMN: 'true_label'})
ground_truth['index'] = ground_truth.index            # add position column to merge on

combined = combined.merge(ground_truth, on='index', how='left')

# ── SANITY CHECK ──────────────────────────────────────────────────────────
print(f"\nRows after merge : {len(combined)}")
print(f"NaN in true_label: {combined['true_label'].isna().sum()}")

# If NaN count is still > 0, print unmatched indices to diagnose further
if combined['true_label'].isna().sum() > 0:
    unmatched = combined[combined['true_label'].isna()]['index'].unique()
    print(f"Unmatched index values: {unmatched[:10]}")
    print(f"Max index in combined : {combined['index'].max()}")
    print(f"Max rows in test_data : {len(test_data_reset)}")

# Drop any rows that still couldn't be matched before computing metrics
combined = combined.dropna(subset=['true_label'])
combined['true_label']     = combined['true_label'].astype(int)
combined['predicted_label'] = combined['predicted_label'].astype(int)

# ── METRICS ───────────────────────────────────────────────────────────────
summary_rows = []
for approach, group in combined.groupby('approach'):
    y_true = group['true_label']
    y_pred = group['predicted_label']
    summary_rows.append({
        'Approach':  approach,
        'Model':     group['model'].iloc[0],
        'Accuracy':  round(accuracy_score(y_true, y_pred), 4),
        'Precision': round(precision_score(y_true, y_pred, zero_division=0), 4),
        'Recall':    round(recall_score(y_true, y_pred, zero_division=0), 4),
        'F1':        round(f1_score(y_true, y_pred, zero_division=0), 4),
        'N':         len(group)
    })

summary_df = pd.DataFrame(summary_rows).sort_values('F1', ascending=False).reset_index(drop=True)
print("\n", summary_df.to_string(index=False))

summary_df.to_csv('bert_roberta_evaluation_summary.csv', index=False)
print("\nSaved: bert_roberta_evaluation_summary.csv")

combined columns : ['index', 'text', 'raw_output', 'predicted_label', 'approach', 'model']
test_data columns: ['id', 'text', 'similarity_score', 'label']

combined 'index' sample : [1, 2, 3, 4, 5]
test_data 'id'   sample : ['1dqiog2', '1g49xt4', '1fqxn7a', '1fbdhwo', '1dgwtln']

Rows after merge : 2152
NaN in true_label: 264
Unmatched index values: [269 270 271 272 273 275 276 277 278 279]
Max index in combined : 308
Max rows in test_data : 269

              Approach                             Model  Accuracy  Precision  Recall     F1   N
    Few-Shot-CoT_BERT                 bert-base-uncased    0.3475     0.3475  1.0000 0.5157 236
 Few-Shot-CoT_RoBERTa                      roberta-base    0.3475     0.3475  1.0000 0.5157 236
        Few-Shot_BERT                 bert-base-uncased    0.3475     0.3475  1.0000 0.5157 236
     Few-Shot_RoBERTa                      roberta-base    0.3475     0.3475  1.0000 0.5157 236
    Zero-Shot_RoBERTa    cross-encoder/nli-roberta-base    0.4110    