# QEvasion – Transformer Fine-tuning (Clarity & Evasion)

In this notebook we fine-tune a pretrained transformer encoder on the QEvasion dataset
for the two main tasks:

- **Task 1 – Clarity-level classification (3-way)**  
  Labels: `clarity_label` → `clarity_id`

- **Task 2 – Evasion-level classification (9-way)**  
  Labels: `evasion_label` → `evasion_id` (on the train split)  
  + special **test evaluation** using annotators (`annotator1/2/3`).

We:
1. Load and preprocess the data.
2. Create train/validation/test splits.
3. Tokenize question–answer pairs with a pretrained tokenizer.
4. Fine-tune a transformer with a **manual PyTorch loop** (no `Trainer`).
5. Evaluate Task 1 on the official test split.
6. Train and evaluate Task 2, including test evaluation using annotators.


In [None]:
#!pip install -q "transformers==4.45.2" "datasets>=2.19" sentencepiece safetensors

## 1. Imports & device

We import:
- `datasets.load_dataset` to fetch QEvasion.
- `transformers` for tokenizer and model.
- `sklearn` for splits and metrics.
- `torch` for manual training.

We also detect whether a GPU is available.


In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


Device: cuda


## 2. Load QEvasion & build the input text

We:
- Load the `ailsntua/QEvasion` dataset.
- Convert `train` and `test` splits to pandas.
- Build a single `text` column per example:

> `"Question: <question> [SEP] Answer: <answer>"`

This is what the model will see.


In [None]:
dataset = load_dataset("ailsntua/QEvasion")

train_df = dataset["train"].to_pandas()
test_df  = dataset["test"].to_pandas()

def build_text_column(df):
    df = df.copy()
    q = df["interview_question"].fillna("")
    a = df["interview_answer"].fillna("")
    df["text"] = "Question: " + q + " [SEP] Answer: " + a
    return df

train_df = build_text_column(train_df)
test_df  = build_text_column(test_df)

print("Train shape:", train_df.shape)
print("Test  shape:", test_df.shape)
train_df[["text", "clarity_label", "evasion_label"]].head()


Train shape: (3448, 21)
Test  shape: (308, 21)


Unnamed: 0,text,clarity_label,evasion_label
0,Question: Q. Of the Biden administration. And ...,Clear Reply,Explicit
1,Question: Q. Of the Biden administration. And ...,Ambivalent,General
2,Question: Q. No worries. Do you believe the co...,Ambivalent,Partial/half-answer
3,Question: Q. No worries. Do you believe the co...,Ambivalent,Dodging
4,"Question: Q. I can imagine. It is evening, I'd...",Clear Reply,Explicit


## 3. Encode labels & create splits

We create integer labels:

- **Clarity (Task 1)**: `clarity_label` → `clarity_id` in `{0,1,2}`.
- **Evasion (Task 2)**: `evasion_label` → `evasion_id` in `{0,…,8}` for train,
  and `-1` when no evasion label is given.

Then:

- For clarity:
  - `clar_train_df` / `clar_val_df` come from the train split (90/10, stratified).
  - `clar_test_df` is the official test split.
- For evasion:
  - we keep only rows with `evasion_id != -1`,
  - then split them into `ev_train_df` / `ev_val_df` (90/10, stratified).


In [None]:
# ----- Clarity labels (Task 1) -----
clarity_labels = sorted(
    list(set(train_df["clarity_label"].dropna().unique()) |
         set(test_df["clarity_label"].dropna().unique()))
)
clarity2id = {lbl: i for i, lbl in enumerate(clarity_labels)}
id2clarity = {i: lbl for lbl, i in clarity2id.items()}

train_df["clarity_id"] = train_df["clarity_label"].map(clarity2id)
test_df["clarity_id"]  = test_df["clarity_label"].map(clarity2id)

print("Clarity mapping:", clarity2id)

# ----- Evasion labels (Task 2 – train only) -----
evasion_labels = sorted(train_df["evasion_label"].dropna().unique())
evasion2id = {lbl: i for i, lbl in enumerate(evasion_labels)}
id2evasion = {i: lbl for lbl, i in evasion2id.items()}

mask_evasion_valid = train_df["evasion_label"].notna() & (train_df["evasion_label"] != "")
train_df["evasion_id"] = np.where(
    mask_evasion_valid,
    train_df["evasion_label"].map(evasion2id),
    -1,
)

print("Evasion mapping:", evasion2id)
train_df[["clarity_label", "clarity_id", "evasion_label", "evasion_id"]].head()


Clarity mapping: {'Ambivalent': 0, 'Clear Non-Reply': 1, 'Clear Reply': 2}
Evasion mapping: {'Claims ignorance': 0, 'Clarification': 1, 'Declining to answer': 2, 'Deflection': 3, 'Dodging': 4, 'Explicit': 5, 'General': 6, 'Implicit': 7, 'Partial/half-answer': 8}


Unnamed: 0,clarity_label,clarity_id,evasion_label,evasion_id
0,Clear Reply,2,Explicit,5
1,Ambivalent,0,General,6
2,Ambivalent,0,Partial/half-answer,8
3,Ambivalent,0,Dodging,4
4,Clear Reply,2,Explicit,5


In [None]:
# ---- Clarity: train / val / test ----
clar_full_train_df = train_df.copy()
y_clar_full = clar_full_train_df["clarity_id"].values

clar_train_idx, clar_val_idx = train_test_split(
    np.arange(len(clar_full_train_df)),
    test_size=0.1,
    stratify=y_clar_full,
    random_state=42,
)

clar_train_df = clar_full_train_df.iloc[clar_train_idx].reset_index(drop=True)
clar_val_df   = clar_full_train_df.iloc[clar_val_idx].reset_index(drop=True)
clar_test_df  = test_df.copy()

print("Clarity shapes (train, val, test):",
      clar_train_df.shape, clar_val_df.shape, clar_test_df.shape)

# ---- Evasion: train / val (only rows with evasion_id != -1) ----
evasion_train_df = train_df[train_df["evasion_id"] != -1].reset_index(drop=True)
y_eva_full = evasion_train_df["evasion_id"].values

ev_train_idx, ev_val_idx = train_test_split(
    np.arange(len(evasion_train_df)),
    test_size=0.1,
    stratify=y_eva_full,
    random_state=42,
)

ev_train_df = evasion_train_df.iloc[ev_train_idx].reset_index(drop=True)
ev_val_df   = evasion_train_df.iloc[ev_val_idx].reset_index(drop=True)

print("Evasion shapes (train, val):",
      ev_train_df.shape, ev_val_df.shape)


Clarity shapes (train, val, test): (3103, 23) (345, 23) (308, 22)
Evasion shapes (train, val): (3103, 23) (345, 23)


## 4. Tokenizer & PyTorch datasets

We choose a pretrained encoder (`roberta-base` or `distilroberta-base`), initialize
its tokenizer, and define a helper to tokenize batches of texts.

We then build simple PyTorch `Dataset` objects and `DataLoader`s for:

- Clarity: train / val / test
- Evasion: train / val


In [None]:
model_name = "roberta-base"  # you can switch to "distilroberta-base" for faster runs
tokenizer = AutoTokenizer.from_pretrained(model_name)

max_length = 256  # can increase to 256 if your QA pairs are long

def tokenize_texts(texts, labels):
    encodings = tokenizer(
        list(texts),
        truncation=True,
        padding="max_length",
        max_length=max_length,
    )
    encodings["labels"] = list(labels)
    return encodings


class TorchTextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}


In [None]:
# ---- Clarity tokenization ----
clar_train_enc = tokenize_texts(clar_train_df["text"], clar_train_df["clarity_id"])
clar_val_enc   = tokenize_texts(clar_val_df["text"],   clar_val_df["clarity_id"])
clar_test_enc  = tokenize_texts(clar_test_df["text"],  clar_test_df["clarity_id"])

clar_train_dataset = TorchTextDataset(clar_train_enc)
clar_val_dataset   = TorchTextDataset(clar_val_enc)
clar_test_dataset  = TorchTextDataset(clar_test_enc)

batch_size = 8

clar_train_loader = torch.utils.data.DataLoader(clar_train_dataset, batch_size=batch_size, shuffle=True)
clar_val_loader   = torch.utils.data.DataLoader(clar_val_dataset,   batch_size=batch_size)
clar_test_loader  = torch.utils.data.DataLoader(clar_test_dataset,  batch_size=batch_size)

# ---- Evasion tokenization ----
ev_train_enc = tokenize_texts(ev_train_df["text"], ev_train_df["evasion_id"])
ev_val_enc   = tokenize_texts(ev_val_df["text"],   ev_val_df["evasion_id"])

ev_train_dataset = TorchTextDataset(ev_train_enc)
ev_val_dataset   = TorchTextDataset(ev_val_enc)

ev_train_loader = torch.utils.data.DataLoader(ev_train_dataset, batch_size=batch_size, shuffle=True)
ev_val_loader   = torch.utils.data.DataLoader(ev_val_dataset,   batch_size=batch_size)

clar_train_dataset[0], ev_train_dataset[0]


({'input_ids': tensor([    0, 45641,    35,  1209,     4,  1534,    89,    10,  1989,  4258,
             14,    47,    74,   356,    13,    31,    42,  2557,  1067,     7,
           1679,   549,    47,   206,   383,    32,   164,   157,   116,    20,
            270,     4,  2647,     6,    38,   206,     5,  3527,    74,    28,
           1291,     4,   370,  1017,   386,    23,   513,    10,  6054,     4,
           3047,     6,    47,   216,     6,    25,    10,   432,   621,     6,
             38,   348,   626,   182,   157,    19,  2656,     4,   653,    47,
            236,     7,   109,    16,   386,    14,     4,   978,     6,    38,
           1017,   101,     7, 11829,    55,    87,    14,     4,   125,    23,
             10,  3527,     6,    38,   109,   679,     6,    23,   513,     6,
             52,   581,    33,  1145,   349,    97,     4,   166,    40,    33,
            450,   349,    97,     4, 13088,     6,    52,    40,    33,  6640,
            349,    97,    

## 5. Training utilities (loss, train loop, evaluation, predictions)

We implement:

- `train_one_epoch`: one pass over the training set.
- `evaluate`: compute loss, accuracy and macro F1 on a dataloader.
- `get_predictions`: get raw predictions and labels (for classification report).

We use **unweighted** cross-entropy loss here.  
(We tested class weights earlier; they did not consistently improve macro F1, so we keep the simpler version for the main pipeline.)


In [None]:
loss_fn = nn.CrossEntropyLoss()

def train_one_epoch(model, dataloader, optimizer, loss_fn):
    model.train()
    total_loss = 0.0

    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
        )
        logits = outputs.logits
        loss = loss_fn(logits, batch["labels"])

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)
    return avg_loss


def evaluate(model, dataloader, loss_fn):
    model.eval()
    total_loss = 0.0
    all_labels = []
    all_preds  = []

    with torch.no_grad():
        for batch in dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}

            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )
            logits = outputs.logits
            loss = loss_fn(logits, batch["labels"])

            preds = torch.argmax(logits, dim=-1)

            total_loss += loss.item()
            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())

    avg_loss = total_loss / len(dataloader)
    acc = accuracy_score(all_labels, all_preds)
    macro_f1 = f1_score(all_labels, all_preds, average="macro")
    return avg_loss, acc, macro_f1


def get_predictions(model, dataloader):
    model.eval()
    all_labels = []
    all_preds  = []

    with torch.no_grad():
        for batch in dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}

            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1)

            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())

    return np.array(all_labels), np.array(all_preds)


## 6. Task 1 – Clarity model: training & evaluation

We fine-tune a transformer for 3-way clarity classification.

- Model: `roberta-base` with a classification head.
- Optimizer: AdamW, learning rate `2e-5`.
- Epochs: e.g. 8 (you can tune this).
- We keep the **best epoch** based on validation macro F1 (early stopping style).


In [None]:
num_clarity_labels = len(clarity2id)

clarity_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_clarity_labels,
    id2label=id2clarity,
    label2id=clarity2id,
).to(device)

optimizer_clarity = torch.optim.AdamW(clarity_model.parameters(), lr=2e-5)

num_epochs_clarity = 8  # you can try 6, 8, 10
best_val_f1 = 0.0
best_state_dict = None

for epoch in range(num_epochs_clarity):
    train_loss = train_one_epoch(clarity_model, clar_train_loader, optimizer_clarity, loss_fn)
    val_loss, val_acc, val_f1 = evaluate(clarity_model, clar_val_loader, loss_fn)

    print(f"Epoch {epoch+1}/{num_epochs_clarity}")
    print(f"  Train loss: {train_loss:.4f}")
    print(f"  Val   loss: {val_loss:.4f} | acc: {val_acc:.4f} | macro F1: {val_f1:.4f}")

    if val_f1 > best_val_f1:
        best_val_f1 = val_f1
        best_state_dict = clarity_model.state_dict().copy()
        print("New best clarity model saved (val macro F1 improved)")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/8
  Train loss: 0.9030
  Val   loss: 0.8922 | acc: 0.5913 | macro F1: 0.2477
New best clarity model saved (val macro F1 improved)
Epoch 2/8
  Train loss: 0.8738
  Val   loss: 0.7850 | acc: 0.6377 | macro F1: 0.4942
New best clarity model saved (val macro F1 improved)
Epoch 3/8
  Train loss: 0.7793
  Val   loss: 0.7370 | acc: 0.6551 | macro F1: 0.5812
New best clarity model saved (val macro F1 improved)
Epoch 4/8
  Train loss: 0.6921
  Val   loss: 0.7566 | acc: 0.6667 | macro F1: 0.5708
Epoch 5/8
  Train loss: 0.6275
  Val   loss: 0.7557 | acc: 0.6638 | macro F1: 0.6087
New best clarity model saved (val macro F1 improved)
Epoch 6/8
  Train loss: 0.5450
  Val   loss: 0.8770 | acc: 0.6580 | macro F1: 0.6046
Epoch 7/8
  Train loss: 0.4788
  Val   loss: 0.9010 | acc: 0.6435 | macro F1: 0.5927
Epoch 8/8
  Train loss: 0.4242
  Val   loss: 0.9412 | acc: 0.6377 | macro F1: 0.6268
New best clarity model saved (val macro F1 improved)


In [None]:
# Load the best epoch before evaluating on test
if best_state_dict is not None:
    clarity_model.load_state_dict(best_state_dict)

test_loss, test_acc, test_f1 = evaluate(clarity_model, clar_test_loader, loss_fn)
print("Clarity – TEST (best epoch)")
print(f"  Loss: {test_loss:.4f} | acc: {test_acc:.4f} | macro F1: {test_f1:.4f}")

# Detailed per-class report
y_true_clar, y_pred_clar = get_predictions(clarity_model, clar_test_loader)
print("\nClarity – classification report (TEST):")
print(classification_report(y_true_clar, y_pred_clar, target_names=clarity_labels))

print("Confusion matrix (rows=true, cols=pred):")
print(confusion_matrix(y_true_clar, y_pred_clar))


Clarity – TEST (best epoch)
  Loss: 0.8617 | acc: 0.6201 | macro F1: 0.5294

Clarity – classification report (TEST):
                 precision    recall  f1-score   support

     Ambivalent       0.74      0.67      0.71       206
Clear Non-Reply       0.47      0.35      0.40        23
    Clear Reply       0.42      0.56      0.48        79

       accuracy                           0.62       308
      macro avg       0.55      0.53      0.53       308
   weighted avg       0.64      0.62      0.63       308

Confusion matrix (rows=true, cols=pred):
[[139   7  60]
 [ 15   8   0]
 [ 33   2  44]]


## 7. Task 2 – Evasion model: training & internal validation

We now fine-tune a second transformer for 9-way evasion classification.

- Train/val data: only rows where `evasion_id != -1`.
- Same idea: keep the best epoch based on validation macro F1.

We do not compute test metrics yet; that will use the special annotator rule.


In [None]:
num_evasion_labels = len(evasion2id)

evasion_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_evasion_labels,
    id2label=id2evasion,
    label2id=evasion2id,
).to(device)

optimizer_eva = torch.optim.AdamW(evasion_model.parameters(), lr=2e-5)

num_epochs_eva = 8  # you can tune this as well
best_val_f1_eva = 0.0
best_state_dict_eva = None

for epoch in range(num_epochs_eva):
    train_loss_eva = train_one_epoch(evasion_model, ev_train_loader, optimizer_eva, loss_fn)
    val_loss_eva, val_acc_eva, val_f1_eva = evaluate(evasion_model, ev_val_loader, loss_fn)

    print(f"[Evasion] Epoch {epoch+1}/{num_epochs_eva}")
    print(f"  Train loss: {train_loss_eva:.4f}")
    print(f"  Val   loss: {val_loss_eva:.4f} | acc: {val_acc_eva:.4f} | macro F1: {val_f1_eva:.4f}")

    if val_f1_eva > best_val_f1_eva:
        best_val_f1_eva = val_f1_eva
        best_state_dict_eva = evasion_model.state_dict().copy()
        print("New best evasion model saved (val macro F1 improved)")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Evasion] Epoch 1/8
  Train loss: 1.8906
  Val   loss: 1.8762 | acc: 0.3043 | macro F1: 0.0546
New best evasion model saved (val macro F1 improved)
[Evasion] Epoch 2/8
  Train loss: 1.8117
  Val   loss: 1.8088 | acc: 0.3043 | macro F1: 0.0759
New best evasion model saved (val macro F1 improved)
[Evasion] Epoch 3/8
  Train loss: 1.7057
  Val   loss: 1.7326 | acc: 0.3739 | macro F1: 0.2678
New best evasion model saved (val macro F1 improved)
[Evasion] Epoch 4/8
  Train loss: 1.6050
  Val   loss: 1.6716 | acc: 0.3362 | macro F1: 0.2876
New best evasion model saved (val macro F1 improved)
[Evasion] Epoch 5/8
  Train loss: 1.4969
  Val   loss: 1.6973 | acc: 0.3884 | macro F1: 0.3072
New best evasion model saved (val macro F1 improved)
[Evasion] Epoch 6/8
  Train loss: 1.3798
  Val   loss: 1.7125 | acc: 0.4029 | macro F1: 0.3594
New best evasion model saved (val macro F1 improved)
[Evasion] Epoch 7/8
  Train loss: 1.2367
  Val   loss: 1.8489 | acc: 0.3681 | macro F1: 0.3485
[Evasion] Epoch 8

In [None]:
# Load best evasion model before test evaluation
if best_state_dict_eva is not None:
    evasion_model.load_state_dict(best_state_dict_eva)

val_loss_eva, val_acc_eva, val_f1_eva = evaluate(evasion_model, ev_val_loader, loss_fn)
print("Evasion – VALIDATION (best epoch)")
print(f"  Loss: {val_loss_eva:.4f} | acc: {val_acc_eva:.4f} | macro F1: {val_f1_eva:.4f}")


Evasion – VALIDATION (best epoch)
  Loss: 1.7895 | acc: 0.3884 | macro F1: 0.3522


## 8. Task 2 – Test evaluation with annotators

The test split has no single ground-truth `evasion_label`.  
Instead, it has `annotator1`, `annotator2`, `annotator3`.

According to the dataset documentation:

> Any of the annotator labels (1, 2 or 3) is considered correct.

We therefore:

1. Build a **set of gold labels** for each test example.
2. Filter to those examples where at least one annotator gave a label.
3. Run the evasion model on these test texts.
4. Count a prediction as correct if `pred_label ∈ gold_set`.
5. Report accuracy under this “any annotator correct” rule.


In [None]:
def get_annotator_gold_set(row):
    labels = []
    for col in ["annotator1", "annotator2", "annotator3"]:
        val = row.get(col, None)
        if isinstance(val, str) and val != "":
            labels.append(val)
    return set(labels)

test_df["evasion_gold_set"] = test_df.apply(get_annotator_gold_set, axis=1)

# keep only rows with at least one annotator label
has_gold = test_df["evasion_gold_set"].apply(lambda s: len(s) > 0)
test_eva_df = test_df[has_gold].reset_index(drop=True)

print("Test examples with at least one annotator label:", len(test_eva_df))
print("Total test examples:", len(test_df))


Test examples with at least one annotator label: 308
Total test examples: 308


In [None]:
# Tokenize test texts for evasion
test_eva_enc = tokenize_texts(test_eva_df["text"], [0] * len(test_eva_df))  # dummy labels
test_eva_dataset = TorchTextDataset(test_eva_enc)
test_eva_loader  = torch.utils.data.DataLoader(test_eva_dataset, batch_size=batch_size)

# Get predictions
evasion_model.eval()
pred_labels = []

with torch.no_grad():
    for batch in test_eva_loader:
        batch = {k: v.to(device) for k, v in batch.items() if k != "labels"}  # ignore dummy labels

        outputs = evasion_model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
        )
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1)
        pred_labels.extend(preds.cpu().numpy())

pred_labels = [id2evasion[i] for i in pred_labels]

gold_sets = test_eva_df["evasion_gold_set"].tolist()

correct_flags = [
    (pred in gold)
    for pred, gold in zip(pred_labels, gold_sets)
]

accuracy_any_annot = np.mean(correct_flags)
print("Task 2 – TEST accuracy (any annotator is correct):", accuracy_any_annot)


Task 2 – TEST accuracy (any annotator is correct): 0.4383116883116883


## 9. Multi-task model: shared encoder for clarity + evasion

We now define a single model with:
- one shared transformer encoder,
- one classification head for **clarity** (3 classes),
- one classification head for **evasion** (9 classes).

During training:
- all examples contribute to the clarity loss;
- only examples with a valid evasion label contribute to the evasion loss.


In [None]:
import torch.nn as nn
from transformers import AutoModel

class MultiTaskQEvasionModel(nn.Module):
    def __init__(self, model_name, num_clarity_labels, num_evasion_labels):
        super().__init__()
        # Shared encoder (RoBERTa encoder without classification head)
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden_size = self.encoder.config.hidden_size

        self.dropout = nn.Dropout(self.encoder.config.hidden_dropout_prob)

        # Clarity head (3 classes)
        self.clarity_head = nn.Linear(hidden_size, num_clarity_labels)
        # Evasion head (9 classes)
        self.evasion_head = nn.Linear(hidden_size, num_evasion_labels)

    def forward(self, input_ids, attention_mask):
        # Standard transformer forward
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        # CLS / <s> token representation
        pooled = outputs.last_hidden_state[:, 0]  # (batch_size, hidden_size)
        pooled = self.dropout(pooled)

        clarity_logits = self.clarity_head(pooled)
        evasion_logits = self.evasion_head(pooled)

        return clarity_logits, evasion_logits


### 9.1 Multi-task dataset (clarity + evasion) for train/val

We build a single dataset where each example has:
- tokenized `text`,
- `clarity_labels` (always defined),
- `evasion_labels` (== -1 if missing),
- `evasion_mask` (1 if evasion label exists, else 0).


In [None]:
def tokenize_multitask(df):
    enc = tokenizer(
        list(df["text"]),
        truncation=True,
        padding="max_length",
        max_length=max_length,
    )
    enc["clarity_labels"] = df["clarity_id"].tolist()
    # evasion_id is -1 when no label (train); on val split, same
    enc["evasion_labels"] = df["evasion_id"].tolist()
    enc["evasion_mask"] = [1 if eid != -1 else 0 for eid in df["evasion_id"]]
    return enc

class MultiTaskDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": torch.tensor(self.encodings["input_ids"][idx]),
            "attention_mask": torch.tensor(self.encodings["attention_mask"][idx]),
            "clarity_labels": torch.tensor(self.encodings["clarity_labels"][idx]),
            "evasion_labels": torch.tensor(self.encodings["evasion_labels"][idx]),
            "evasion_mask": torch.tensor(self.encodings["evasion_mask"][idx]),
        }

# Build encodings for train/val
mt_train_enc = tokenize_multitask(clar_train_df)
mt_val_enc   = tokenize_multitask(clar_val_df)

mt_train_dataset = MultiTaskDataset(mt_train_enc)
mt_val_dataset   = MultiTaskDataset(mt_val_enc)

batch_size = 8  # reuse your value

mt_train_loader = torch.utils.data.DataLoader(mt_train_dataset, batch_size=batch_size, shuffle=True)
mt_val_loader   = torch.utils.data.DataLoader(mt_val_dataset,   batch_size=batch_size)


### 9.2 Training & evaluation for the multi-task model

Loss:
- `L_total = L_clarity + alpha * L_evasion` (only where evasion_mask == 1)

Metrics:
- Clarity accuracy / macro F1 on all examples.
- Evasion accuracy / macro F1 on the subset with labels.


In [None]:
alpha = 1.0  # weight for evasion loss
ce_clarity = nn.CrossEntropyLoss()
ce_evasion = nn.CrossEntropyLoss()

def train_one_epoch_multitask(model, dataloader, optimizer):
    model.train()
    total_loss = 0.0

    for batch in dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        clarity_logits, evasion_logits = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
        )

        clarity_labels = batch["clarity_labels"]
        evasion_labels = batch["evasion_labels"]
        evasion_mask   = batch["evasion_mask"].bool()

        loss_cl = ce_clarity(clarity_logits, clarity_labels)

        if evasion_mask.any():
            loss_ev = ce_evasion(
                evasion_logits[evasion_mask],
                evasion_labels[evasion_mask],
            )
            loss = loss_cl + alpha * loss_ev
        else:
            loss = loss_cl

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)


def evaluate_multitask(model, dataloader):
    model.eval()
    total_loss = 0.0

    all_cl_labels = []
    all_cl_preds  = []

    all_ev_labels = []
    all_ev_preds  = []

    with torch.no_grad():
        for batch in dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}

            clarity_logits, evasion_logits = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            clarity_labels = batch["clarity_labels"]
            evasion_labels = batch["evasion_labels"]
            evasion_mask   = batch["evasion_mask"].bool()

            loss_cl = ce_clarity(clarity_logits, clarity_labels)
            if evasion_mask.any():
                loss_ev = ce_evasion(
                    evasion_logits[evasion_mask],
                    evasion_labels[evasion_mask],
                )
                loss = loss_cl + alpha * loss_ev
            else:
                loss = loss_cl

            total_loss += loss.item()

            # Clarity preds (all examples)
            cl_pred = torch.argmax(clarity_logits, dim=-1)
            all_cl_labels.extend(clarity_labels.cpu().numpy())
            all_cl_preds.extend(cl_pred.cpu().numpy())

            # Evasion preds (only where label exists)
            if evasion_mask.any():
                ev_pred = torch.argmax(evasion_logits, dim=-1)
                all_ev_labels.extend(evasion_labels[evasion_mask].cpu().numpy())
                all_ev_preds.extend(ev_pred[evasion_mask].cpu().numpy())

    avg_loss = total_loss / len(dataloader)

    # Clarity metrics
    cl_acc = accuracy_score(all_cl_labels, all_cl_preds)
    cl_f1  = f1_score(all_cl_labels, all_cl_preds, average="macro")

    # Evasion metrics (only if we have any labels)
    if len(all_ev_labels) > 0:
        ev_acc = accuracy_score(all_ev_labels, all_ev_preds)
        ev_f1  = f1_score(all_ev_labels, all_ev_preds, average="macro")
    else:
        ev_acc, ev_f1 = None, None

    return avg_loss, cl_acc, cl_f1, ev_acc, ev_f1


### 9.3 Train multi-task model (clarity + evasion)

We keep the best epoch according to validation **clarity macro F1**.
You can also choose to use a combined criterion later.


In [34]:
num_clarity_labels = len(clarity2id)
num_evasion_labels = len(evasion2id)

mt_model = MultiTaskQEvasionModel(
    model_name=model_name,
    num_clarity_labels=num_clarity_labels,
    num_evasion_labels=num_evasion_labels,
).to(device)

optimizer_mt = torch.optim.AdamW(mt_model.parameters(), lr=2e-5)

num_epochs_mt = 10  # start with 6; you can try 8 later

best_val_cl_f1 = 0.0
best_state_dict_mt = None

for epoch in range(num_epochs_mt):
    train_loss_mt = train_one_epoch_multitask(mt_model, mt_train_loader, optimizer_mt)
    val_loss_mt, cl_acc_mt, cl_f1_mt, ev_acc_mt, ev_f1_mt = evaluate_multitask(mt_model, mt_val_loader)

    print(f"[MT] Epoch {epoch+1}/{num_epochs_mt}")
    print(f"  Train loss: {train_loss_mt:.4f}")
    print(f"  Val   loss: {val_loss_mt:.4f}")
    print(f"  Clarity  - acc: {cl_acc_mt:.4f} | macro F1: {cl_f1_mt:.4f}")
    if ev_acc_mt is not None:
        print(f"  Evasion  - acc: {ev_acc_mt:.4f} | macro F1: {ev_f1_mt:.4f}")

    if cl_f1_mt > best_val_cl_f1:
        best_val_cl_f1 = cl_f1_mt
        best_state_dict_mt = mt_model.state_dict().copy()
        print("New best multi-task model saved (val clarity macro F1 improved)")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[MT] Epoch 1/10
  Train loss: 2.7785
  Val   loss: 2.6182
  Clarity  - acc: 0.6000 | macro F1: 0.3391
  Evasion  - acc: 0.3043 | macro F1: 0.0519
New best multi-task model saved (val clarity macro F1 improved)
[MT] Epoch 2/10
  Train loss: 2.4972
  Val   loss: 2.4398
  Clarity  - acc: 0.6377 | macro F1: 0.5865
  Evasion  - acc: 0.3768 | macro F1: 0.2372
New best multi-task model saved (val clarity macro F1 improved)
[MT] Epoch 3/10
  Train loss: 2.2665
  Val   loss: 2.4156
  Clarity  - acc: 0.6580 | macro F1: 0.6081
  Evasion  - acc: 0.4087 | macro F1: 0.3220
New best multi-task model saved (val clarity macro F1 improved)
[MT] Epoch 4/10
  Train loss: 2.0341
  Val   loss: 2.2763
  Clarity  - acc: 0.6783 | macro F1: 0.6062
  Evasion  - acc: 0.4058 | macro F1: 0.3543
[MT] Epoch 5/10
  Train loss: 1.8251
  Val   loss: 2.4216
  Clarity  - acc: 0.6696 | macro F1: 0.6086
  Evasion  - acc: 0.4145 | macro F1: 0.3707
New best multi-task model saved (val clarity macro F1 improved)
[MT] Epoch 6/1

### 9.4 Clarity evaluation on TEST (using multi-task model)

We reuse the existing `clar_test_enc` / `clar_test_loader`, but pass batches through
the multi-task model and use only the clarity head.


In [35]:
# Reload best multi-task model
if best_state_dict_mt is not None:
    mt_model.load_state_dict(best_state_dict_mt)

# Use the existing clarity test loader: clar_test_loader
mt_model.eval()
all_labels_test = []
all_preds_test  = []

with torch.no_grad():
    for batch in clar_test_loader:
        batch_gpu = {k: v.to(device) for k, v in batch.items()}
        clarity_logits, _ = mt_model(
            input_ids=batch_gpu["input_ids"],
            attention_mask=batch_gpu["attention_mask"],
        )
        preds = torch.argmax(clarity_logits, dim=-1)

        all_labels_test.extend(batch_gpu["labels"].cpu().numpy())
        all_preds_test.extend(preds.cpu().numpy())

cl_test_acc  = accuracy_score(all_labels_test, all_preds_test)
cl_test_f1   = f1_score(all_labels_test, all_preds_test, average="macro")

print("Multi-task model – Clarity TEST")
print(f"  acc: {cl_test_acc:.4f} | macro F1: {cl_test_f1:.4f}")


Multi-task model – Clarity TEST
  acc: 0.6591 | macro F1: 0.5460


In [38]:
# Evaluate multi-task model on TEST for EVASION (any annotator is correct)

# 1) Build gold label sets from annotators
def get_annotator_gold_set(row):
    labels = []
    for col in ["annotator1", "annotator2", "annotator3"]:
        val = row.get(col, None)
        if isinstance(val, str) and val != "":
            labels.append(val)
    return set(labels)

test_df["evasion_gold_set"] = test_df.apply(get_annotator_gold_set, axis=1)
has_gold = test_df["evasion_gold_set"].apply(lambda s: len(s) > 0)
test_eva_df = test_df[has_gold].reset_index(drop=True)

print("Test examples with at least one annotator label:", len(test_eva_df))
print("Total test examples:", len(test_df))

# 2) Tokenize test texts (dummy labels just to reuse the dataset class)
test_eva_enc = tokenize_texts(test_eva_df["text"], labels=[0] * len(test_eva_df))
test_eva_dataset = TorchTextDataset(test_eva_enc)
test_eva_loader  = torch.utils.data.DataLoader(test_eva_dataset, batch_size=batch_size)

# 3) Use the multi-task model's evasion head to get predictions
if best_state_dict_mt is not None:
    mt_model.load_state_dict(best_state_dict_mt)

mt_model.eval()
pred_labels = []

with torch.no_grad():
    for batch in test_eva_loader:
        batch_gpu = {k: v.to(device) for k, v in batch.items()}
        _, evasion_logits = mt_model(
            input_ids=batch_gpu["input_ids"],
            attention_mask=batch_gpu["attention_mask"],
        )
        preds = torch.argmax(evasion_logits, dim=-1)
        pred_labels.extend(preds.cpu().numpy())

pred_labels_str = [id2evasion[i] for i in pred_labels]
gold_sets = test_eva_df["evasion_gold_set"].tolist()

# 4) Accuracy with "any annotator is correct" rule
correct_flags = [
    (pred in gold)
    for pred, gold in zip(pred_labels_str, gold_sets)
]

accuracy_any_annot = np.mean(correct_flags)
print("Multi-task model – Task 2 (Evasion) TEST accuracy (any annotator correct):", accuracy_any_annot)


Test examples with at least one annotator label: 308
Total test examples: 308
Multi-task model – Task 2 (Evasion) TEST accuracy (any annotator correct): 0.4642857142857143


In [36]:
# Evaluate multi-task model on the VALIDATION split
val_loss_mt, cl_acc_mt, cl_f1_mt, ev_acc_mt, ev_f1_mt = evaluate_multitask(mt_model, mt_val_loader)

print("Multi-task model – VALIDATION")
print(f"  Loss: {val_loss_mt:.4f}")
print(f"  Clarity  -> acc: {cl_acc_mt:.4f} | macro F1: {cl_f1_mt:.4f}")

if ev_acc_mt is not None:
    print(f"  Evasion  -> acc: {ev_acc_mt:.4f} | macro F1: {ev_f1_mt:.4f}")
else:
    print("  Evasion  -> no labeled examples in this split.")


Multi-task model – VALIDATION
  Loss: 3.2555
  Clarity  -> acc: 0.6812 | macro F1: 0.6558
  Evasion  -> acc: 0.4290 | macro F1: 0.4114
