# QEvasion – Classical Approach

This notebook prepares the QEvasion dataset
for the two tasks:

- **Task 1 – Clarity-level classification (3-way)**  
  `clarity_label` in both `train` and `test`.

- **Task 2 – Evasion-level classification (9-way)**  
  - In `train`: `evasion_label` is a single gold label per example.  
  - In `test`: `evasion_label` is empty on purpose. Instead we have
    `annotator1`, `annotator2`, `annotator3`. **Any annotator label is considered correct.**

In this notebook we will:

1. Build a unified `text` field = Question + Answer.
2. Encode clarity and evasion labels as integers.
3. Train TF–IDF + linear models (SVM) for both tasks.
4. Implement a special evaluation for Task 2 on the official test set
   using the three annotators as a set of acceptable labels.


In [1]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

plt.rcParams["figure.figsize"] = (8, 4)
plt.rcParams["axes.grid"] = True

# Load QEvasion from Hugging Face
dataset = load_dataset("ailsntua/QEvasion")
dataset


  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['title', 'date', 'president', 'url', 'question_order', 'interview_question', 'interview_answer', 'gpt3.5_summary', 'gpt3.5_prediction', 'question', 'annotator_id', 'annotator1', 'annotator2', 'annotator3', 'inaudible', 'multiple_questions', 'affirmative_questions', 'index', 'clarity_label', 'evasion_label'],
        num_rows: 3448
    })
    test: Dataset({
        features: ['title', 'date', 'president', 'url', 'question_order', 'interview_question', 'interview_answer', 'gpt3.5_summary', 'gpt3.5_prediction', 'question', 'annotator_id', 'annotator1', 'annotator2', 'annotator3', 'inaudible', 'multiple_questions', 'affirmative_questions', 'index', 'clarity_label', 'evasion_label'],
        num_rows: 308
    })
})

We will work mainly with the `train` and `test` splits and convert them to pandas DataFrames.


In [2]:
train_df = dataset["train"].to_pandas()
test_df  = dataset["test"].to_pandas()

train_df.head()[[
    "title",
    "date",
    "president",
    "interview_question",
    "interview_answer",
    "clarity_label",
    "evasion_label",
]]


Unnamed: 0,title,date,president,interview_question,interview_answer,clarity_label,evasion_label
0,"The President's News Conference in Hanoi, Vietnam","September 10, 2023",Joseph R. Biden,Q. Of the Biden administration. And accused th...,"Well, look, first of all, theI am sincere abou...",Clear Reply,Explicit
1,"The President's News Conference in Hanoi, Vietnam","September 10, 2023",Joseph R. Biden,Q. Of the Biden administration. And accused th...,"Well, look, first of all, theI am sincere abou...",Ambivalent,General
2,"The President's News Conference in Hanoi, Vietnam","September 10, 2023",Joseph R. Biden,Q. No worries. Do you believe the country's sl...,"Look, I think China has a difficult economic p...",Ambivalent,Partial/half-answer
3,"The President's News Conference in Hanoi, Vietnam","September 10, 2023",Joseph R. Biden,Q. No worries. Do you believe the country's sl...,"Look, I think China has a difficult economic p...",Ambivalent,Dodging
4,"The President's News Conference in Hanoi, Vietnam","September 10, 2023",Joseph R. Biden,"Q. I can imagine. It is evening, I'd like to r...","Well, I hope I get to see Mr. Xi sooner than l...",Clear Reply,Explicit


## 2. Build unified input text

For all models, we will use a single text field combining question and answer, for example:

> `Question: <question> [SEP] Answer: <answer>`

The `[SEP]` is just a separator token in the raw text (for TF–IDF).  
For transformers later, we will let the tokenizer handle segments.


In [3]:
train_df = train_df[train_df["clarity_label"].notna() & train_df["clarity_label"].astype(str).str.strip().ne("")]
test_df = test_df[test_df["clarity_label"].notna() & test_df["clarity_label"].astype(str).str.strip().ne("")]


In [4]:
def build_text_column(df):
    df = df.copy()
    q = df["interview_question"].fillna("")
    a = df["interview_answer"].fillna("")
    df["text"] = "Question: " + q + " [SEP] Answer: " + a
    return df

train_df = build_text_column(train_df)
test_df  = build_text_column(test_df)

train_df[["text", "clarity_label", "evasion_label"]].head()


Unnamed: 0,text,clarity_label,evasion_label
0,Question: Q. Of the Biden administration. And ...,Clear Reply,Explicit
1,Question: Q. Of the Biden administration. And ...,Ambivalent,General
2,Question: Q. No worries. Do you believe the co...,Ambivalent,Partial/half-answer
3,Question: Q. No worries. Do you believe the co...,Ambivalent,Dodging
4,"Question: Q. I can imagine. It is evening, I'd...",Clear Reply,Explicit


## 3. Label encoding

We map string labels to integer IDs:

- **Clarity** (Task 1): 3 labels in both `train` and `test`.
- **Evasion** (Task 2): 9 labels in `train`.  
  In `test`, `evasion_label` is empty (for Task 2); we will use `annotator1/2/3` there.

We build the mapping dictionaries from the **training** data to ensure consistency.


In [5]:
# Clarity labels: from train and test (just in case)
clarity_labels = sorted(
    list(set(train_df["clarity_label"].dropna().unique()) |
         set(test_df["clarity_label"].dropna().unique()))
)

print("Clarity labels:", clarity_labels)

clarity2id = {lbl: i for i, lbl in enumerate(clarity_labels)}
id2clarity = {i: lbl for lbl, i in clarity2id.items()}

clarity2id


Clarity labels: ['Ambivalent', 'Clear Non-Reply', 'Clear Reply']


{'Ambivalent': 0, 'Clear Non-Reply': 1, 'Clear Reply': 2}

In [6]:
# Evasion labels: only from train (test.evasion_label is intentionally empty for Task 2)
evasion_labels = sorted(train_df["evasion_label"].dropna().unique())
print("Evasion labels (train):", evasion_labels)

evasion2id = {lbl: i for i, lbl in enumerate(evasion_labels)}
id2evasion = {i: lbl for lbl, i in evasion2id.items()}

evasion2id


Evasion labels (train): ['Claims ignorance', 'Clarification', 'Declining to answer', 'Deflection', 'Dodging', 'Explicit', 'General', 'Implicit', 'Partial/half-answer']


{'Claims ignorance': 0,
 'Clarification': 1,
 'Declining to answer': 2,
 'Deflection': 3,
 'Dodging': 4,
 'Explicit': 5,
 'General': 6,
 'Implicit': 7,
 'Partial/half-answer': 8}

Now we add integer columns:

- `clarity_id` for both train and test.
- `evasion_id` only for rows where `evasion_label` is not empty (in train).


In [7]:
# Add clarity_id everywhere
train_df["clarity_id"] = train_df["clarity_label"].map(clarity2id)
test_df["clarity_id"]  = test_df["clarity_label"].map(clarity2id)

# For evasion, only valid labels in train
mask_evasion_valid = train_df["evasion_label"].notna() & (train_df["evasion_label"] != "")
train_df["evasion_id"] = np.where(
    mask_evasion_valid,
    train_df["evasion_label"].map(evasion2id),
    -1  # -1 = invalid / missing
)

train_df[["evasion_label", "evasion_id"]].head(10)


Unnamed: 0,evasion_label,evasion_id
0,Explicit,5
1,General,6
2,Partial/half-answer,8
3,Dodging,4
4,Explicit,5
5,Implicit,7
6,Deflection,3
7,Implicit,7
8,Explicit,5
9,Explicit,5


## 4. Task 1 – Clarity: train / validation / test splits

We create:

- `clar_train_df` and `clar_val_df` from the original training split, using a stratified split on `clarity_id`.
- `clar_test_df` from the original test split (we keep it as the official test set).

We stratify by `clarity_id` to keep the proportion of the three clarity classes similar across train and validation.


In [8]:
# Use all training examples for clarity
clar_full_train_df = train_df.copy()

X_clar_full = clar_full_train_df["text"].values
y_clar_full = clar_full_train_df["clarity_id"].values

clar_train_idx, clar_val_idx = train_test_split(
    np.arange(len(clar_full_train_df)),
    test_size=0.1,
    stratify=y_clar_full,
    random_state=42
)

clar_train_df = clar_full_train_df.iloc[clar_train_idx].reset_index(drop=True)
clar_val_df   = clar_full_train_df.iloc[clar_val_idx].reset_index(drop=True)
clar_test_df  = test_df.copy()

len(clar_train_df), len(clar_val_df), len(clar_test_df)


(3103, 345, 308)

In [9]:
print("Train clarity distribution:")
print(clar_train_df["clarity_label"].value_counts(), "\n")

print("Val clarity distribution:")
print(clar_val_df["clarity_label"].value_counts(), "\n")

print("Test clarity distribution:")
print(clar_test_df["clarity_label"].value_counts())


Train clarity distribution:
clarity_label
Ambivalent         1836
Clear Reply         947
Clear Non-Reply     320
Name: count, dtype: int64 

Val clarity distribution:
clarity_label
Ambivalent         204
Clear Reply        105
Clear Non-Reply     36
Name: count, dtype: int64 

Test clarity distribution:
clarity_label
Ambivalent         206
Clear Reply         79
Clear Non-Reply     23
Name: count, dtype: int64


## 5. Task 2 – Evasion: train / validation splits (from `train` only)

For Task 2, the gold `evasion_label` exists only in the **train** split.
We therefore:

1. Filter `train_df` to keep only rows with a valid `evasion_label`.
2. Split this subset into `ev_train_df` and `ev_val_df` (stratified on `evasion_id`).
3. Keep the official `test_df` for later evaluation using `annotator1/2/3`
   (where the `evasion_label` column is intentionally empty).


In [10]:
# Keep only rows with valid evasion labels in train
evasion_train_df = train_df[train_df["evasion_id"] != -1].reset_index(drop=True)

X_eva_full = evasion_train_df["text"].values
y_eva_full = evasion_train_df["evasion_id"].values

ev_train_idx, ev_val_idx = train_test_split(
    np.arange(len(evasion_train_df)),
    test_size=0.1,
    stratify=y_eva_full,
    random_state=42
)

ev_train_df = evasion_train_df.iloc[ev_train_idx].reset_index(drop=True)
ev_val_df   = evasion_train_df.iloc[ev_val_idx].reset_index(drop=True)

len(ev_train_df), len(ev_val_df)


(3103, 345)

In [11]:
print("Evasion train distribution:")
print(ev_train_df["evasion_label"].value_counts(), "\n")

print("Evasion val distribution:")
print(ev_val_df["evasion_label"].value_counts())


Evasion train distribution:
evasion_label
Explicit               947
Dodging                635
Implicit               439
General                347
Deflection             343
Declining to answer    131
Claims ignorance       107
Clarification           83
Partial/half-answer     71
Name: count, dtype: int64 

Evasion val distribution:
evasion_label
Explicit               105
Dodging                 71
Implicit                49
General                 39
Deflection              38
Declining to answer     14
Claims ignorance        12
Clarification            9
Partial/half-answer      8
Name: count, dtype: int64


## 6. Evaluation helpers

We define:
- A generic function to print accuracy and macro F1.
- Later, a special function for Task 2 test evaluation with multiple annotators.


In [12]:
def eval_classification(y_true, y_pred, label_type=""):
    acc = accuracy_score(y_true, y_pred)
    macro_f1 = f1_score(y_true, y_pred, average="macro")

    print(f"=== {label_type} ===")
    print(f"Accuracy : {acc:.4f}")
    print(f"Macro F1 : {macro_f1:.4f}")
    print("\nClassification report:")
    print(classification_report(y_true, y_pred))


## 7. TF–IDF vectorisation

We build a TF–IDF representation on the training texts (Task 1 train) 
and reuse it for both clarity and evasion models as a classical baseline.


In [13]:
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),    # unigrams + bigrams
    max_features=30000,    # can be tuned
)

# Fit on all clarity train texts (covers most vocabulary)
X_tfidf_clar_train = tfidf.fit_transform(clar_train_df["text"].values)
X_tfidf_clar_val   = tfidf.transform(clar_val_df["text"].values)
X_tfidf_clar_test  = tfidf.transform(clar_test_df["text"].values)

X_tfidf_clar_train.shape, X_tfidf_clar_val.shape, X_tfidf_clar_test.shape


((3103, 30000), (345, 30000), (308, 30000))

In [14]:
X_tfidf_eva_train = tfidf.transform(ev_train_df["text"].values)
X_tfidf_eva_val   = tfidf.transform(ev_val_df["text"].values)

X_tfidf_eva_train.shape, X_tfidf_eva_val.shape


((3103, 30000), (345, 30000))

## 8. Baseline 1 – Majority class classifier

As a very simple baseline, we predict the most frequent class in the training set
for each task and evaluate on validation and test.


In [15]:
from collections import Counter

# Task 1 – clarity
clar_majority = Counter(clar_train_df["clarity_id"]).most_common(1)[0][0]

y_clar_val_majority = np.full_like(clar_val_df["clarity_id"].values, clar_majority)
y_clar_test_majority = np.full_like(clar_test_df["clarity_id"].values, clar_majority)

print("Majority baseline - clarity (val):")
eval_classification(clar_val_df["clarity_id"].values, y_clar_val_majority, label_type="Clarity (val)")

print("\nMajority baseline - clarity (test):")
eval_classification(clar_test_df["clarity_id"].values, y_clar_test_majority, label_type="Clarity (test)")


Majority baseline - clarity (val):
=== Clarity (val) ===
Accuracy : 0.5913
Macro F1 : 0.2477

Classification report:
              precision    recall  f1-score   support

           0       0.59      1.00      0.74       204
           1       0.00      0.00      0.00        36
           2       0.00      0.00      0.00       105

    accuracy                           0.59       345
   macro avg       0.20      0.33      0.25       345
weighted avg       0.35      0.59      0.44       345


Majority baseline - clarity (test):
=== Clarity (test) ===
Accuracy : 0.6688
Macro F1 : 0.2672

Classification report:
              precision    recall  f1-score   support

           0       0.67      1.00      0.80       206
           1       0.00      0.00      0.00        23
           2       0.00      0.00      0.00        79

    accuracy                           0.67       308
   macro avg       0.22      0.33      0.27       308
weighted avg       0.45      0.67      0.54       308



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [16]:
# Task 2 – evasion (on train/val only)
eva_majority = Counter(ev_train_df["evasion_id"]).most_common(1)[0][0]

y_eva_val_majority = np.full_like(ev_val_df["evasion_id"].values, eva_majority)

print("Majority baseline - evasion (val):")
eval_classification(ev_val_df["evasion_id"].values, y_eva_val_majority, label_type="Evasion (val)")


Majority baseline - evasion (val):
=== Evasion (val) ===
Accuracy : 0.3043
Macro F1 : 0.0519

Classification report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        12
           1       0.00      0.00      0.00         9
           2       0.00      0.00      0.00        14
           3       0.00      0.00      0.00        38
           4       0.00      0.00      0.00        71
           5       0.30      1.00      0.47       105
           6       0.00      0.00      0.00        39
           7       0.00      0.00      0.00        49
           8       0.00      0.00      0.00         8

    accuracy                           0.30       345
   macro avg       0.03      0.11      0.05       345
weighted avg       0.09      0.30      0.14       345



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 9. Baseline 2 – TF–IDF + Linear SVM (Task 1: Clarity)

We now train a linear SVM (LinearSVC) on TF–IDF features to predict `clarity_id`.
This is a strong classical text classification baseline.


In [17]:
from sklearn.svm import LinearSVC

y_clar_train = clar_train_df["clarity_id"].values
y_clar_val   = clar_val_df["clarity_id"].values
y_clar_test  = clar_test_df["clarity_id"].values

clf_clarity = LinearSVC()
clf_clarity.fit(X_tfidf_clar_train, y_clar_train)

# Predictions
y_clar_val_pred  = clf_clarity.predict(X_tfidf_clar_val)
y_clar_test_pred = clf_clarity.predict(X_tfidf_clar_test)

print("TF–IDF + LinearSVC – clarity (val):")
eval_classification(y_clar_val, y_clar_val_pred, label_type="Clarity (val)")

print("\nTF–IDF + LinearSVC – clarity (test):")
eval_classification(y_clar_test, y_clar_test_pred, label_type="Clarity (test)")


TF–IDF + LinearSVC – clarity (val):
=== Clarity (val) ===
Accuracy : 0.6493
Macro F1 : 0.5667

Classification report:
              precision    recall  f1-score   support

           0       0.68      0.81      0.74       204
           1       0.61      0.39      0.47        36
           2       0.56      0.43      0.48       105

    accuracy                           0.65       345
   macro avg       0.62      0.54      0.57       345
weighted avg       0.64      0.65      0.64       345


TF–IDF + LinearSVC – clarity (test):
=== Clarity (test) ===
Accuracy : 0.6266
Macro F1 : 0.3938

Classification report:
              precision    recall  f1-score   support

           0       0.68      0.86      0.76       206
           1       0.50      0.13      0.21        23
           2       0.31      0.16      0.21        79

    accuracy                           0.63       308
   macro avg       0.50      0.38      0.39       308
weighted avg       0.57      0.63      0.58       308


## 10. Baseline 3 – TF–IDF + Linear SVM (Task 2: Evasion, internal validation)

For Task 2, we train another LinearSVC on the subset of `train` with valid `evasion_label`,
and evaluate on the internal validation split (`ev_val_df`).


In [18]:
y_eva_train = ev_train_df["evasion_id"].values
y_eva_val   = ev_val_df["evasion_id"].values

clf_evasion = LinearSVC()
clf_evasion.fit(X_tfidf_eva_train, y_eva_train)

y_eva_val_pred = clf_evasion.predict(X_tfidf_eva_val)

print("TF–IDF + LinearSVC – evasion (internal val):")
eval_classification(y_eva_val, y_eva_val_pred, label_type="Evasion (val, internal)")


TF–IDF + LinearSVC – evasion (internal val):
=== Evasion (val, internal) ===
Accuracy : 0.3420
Macro F1 : 0.2893

Classification report:
              precision    recall  f1-score   support

           0       0.20      0.08      0.12        12
           1       0.83      0.56      0.67         9
           2       0.80      0.29      0.42        14
           3       0.18      0.16      0.17        38
           4       0.32      0.35      0.33        71
           5       0.41      0.57      0.47       105
           6       0.14      0.10      0.12        39
           7       0.35      0.27      0.30        49
           8       0.00      0.00      0.00         8

    accuracy                           0.34       345
   macro avg       0.36      0.26      0.29       345
weighted avg       0.34      0.34      0.33       345



## 11. Task 2 – Evasion evaluation on test (multiple annotators)

On the official `test` split:
- `evasion_label` is empty (Task 2).
- Instead, `annotator1`, `annotator2`, `annotator3` each provide an evasion label.
- According to the dataset description, **any of these annotator labels is considered correct**.

We implement an evaluation function that:
1. Takes model predictions (one evasion label per example).
2. Builds, for each example, the set of acceptable gold labels `G` by collecting the non-empty annotator labels:

```text
G = {annotator1, annotator2, annotator3} minus {empty}
# i.e., keep only labels that are not "", None, or NaN


In [19]:
def get_annotator_gold_set(row):
    """
    Build the set of gold evasion labels from annotator1/2/3 for one row.
    Empty strings or NaNs are ignored.
    """
    labels = []
    for col in ["annotator1", "annotator2", "annotator3"]:
        val = row.get(col, None)
        if isinstance(val, str) and val != "":
            labels.append(val)
    return set(labels)


# Build gold sets for all test examples
test_df["evasion_gold_set"] = test_df.apply(get_annotator_gold_set, axis=1)

# Filter rows where we have at least one annotator label
has_gold = test_df["evasion_gold_set"].apply(lambda s: len(s) > 0)
test_eva_df = test_df[has_gold].reset_index(drop=True)

len(test_eva_df), len(test_df)


(308, 308)

In [20]:
# Transform test texts for evasion
X_tfidf_eva_test = tfidf.transform(test_eva_df["text"].values)

# Predict evasion IDs
y_eva_test_pred_ids = clf_evasion.predict(X_tfidf_eva_test)

# Convert predictions back to string labels
y_eva_test_pred_labels = [id2evasion[i] for i in y_eva_test_pred_ids]


In [21]:
# Compute accuracy where a prediction is correct if it matches any annotator label
correct_flags = []

for pred, gold_set in zip(y_eva_test_pred_labels, test_eva_df["evasion_gold_set"]):
    correct_flags.append(pred in gold_set)

accuracy_any_annot = np.mean(correct_flags)
accuracy_any_annot


np.float64(0.37012987012987014)

The value above is the **Task 2 test accuracy** under the rule:

> A prediction is counted as correct if it matches *any* of the annotators' evasion labels.

We can report this alongside the internal validation metrics as a baseline
for evasion-level classification on the official test split.
