# CLEF2025 Subtask 4a – Scientific Discourse Detection

This notebook presents a reproducible pipeline to train and evaluate a multi-label classifier for detecting scientific discourse in tweets.  
The model used is `microsoft/deberta-v3-base`.  
We follow a structured, multi-phase approach:
1. Baseline
2. Threshold tuning
3. Fine-tuning
4. Class weights
5. Ensemble

Output: `predictions.csv` for leaderboard submission.

## 📦 Setup & Imports

In [None]:
!pip install -q transformers datasets scikit-learn

import pandas as pd
import numpy as np
import torch
import json
import os
from tqdm import tqdm
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, precision_recall_curve
from sklearn.utils.class_weight import compute_class_weight
from torch.nn import BCEWithLogitsLoss
from scipy.special import expit

# Setup
MODEL_NAME = "microsoft/deberta-v3-base"
SEED = 42
N_FOLDS = 5
torch.manual_seed(SEED)


<torch._C.Generator at 0x7ce1b03e9510>

In [None]:
# 🔧 Clone the official repository if not already cloned locally
!git clone https://gitlab.com/checkthat_lab/clef2025-checkthat-lab.git

# 📁 Set your base working directory
base_path = "/content/clef2025_task4a"
subdirs = ["data", "models", "predictions"]

for sub in subdirs:
    os.makedirs(os.path.join(base_path, sub), exist_ok=True)

# ✅ Copy key files from the official repository
source_path = "/content/clef2025-checkthat-lab/task4/subtask_4a"
!cp {source_path}/ct_train.tsv {base_path}/data/
!cp {source_path}/ct_test.tsv {base_path}/data/

# (optional if needed: ct_dev.tsv, baselines.ipynb)
print("✅ Data files successfully copied to /data folder.")

Cloning into 'clef2025-checkthat-lab'...
remote: Enumerating objects: 804, done.[K
remote: Counting objects: 100% (783/783), done.[K
remote: Compressing objects: 100% (457/457), done.[K
remote: Total 804 (delta 389), reused 660 (delta 313), pack-reused 21 (from 1)[K
Receiving objects: 100% (804/804), 77.82 MiB | 22.07 MiB/s, done.
Resolving deltas: 100% (393/393), done.
Updating files: 100% (155/155), done.
✅ Archivos de datos copiados correctamente a carpeta /data.


In [None]:
#📂 Load data and tokenize

base_path = "/content/clef2025_task4a"
data_path = os.path.join(base_path, "data")

# 📥 Read TSV files
train_file = os.path.join(data_path, "ct_train.tsv")
test_file = os.path.join(data_path, "ct_test.tsv")

# Validate existence
assert os.path.exists(train_file), "❌ No se encontró ct_train.tsv"
assert os.path.exists(test_file), "❌ No se encontró ct_test.tsv"

# Load datasets
df = pd.read_csv(train_file, sep="\t")
test_df = pd.read_csv(test_file, sep="\t")

# Validate content
print(f"✅ ct_train.tsv cargado: {df.shape[0]} registros, columnas: {list(df.columns)}")
print(df.head(2))

print(f"✅ ct_test.tsv cargado: {test_df.shape[0]} registros, columnas: {list(test_df.columns)}")
print(test_df.head(2))

# Convert list string to actual list
df['labels'] = df['labels'].apply(eval)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

# Prepare HuggingFace datasets
dataset = Dataset.from_pandas(df[['text', 'labels']])
dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])

test_dataset = Dataset.from_pandas(test_df)
test_dataset = test_dataset.map(tokenize, batched=True, remove_columns=["text"])

print("✅ Tokenization completed for train and test.")

✅ ct_train.tsv cargado: 1229 registros, columnas: ['index', 'text', 'labels']
   index                                               text           labels
0   1046  @user those eyes are a gift send straight from...  [0.0, 0.0, 0.0]
1    638  Remember when libs attacked @user for his conc...  [0.0, 0.0, 0.0]
✅ ct_test.tsv cargado: 240 registros, columnas: ['index', 'text']
  index                                               text
0   t_0  'That's because if a broadband provider inform...
1   t_1  Prostate biopsy - adenocarcinoma, high-grade t...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



Map:   0%|          | 0/1229 [00:00<?, ? examples/s]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

✅ Tokenización completada para train y test.


## 🔁 Phase 1 - Cross-validation: DeBERTa base without adjustments

In [None]:
# Configuration
EPOCHS = 10
LR = 2e-5

# Cross-validation
kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
fold_results = []
cv_logits = []
cv_labels = []

print(f"🚀 Starting baseline cross-validation: {N_FOLDS} folds")

for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
    print(f"\n📂 Fold {fold + 1}/{N_FOLDS}")

    # Split data for the fold
    train_dataset_fold = dataset.select(train_idx)
    val_dataset_fold = dataset.select(val_idx)
    val_labels_fold = np.array(df.iloc[val_idx]["labels"].tolist())
    cv_labels.extend(val_labels_fold.tolist())

    # Load base model
    model = AutoModelForSequenceClassification.from_pretrained(
        "microsoft/deberta-v3-base",
        num_labels=3,
        problem_type="multi_label_classification"
    )

    # Training configuration
    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir=os.path.join(base_path, f"models/baseline_fold{fold}"),
            num_train_epochs=EPOCHS,
            per_device_train_batch_size=16,
            learning_rate=LR,
            seed=SEED,
            report_to="none",
            logging_steps=100,
        ),
        train_dataset=train_dataset_fold,
    )

    # Train and predict
    trainer.train()
    logits = trainer.predict(val_dataset_fold).predictions
    cv_logits.extend(logits.tolist())

    # Calculate temporary F1
    probs = expit(logits)
    preds_bin = (probs > 0.5).astype(int)
    f1 = f1_score(val_labels_fold, preds_bin, average='macro')
    fold_results.append(f1)
    print(f"✅ Fold macro F1: {f1:.4f}")

# Convert to DataFrame
cv_df = pd.DataFrame(cv_logits, columns=['cat1_logit', 'cat2_logit', 'cat3_logit'])
cv_df['labels'] = cv_labels

print("\n🎯 Final result baseline cross-validation:")
print(f"Average Macro F1: {np.mean(fold_results):.4f}")

🚀 Iniciando validación cruzada baseline: 5 folds

📂 Fold 1/5


pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

Step,Training Loss
100,0.4665
200,0.2685
300,0.1648
400,0.0922
500,0.0603
600,0.0441


✅ Macro F1 del fold: 0.7949

📂 Fold 2/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.4576
200,0.2714
300,0.1518
400,0.0845
500,0.0562
600,0.0388


✅ Macro F1 del fold: 0.8084

📂 Fold 3/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.4646
200,0.2574
300,0.1348
400,0.0755
500,0.0438
600,0.0283


✅ Macro F1 del fold: 0.7955

📂 Fold 4/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.4883
200,0.2616
300,0.1366
400,0.0835
500,0.0503
600,0.0351


✅ Macro F1 del fold: 0.8115

📂 Fold 5/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.449
200,0.2287
300,0.1386
400,0.0825
500,0.0565
600,0.0361


✅ Macro F1 del fold: 0.8247

🎯 Resultado final validación cruzada baseline:
Macro F1 promedio: 0.8070


## 🎯 # Phase 2 – Threshold tuning per Class (PR curve)


In [None]:
def tune_thresholds(cv_df):
    thresholds = {}
    f1_scores = {}

    for i, cat in enumerate(['cat1', 'cat2', 'cat3']):
        y_true = [label[i] for label in cv_df['labels']]
        scores = expit(cv_df[f'{cat}_logit'].values)
        precision, recall, thresh = precision_recall_curve(y_true, scores)
        f1 = 2 * (precision * recall) / (precision + recall + 1e-8)
        best_thresh = thresh[np.argmax(f1)]
        thresholds[cat] = round(float(best_thresh), 4)
        f1_scores[cat] = round(float(np.max(f1)), 4)
        print(f"✅ {cat}: Optimal threshold = {thresholds[cat]} | F1 = {f1_scores[cat]}")

    return thresholds, f1_scores

# Prepare columns
for i, cat in enumerate(['cat1', 'cat2', 'cat3']):
    cv_df[f'{cat}_logit'] = np.array(cv_logits)[:, i]

# Compute thresholds
thresholds, best_f1s = tune_thresholds(cv_df)

# Save thresholds
threshold_path = os.path.join(base_path, "data/thresholds.json")
with open(threshold_path, "w") as f:
    json.dump(thresholds, f)

print(f"\n📁 Thresholds saved to: {threshold_path}")

✅ cat1: Threshold óptimo = 0.4256 | F1 = 0.8182
✅ cat2: Threshold óptimo = 0.7498 | F1 = 0.7946
✅ cat3: Threshold óptimo = 0.7172 | F1 = 0.8443

📁 Thresholds guardados en: /content/clef2025_task4a/data/thresholds.json


## 🧪 Phase 3 – Fine-tuning: compare different LRs and Epochs

In [None]:
# 📊 CConfigurations to test
fine_tune_configs = [
    {"name": "FT-1", "lr": 1e-5, "epochs": 10},
    {"name": "FT-2", "lr": 3e-5, "epochs": 6},
    {"name": "FT-3", "lr": 2e-5, "epochs": 12},
]

fine_tune_results = []
all_preds_finetuned = []
all_labels_finetuned = []

def run_finetune_experiment(lr, epochs, return_preds=False):
    kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
    all_preds = []
    all_labels = []
    all_logits = []

    for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
        model = AutoModelForSequenceClassification.from_pretrained(
            MODEL_NAME, num_labels=3, problem_type="multi_label_classification"
        )

        train_dataset_fold = dataset.select(train_idx)
        val_dataset_fold = dataset.select(val_idx)
        val_labels = np.array(df.iloc[val_idx]["labels"].tolist())

        trainer = Trainer(
            model=model,
            args=TrainingArguments(
                output_dir=os.path.join(base_path, f"models/finetune_lr{lr}_ep{epochs}_fold{fold}"),
                num_train_epochs=epochs,
                per_device_train_batch_size=16,
                learning_rate=lr,
                report_to="none"
            ),
            train_dataset=train_dataset_fold,
        )

        trainer.train()
        logits = trainer.predict(val_dataset_fold).predictions
        probs = expit(logits)
        preds_bin = (probs > 0.5).astype(int)

        all_preds.extend(preds_bin.tolist())
        all_labels.extend(val_labels.tolist())
        all_logits.extend(logits.tolist())
        all_preds_finetuned.extend(probs.tolist())
        all_labels_finetuned.extend(val_labels.tolist())

    if return_preds:
        return f1_score(all_labels, all_preds, average="macro"), all_logits, all_labels
    return f1_score(all_labels, all_preds, average="macro")

print("🚀 Running fine-tuning experiments...")

for config in tqdm(fine_tune_configs):
    macro = run_finetune_experiment(config["lr"], config["epochs"])
    fine_tune_results.append({
        "Config": config["name"],
        "Learning Rate": config["lr"],
        "Epochs": config["epochs"],
        "Macro F1": round(macro, 4)
    })

# Show results
df_finetune = pd.DataFrame(fine_tune_results)
df_finetune = df_finetune.sort_values(by="Macro F1", ascending=False)
df_finetune.reset_index(drop=True, inplace=True)

print("\n📊 Fine-tuning results:")
display(df_finetune)

# 🏆 Best auto-selected configuration
best_config = df_finetune.iloc[0]
print(f"\n🏆 Best auto configuration: {best_config['Config']} → LR={best_config['Learning Rate']}, Epochs={int(best_config['Epochs'])}, Macro F1={best_config['Macro F1']}")

# ✅ Manual selection for the next phase
selected_lr = best_config["Learning Rate"]
selected_epochs = int(best_config["Epochs"])

print(f"\n📌 Manually using: LR={selected_lr}, Epochs={selected_epochs}")

🚀 Ejecutando experimentos de fine-tuning...


  0%|          | 0/3 [00:00<?, ?it/s]Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2725


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2596


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2614


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2616


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2713


 33%|███▎      | 1/3 [37:10<1:14:21, 2230.61s/it]Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


 67%|██████▋   | 2/3 [57:51<27:28, 1648.62s/it]  Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.1967


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2078


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2067


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2395


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2271


100%|██████████| 3/3 [1:37:51<00:00, 1957.21s/it]


📊 Resultados Fine-tuning:





Unnamed: 0,Config,Learning Rate,Epochs,Macro F1
0,FT-2,3e-05,6,0.8154
1,FT-3,2e-05,12,0.8105
2,FT-1,1e-05,10,0.7982



🏆 Mejor configuración automática: FT-2 → LR=3e-05, Epochs=6, Macro F1=0.8154

📌 Usando manualmente: LR=3e-05, Epochs=6


## ⚖️ Phase 4 – Training with Class Weights

In [None]:
# 📌 Use same values from Phase 3 (Can be changed if needed)
lr_cw = selected_lr
epochs_cw = selected_epochs

# 🧮 Compute positive class weights per class (0 vs 1)
label_matrix = np.array(df['labels'].tolist())
weights_list = [
    compute_class_weight(class_weight='balanced', classes=np.array([0, 1]), y=label_matrix[:, i])[1]
    for i in range(3)
]
class_weights_tensor = torch.tensor(weights_list)

print(f"📐 Applied Class Weights: {class_weights_tensor.tolist()}")

# 🎓 Extended Trainer class to use class weights
class WeightedLossTrainer(Trainer):
    def __init__(self, class_weights_tensor, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights_tensor

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        loss_fct = BCEWithLogitsLoss(pos_weight=self.class_weights.to(logits.device))
        loss = loss_fct(logits, labels.type_as(logits))
        return (loss, outputs) if return_outputs else loss

# 🔁 Cross-validation with class weights
kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
all_preds = []
all_labels = []

print(f"\n🚀 Training with class weights (lr={lr_cw}, epochs={epochs_cw})")

for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
    print(f"📂 Fold {fold+1}/{N_FOLDS}")

    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=3, problem_type="multi_label_classification"
    )

    train_dataset_fold = dataset.select(train_idx)
    val_dataset_fold = dataset.select(val_idx)
    val_labels = np.array(df.iloc[val_idx]["labels"].tolist())

    trainer = WeightedLossTrainer(
        class_weights_tensor=class_weights_tensor,
        model=model,
        args=TrainingArguments(
            output_dir=os.path.join(base_path, f"models/classweights_fold{fold}"),
            num_train_epochs=epochs_cw,
            per_device_train_batch_size=16,
            learning_rate=lr_cw,
            report_to="none"
        ),
        train_dataset=train_dataset_fold,
    )

    trainer.train()
    logits = trainer.predict(val_dataset_fold).predictions
    probs = expit(logits)
    preds_bin = (probs > 0.5).astype(int)

    all_preds.extend(preds_bin.tolist())
    all_labels.extend(val_labels.tolist())

# 📊 Final metric
macro_f1_cw = f1_score(all_labels, all_preds, average="macro")
print(f"\n✅ Macro F1 with class weights: {macro_f1_cw:.4f}")

📐 Class Weights aplicados: [1.8453453453453454, 2.7433035714285716, 2.008169934640523]

🚀 Entrenando con class weights (lr=3e-05, epochs=6)
📂 Fold 1/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 2/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 3/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 4/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 5/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss



✅ Macro F1 con class weights: 0.7882


In [None]:
# Variables to store results
all_preds_finetuned = []
all_labels_finetuned = []

print(f"🚀 Running ensemble phase (lr={selected_lr}, epochs={selected_epochs})")

kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
    print(f"📂 Fold {fold+1}/{N_FOLDS}")

    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=3, problem_type="multi_label_classification"
    )

    train_dataset_fold = dataset.select(train_idx)
    val_dataset_fold = dataset.select(val_idx)
    val_labels = np.array(df.iloc[val_idx]["labels"].tolist())

    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir=os.path.join(base_path, f"models/final_ft_fold{fold}"),
            num_train_epochs=selected_epochs,
            per_device_train_batch_size=16,
            learning_rate=selected_lr,
            report_to="none"
        ),
        train_dataset=train_dataset_fold,
    )

    trainer.train()
    logits = trainer.predict(val_dataset_fold).predictions
    probs = expit(logits)

    all_preds_finetuned.extend(probs.tolist())      # save probs for ensemble
    all_labels_finetuned.extend(val_labels.tolist())  # true labels

# Evaluate macro F1 for validation only
preds_bin = (np.array(all_preds_finetuned) > 0.5).astype(int)
macro_f1_ft = f1_score(all_labels_finetuned, preds_bin, average="macro")
print(f"✅ Macro F1 (fine-tuned, no weights): {macro_f1_ft:.4f}")

🚀 Ejecutando Fase 3 FINAL para ensemble (lr=3e-05, epochs=6)
📂 Fold 1/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 2/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 3/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 4/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


📂 Fold 5/5


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


✅ Macro F1 (fine-tuned, sin weights): 0.8178


## 🔀 Phase 5 – Ensemble (average of logits)

In [None]:
# ✅ Ensure we have validation logits from both phases
# If not saved explicitly, modify Phase 3 and 4 to store logits per fold

# For this example, we assume these two lists are available:
# - fine_tune_logits: list of arrays [fold1_logits, fold2_logits, ...]
# - class_weights_logits: list of arrays [fold1_logits, fold2_logits, ...]

# If not, using all_preds from each phase instead of logits is also valid

print("🔀 Combining probabilities from models with and without class weights...")

# Convert prediction lists to probability arrays
probs_finetuned = expit(np.array(all_preds_finetuned))  # from model without class weights
probs_classweights = expit(np.array(all_preds))         # from model with class weights (current Phase 4)

# Ensure compatible shape
assert probs_finetuned.shape == probs_classweights.shape, "❌ Las dimensiones de logits no coinciden."

# Average ensemble
ensemble_probs = (probs_finetuned + probs_classweights) / 2

# Apply thresholds from Phase 2
preds_ensemble = np.zeros_like(ensemble_probs)
preds_ensemble[:, 0] = (ensemble_probs[:, 0] > thresholds["cat1"]).astype(int)
preds_ensemble[:, 1] = (ensemble_probs[:, 1] > thresholds["cat2"]).astype(int)
preds_ensemble[:, 2] = (ensemble_probs[:, 2] > thresholds["cat3"]).astype(int)

# Evaluate against true labels
ensemble_macro_f1 = f1_score(all_labels, preds_ensemble, average="macro")

print(f"\n✅ Ensemble Macro F1: {ensemble_macro_f1:.4f}")

🔀 Combinando probabilidades de modelos con y sin class weights...

✅ Macro F1 del ensemble: 0.4141


## 📤 Final prediction on ct_test.tsv and export

In [None]:
print(f"🚀 Training final model on FULL train (lr={selected_lr}, epochs={selected_epochs})")

# Load clean model
model_final = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=3, problem_type="multi_label_classification"
)

# Train on full dataset
trainer_final = Trainer(
    model=model_final,
    args=TrainingArguments(
        output_dir=os.path.join(base_path, "models/final_model"),
        num_train_epochs=selected_epochs,
        per_device_train_batch_size=16,
        learning_rate=selected_lr,
        report_to="none"
    ),
    train_dataset=dataset
)

trainer_final.train()

# 🧪 Inference on test set
print("🧪 Generando predicciones para ct_test.tsv...")
pred_output = trainer_final.predict(test_dataset)
logits_test = pred_output.predictions
probs_test = expit(logits_test)

# 🧾 Load thresholds
thresholds_path = os.path.join(base_path, "data/thresholds.json")
with open(thresholds_path, "r") as f:
    thresholds = json.load(f)

# 🔍 Apply thresholds
preds_test = np.zeros_like(probs_test)
for i, cat in enumerate(['cat1', 'cat2', 'cat3']):
    preds_test[:, i] = (probs_test[:, i] > thresholds[cat]).astype(int)

# 🧾 Generate predictions.csv
pred_df = pd.DataFrame({
    "index": test_df["index"],
    "cat1_pred": preds_test[:, 0],
    "cat2_pred": preds_test[:, 1],
    "cat3_pred": preds_test[:, 2]
})

output_path = os.path.join(base_path, "predictions/predictions.csv")
pred_df.to_csv(output_path, index=False)

print(f"✅ predictions.csv successfully generated at: {output_path}")

🚀 Entrenando modelo final sobre TODO el train (lr=3e-05, epochs=6)


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


🧪 Generando predicciones para ct_test.tsv...


✅ predictions.csv generado correctamente en: /content/clef2025_task4a/predictions/predictions.csv
