# 02_baseline_transformer_beto ‚Äî Binario A/D

**Objetivo:** baseline con **transformer en espa√±ol** (*roberta-bne* o equivalente) para contrastar con TF‚ÄëIDF y reglas.  
**Justificaci√≥n:** modelos preentrenados capturan sem√°ntica y contexto; con preprocesamiento conservador suelen superar a m√©todos cl√°sicos cuando hay suficiente se√±al.


In [1]:
# ===============================================================
# Setup: Paths, Imports, y Utilidades Compartidas
# ===============================================================

from pathlib import Path
import pandas as pd
import re, unicodedata, os

# Intentar importar utilidades compartidas
try:
    import sys
    sys.path.insert(0, str(Path.cwd()))
    from utils_shared import setup_paths, guess_text_col, guess_label_col, normalize_label
    print("[INFO] Utilizando utils_shared.py")
    
    # Setup de paths centralizado
    paths = setup_paths()
    BASE_PATH = paths['BASE_PATH']
    DATA_PATH = paths['DATA_PATH']
    SPLITS_PATH = paths['SPLITS_PATH']
    
    # Usar funciones centralizadas
    _guess_text_col = guess_text_col
    _guess_label_col = guess_label_col
    _norm_label_bin = normalize_label
    
except ImportError:
    print("[WARNING] utils_shared.py no encontrado, usando funciones locales")
    
    # Setup manual de paths
    BASE_PATH = Path.cwd()
    if BASE_PATH.name == "notebooks":
        BASE_PATH = BASE_PATH.parent
    
    DATA_PATH = BASE_PATH / "data"
    SPLITS_PATH = DATA_PATH / "splits"
    
    DATA_PATH.mkdir(exist_ok=True)
    
    # Funciones helper locales
    def _guess_text_col(df):
        for c in ["texto", "text", "comment", "comentario"]:
            if c in df.columns:
                return c
        return df.columns[0]
    
    def _guess_label_col(df):
        for c in ["etiqueta", "label", "category"]:
            if c in df.columns:
                return c
        return df.columns[1] if len(df.columns) > 1 else df.columns[-1]
    
    def _norm_label_bin(s):
        if pd.isna(s): 
            return ""
        s = str(s).strip().lower()
        s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
        return {'depresivo': 'depresion'}.get(s, s)

# Validar existencia de splits
if not SPLITS_PATH.exists():
    raise FileNotFoundError(
        f"[ERROR] Splits no encontrados en {SPLITS_PATH}\n"
        f"        Debes ejecutar primero: 02_create_splits.ipynb"
    )

print(f"[INFO] Paths configurados:")
print(f"  BASE_PATH:   {BASE_PATH}")
print(f"  DATA_PATH:   {DATA_PATH}")
print(f"  SPLITS_PATH: {SPLITS_PATH}")

# Columnas esperadas en dataset_base.csv
TEXT_COL = "texto"
LABEL_COL = "etiqueta"

[INFO] Utilizando utils_shared.py
[INFO] Paths configurados:
  BASE_PATH:   /Users/manuelnunez/Projects/psych-phenotyping-paraguay
  DATA_PATH:   /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data
  SPLITS_PATH: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/splits


## 1) Carga y preprocesamiento **conservador** (preserva tildes/casing)

In [2]:
# ===============================================================
# Carga de Datos y Preprocesamiento CONSERVADOR (BETO/Transformer)
# ===============================================================
#
# ESTRATEGIA DE PREPROCESAMIENTO: CONSERVADORA (m√≠nimo)
#
# ¬øPor qu√© preprocesamiento conservador/m√≠nimo?
#
# 1. **Los transformers est√°n preentrenados con texto "natural"**:
#    - BETO/RoBERTa se entrenaron con Wikipedia, noticias, web en espa√±ol
#    - Ese texto tiene may√∫sculas, tildes, puntuaci√≥n original
#    - Normalizar agresivamente = salirse de la distribuci√≥n de entrenamiento
#
# 2. **Tokenizaci√≥n BPE maneja variaciones**:
#    - "Depresi√≥n" y "depresi√≥n" ‚Üí mismo subtokens (el tokenizer lo normaliza)
#    - El modelo aprende equivalencias durante el pretraining
#    - No necesitamos lowercase manual
#
# 3. **Embeddings contextuales capturan sem√°ntica**:
#    - BETO entiende "no tengo apetito" vs "tengo apetito" sin marcadores
#    - La atenci√≥n captura negaci√≥n impl√≠citamente
#    - No necesitamos heur√≠sticas como "no_X"
#
# Preprocesamiento conservador para transformers
#
# Estrategia: M√≠nimo (solo colapsa alargamientos, preserva todo lo dem√°s)
# - BETO se entren√≥ con texto natural (may√∫sculas, tildes, puntuaci√≥n)
# - WordPiece tokenization maneja variaciones autom√°ticamente
# - Comparaci√≥n: Rule-based conserva para patterns, TF-IDF normaliza, BETO preserva distribuci√≥n original

import pandas as pd, re, unicodedata

# Cargar splits unificados desde 02_create_splits.ipynb
dataset_base = pd.read_csv(SPLITS_PATH / 'dataset_base.csv')
train_indices = pd.read_csv(SPLITS_PATH / 'train_indices.csv')['row_id'].values
dev_indices = pd.read_csv(SPLITS_PATH / 'dev_indices.csv')['row_id'].values

text_col = _guess_text_col(dataset_base)
label_col = _guess_label_col(dataset_base)

print(f"[INFO] Splits: {len(train_indices)} train, {len(dev_indices)} val")

# Definir funci√≥n de limpieza conservadora
RE_MULTI = re.compile(r'(.)\1{2,}')  # Detecta 3+ letras repetidas

def clean_text_trf(s: str) -> str:
    """
    Limpieza CONSERVADORA para transformers (m√≠nimo preprocesamiento).
    
    Aplica √öNICAMENTE:
    - Normalizaci√≥n NFC (forma can√≥nica de tildes)
    - Colapso de alargamientos (holaaa ‚Üí holaa)
    - Normalizaci√≥n de espacios
    
    Preserva:
    - May√∫sculas y min√∫sculas (BETO las usa)
    - Tildes y acentos (parte del vocabulario)
    - Puntuaci√≥n (se√±al contextual)
    - Estructura original del texto
    """
    if pd.isna(s):
        return ""
    
    s = str(s).strip()
    s = unicodedata.normalize("NFC", s)  # Normaliza tildes (√© = √©, no e + ¬¥)
    s = RE_MULTI.sub(r'\1\1', s)         # holaaa ‚Üí holaa (evita OOV)
    s = re.sub(r"\s+", " ", s).strip()   # Colapsa espacios m√∫ltiples
    
    return s

dataset_base['texto_trf'] = dataset_base[text_col].map(clean_text_trf)

df_train = dataset_base[dataset_base['row_id'].isin(train_indices)].copy()
df_dev = dataset_base[dataset_base['row_id'].isin(dev_indices)].copy()

X_train, y_train = df_train['texto_trf'], df_train[label_col]
X_dev, y_dev = df_dev['texto_trf'], df_dev[label_col]

print(f"[INFO] Distribuci√≥n train: {dict(y_train.value_counts())}")
print(f"[INFO] Distribuci√≥n val: {dict(y_dev.value_counts())}")

[INFO] Splits: 1849 train, 641 val
[INFO] Distribuci√≥n train: {'depresion': 1270, 'ansiedad': 579}
[INFO] Distribuci√≥n val: {'depresion': 456, 'ansiedad': 185}
[INFO] Distribuci√≥n train: {'depresion': 1270, 'ansiedad': 579}
[INFO] Distribuci√≥n val: {'depresion': 456, 'ansiedad': 185}


## 2) Tokenizaci√≥n y datasets

In [3]:
from datasets import Dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

MODEL = "dccuchile/bert-base-spanish-wwm-cased"  # Espa√±ol cased
tok = AutoTokenizer.from_pretrained(MODEL)

# Mapeo de etiquetas a IDs num√©ricos
label2id = {'depresion': 0, 'ansiedad': 1}
id2label = {0: 'depresion', 1: 'ansiedad'}

# CR√çTICO: Agregar columna 'labels' num√©rica (el Trainer la requiere)
df_train['labels'] = df_train[label_col].map(label2id)
df_dev['labels'] = df_dev[label_col].map(label2id)

print(f"[INFO] Mapeo de etiquetas: {label2id}")
print(f"[INFO] Distribuci√≥n train (num√©rica): {dict(df_train['labels'].value_counts())}")
print(f"[INFO] Distribuci√≥n dev (num√©rica): {dict(df_dev['labels'].value_counts())}")

def preprocess(batch):
    return tok(batch["texto_trf"], truncation=True, padding=False, max_length=256)

# Crear datasets con columnas: texto_trf y labels
train_ds = Dataset.from_pandas(df_train[['texto_trf', 'labels']].reset_index(drop=True)).map(preprocess, batched=True, remove_columns=["texto_trf"])
dev_ds = Dataset.from_pandas(df_dev[['texto_trf', 'labels']].reset_index(drop=True)).map(preprocess, batched=True, remove_columns=["texto_trf"])

collator = DataCollatorWithPadding(tokenizer=tok)

print(f"\n[INFO] Datasets creados:")
print(f"  Train: {len(train_ds)} ejemplos")
print(f"  Dev: {len(dev_ds)} ejemplos")
print(f"  Columnas train_ds: {train_ds.column_names}")
print(f"  Columnas dev_ds: {dev_ds.column_names}")

[INFO] Mapeo de etiquetas: {'depresion': 0, 'ansiedad': 1}
[INFO] Distribuci√≥n train (num√©rica): {0: 1270, 1: 579}
[INFO] Distribuci√≥n dev (num√©rica): {0: 456, 1: 185}


Map:   0%|          | 0/1849 [00:00<?, ? examples/s]

Map:   0%|          | 0/641 [00:00<?, ? examples/s]


[INFO] Datasets creados:
  Train: 1849 ejemplos
  Dev: 641 ejemplos
  Columnas train_ds: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']
  Columnas dev_ds: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']


## 3) Entrenamiento y evaluaci√≥n

In [4]:
import evaluate, numpy as np
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Mapeo ya definido en celda anterior
# label2id = {'depresion': 0, 'ansiedad': 1}
# id2label = {0: 'depresion', 1: 'ansiedad'}

model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2, id2label=id2label, label2id=label2id)

metric_f1   = evaluate.load("f1")
metric_prec = evaluate.load("precision")
metric_rec  = evaluate.load("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "macro_f1":   metric_f1.compute(predictions=preds, references=labels, average="macro")["f1"],
        "macro_precision": metric_prec.compute(predictions=preds, references=labels, average="macro")["precision"],
        "macro_recall":    metric_rec.compute(predictions=preds, references=labels, average="macro")["recall"],
    }

args = TrainingArguments(
    output_dir="runs/beto_ad",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
    seed=42,
    logging_steps=50
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=dev_ds,
    tokenizer=tok,
    data_collator=collator,
    compute_metrics=compute_metrics
)

print("[INFO] Iniciando entrenamiento...")
print(f"  Epochs: {args.num_train_epochs}")
print(f"  Batch size: {args.per_device_train_batch_size}")
print(f"  Learning rate: {args.learning_rate}")

trainer.train()
eval_res = trainer.evaluate()

import pandas as pd
(pd.DataFrame([eval_res]).to_csv(DATA_PATH/'beto_eval.csv', index=False, encoding='utf-8'))
print(f"\n[INFO] Entrenamiento completado")
print(f"  Macro F1: {eval_res['eval_macro_f1']:.4f}")
print(f"  Macro Precision: {eval_res['eval_macro_precision']:.4f}")
print(f"  Macro Recall: {eval_res['eval_macro_recall']:.4f}")
print(f"[INFO] Eval guardada: {DATA_PATH/'beto_eval.csv'}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
  trainer = Trainer(


[INFO] Iniciando entrenamiento...
  Epochs: 3
  Batch size: 16
  Learning rate: 2e-05


Epoch,Training Loss,Validation Loss,Macro F1,Macro Precision,Macro Recall
1,0.3634,0.291494,0.828506,0.907972,0.794518
2,0.2325,0.297656,0.84099,0.867082,0.823429
3,0.1379,0.351497,0.840852,0.875007,0.819707



[INFO] Entrenamiento completado
  Macro F1: 0.8410
  Macro Precision: 0.8671
  Macro Recall: 0.8234
[INFO] Eval guardada: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/beto_eval.csv


## 3) Evaluaci√≥n single (dev set) y m√©tricas exportables

**‚ö†Ô∏è IMPORTANTE - MANEJO DE CASOS NEUTRALES (BETO vs Rule-Based):**

**BETO es un modelo transformer BINARIO FORZADO:**
- AutoModelForSequenceClassification con `num_labels=2`
- Capa final: softmax sobre 2 clases (ansiedad, depresi√≥n)
- **NO puede abstenerse** ni generar predicciones "neutral"
- Siempre clasifica cada caso seg√∫n probabilidades de salida

**Diferencia con Rule-Based:**
- **Rule-Based:** Puede devolver "neutral" (~78.4% casos sin matches)
  - Sistema de reglas: si NO match ‚Üí neutral
  - Para comparar, convierte neutrales ‚Üí mayoritaria
  
- **BETO:** Siempre binario (0% neutrales)
  - Red neuronal: aprende representaci√≥n sem√°ntica
  - Incluso textos ambiguos son forzados a una clase
  - Probabilidades [0.51, 0.49] ‚Üí predice clase con mayor prob

**Diferencia con TF-IDF:**
- **TF-IDF:** Aprende de char n-grams (superficie textual)
- **BETO:** Aprende de contexto sem√°ntico (embeddings preentrenados)
- Ambos son **binarios forzados** (0% neutrales)

**Implicaciones para comparaci√≥n:**
1. ‚úÖ BETO comparable directamente con Dummy/TF-IDF (todos binarios)
2. ‚ö†Ô∏è Comparaci√≥n con Rule-Based es INJUSTA:
   - F1 RB bajo = 78% neutrales + errores
   - F1 BETO alto = decisi√≥n forzada + contexto sem√°ntico
3. üìä BETO vs TF-IDF: Ambos cubren 100% dataset, compiten en discriminaci√≥n

**En este notebook:**
- NO hay conversi√≥n de neutrales (modelo binario puro)
- M√©tricas reflejan capacidad discriminativa del transformer

In [5]:
import numpy as np, pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

pred_logits = trainer.predict(dev_ds).predictions
pred_ids = pred_logits.argmax(axis=-1)
y_true = df_dev["labels"].to_numpy()  # CORREGIDO: usar 'labels' en lugar de 'label'

# Exportables
beto_pred_csv   = DATA_PATH/'beto_predictions.csv'
beto_report_csv = DATA_PATH/'beto_classification_report.csv'
beto_eval_csv   = DATA_PATH/'beto_eval.csv'  # ya creado arriba
beto_cm_csv     = DATA_PATH/'beto_confusion_matrix.csv'

pd.DataFrame(classification_report(y_true, pred_ids, target_names=['depresion','ansiedad'], output_dict=True, zero_division=0)).transpose().to_csv(beto_report_csv, index=True, encoding='utf-8')

cm = confusion_matrix(y_true, pred_ids, labels=[0,1])
pd.DataFrame(cm, index=['true_depresion','true_ansiedad'], columns=['pred_depresion','pred_ansiedad']).to_csv(beto_cm_csv)

# Con textos (√∫til para an√°lisis de errores)
dev_out = df_dev.copy()
dev_out["y_true"] = dev_out["labels"].map({0:"depresion",1:"ansiedad"})
dev_out["y_pred"] = [id2label[i] for i in pred_ids]
dev_out.to_csv(beto_pred_csv, index=False, encoding="utf-8")

print("[INFO] Exportados:")
print(f"  - Predicciones: {beto_pred_csv}")
print(f"  - Reporte: {beto_report_csv}")
print(f"  - Eval: {beto_eval_csv}")
print(f"  - Matriz: {beto_cm_csv}")

# Mostrar reporte en consola
print("\n" + "="*60)
print("CLASSIFICATION REPORT (BETO)")
print("="*60)
print(classification_report(y_true, pred_ids, target_names=['depresion','ansiedad'], zero_division=0))
print("\nMatriz de Confusi√≥n:")
print(pd.DataFrame(cm, index=['true_depresion','true_ansiedad'], columns=['pred_depresion','pred_ansiedad']))

[INFO] Exportados:
  - Predicciones: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/beto_predictions.csv
  - Reporte: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/beto_classification_report.csv
  - Eval: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/beto_eval.csv
  - Matriz: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/beto_confusion_matrix.csv

CLASSIFICATION REPORT (BETO)
              precision    recall  f1-score   support

   depresion       0.89      0.95      0.92       456
    ansiedad       0.85      0.70      0.77       185

    accuracy                           0.88       641
   macro avg       0.87      0.82      0.84       641
weighted avg       0.87      0.88      0.87       641


Matriz de Confusi√≥n:
                pred_depresion  pred_ansiedad
true_depresion             433             23
true_ansiedad               56            129


## 4) An√°lisis de Errores (FP/FN)

In [6]:
# Exportar errores para an√°lisis cualitativo
# Usar dev_out que ya tiene y_true/y_pred en formato texto

fp_depresion = dev_out[(dev_out['y_true'] == 'ansiedad') & (dev_out['y_pred'] == 'depresion')].copy()
fp_depresion['error_type'] = 'FP_depresion'

fn_depresion = dev_out[(dev_out['y_true'] == 'depresion') & (dev_out['y_pred'] == 'ansiedad')].copy()
fn_depresion['error_type'] = 'FN_depresion'

fp_ansiedad = dev_out[(dev_out['y_true'] == 'depresion') & (dev_out['y_pred'] == 'ansiedad')].copy()
fp_ansiedad['error_type'] = 'FP_ansiedad'

fn_ansiedad = dev_out[(dev_out['y_true'] == 'ansiedad') & (dev_out['y_pred'] == 'depresion')].copy()
fn_ansiedad['error_type'] = 'FN_ansiedad'

beto_fp_dep_csv = DATA_PATH / 'beto_fp_depresion.csv'
beto_fn_dep_csv = DATA_PATH / 'beto_fn_depresion.csv'
beto_fp_ans_csv = DATA_PATH / 'beto_fp_ansiedad.csv'
beto_fn_ans_csv = DATA_PATH / 'beto_fn_ansiedad.csv'

fp_depresion[['texto_trf', 'y_true', 'y_pred', 'error_type']].to_csv(beto_fp_dep_csv, index=False, encoding='utf-8')
fn_depresion[['texto_trf', 'y_true', 'y_pred', 'error_type']].to_csv(beto_fn_dep_csv, index=False, encoding='utf-8')
fp_ansiedad[['texto_trf', 'y_true', 'y_pred', 'error_type']].to_csv(beto_fp_ans_csv, index=False, encoding='utf-8')
fn_ansiedad[['texto_trf', 'y_true', 'y_pred', 'error_type']].to_csv(beto_fn_ans_csv, index=False, encoding='utf-8')

print("[INFO] An√°lisis de errores exportado:")
print(f"  FP Depresi√≥n: {len(fp_depresion)} casos ‚Üí {beto_fp_dep_csv.name}")
print(f"  FN Depresi√≥n: {len(fn_depresion)} casos ‚Üí {beto_fn_dep_csv.name}")
print(f"  FP Ansiedad:  {len(fp_ansiedad)} casos ‚Üí {beto_fp_ans_csv.name}")
print(f"  FN Ansiedad:  {len(fn_ansiedad)} casos ‚Üí {beto_fn_ans_csv.name}")

[INFO] An√°lisis de errores exportado:
  FP Depresi√≥n: 56 casos ‚Üí beto_fp_depresion.csv
  FN Depresi√≥n: 23 casos ‚Üí beto_fn_depresion.csv
  FP Ansiedad:  23 casos ‚Üí beto_fp_ansiedad.csv
  FN Ansiedad:  56 casos ‚Üí beto_fn_ansiedad.csv


## 5) Cross-Validation 5-Fold (Patient-Level)

**Objetivo:** Estimar F1 con todos los pacientes (90) y cuantificar varianza del modelo.

**Estrategia:**
- StratifiedKFold 5-fold a nivel de paciente
- Entrena modelo BETO fresco en cada fold
- Reporta F1 macro ¬± std e IC95%

**Comparaci√≥n:**
- TF-IDF CV: F1 = 0.850 ¬± 0.031
- BETO CV: esperado ~0.84-0.86

---

**‚ö†Ô∏è MANEJO DE NEUTRALES EN CV:**

**BETO NO genera predicciones neutrales en ning√∫n fold:**
- Transformer con clasificaci√≥n binaria forzada (2 clases)
- Cada caso procesado ‚Üí softmax(logits) ‚Üí argmax ‚Üí clase final
- Varianza de CV refleja **capacidad de generalizaci√≥n del modelo**
- NO hay conversi√≥n de neutrales (contrario a Rule-Based)

**Diferencia con Rule-Based CV:**
- **Rule-Based:** ~78% neutrales por fold ‚Üí convertidos a mayoritaria
  - Varianza CV = heterogeneidad dataset + cobertura variable
  - F1 bajo por falta de cobertura de vocabulario
  
- **BETO:** 0% neutrales (binario puro)
  - Varianza CV = capacidad de generalizar + variabilidad sem√°ntica
  - F1 refleja discriminaci√≥n real (no penalizaci√≥n por cobertura)

**Diferencia con TF-IDF CV:**
- Ambos son binarios puros (0% neutrales)
- TF-IDF: Varianza por char patterns
- BETO: Varianza por representaci√≥n sem√°ntica
- Comparaci√≥n directa v√°lida (misma "regla de juego")

**Interpretaci√≥n esperada:**
- Si BETO ‚âà TF-IDF ‚Üí contexto sem√°ntico no ayuda m√°s que char patterns
- Si BETO > TF-IDF ‚Üí sem√°ntica captura matices que superficie no ve
- Si BETO < TF-IDF ‚Üí overfitting o dataset muy peque√±o para transformer

In [7]:
# ===============================================================
# Cross-Validation 5-Fold (Patient-Level)
# ===============================================================

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, precision_score, recall_score
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import Dataset

print("="*80)
print("CROSS-VALIDATION 5-FOLD (PATIENT-LEVEL) - BETO")
print("="*80)
print()

# Configuraci√≥n
N_SPLITS = 5
RANDOM_STATE = 42
MODEL = "dccuchile/bert-base-spanish-wwm-cased"
MAX_LEN = 256
BATCH_SIZE = 8
EPOCHS = 3

# Preparar dataset completo
df_full = dataset_base.copy()
df_full = df_full.dropna(subset=[text_col, label_col]).copy()
df_full['texto_trf'] = df_full[text_col].map(clean_text_trf)

# Mapear labels
label2id = {'depresion': 0, 'ansiedad': 1}
id2label = {0: 'depresion', 1: 'ansiedad'}
df_full['labels'] = df_full[label_col].map(label2id)

print(f"‚úì Dataset completo: {len(df_full)} casos, {df_full['patient_id'].nunique()} pacientes")
print(f"‚úì Distribuci√≥n: {df_full[label_col].value_counts().to_dict()}")
print()

# Stratification a nivel de paciente (etiqueta mayoritaria)
patient_labels = df_full.groupby('patient_id')[label_col].agg(
    lambda x: x.value_counts().index[0]
).reset_index()
patient_labels.columns = ['patient_id', 'label_majority']

patient_ids = patient_labels['patient_id'].values
patient_y = patient_labels['label_majority'].map(label2id).values

# Crear folds
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)

# Tokenizer
tok = AutoTokenizer.from_pretrained(MODEL)

def tokenize_fn(examples):
    return tok(examples['texto_trf'], truncation=True, max_length=MAX_LEN)

# Ejecutar CV
cv_results = []

for fold_idx, (train_pat_idx, test_pat_idx) in enumerate(skf.split(patient_ids, patient_y), start=1):
    print(f"Fold {fold_idx}/{N_SPLITS}:", end=" ")
    
    # Obtener pacientes del fold
    train_pats = patient_ids[train_pat_idx]
    test_pats = patient_ids[test_pat_idx]
    
    # Filtrar casos
    train_df_fold = df_full[df_full['patient_id'].isin(train_pats)].copy()
    test_df_fold = df_full[df_full['patient_id'].isin(test_pats)].copy()
    
    n_train_cases = len(train_df_fold)
    n_test_cases = len(test_df_fold)
    
    print(f"{len(train_pats)} train patients ({n_train_cases} casos), {len(test_pats)} test patients ({n_test_cases} casos)")
    
    # Crear datasets HuggingFace
    train_ds_fold = Dataset.from_pandas(train_df_fold[['texto_trf', 'labels']]).map(tokenize_fn, batched=True)
    test_ds_fold = Dataset.from_pandas(test_df_fold[['texto_trf', 'labels']]).map(tokenize_fn, batched=True)
    
    # Modelo fresco para cada fold
    model_fold = AutoModelForSequenceClassification.from_pretrained(
        MODEL, 
        num_labels=2,
        id2label=id2label,
        label2id=label2id
    )
    
    # Training args
    train_args = TrainingArguments(
        output_dir=f'./runs/beto_cv_fold{fold_idx}',
        num_train_epochs=EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        learning_rate=2e-5,
        logging_steps=100,
        save_strategy='no',
        report_to='none'
    )
    
    # Trainer
    trainer_fold = Trainer(
        model=model_fold,
        args=train_args,
        train_dataset=train_ds_fold,
        eval_dataset=test_ds_fold,
        data_collator=DataCollatorWithPadding(tok)
    )
    
    # Entrenar
    trainer_fold.train()
    
    # Predecir
    preds = trainer_fold.predict(test_ds_fold)
    pred_ids_fold = preds.predictions.argmax(axis=-1)
    y_true_fold = test_df_fold['labels'].to_numpy()
    
    # M√©tricas
    f1 = f1_score(y_true_fold, pred_ids_fold, average='macro', zero_division=0)
    prec = precision_score(y_true_fold, pred_ids_fold, average='macro', zero_division=0)
    rec = recall_score(y_true_fold, pred_ids_fold, average='macro', zero_division=0)
    
    cv_results.append({
        'fold': fold_idx,
        'f1_macro': f1,
        'precision': prec,
        'recall': rec,
        'n_train_patients': len(train_pats),
        'n_test_patients': len(test_pats),
        'n_train_cases': n_train_cases,
        'n_test_cases': n_test_cases
    })
    
    print(f"  ‚Üí F1={f1:.3f}, Prec={prec:.3f}, Rec={rec:.3f}")

# Resultados
df_cv = pd.DataFrame(cv_results)

print()
print("="*80)
print("RESULTADOS CROSS-VALIDATION")
print("="*80)
print()
print(df_cv[['fold', 'f1_macro', 'precision', 'recall', 'n_test_patients']].to_string(index=False))
print()

# Estad√≠sticas
f1_mean = df_cv['f1_macro'].mean()
f1_std = df_cv['f1_macro'].std()
f1_min = df_cv['f1_macro'].min()
f1_max = df_cv['f1_macro'].max()
f1_ci95_lower = f1_mean - 1.96 * f1_std
f1_ci95_upper = f1_mean + 1.96 * f1_std

print(f"üìä ESTAD√çSTICAS:") 
print(f"   F1 macro:  {f1_mean:.3f} ¬± {f1_std:.3f}")
print(f"   IC95%:     [{f1_ci95_lower:.3f}, {f1_ci95_upper:.3f}]")
print(f"   Rango:     [{f1_min:.3f}, {f1_max:.3f}]")
print()
print(f"   Precision: {df_cv['precision'].mean():.3f} ¬± {df_cv['precision'].std():.3f}")
print(f"   Recall:    {df_cv['recall'].mean():.3f} ¬± {df_cv['recall'].std():.3f}")
print()

# Comparaci√≥n
print("üîç COMPARACI√ìN CON TF-IDF:")
print(f"   TF-IDF CV: F1 = 0.850 ¬± 0.031 (IC95%: [0.789, 0.910])")
print(f"   BETO CV:   F1 = {f1_mean:.3f} ¬± {f1_std:.3f} (IC95%: [{f1_ci95_lower:.3f}, {f1_ci95_upper:.3f}])")
print()

if f1_ci95_lower > 0.910:
    print("‚úÖ BETO supera significativamente a TF-IDF (IC95% no se solapan)")
elif f1_ci95_upper < 0.789:
    print("‚ö†Ô∏è TF-IDF supera significativamente a BETO (IC95% no se solapan)")
else:
    print("‚û°Ô∏è BETO y TF-IDF son comparables (IC95% se solapan)")

print()

# Exportar
cv_output_dir = DATA_PATH / 'cv_results'
cv_output_dir.mkdir(exist_ok=True)
cv_output = cv_output_dir / 'beto_cv_results.csv'
df_cv.to_csv(cv_output, index=False)

print(f"üíæ Resultados guardados: {cv_output}")
print()
print("="*80)

CROSS-VALIDATION 5-FOLD (PATIENT-LEVEL) - BETO

‚úì Dataset completo: 3126 casos, 90 pacientes
‚úì Distribuci√≥n: {'depresion': 2201, 'ansiedad': 925}

‚úì Dataset completo: 3126 casos, 90 pacientes
‚úì Distribuci√≥n: {'depresion': 2201, 'ansiedad': 925}

Fold 1/5: 72 train patients (2428 casos), 18 test patients (698 casos)
Fold 1/5: 72 train patients (2428 casos), 18 test patients (698 casos)


Map:   0%|          | 0/2428 [00:00<?, ? examples/s]

Map:   0%|          | 0/698 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.4804
200,0.3442
300,0.3532
400,0.2407
500,0.2321
600,0.2198
700,0.0971
800,0.133
900,0.1291


  ‚Üí F1=0.779, Prec=0.795, Rec=0.767
Fold 2/5: 72 train patients (2517 casos), 18 test patients (609 casos)


Map:   0%|          | 0/2517 [00:00<?, ? examples/s]

Map:   0%|          | 0/609 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.5355
200,0.3714
300,0.3262
400,0.2414
500,0.2544
600,0.2053
700,0.1427
800,0.1006
900,0.1246


  ‚Üí F1=0.870, Prec=0.888, Rec=0.859
Fold 3/5: 72 train patients (2493 casos), 18 test patients (633 casos)


Map:   0%|          | 0/2493 [00:00<?, ? examples/s]

Map:   0%|          | 0/633 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.4982
200,0.3449
300,0.3476
400,0.3037
500,0.2731
600,0.2252
700,0.174
800,0.1264
900,0.1689


  ‚Üí F1=0.813, Prec=0.805, Rec=0.826
Fold 4/5: 72 train patients (2514 casos), 18 test patients (612 casos)


Map:   0%|          | 0/2514 [00:00<?, ? examples/s]

Map:   0%|          | 0/612 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.5688
200,0.4031
300,0.3486
400,0.3127
500,0.262
600,0.2966
700,0.227
800,0.1366
900,0.1202


  ‚Üí F1=0.839, Prec=0.868, Rec=0.820
Fold 5/5: 72 train patients (2552 casos), 18 test patients (574 casos)


Map:   0%|          | 0/2552 [00:00<?, ? examples/s]

Map:   0%|          | 0/574 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,0.5696
200,0.3605
300,0.3393
400,0.263
500,0.2237
600,0.2549
700,0.1246
800,0.1649
900,0.1244


  ‚Üí F1=0.804, Prec=0.790, Rec=0.831

RESULTADOS CROSS-VALIDATION

 fold  f1_macro  precision   recall  n_test_patients
    1  0.778591   0.794630 0.767131               18
    2  0.870396   0.887678 0.858761               18
    3  0.813167   0.804542 0.826402               18
    4  0.839193   0.867715 0.819831               18
    5  0.804448   0.790021 0.831385               18

üìä ESTAD√çSTICAS:
   F1 macro:  0.821 ¬± 0.035
   IC95%:     [0.753, 0.890]
   Rango:     [0.779, 0.870]

   Precision: 0.829 ¬± 0.045
   Recall:    0.821 ¬± 0.033

üîç COMPARACI√ìN CON TF-IDF:
   TF-IDF CV: F1 = 0.850 ¬± 0.031 (IC95%: [0.789, 0.910])
   BETO CV:   F1 = 0.821 ¬± 0.035 (IC95%: [0.753, 0.890])

‚û°Ô∏è BETO y TF-IDF son comparables (IC95% se solapan)

üíæ Resultados guardados: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/cv_results/beto_cv_results.csv



## 6) Exportar Resultados y Pr√≥ximos Pasos

**‚úÖ Archivos generados por este baseline:**

Evaluaci√≥n en dev set:
- `beto_predictions.csv` - Predicciones por caso
- `beto_eval.csv` - M√©tricas macro agregadas
- `beto_classification_report.csv` - Reporte por clase
- `beto_confusion_matrix.csv` - Matriz de confusi√≥n

Cross-Validation:
- `cv_results/beto_cv_results.csv` - Resultados 5-fold CV

---

**üìä Para an√°lisis comparativo completo:**
‚Üí Ejecutar notebook: `02_comparacion_resultados.ipynb`

Este notebook consolida todos los resultados CV, calcula estad√≠sticas (IC95%), compara modelos, y genera visualizaciones e interpretaci√≥n para paper/tesis.

---

**üìù Notas metodol√≥gicas:**
- **Dataset:** dataset_base.csv (3,155 casos, 90 pacientes)
- **Split:** Patient-level 60/20/20 (0% leakage)
- **CV:** 5-fold patient-level stratified (54 pacientes train por fold)
- **Modelo:** BETO (dccuchile/bert-base-spanish-wwm-cased)
- **Hiperpar√°metros:** epochs=3, lr=2e-5, max_length=256, batch_size=8
- **Preprocesamiento:** Conservador (preserva may√∫sculas, tildes, puntuaci√≥n)