# 02_baseline_dummy — Sanity Check Baselines

**Objetivo:** Implementar baselines triviales (majority class, random estratificado) para validar que los modelos ML capturan patrones reales.

**Exportables:**
- `data/dummy_majority_eval.csv` + classification_report
- `data/dummy_stratified_eval.csv` + classification_report

**Para comparación completa:** Ver `05_comparacion_resultados.ipynb`

In [1]:
# ===============================================================
# Setup: Imports y configuración de paths
# ===============================================================
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from sklearn.dummy import DummyClassifier

# Importar utilidades compartidas
try:
    from utils_shared import setup_paths, load_splits, calculate_metrics, get_cv_splitter
    paths = setup_paths()
    DATA_PATH = paths['DATA_PATH']
    SPLITS_PATH = paths['SPLITS_PATH']
    print("[OK] Usando utils_shared.py")
except ImportError:
    print("[ERROR] No se encontró utils_shared.py. Verifica que estás en el directorio correcto.")
    raise

[OK] Usando utils_shared.py


## 1) Cargar datos y splits

Usamos los mismos splits pacientes que los otros baselines (patient-level, 0% leakage).

In [2]:
# Cargar datasets alineados con estrategia Denoised
try:
    # Train: Usar train_denoised (señal clínica)
    df_train = pd.read_csv(SPLITS_PATH / 'train_denoised.csv')
    
    # Dev: Construir desde splits (dataset completo)
    df_base, _, dev_idx, _ = load_splits(SPLITS_PATH)
    df_dev = df_base.set_index('row_id').loc[dev_idx].reset_index()
    
    print(f"Train (Denoised): {len(df_train)} casos")
    print(f"Dev (Full): {len(df_dev)} casos")

    # Preparar X, y
    X_train = df_train['texto'].values
    y_train = df_train['etiqueta'].values

    X_dev = df_dev['texto'].values
    y_dev = df_dev['etiqueta'].values

    print(f"\nDistribución train:")
    print(pd.Series(y_train).value_counts())
    print(f"\nDistribución dev:")
    print(pd.Series(y_dev).value_counts())

except FileNotFoundError:
    print("[ERROR] No se encontraron los datasets. Ejecuta 03_rule_based_denoising.ipynb primero.")
    raise

Dataset base: 3131 casos
Train: 1867 casos
Dev: 627 casos

Distribución train:
depresion    1388
ansiedad      479
Name: count, dtype: int64

Distribución dev:
depresion    342
ansiedad     285
Name: count, dtype: int64


## 2) Baseline 1: Majority Class

Predice **siempre** la clase mayoritaria (Depresión).

In [3]:
# Entrenar (solo aprende la clase mayoritaria)
dummy_majority = DummyClassifier(strategy='most_frequent', random_state=42)
dummy_majority.fit(X_train, y_train)

# Predecir en validación
y_pred_majority = dummy_majority.predict(X_dev)

# Calcular métricas
metrics_majority = calculate_metrics(y_dev, y_pred_majority)

print("=" * 60)
print("DUMMY BASELINE: MAJORITY CLASS")
print("=" * 60)
print(f"Macro F1: {metrics_majority['f1_macro']:.4f}")
print(f"Macro Precision: {metrics_majority['precision_macro']:.4f}")
print(f"Macro Recall: {metrics_majority['recall_macro']:.4f}")
print()
print(metrics_majority['report'])

# Exportar resultados
eval_majority = pd.DataFrame([{
    'modelo': 'dummy_majority',
    'f1_macro': metrics_majority['f1_macro'],
    'precision_macro': metrics_majority['precision_macro'],
    'recall_macro': metrics_majority['recall_macro'],
    'n_dev': len(y_dev)
}])
eval_majority.to_csv(DATA_PATH / 'dummy_majority_eval.csv', index=False)
print(f"✓ Exportado: {DATA_PATH / 'dummy_majority_eval.csv'}")

DUMMY BASELINE: MAJORITY CLASS
Macro F1: 0.3529
Macro Precision: 0.2727
Macro Recall: 0.5000

              precision    recall  f1-score   support

    ansiedad       0.00      0.00      0.00       285
   depresion       0.55      1.00      0.71       342

    accuracy                           0.55       627
   macro avg       0.27      0.50      0.35       627
weighted avg       0.30      0.55      0.39       627

✓ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_majority_eval.csv


## 3) Baseline 2: Stratified Random

Predice aleatoriamente respetando la distribución de clases del train set.

In [4]:
# Entrenar (aprende solo la distribución de clases)
dummy_stratified = DummyClassifier(strategy='stratified', random_state=42)
dummy_stratified.fit(X_train, y_train)

# Predecir en validación
y_pred_stratified = dummy_stratified.predict(X_dev)

# Calcular métricas
metrics_stratified = calculate_metrics(y_dev, y_pred_stratified)

print("=" * 60)
print("DUMMY BASELINE: STRATIFIED RANDOM")
print("=" * 60)
print(f"Macro F1: {metrics_stratified['f1_macro']:.4f}")
print(f"Macro Precision: {metrics_stratified['precision_macro']:.4f}")
print(f"Macro Recall: {metrics_stratified['recall_macro']:.4f}")
print()
print(metrics_stratified['report'])

# Exportar resultados
eval_stratified = pd.DataFrame([{
    'modelo': 'dummy_stratified',
    'f1_macro': metrics_stratified['f1_macro'],
    'precision_macro': metrics_stratified['precision_macro'],
    'recall_macro': metrics_stratified['recall_macro'],
    'n_dev': len(y_dev)
}])
eval_stratified.to_csv(DATA_PATH / 'dummy_stratified_eval.csv', index=False)
print(f"✓ Exportado: {DATA_PATH / 'dummy_stratified_eval.csv'}")

DUMMY BASELINE: STRATIFIED RANDOM
Macro F1: 0.4764
Macro Precision: 0.4945
Macro Recall: 0.4956

              precision    recall  f1-score   support

    ansiedad       0.45      0.26      0.33       285
   depresion       0.54      0.73      0.62       342

    accuracy                           0.52       627
   macro avg       0.49      0.50      0.48       627
weighted avg       0.50      0.52      0.49       627

✓ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_stratified_eval.csv


## 3) Cross-Validation (5-Fold)

Para una evaluación robusta, combinamos Train + Dev y realizamos validación cruzada estratificada por paciente.

In [5]:
from utils_shared import get_cv_splitter, calculate_metrics
from sklearn.dummy import DummyClassifier

# Combinar Train + Dev para CV
# reset_index() para mantener row_id como columna
df_full = pd.concat([df_train, df_dev]).reset_index()
X_full = df_full['texto']
y_full = df_full['etiqueta']
groups_full = df_full['patient_id'] # Usar patient_id directamente

cv = get_cv_splitter(n_splits=5)
cv_results = []

print("Iniciando Cross-Validation (Dummy Stratified)...")

for fold, (train_idx, val_idx) in enumerate(cv.split(X_full, y_full, groups_full)):
    X_tr, y_tr = X_full.iloc[train_idx], y_full.iloc[train_idx]
    X_val, y_val = X_full.iloc[val_idx], y_full.iloc[val_idx]
    
    # Dummy Stratified
    clf = DummyClassifier(strategy='stratified', random_state=42 + fold)
    clf.fit(X_tr, y_tr)
    y_pred = clf.predict(X_val)
    
    metrics = calculate_metrics(y_val, y_pred)
    cv_results.append({
        'fold': fold + 1,
        'model': 'Dummy Stratified',
        'f1_macro': metrics['f1_macro'],
        'precision_macro': metrics['precision_macro'],
        'recall_macro': metrics['recall_macro']
    })
    print(f"Fold {fold+1}: F1={metrics['f1_macro']:.4f}")

df_cv = pd.DataFrame(cv_results)
print("\nPromedio CV:")
print(df_cv.mean(numeric_only=True))

out_path = DATA_PATH / 'dummy_cv_results.csv'
df_cv.to_csv(out_path, index=False)
print(f"✓ Exportado: {out_path}")

Iniciando Cross-Validation (Dummy Stratified)...
Fold 1: F1=0.4632
Fold 2: F1=0.5085
Fold 3: F1=0.5131
Fold 4: F1=0.4461
Fold 5: F1=0.4760

Promedio CV:
fold               3.000000
f1_macro           0.481367
precision_macro    0.492582
recall_macro       0.491253
dtype: float64
✓ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_cv_results.csv
