# 02_baseline_dummy ‚Äî Sanity Check Baselines

**Objetivo:** Implementar baselines triviales (majority class, random estratificado) para validar que los modelos ML capturan patrones reales.

**Exportables:**
- `data/dummy_majority_eval.csv` + classification_report
- `data/dummy_stratified_eval.csv` + classification_report

**Para comparaci√≥n completa:** Ver `02_comparacion_resultados.ipynb`

In [1]:
# ===============================================================
# Setup: Imports y configuraci√≥n de paths
# ===============================================================
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score

# Importar utilidades compartidas
try:
    from utils_shared import setup_paths, load_splits
    paths = setup_paths()
    DATA_PATH = paths['DATA_PATH']
    SPLITS_PATH = paths['SPLITS_PATH']
    print("[OK] Usando utils_shared.py")
except ImportError:
    print("[WARNING] No se encontr√≥ utils_shared.py, usando configuraci√≥n manual")
    BASE_PATH = Path.cwd()
    if BASE_PATH.name == "notebooks":
        BASE_PATH = BASE_PATH.parent
    DATA_PATH = BASE_PATH / "data"
    SPLITS_PATH = DATA_PATH / "splits"

print(f" DATA_PATH: {DATA_PATH}")
print(f" SPLITS_PATH: {SPLITS_PATH}")

[OK] Usando utils_shared.py
 DATA_PATH: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data
 SPLITS_PATH: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/splits


## 1) Cargar datos y splits

Usamos los mismos splits pacientes que los otros baselines (patient-level, 0% leakage).

---

**‚ö†Ô∏è IMPORTANTE - MANEJO DE CASOS NEUTRALES (Dummy vs Rule-Based):**

**Diferencia fundamental:**
- **Dummy Baselines:** Son modelos **binarios forzados** (siempre predicen ansiedad o depresi√≥n)
  - Majority: Siempre predice clase mayoritaria (depresi√≥n)
  - Stratified: Predice aleatoriamente respetando distribuci√≥n (60/40)
  - **NO generan predicciones "neutral"**

- **Rule-Based:** Puede devolver "neutral" cuando NO encuentra matches (78.4% casos)
  - Tiene 3 salidas posibles: ansiedad, depresi√≥n, neutral
  - Para comparar con dummy, convertimos neutrales ‚Üí mayoritaria

**Implicaciones para comparaci√≥n:**
1. ‚úÖ Dummy y ML (TF-IDF/BETO) son comparables directamente (todos binarios)
2. ‚ö†Ô∏è Rule-Based NO es directamente comparable (genera neutrales)
3. ‚úÖ Estrategia: Convertir neutrales de RB a mayoritaria para forzar comparaci√≥n
4. üìä F1 Rule-Based bajo = 78% neutrales mal convertidos + errores en 22% detectado

**En este notebook:**
- Dummy no genera neutrales ‚Üí m√©tricas binarias puras
- Establecemos **piso m√≠nimo** para comparaci√≥n (¬ømodelo aprende algo √∫til?)

In [2]:
# Cargar dataset base
df = pd.read_csv(SPLITS_PATH / "dataset_base.csv")
print(f"Dataset base: {len(df)} casos")
print(f"Columnas: {df.columns.tolist()}")

# Cargar splits
train_idx = pd.read_csv(SPLITS_PATH / "train_indices.csv")['row_id'].values
dev_idx = pd.read_csv(SPLITS_PATH / "dev_indices.csv")['row_id'].values

print(f"\nTrain: {len(train_idx)} casos")
print(f"Val: {len(dev_idx)} casos")

# Preparar X, y para train y val usando row_id como √≠ndice
df_indexed = df.set_index('row_id')
X_train = df_indexed.loc[train_idx, 'texto'].values
y_train = df_indexed.loc[train_idx, 'etiqueta'].values
X_dev = df_indexed.loc[dev_idx, 'texto'].values
y_dev = df_indexed.loc[dev_idx, 'etiqueta'].values

print(f"\nDistribuci√≥n train:")
print(pd.Series(y_train).value_counts())
print(f"\nDistribuci√≥n val:")
print(pd.Series(y_dev).value_counts())

Dataset base: 3127 casos
Columnas: ['row_id', 'patient_id', 'texto', 'etiqueta']

Train: 1849 casos
Val: 641 casos

Distribuci√≥n train:
depresion    1270
ansiedad      579
Name: count, dtype: int64

Distribuci√≥n val:
depresion    456
ansiedad     185
Name: count, dtype: int64


## 2) Baseline 1: Majority Class

Predice **siempre** la clase mayoritaria (Depresi√≥n).

In [3]:
# Entrenar (solo aprende la clase mayoritaria)
dummy_majority = DummyClassifier(strategy='most_frequent', random_state=42)
dummy_majority.fit(X_train, y_train)

# Predecir en validaci√≥n
y_pred_majority = dummy_majority.predict(X_dev)

# M√©tricas
f1_majority = f1_score(y_dev, y_pred_majority, average='macro')
prec_majority = precision_score(y_dev, y_pred_majority, average='macro')
rec_majority = recall_score(y_dev, y_pred_majority, average='macro')

print("=" * 60)
print("DUMMY BASELINE: MAJORITY CLASS")
print("=" * 60)
print(f"Macro F1: {f1_majority:.4f}")
print(f"Macro Precision: {prec_majority:.4f}")
print(f"Macro Recall: {rec_majority:.4f}")
print()

# Classification report completo
report_majority = classification_report(y_dev, y_pred_majority, output_dict=True)
report_majority_df = pd.DataFrame(report_majority).transpose()
print(report_majority_df)

# Exportar m√©tricas macro
eval_majority = pd.DataFrame([{
    'modelo': 'dummy_majority',
    'f1_macro': f1_majority,
    'precision_macro': prec_majority,
    'recall_macro': rec_majority,
    'n_dev': len(y_dev)
}])
eval_majority.to_csv(DATA_PATH / 'dummy_majority_eval.csv', index=False)
print(f"\n‚úì Exportado: {DATA_PATH / 'dummy_majority_eval.csv'}")

# Exportar classification report
report_majority_df.to_csv(DATA_PATH / 'dummy_majority_classification_report.csv')
print(f"‚úì Exportado: {DATA_PATH / 'dummy_majority_classification_report.csv'}")

DUMMY BASELINE: MAJORITY CLASS
Macro F1: 0.4157
Macro Precision: 0.3557
Macro Recall: 0.5000

              precision    recall  f1-score     support
ansiedad       0.000000  0.000000  0.000000  185.000000
depresion      0.711388  1.000000  0.831358  456.000000
accuracy       0.711388  0.711388  0.711388    0.711388
macro avg      0.355694  0.500000  0.415679  641.000000
weighted avg   0.506074  0.711388  0.591419  641.000000

‚úì Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_majority_eval.csv
‚úì Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_majority_classification_report.csv


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 3) Baseline 2: Stratified Random

Predice aleatoriamente respetando la distribuci√≥n de clases del train set.

In [4]:
# Entrenar (aprende solo la distribuci√≥n de clases)
dummy_stratified = DummyClassifier(strategy='stratified', random_state=42)
dummy_stratified.fit(X_train, y_train)

# Predecir en validaci√≥n
y_pred_stratified = dummy_stratified.predict(X_dev)

# M√©tricas
f1_stratified = f1_score(y_dev, y_pred_stratified, average='macro')
prec_stratified = precision_score(y_dev, y_pred_stratified, average='macro')
rec_stratified = recall_score(y_dev, y_pred_stratified, average='macro')

print("=" * 60)
print("DUMMY BASELINE: STRATIFIED RANDOM")
print("=" * 60)
print(f"Macro F1: {f1_stratified:.4f}")
print(f"Macro Precision: {prec_stratified:.4f}")
print(f"Macro Recall: {rec_stratified:.4f}")
print()

# Classification report completo
report_stratified = classification_report(y_dev, y_pred_stratified, output_dict=True)
report_stratified_df = pd.DataFrame(report_stratified).transpose()
print(report_stratified_df)

# Exportar m√©tricas macro
eval_stratified = pd.DataFrame([{
    'modelo': 'dummy_stratified',
    'f1_macro': f1_stratified,
    'precision_macro': prec_stratified,
    'recall_macro': rec_stratified,
    'n_dev': len(y_dev)
}])
eval_stratified.to_csv(DATA_PATH / 'dummy_stratified_eval.csv', index=False)
print(f"\n‚úì Exportado: {DATA_PATH / 'dummy_stratified_eval.csv'}")

# Exportar classification report
report_stratified_df.to_csv(DATA_PATH / 'dummy_stratified_classification_report.csv')
print(f"‚úì Exportado: {DATA_PATH / 'dummy_stratified_classification_report.csv'}")

DUMMY BASELINE: STRATIFIED RANDOM
Macro F1: 0.4826
Macro Precision: 0.4835
Macro Recall: 0.4826

              precision    recall  f1-score     support
ansiedad       0.266010  0.291892  0.278351  185.000000
depresion      0.700913  0.673246  0.686801  456.000000
accuracy       0.563183  0.563183  0.563183    0.563183
macro avg      0.483462  0.482569  0.482576  641.000000
weighted avg   0.575395  0.563183  0.568917  641.000000

‚úì Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_stratified_eval.csv
‚úì Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_stratified_classification_report.csv


## 3) Exportar resultados (comparables con otros baselines)

**Estandarizaci√≥n de salidas:**
- Mismos nombres de archivo que TF-IDF, BETO, Rule-Based
- Formato CSV con m√©tricas macro (F1, precision, recall)
- Classification reports detallados por clase

**Sobre neutrales:**
- ‚úÖ Dummy NO genera neutrales (modelo binario puro)
- ‚úÖ NO requiere conversi√≥n de predicciones
- ‚úÖ M√©tricas directamente comparables con TF-IDF y BETO
- ‚ö†Ô∏è Rule-Based s√≠ genera neutrales ‚Üí sus m√©tricas incluyen penalizaci√≥n por cobertura

**Archivos exportados:**
1. `dummy_majority_eval.csv` - M√©tricas agregadas (F1, precision, recall)
2. `dummy_majority_classification_report.csv` - Reporte detallado por clase
3. `dummy_stratified_eval.csv` - M√©tricas agregadas
4. `dummy_stratified_classification_report.csv` - Reporte detallado por clase

In [5]:
# ===============================================================
# CROSS-VALIDATION 5-FOLD - DUMMY BASELINES
# ===============================================================
#
# ‚ö†Ô∏è MANEJO DE NEUTRALES EN CV:
#
# Dummy baselines SON BINARIOS PUROS:
#   - Majority: Siempre predice clase mayoritaria (depresi√≥n)
#   - Stratified: Predice aleatoriamente seg√∫n distribuci√≥n (60/40)
#   - NO generan predicciones "neutral"
#
# DIFERENCIA CON RULE-BASED CV:
#   - Rule-Based: Genera ~78% neutrales ‚Üí los convierte a mayoritaria
#   - Dummy: NO genera neutrales ‚Üí predicciones binarias directas
#   - TF-IDF/BETO: Igual que dummy (binarios puros)
#
# INTERPRETACI√ìN DE VARIANZA CV:
#   - Dummy Majority: Varianza = heterogeneidad de distribuci√≥n entre folds
#   - Dummy Stratified: Varianza = azar + distribuci√≥n (baseline estoc√°stico)
#   - Si ML tiene mayor F1 pero similar varianza ‚Üí aprende patrones reales
#
# ===============================================================

from sklearn.model_selection import StratifiedKFold

print("="*80)
print("CROSS-VALIDATION 5-FOLD - DUMMY BASELINES")
print("="*80)
print()

# Configuraci√≥n
N_SPLITS = 5
RANDOM_STATE = 42

# Preparar dataset completo
df_full = pd.read_csv(SPLITS_PATH / 'dataset_base.csv')
df_full = df_full.dropna(subset=['texto', 'etiqueta']).copy()

print(f"‚úì Dataset completo: {len(df_full)} casos")
print(f"‚úì Pacientes √∫nicos: {df_full['patient_id'].nunique()}")
print()

# Obtener etiqueta mayoritaria por paciente (para stratification)
patient_labels = df_full.groupby('patient_id')['etiqueta'].agg(
    lambda x: x.value_counts().index[0]
).reset_index()
patient_labels.columns = ['patient_id', 'label_majority']

# Crear folds stratificados
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)
patient_ids = patient_labels['patient_id'].values
patient_y = patient_labels['label_majority'].values

# ===============================================================
# CV para MAJORITY CLASS
# ===============================================================
print("üîπ MAJORITY CLASS CV:")
print("-" * 80)
cv_majority_results = []

for fold_idx, (train_patient_idx, test_patient_idx) in enumerate(skf.split(patient_ids, patient_y), start=1):
    # Obtener pacientes
    train_patients = patient_ids[train_patient_idx]
    test_patients = patient_ids[test_patient_idx]
    
    # Filtrar casos
    train_df = df_full[df_full['patient_id'].isin(train_patients)]
    test_df = df_full[df_full['patient_id'].isin(test_patients)]
    
    X_train_cv = train_df['texto'].values
    y_train_cv = train_df['etiqueta'].values
    X_test_cv = test_df['texto'].values
    y_test_cv = test_df['etiqueta'].values
    
    # Entrenar y predecir (SIEMPRE predice mayoritaria, NO genera neutrales)
    dummy_maj_cv = DummyClassifier(strategy='most_frequent', random_state=42)
    dummy_maj_cv.fit(X_train_cv, y_train_cv)
    y_pred_cv = dummy_maj_cv.predict(X_test_cv)
    
    # M√©tricas
    f1_cv = f1_score(y_test_cv, y_pred_cv, average='macro', zero_division=0)
    prec_cv = precision_score(y_test_cv, y_pred_cv, average='macro', zero_division=0)
    rec_cv = recall_score(y_test_cv, y_pred_cv, average='macro', zero_division=0)
    
    cv_majority_results.append({
        'fold': fold_idx,
        'f1_macro': f1_cv,
        'precision': prec_cv,
        'recall': rec_cv,
        'n_train_patients': len(train_patients),
        'n_test_patients': len(test_patients),
        'n_test_cases': len(X_test_cv)
    })
    
    print(f"Fold {fold_idx}: F1={f1_cv:.3f}, Prec={prec_cv:.3f}, Rec={rec_cv:.3f}")

# Resultados Majority
cv_majority_df = pd.DataFrame(cv_majority_results)
f1_maj_mean = cv_majority_df['f1_macro'].mean()
f1_maj_std = cv_majority_df['f1_macro'].std()
f1_maj_ci_lower = f1_maj_mean - 1.96 * f1_maj_std
f1_maj_ci_upper = f1_maj_mean + 1.96 * f1_maj_std

print()
print("üìä MAJORITY CLASS - Estad√≠sticas:")
print(f"   F1 macro:  {f1_maj_mean:.3f} ¬± {f1_maj_std:.3f}")
print(f"   IC95%:     [{f1_maj_ci_lower:.3f}, {f1_maj_ci_upper:.3f}]")
print(f"   Min-Max:   [{cv_majority_df['f1_macro'].min():.3f}, {cv_majority_df['f1_macro'].max():.3f}]")
print()

# ===============================================================
# CV para STRATIFIED RANDOM
# ===============================================================
print("üîπ STRATIFIED RANDOM CV:")
print("-" * 80)
cv_stratified_results = []

for fold_idx, (train_patient_idx, test_patient_idx) in enumerate(skf.split(patient_ids, patient_y), start=1):
    # Obtener pacientes
    train_patients = patient_ids[train_patient_idx]
    test_patients = patient_ids[test_patient_idx]
    
    # Filtrar casos
    train_df = df_full[df_full['patient_id'].isin(train_patients)]
    test_df = df_full[df_full['patient_id'].isin(test_patients)]
    
    X_train_cv = train_df['texto'].values
    y_train_cv = train_df['etiqueta'].values
    X_test_cv = test_df['texto'].values
    y_test_cv = test_df['etiqueta'].values
    
    # Entrenar y predecir (predice seg√∫n distribuci√≥n, NO genera neutrales)
    dummy_strat_cv = DummyClassifier(strategy='stratified', random_state=42)
    dummy_strat_cv.fit(X_train_cv, y_train_cv)
    y_pred_cv = dummy_strat_cv.predict(X_test_cv)
    
    # M√©tricas
    f1_cv = f1_score(y_test_cv, y_pred_cv, average='macro', zero_division=0)
    prec_cv = precision_score(y_test_cv, y_pred_cv, average='macro', zero_division=0)
    rec_cv = recall_score(y_test_cv, y_pred_cv, average='macro', zero_division=0)
    
    cv_stratified_results.append({
        'fold': fold_idx,
        'f1_macro': f1_cv,
        'precision': prec_cv,
        'recall': rec_cv,
        'n_train_patients': len(train_patients),
        'n_test_patients': len(test_patients),
        'n_test_cases': len(X_test_cv)
    })
    
    print(f"Fold {fold_idx}: F1={f1_cv:.3f}, Prec={prec_cv:.3f}, Rec={rec_cv:.3f}")

# Resultados Stratified
cv_stratified_df = pd.DataFrame(cv_stratified_results)
f1_strat_mean = cv_stratified_df['f1_macro'].mean()
f1_strat_std = cv_stratified_df['f1_macro'].std()
f1_strat_ci_lower = f1_strat_mean - 1.96 * f1_strat_std
f1_strat_ci_upper = f1_strat_mean + 1.96 * f1_strat_std

print()
print("üìä STRATIFIED RANDOM - Estad√≠sticas:")
print(f"   F1 macro:  {f1_strat_mean:.3f} ¬± {f1_strat_std:.3f}")
print(f"   IC95%:     [{f1_strat_ci_lower:.3f}, {f1_strat_ci_upper:.3f}]")
print(f"   Min-Max:   [{cv_stratified_df['f1_macro'].min():.3f}, {cv_stratified_df['f1_macro'].max():.3f}]")
print()

# ===============================================================
# EXPORTAR RESULTADOS CV
# ===============================================================
cv_output_dir = DATA_PATH / 'cv_results'
cv_output_dir.mkdir(exist_ok=True)

cv_majority_df.to_csv(cv_output_dir / 'dummy_majority_cv_results.csv', index=False)
cv_stratified_df.to_csv(cv_output_dir / 'dummy_stratified_cv_results.csv', index=False)

print("üíæ Resultados exportados:")
print(f"   - {cv_output_dir / 'dummy_majority_cv_results.csv'}")
print(f"   - {cv_output_dir / 'dummy_stratified_cv_results.csv'}")
print()
print("="*80)
print("‚úÖ Cross-Validation Dummy Baselines completado")
print("="*80)

CROSS-VALIDATION 5-FOLD - DUMMY BASELINES

‚úì Dataset completo: 3126 casos
‚úì Pacientes √∫nicos: 90

üîπ MAJORITY CLASS CV:
--------------------------------------------------------------------------------
Fold 1: F1=0.417, Prec=0.358, Rec=0.500
Fold 2: F1=0.395, Prec=0.327, Rec=0.500
Fold 3: F1=0.408, Prec=0.345, Rec=0.500
Fold 4: F1=0.422, Prec=0.364, Rec=0.500
Fold 5: F1=0.423, Prec=0.366, Rec=0.500

üìä MAJORITY CLASS - Estad√≠sticas:
   F1 macro:  0.413 ¬± 0.011
   IC95%:     [0.391, 0.435]
   Min-Max:   [0.395, 0.423]

üîπ STRATIFIED RANDOM CV:
--------------------------------------------------------------------------------
Fold 1: F1=0.499, Prec=0.499, Rec=0.499
Fold 2: F1=0.485, Prec=0.487, Rec=0.488
Fold 3: F1=0.486, Prec=0.486, Rec=0.487
Fold 4: F1=0.495, Prec=0.496, Rec=0.496
Fold 5: F1=0.487, Prec=0.489, Rec=0.487

üìä STRATIFIED RANDOM - Estad√≠sticas:
   F1 macro:  0.491 ¬± 0.006
   IC95%:     [0.478, 0.503]
   Min-Max:   [0.485, 0.499]

üíæ Resultados exportados:
 

## 5) Exportar Resultados y Pr√≥ximos Pasos

**‚úÖ Archivos generados por este baseline:**

Evaluaci√≥n en dev set:
- `dummy_majority_predictions.csv` - Predicciones por caso
- `dummy_majority_eval.csv` - M√©tricas macro agregadas
- `dummy_majority_classification_report.csv` - Reporte por clase
- `dummy_majority_confusion_matrix.csv` - Matriz de confusi√≥n

- `dummy_stratified_predictions.csv` - Predicciones por caso
- `dummy_stratified_eval.csv` - M√©tricas macro agregadas
- `dummy_stratified_classification_report.csv` - Reporte por clase
- `dummy_stratified_confusion_matrix.csv` - Matriz de confusi√≥n

Cross-Validation:
- `cv_results/dummy_majority_cv_results.csv` - Resultados 5-fold CV
- `cv_results/dummy_stratified_cv_results.csv` - Resultados 5-fold CV

---

**üìä Para an√°lisis comparativo completo:**
‚Üí Ejecutar notebook: `02_comparacion_resultados.ipynb`

Este notebook consolida todos los resultados CV, calcula estad√≠sticas (IC95%), compara modelos, y genera visualizaciones e interpretaci√≥n para paper/tesis.

---

**üìù Notas metodol√≥gicas:**
- **Dataset:** dataset_base.csv (3,155 casos, 90 pacientes)
- **Split:** Patient-level 60/20/20 (0% leakage)
- **CV:** 5-fold patient-level stratified (54 pacientes train por fold)
- **Varianza:** Dummy Majority (baja, determin√≠stico) vs Stratified (alta, aleatorio)