# 02_baseline_dummy â€” Sanity Check Baselines

**Objetivo:** Implementar baselines triviales (majority class, random estratificado) para validar que los modelos ML capturan patrones reales y no overfitting.

**Exportables:**
- `data/dummy_majority_eval.csv` con mÃ©tricas del baseline majority
- `data/dummy_stratified_eval.csv` con mÃ©tricas del baseline random estratificado
- `data/02_baselines_con_dummy.csv` tabla consolidada con TODOS los modelos (incluye dummy)

In [1]:
# ===============================================================
# Setup: Imports y configuraciÃ³n de paths
# ===============================================================
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score

# Importar utilidades compartidas
try:
    from utils_shared import setup_paths, load_splits
    paths = setup_paths()
    DATA_PATH = paths['DATA_PATH']
    SPLITS_PATH = paths['SPLITS_PATH']
    print("[OK] Usando utils_shared.py")
except ImportError:
    print("[WARNING] No se encontrÃ³ utils_shared.py, usando configuraciÃ³n manual")
    BASE_PATH = Path.cwd()
    if BASE_PATH.name == "notebooks":
        BASE_PATH = BASE_PATH.parent
    DATA_PATH = BASE_PATH / "data"
    SPLITS_PATH = DATA_PATH / "splits"

print(f" DATA_PATH: {DATA_PATH}")
print(f" SPLITS_PATH: {SPLITS_PATH}")

[OK] Usando utils_shared.py
 DATA_PATH: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data
 SPLITS_PATH: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/splits


## 1) Cargar datos y splits

Usamos los mismos splits pacientes que los otros baselines (patient-level, 0% leakage).

In [2]:
# Cargar dataset base
df = pd.read_csv(SPLITS_PATH / "dataset_base.csv")
print(f"Dataset base: {len(df)} casos")
print(f"Columnas: {df.columns.tolist()}")

# Cargar splits
train_idx = pd.read_csv(SPLITS_PATH / "train_indices.csv")['row_id'].values
val_idx = pd.read_csv(SPLITS_PATH / "val_indices.csv")['row_id'].values

print(f"\nTrain: {len(train_idx)} casos")
print(f"Val: {len(val_idx)} casos")

# Preparar X, y para train y val usando row_id como Ã­ndice
df_indexed = df.set_index('row_id')
X_train = df_indexed.loc[train_idx, 'texto'].values
y_train = df_indexed.loc[train_idx, 'etiqueta'].values
X_val = df_indexed.loc[val_idx, 'texto'].values
y_val = df_indexed.loc[val_idx, 'etiqueta'].values

print(f"\nDistribuciÃ³n train:")
print(pd.Series(y_train).value_counts())
print(f"\nDistribuciÃ³n val:")
print(pd.Series(y_val).value_counts())

Dataset base: 3155 casos
Columnas: ['row_id', 'patient_id', 'texto', 'etiqueta']

Train: 2509 casos
Val: 646 casos

DistribuciÃ³n train:
depresion    1745
ansiedad      764
Name: count, dtype: int64

DistribuciÃ³n val:
depresion    485
ansiedad     161
Name: count, dtype: int64


## 2) Baseline 1: Majority Class

Predice **siempre** la clase mayoritaria (DepresiÃ³n).

In [3]:
# Entrenar (solo aprende la clase mayoritaria)
dummy_majority = DummyClassifier(strategy='most_frequent', random_state=42)
dummy_majority.fit(X_train, y_train)

# Predecir en validaciÃ³n
y_pred_majority = dummy_majority.predict(X_val)

# MÃ©tricas
f1_majority = f1_score(y_val, y_pred_majority, average='macro')
prec_majority = precision_score(y_val, y_pred_majority, average='macro')
rec_majority = recall_score(y_val, y_pred_majority, average='macro')

print("=" * 60)
print("DUMMY BASELINE: MAJORITY CLASS")
print("=" * 60)
print(f"Macro F1: {f1_majority:.4f}")
print(f"Macro Precision: {prec_majority:.4f}")
print(f"Macro Recall: {rec_majority:.4f}")
print()

# Classification report completo
report_majority = classification_report(y_val, y_pred_majority, output_dict=True)
report_majority_df = pd.DataFrame(report_majority).transpose()
print(report_majority_df)

# Exportar mÃ©tricas macro
eval_majority = pd.DataFrame([{
    'modelo': 'dummy_majority',
    'f1_macro': f1_majority,
    'precision_macro': prec_majority,
    'recall_macro': rec_majority,
    'n_val': len(y_val)
}])
eval_majority.to_csv(DATA_PATH / 'dummy_majority_eval.csv', index=False)
print(f"\nâœ“ Exportado: {DATA_PATH / 'dummy_majority_eval.csv'}")

# Exportar classification report
report_majority_df.to_csv(DATA_PATH / 'dummy_majority_classification_report.csv')
print(f"âœ“ Exportado: {DATA_PATH / 'dummy_majority_classification_report.csv'}")

DUMMY BASELINE: MAJORITY CLASS
Macro F1: 0.4288
Macro Precision: 0.3754
Macro Recall: 0.5000

              precision    recall  f1-score     support
ansiedad       0.000000  0.000000  0.000000  161.000000
depresion      0.750774  1.000000  0.857648  485.000000
accuracy       0.750774  0.750774  0.750774    0.750774
macro avg      0.375387  0.500000  0.428824  646.000000
weighted avg   0.563662  0.750774  0.643900  646.000000

âœ“ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_majority_eval.csv
âœ“ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_majority_classification_report.csv


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 3) Baseline 2: Stratified Random

Predice aleatoriamente respetando la distribuciÃ³n de clases del train set.

In [4]:
# Entrenar (aprende solo la distribuciÃ³n de clases)
dummy_stratified = DummyClassifier(strategy='stratified', random_state=42)
dummy_stratified.fit(X_train, y_train)

# Predecir en validaciÃ³n
y_pred_stratified = dummy_stratified.predict(X_val)

# MÃ©tricas
f1_stratified = f1_score(y_val, y_pred_stratified, average='macro')
prec_stratified = precision_score(y_val, y_pred_stratified, average='macro')
rec_stratified = recall_score(y_val, y_pred_stratified, average='macro')

print("=" * 60)
print("DUMMY BASELINE: STRATIFIED RANDOM")
print("=" * 60)
print(f"Macro F1: {f1_stratified:.4f}")
print(f"Macro Precision: {prec_stratified:.4f}")
print(f"Macro Recall: {rec_stratified:.4f}")
print()

# Classification report completo
report_stratified = classification_report(y_val, y_pred_stratified, output_dict=True)
report_stratified_df = pd.DataFrame(report_stratified).transpose()
print(report_stratified_df)

# Exportar mÃ©tricas macro
eval_stratified = pd.DataFrame([{
    'modelo': 'dummy_stratified',
    'f1_macro': f1_stratified,
    'precision_macro': prec_stratified,
    'recall_macro': rec_stratified,
    'n_val': len(y_val)
}])
eval_stratified.to_csv(DATA_PATH / 'dummy_stratified_eval.csv', index=False)
print(f"\nâœ“ Exportado: {DATA_PATH / 'dummy_stratified_eval.csv'}")

# Exportar classification report
report_stratified_df.to_csv(DATA_PATH / 'dummy_stratified_classification_report.csv')
print(f"âœ“ Exportado: {DATA_PATH / 'dummy_stratified_classification_report.csv'}")

DUMMY BASELINE: STRATIFIED RANDOM
Macro F1: 0.4934
Macro Precision: 0.4960
Macro Recall: 0.4955

              precision    recall  f1-score     support
ansiedad       0.243781  0.304348  0.270718  161.000000
depresion      0.748315  0.686598  0.716129  485.000000
accuracy       0.591331  0.591331  0.591331    0.591331
macro avg      0.496048  0.495473  0.493424  646.000000
weighted avg   0.622572  0.591331  0.605121  646.000000

âœ“ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_stratified_eval.csv
âœ“ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/dummy_stratified_classification_report.csv


## 4) Exportar tabla consolidada

Generamos la tabla `02_baselines_con_dummy.csv` que incluye TODOS los modelos (dummy + ML).

**Para anÃ¡lisis completo de mejora, interpretaciÃ³n y visualizaciones:**
â†’ Ver `notebooks/02_comparacion_resultados.ipynb`

In [5]:
# ===============================================================
# Exportar tabla consolidada con todos los modelos (incluye dummy)
# ===============================================================

# Cargar resultados de modelos ML
comp_original = pd.read_csv(DATA_PATH / '02_baselines_comparacion.csv')
print("Resultados modelos ML:")
print(comp_original)

# Renombrar columnas de comp_original para coincidir
comp_original_renamed = comp_original.rename(columns={
    'baseline': 'modelo',
    'macro_f1': 'f1_macro',
    'macro_precision': 'precision_macro',
    'macro_recall': 'recall_macro',
    'n': 'n_val'
})

# Crear tabla consolidada con dummy baselines
comp_con_dummy = pd.concat([
    eval_majority[['modelo', 'f1_macro', 'precision_macro', 'recall_macro', 'n_val']],
    eval_stratified[['modelo', 'f1_macro', 'precision_macro', 'recall_macro', 'n_val']],
    comp_original_renamed[['modelo', 'f1_macro', 'precision_macro', 'recall_macro', 'n_val']]
], ignore_index=True)

# Ordenar por F1 (descendente)
comp_con_dummy = comp_con_dummy.sort_values('f1_macro', ascending=False).reset_index(drop=True)

print("\n" + "=" * 80)
print("TABLA CONSOLIDADA (todos los modelos)")
print("=" * 80)
print(comp_con_dummy.to_string(index=False))

# Exportar tabla consolidada
out_csv = DATA_PATH / '02_baselines_con_dummy.csv'
comp_con_dummy.to_csv(out_csv, index=False)
print(f"\nâœ“ Exportado: {out_csv}")
print(f"\nðŸ“Œ Para anÃ¡lisis completo, mejora vs dummy, y visualizaciones:")
print(f"   â†’ notebooks/02_comparacion_resultados.ipynb")

Resultados modelos ML:
     baseline  macro_f1  macro_precision  macro_recall      n
0  rule_based  0.503060         0.516587      0.510629  646.0
1       tfidf  0.755270         0.745770      0.768432  646.0
2        beto  0.741577         0.736377      0.747711  646.0

TABLA CONSOLIDADA (todos los modelos)
          modelo  f1_macro  precision_macro  recall_macro  n_val
           tfidf  0.755270         0.745770      0.768432  646.0
            beto  0.741577         0.736377      0.747711  646.0
      rule_based  0.503060         0.516587      0.510629  646.0
dummy_stratified  0.493424         0.496048      0.495473  646.0
  dummy_majority  0.428824         0.375387      0.500000  646.0

âœ“ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/02_baselines_con_dummy.csv

ðŸ“Œ Para anÃ¡lisis completo, mejora vs dummy, y visualizaciones:
   â†’ notebooks/02_comparacion_resultados.ipynb
