# 04_baseline_tfidf — TF-IDF + LinearSVC

**Objetivo:** Implementar un baseline robusto usando **Character TF-IDF** (n-grams 3-5) y **LinearSVC**.

**Estrategia:**
- **Train:** Usar `train_denoised.csv` (solo casos con señal clínica) para evitar aprender ruido.
- **Dev:** Usar `dev_full.csv` (dataset completo) para evaluar en un escenario realista.
- **Features:** Char n-grams son robustos a errores ortográficos y variantes morfológicas.

**Exportables:**
- `data/tfidf_eval.csv`
- `data/tfidf_classification_report.csv`

In [13]:
# ===============================================================
# Setup: Imports y configuración de paths
# ===============================================================
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import re
import unicodedata
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

# Importar utilidades compartidas
try:
    from utils_shared import setup_paths, load_splits, calculate_metrics, get_cv_splitter
    paths = setup_paths()
    DATA_PATH = paths['DATA_PATH']
    SPLITS_PATH = paths['SPLITS_PATH']
    print("[OK] Usando utils_shared.py")
except ImportError:
    print("[ERROR] No se encontró utils_shared.py. Verifica que estás en el directorio correcto.")
    raise

[OK] Usando utils_shared.py


## 1) Carga de Datos y Preprocesamiento

In [14]:
# Cargar datasets procesados
try:
    # Train: Usar train_denoised (señal clínica) desde SPLITS_PATH
    df_train = pd.read_csv(SPLITS_PATH / 'train_denoised.csv')
    
    # Dev: Construir desde splits (dataset completo)
    df_base, _, dev_idx, _ = load_splits(SPLITS_PATH)
    df_dev = df_base.set_index('row_id').loc[dev_idx].reset_index()
    
    print(f"Train (Denoised): {len(df_train)} casos")
    print(f"Dev (Full): {len(df_dev)} casos")
except FileNotFoundError:
    print("[ERROR] No se encontraron los datasets. Ejecuta 03_rule_based_denoising.ipynb primero.")
    raise

# Preprocesamiento Agresivo para TF-IDF
RE_MULTI = re.compile(r'(.)\1{2,}')

def clean_text_ml(s: str) -> str:
    if pd.isna(s):
        return ""
    s = str(s).lower().strip()
    s = unicodedata.normalize("NFC", s)
    s = RE_MULTI.sub(r'\1\1', s)
    s = re.sub(r"[^a-z0-9áéíóúüñ\s.,!?:/\-]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    # Marca negaciones simples: "no tengo" -> "no_tengo"
    s = re.sub(r"\bno\s+([a-záéíóúüñ]{2,})", r"no_\1", s)
    return s

print("Preprocesando textos...")
df_train['texto_ml'] = df_train['texto'].map(clean_text_ml)
df_dev['texto_ml'] = df_dev['texto'].map(clean_text_ml)

X_train = df_train['texto_ml']
y_train = df_train['etiqueta']
X_dev = df_dev['texto_ml']
y_dev = df_dev['etiqueta']

Train (Denoised): 993 casos
Dev (Full): 627 casos
Preprocesando textos...


## 2) Entrenamiento y Evaluación

In [15]:
# Pipeline: TF-IDF (Chars) + LinearSVC
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        analyzer='char_wb',
        ngram_range=(3, 5),
        min_df=2,
        max_features=10000
    )),
    ('clf', LinearSVC(class_weight='balanced', random_state=42))
])

print("Entrenando modelo...")
pipeline.fit(X_train, y_train)

print("Evaluando en Dev...")
y_pred = pipeline.predict(X_dev)

# Calcular métricas
metrics = calculate_metrics(y_dev, y_pred)

print("=" * 60)
print("TF-IDF + LinearSVC Results")
print("=" * 60)
print(f"Macro F1: {metrics['f1_macro']:.4f}")
print(f"Macro Precision: {metrics['precision_macro']:.4f}")
print(f"Macro Recall: {metrics['recall_macro']:.4f}")
print()
print(metrics['report'])

# Exportar resultados
eval_df = pd.DataFrame([{
    'modelo': 'tfidf_svm',
    'f1_macro': metrics['f1_macro'],
    'precision_macro': metrics['precision_macro'],
    'recall_macro': metrics['recall_macro'],
    'n_train': len(X_train),
    'n_dev': len(X_dev)
}])
eval_df.to_csv(DATA_PATH / 'tfidf_eval.csv', index=False)
print(f"✓ Exportado: {DATA_PATH / 'tfidf_eval.csv'}")

# Exportar reporte completo
report_df = pd.DataFrame(metrics['report_dict']).transpose()
report_df.to_csv(DATA_PATH / 'tfidf_classification_report.csv')
print(f"✓ Exportado: {DATA_PATH / 'tfidf_classification_report.csv'}")

Entrenando modelo...
Evaluando en Dev...
TF-IDF + LinearSVC Results
Macro F1: 0.8386
Macro Precision: 0.8679
Macro Recall: 0.8336

              precision    recall  f1-score   support

    ansiedad       0.94      0.71      0.81       285
   depresion       0.80      0.96      0.87       342

    accuracy                           0.85       627
   macro avg       0.87      0.83      0.84       627
weighted avg       0.86      0.85      0.84       627

✓ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/tfidf_eval.csv
✓ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/tfidf_classification_report.csv


## 3) Cross-Validation (5-Fold)

Evaluación robusta usando Stratified Group K-Fold.

In [16]:
from utils_shared import get_cv_splitter

# Combinar Train + Dev
df_full = pd.concat([df_train, df_dev]).reset_index(drop=True)
# Recalcular limpieza para asegurar consistencia
df_full['texto_ml'] = df_full['texto'].map(clean_text_ml)

X_full = df_full['texto_ml']
y_full = df_full['etiqueta']
groups_full = df_full['patient_id']  # Usar patient_id directamente

cv = get_cv_splitter(n_splits=5)
cv_results = []

print("Iniciando Cross-Validation (TF-IDF)...")

for fold, (train_idx, val_idx) in enumerate(cv.split(X_full, y_full, groups_full)):
    X_tr, y_tr = X_full.iloc[train_idx], y_full.iloc[train_idx]
    X_val, y_val = X_full.iloc[val_idx], y_full.iloc[val_idx]
    
    # Re-inicializar pipeline para no contaminar
    pipeline_cv = Pipeline([
        ('tfidf', TfidfVectorizer(
            analyzer='char_wb',
            ngram_range=(3, 5),
            min_df=2,
            max_features=10000
        )),
        ('clf', LinearSVC(class_weight='balanced', random_state=42))
    ])
    
    pipeline_cv.fit(X_tr, y_tr)
    y_pred = pipeline_cv.predict(X_val)
    
    metrics = calculate_metrics(y_val, y_pred)
    cv_results.append({
        'fold': fold + 1,
        'model': 'TF-IDF + LinearSVC',
        'f1_macro': metrics['f1_macro'],
        'precision_macro': metrics['precision_macro'],
        'recall_macro': metrics['recall_macro']
    })
    print(f"Fold {fold+1}: F1={metrics['f1_macro']:.4f}")

df_cv = pd.DataFrame(cv_results)
print("\nPromedio CV:")
print(df_cv.mean(numeric_only=True))

out_path = DATA_PATH / 'tfidf_cv_results.csv'
df_cv.to_csv(out_path, index=False)
print(f"✓ Exportado: {out_path}")

Iniciando Cross-Validation (TF-IDF)...
Fold 1: F1=0.8420
Fold 2: F1=0.7966
Fold 3: F1=0.8772
Fold 4: F1=0.8032
Fold 5: F1=0.8022

Promedio CV:
fold               3.000000
f1_macro           0.824234
precision_macro    0.826822
recall_macro       0.841238
dtype: float64
✓ Exportado: /Users/manuelnunez/Projects/psych-phenotyping-paraguay/data/tfidf_cv_results.csv
