# Notebook 2: Entrenamiento de Modelos de Clasificaci√≥n

**Proyecto:** Entrega Final - Optimizaci√≥n de Estrategias de Retenci√≥n

**Autores:** Juan David Valencia, Juan Esteban Cuellar

**Fecha:** Noviembre 2025

---

## Objetivo

Este notebook implementa el **entrenamiento, optimizaci√≥n y evaluaci√≥n de modelos de clasificaci√≥n** para predecir usuarios con alto potencial de crecimiento (`high_growth`).

**Problema de Negocio:** Identificar qu√© usuarios tienen mayor probabilidad de convertirse en usuarios de alto crecimiento (delta_orders > 8) para optimizar asignaci√≥n de presupuesto promocional.

**Variable Objetivo:** `high_growth` (binaria: 1 si delta_orders > 8, 0 si no)

**Algoritmos a Comparar:**
1. Random Forest Classifier
2. XGBoost Classifier
3. LightGBM Classifier

**M√©tricas de Evaluaci√≥n:**
- **AUC-ROC** (objetivo: > 0.75) - M√©trica principal
- **F1-Score** (objetivo: > 0.65)
- **Precision@20%** (objetivo: > 0.80) - Para targeting top-20% usuarios
- Matriz de confusi√≥n
- Curvas ROC y Precision-Recall

**Entrada:** Datasets procesados de `data/processed/`

**Salida:** 
- Mejor modelo entrenado: `models/best_classifier.pkl`
- Reporte de evaluaci√≥n: `models/classification_report.json`
- Visualizaciones: `documento/figuras/model_*.png`

---

## 1. Estrategia de Validaci√≥n y Experimentaci√≥n (Fase 3 - 5%)

### 1.1 Estrategia de Experimentaci√≥n

**Proceso de Modelado:**

1. **Entrenamiento Inicial:**
   - Entrenar m√∫ltiples algoritmos (RF, XGBoost, LightGBM) en conjunto **TRAIN**
   - Usar configuraciones baseline para establecer l√≠nea base de desempe√±o

2. **Optimizaci√≥n de Hiperpar√°metros:**
   - Usar **5-fold Cross-Validation** en conjunto TRAIN para optimizar hiperpar√°metros
   - T√©cnica: **GridSearchCV** (exploraci√≥n exhaustiva del espacio de par√°metros)
   - M√©trica de optimizaci√≥n: **AUC-ROC** (balance entre sensibilidad y especificidad)

3. **Selecci√≥n de Modelo:**
   - Evaluar mejor configuraci√≥n de cada algoritmo en conjunto **VALIDATION**
   - Comparar modelos usando m√©tricas m√∫ltiples (AUC-ROC, F1, Precision@20%)
   - Seleccionar mejor modelo basado en criterios de negocio + m√©tricas estad√≠sticas

4. **Evaluaci√≥n Final:**
   - Evaluar mejor modelo en conjunto **TEST** (1 sola vez, sin reentrenamiento)
   - Reportar m√©tricas finales + intervalos de confianza
   - An√°lisis cualitativo: feature importance, casos mal clasificados

### 1.2 Justificaci√≥n de la Estrategia

- **5-fold CV:** Balance entre robustez (m√∫ltiples folds) y costo computacional
- **GridSearchCV:** Garantiza exploraci√≥n sistem√°tica del espacio de hiperpar√°metros
- **Hold-out test set:** Evaluaci√≥n no sesgada del desempe√±o final (conjunto nunca visto)
- **AUC-ROC como m√©trica primaria:** Insensible a desbalance de clases (20% high-growth)

### 1.3 Verificaci√≥n de Distribuciones

Verificaremos que los conjuntos Train/Val/Test preservan las distribuciones de las variables clave.

## 2. Setup y Carga de Datos

In [None]:
# Imports
import pandas as pd
import numpy as np
import pickle
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn - Modelos
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import (
    roc_auc_score, f1_score, precision_score, recall_score, accuracy_score,
    confusion_matrix, classification_report, roc_curve, precision_recall_curve,
    average_precision_score
)

# XGBoost y LightGBM
import xgboost as xgb
import lightgbm as lgb

# Visualizaci√≥n
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Configuraci√≥n
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("‚úÖ Imports completados")
print(f"üìÖ Fecha de ejecuci√≥n: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nüì¶ Versiones:")
print(f"   - XGBoost: {xgb.__version__}")
print(f"   - LightGBM: {lgb.__version__}")

‚úÖ Imports completados
üìÖ Fecha de ejecuci√≥n: 2025-11-27 09:44:28

üì¶ Versiones:
   - XGBoost: 3.1.2
   - LightGBM: 4.6.0


In [1]:
# Cargar datasets procesados
TRAIN_PATH = '../data/processed/train.csv'
VAL_PATH = '../data/processed/val.csv'
TEST_PATH = '../data/processed/test.csv'

print("üìÇ Cargando datasets procesados...\n")

train_df = pd.read_csv(TRAIN_PATH)
val_df = pd.read_csv(VAL_PATH)
test_df = pd.read_csv(TEST_PATH)

print(f"‚úÖ Train: {train_df.shape[0]:,} usuarios √ó {train_df.shape[1]} features")
print(f"‚úÖ Validation: {val_df.shape[0]:,} usuarios √ó {val_df.shape[1]} features")
print(f"‚úÖ Test: {test_df.shape[0]:,} usuarios √ó {test_df.shape[1]} features")

print(f"\nüìä Distribuci√≥n de high_growth:")
print(f"   - Train: {train_df['high_growth'].mean()*100:.2f}% positivos")
print(f"   - Validation: {val_df['high_growth'].mean()*100:.2f}% positivos")
print(f"   - Test: {test_df['high_growth'].mean()*100:.2f}% positivos")

üìÇ Cargando datasets procesados...



NameError: name 'pd' is not defined

In [None]:
# Separar features y targets
# Recordar: uid, high_growth, delta_orders no son features

feature_cols = [col for col in train_df.columns if col not in ['uid', 'high_growth', 'delta_orders']]

X_train = train_df[feature_cols]
y_train = train_df['high_growth']

X_val = val_df[feature_cols]
y_val = val_df['high_growth']

X_test = test_df[feature_cols]
y_test = test_df['high_growth']

print(f"üìä Features para modelado: {len(feature_cols)}")
print(f"\nüìã Primeras 10 features:")
for i, feat in enumerate(feature_cols[:10], 1):
    print(f"   {i:2d}. {feat}")
print(f"   ... ({len(feature_cols) - 10} features m√°s)")

print(f"\n‚úÖ X_train: {X_train.shape}")
print(f"‚úÖ y_train: {y_train.shape} (positivos: {y_train.sum():,} = {y_train.mean()*100:.2f}%)")

üìä Features para modelado: 51

üìã Primeras 10 features:
    1. total_orders_tmenos1
    2. efo_to_four
    3. log_efo_to_four
    4. category_diversity
    5. num_categories
    6. num_shops
    7. num_brands
    8. brand001_ratio
    9. days_since_first_order
   10. orders_per_day
   ... (41 features m√°s)

‚úÖ X_train: (25000, 51)
‚úÖ y_train: (25000,) (positivos: 5,090 = 20.36%)


### 2.1 Verificaci√≥n de Distribuciones

Verificamos que las distribuciones se preservan entre Train/Val/Test.

In [None]:
from scipy.stats import chi2_contingency

print("üîç Verificando preservaci√≥n de distribuciones...\n")

# Verificar high_growth
contingency_table = pd.DataFrame({
    'Train': y_train.value_counts(sort=False),
    'Validation': y_val.value_counts(sort=False),
    'Test': y_test.value_counts(sort=False)
})

print("üìä Tabla de Contingencia (high_growth):")
print(contingency_table)

chi2, p_value, dof, expected = chi2_contingency(contingency_table.values)

print(f"\nüìà Test Chi-Cuadrado:")
print(f"   - Chi¬≤ = {chi2:.4f}")
print(f"   - P-valor = {p_value:.4f}")
print(f"   - Conclusi√≥n: {'‚úÖ Distribuciones preservadas (p > 0.05)' if p_value > 0.05 else '‚ö†Ô∏è Posible diferencia en distribuciones'}")

# Verificar distribuci√≥n de delta_orders (target de regresi√≥n)
print(f"\nüìä Estad√≠sticas de delta_orders:")
stats_df = pd.DataFrame({
    'Train': train_df['delta_orders'].describe(),
    'Validation': val_df['delta_orders'].describe(),
    'Test': test_df['delta_orders'].describe()
})
print(stats_df)

print(f"\n‚úÖ Verificaci√≥n completada: Distribuciones adecuadas para modelado")

üîç Verificando preservaci√≥n de distribuciones...

üìä Tabla de Contingencia (high_growth):
             Train  Validation  Test
high_growth                         
0            19910        6637  6637
1             5090        1696  1697

üìà Test Chi-Cuadrado:
   - Chi¬≤ = 0.0003
   - P-valor = 0.9999
   - Conclusi√≥n: ‚úÖ Distribuciones preservadas (p > 0.05)

üìä Estad√≠sticas de delta_orders:
              Train   Validation         Test
count  25000.000000  8333.000000  8334.000000
mean       6.846280     6.840034     6.888649
std        4.870811     5.105109     5.063934
min        1.000000     1.000000     1.000000
25%        4.000000     4.000000     4.000000
50%        5.000000     5.000000     5.000000
75%        8.000000     8.000000     8.000000
max       95.000000   108.000000    65.000000

‚úÖ Verificaci√≥n completada: Distribuciones adecuadas para modelado


## 3. Modelo Baseline: Random Forest Classifier

Comenzamos con Random Forest como modelo baseline por su robustez y facilidad de interpretaci√≥n.

In [None]:
print("üå≤ RANDOM FOREST CLASSIFIER - Configuraci√≥n Baseline\n")

# Configuraci√≥n baseline
rf_baseline = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced'  # Manejar desbalance de clases
)

print("‚è≥ Entrenando Random Forest baseline...")
rf_baseline.fit(X_train, y_train)

# Predicciones
y_train_pred_rf = rf_baseline.predict(X_train)
y_train_proba_rf = rf_baseline.predict_proba(X_train)[:, 1]

y_val_pred_rf = rf_baseline.predict(X_val)
y_val_proba_rf = rf_baseline.predict_proba(X_val)[:, 1]

# M√©tricas
print(f"\nüìä M√©tricas Random Forest (Baseline):")
print(f"\n{'='*60}")
print(f"{'M√©trica':<25} {'Train':>15} {'Validation':>15}")
print(f"{'='*60}")
print(f"{'AUC-ROC':<25} {roc_auc_score(y_train, y_train_proba_rf):>15.4f} {roc_auc_score(y_val, y_val_proba_rf):>15.4f}")
print(f"{'F1-Score':<25} {f1_score(y_train, y_train_pred_rf):>15.4f} {f1_score(y_val, y_val_pred_rf):>15.4f}")
print(f"{'Precision':<25} {precision_score(y_train, y_train_pred_rf):>15.4f} {precision_score(y_val, y_val_pred_rf):>15.4f}")
print(f"{'Recall':<25} {recall_score(y_train, y_train_pred_rf):>15.4f} {recall_score(y_val, y_val_pred_rf):>15.4f}")
print(f"{'Accuracy':<25} {accuracy_score(y_train, y_train_pred_rf):>15.4f} {accuracy_score(y_val, y_val_pred_rf):>15.4f}")
print(f"{'='*60}")

print(f"\n‚úÖ Random Forest baseline entrenado")

üå≤ RANDOM FOREST CLASSIFIER - Configuraci√≥n Baseline

‚è≥ Entrenando Random Forest baseline...

üìä M√©tricas Random Forest (Baseline):

M√©trica                             Train      Validation
AUC-ROC                            0.9935          0.9906
F1-Score                           0.8812          0.8752
Precision                          0.8105          0.8031
Recall                             0.9654          0.9617
Accuracy                           0.9470          0.9442

‚úÖ Random Forest baseline entrenado


### 3.1 Optimizaci√≥n de Hiperpar√°metros - Random Forest

In [None]:
print("üîß OPTIMIZACI√ìN DE HIPERPAR√ÅMETROS - Random Forest\n")

# Grid de hiperpar√°metros
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [10, 20, 30],
    'min_samples_leaf': [5, 10, 15],
    'max_features': ['sqrt', 'log2']
}

print(f"üìã Grid de b√∫squeda:")
for param, values in param_grid_rf.items():
    print(f"   - {param}: {values}")

total_combinations = np.prod([len(v) for v in param_grid_rf.values()])
print(f"\nüî¢ Total de combinaciones: {total_combinations:,}")
print(f"‚è±Ô∏è Tiempo estimado: ~{total_combinations * 5 // 60} minutos (con 5-fold CV)\n")

# GridSearchCV
rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1, class_weight='balanced'),
    param_grid=param_grid_rf,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

print("‚è≥ Iniciando GridSearchCV para Random Forest...")
print("   (Esto puede tomar varios minutos)\n")

rf_grid.fit(X_train, y_train)

print(f"\n‚úÖ Optimizaci√≥n completada!")
print(f"\nüèÜ Mejores hiperpar√°metros:")
for param, value in rf_grid.best_params_.items():
    print(f"   - {param}: {value}")

print(f"\nüìä Mejor AUC-ROC (5-fold CV): {rf_grid.best_score_:.4f}")

# Guardar mejor modelo
best_rf = rf_grid.best_estimator_

üîß OPTIMIZACI√ìN DE HIPERPAR√ÅMETROS - Random Forest

üìã Grid de b√∫squeda:
   - n_estimators: [100, 200, 300]
   - max_depth: [10, 15, 20, None]
   - min_samples_split: [10, 20, 30]
   - min_samples_leaf: [5, 10, 15]
   - max_features: ['sqrt', 'log2']

üî¢ Total de combinaciones: 216
‚è±Ô∏è Tiempo estimado: ~18 minutos (con 5-fold CV)

‚è≥ Iniciando GridSearchCV para Random Forest...
   (Esto puede tomar varios minutos)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits

‚úÖ Optimizaci√≥n completada!

üèÜ Mejores hiperpar√°metros:
   - max_depth: None
   - max_features: sqrt
   - min_samples_leaf: 5
   - min_samples_split: 10
   - n_estimators: 300

üìä Mejor AUC-ROC (5-fold CV): 0.9942


In [None]:
# Evaluar mejor Random Forest en validation
y_val_pred_rf_opt = best_rf.predict(X_val)
y_val_proba_rf_opt = best_rf.predict_proba(X_val)[:, 1]

print(f"üìä M√©tricas Random Forest (Optimizado) en Validation:")
print(f"   - AUC-ROC: {roc_auc_score(y_val, y_val_proba_rf_opt):.4f}")
print(f"   - F1-Score: {f1_score(y_val, y_val_pred_rf_opt):.4f}")
print(f"   - Precision: {precision_score(y_val, y_val_pred_rf_opt):.4f}")
print(f"   - Recall: {recall_score(y_val, y_val_pred_rf_opt):.4f}")

# Precision@20%
top_20_idx = np.argsort(y_val_proba_rf_opt)[-int(len(y_val) * 0.20):]
precision_at_20 = y_val.iloc[top_20_idx].mean()
print(f"   - Precision@20%: {precision_at_20:.4f}")

üìä M√©tricas Random Forest (Optimizado) en Validation:
   - AUC-ROC: 0.9953
   - F1-Score: 0.9179
   - Precision: 0.8774
   - Recall: 0.9623
   - Precision@20%: 0.9334


## 4. Modelo 2: XGBoost Classifier

In [None]:
print("üöÄ XGBOOST CLASSIFIER - Configuraci√≥n Baseline\n")

# Calcular scale_pos_weight para manejar desbalance
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
print(f"‚öñÔ∏è Scale pos weight: {scale_pos_weight:.2f}")

# Baseline
xgb_baseline = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    n_jobs=-1,
    eval_metric='logloss'
)

print("‚è≥ Entrenando XGBoost baseline...")
xgb_baseline.fit(X_train, y_train)

y_val_pred_xgb = xgb_baseline.predict(X_val)
y_val_proba_xgb = xgb_baseline.predict_proba(X_val)[:, 1]

print(f"\nüìä M√©tricas XGBoost (Baseline) en Validation:")
print(f"   - AUC-ROC: {roc_auc_score(y_val, y_val_proba_xgb):.4f}")
print(f"   - F1-Score: {f1_score(y_val, y_val_pred_xgb):.4f}")
print(f"   - Precision: {precision_score(y_val, y_val_pred_xgb):.4f}")
print(f"   - Recall: {recall_score(y_val, y_val_pred_xgb):.4f}")

print(f"\n‚úÖ XGBoost baseline entrenado")

üöÄ XGBOOST CLASSIFIER - Configuraci√≥n Baseline

‚öñÔ∏è Scale pos weight: 3.91
‚è≥ Entrenando XGBoost baseline...

üìä M√©tricas XGBoost (Baseline) en Validation:
   - AUC-ROC: 0.9999
   - F1-Score: 0.9965
   - Precision: 0.9965
   - Recall: 0.9965

‚úÖ XGBoost baseline entrenado


### 4.1 Optimizaci√≥n de Hiperpar√°metros - XGBoost

In [None]:
print("üîß OPTIMIZACI√ìN DE HIPERPAR√ÅMETROS - XGBoost\n")

param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [4, 6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

print(f"üìã Grid de b√∫squeda:")
for param, values in param_grid_xgb.items():
    print(f"   - {param}: {values}")

total_combinations = np.prod([len(v) for v in param_grid_xgb.values()])
print(f"\nüî¢ Total de combinaciones: {total_combinations:,}")
print(f"‚è±Ô∏è Tiempo estimado: ~{total_combinations * 3 // 60} minutos\n")

xgb_grid = GridSearchCV(
    xgb.XGBClassifier(
        scale_pos_weight=scale_pos_weight,
        random_state=42,
        n_jobs=-1,
        eval_metric='logloss'
    ),
    param_grid=param_grid_xgb,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

print("‚è≥ Iniciando GridSearchCV para XGBoost...")
xgb_grid.fit(X_train, y_train)

print(f"\n‚úÖ Optimizaci√≥n completada!")
print(f"\nüèÜ Mejores hiperpar√°metros:")
for param, value in xgb_grid.best_params_.items():
    print(f"   - {param}: {value}")

print(f"\nüìä Mejor AUC-ROC (5-fold CV): {xgb_grid.best_score_:.4f}")

best_xgb = xgb_grid.best_estimator_

üîß OPTIMIZACI√ìN DE HIPERPAR√ÅMETROS - XGBoost

üìã Grid de b√∫squeda:
   - n_estimators: [100, 200, 300]
   - max_depth: [4, 6, 8, 10]
   - learning_rate: [0.01, 0.05, 0.1]
   - subsample: [0.7, 0.8, 0.9]
   - colsample_bytree: [0.7, 0.8, 0.9]

üî¢ Total de combinaciones: 324
‚è±Ô∏è Tiempo estimado: ~16 minutos

‚è≥ Iniciando GridSearchCV para XGBoost...
Fitting 5 folds for each of 324 candidates, totalling 1620 fits

‚úÖ Optimizaci√≥n completada!

üèÜ Mejores hiperpar√°metros:
   - colsample_bytree: 0.9
   - learning_rate: 0.1
   - max_depth: 8
   - n_estimators: 300
   - subsample: 0.9

üìä Mejor AUC-ROC (5-fold CV): 1.0000


In [None]:
# Evaluar mejor XGBoost en validation
y_val_pred_xgb_opt = best_xgb.predict(X_val)
y_val_proba_xgb_opt = best_xgb.predict_proba(X_val)[:, 1]

print(f"üìä M√©tricas XGBoost (Optimizado) en Validation:")
print(f"   - AUC-ROC: {roc_auc_score(y_val, y_val_proba_xgb_opt):.4f}")
print(f"   - F1-Score: {f1_score(y_val, y_val_pred_xgb_opt):.4f}")
print(f"   - Precision: {precision_score(y_val, y_val_pred_xgb_opt):.4f}")
print(f"   - Recall: {recall_score(y_val, y_val_pred_xgb_opt):.4f}")

# Precision@20%
top_20_idx_xgb = np.argsort(y_val_proba_xgb_opt)[-int(len(y_val) * 0.20):]
precision_at_20_xgb = y_val.iloc[top_20_idx_xgb].mean()
print(f"   - Precision@20%: {precision_at_20_xgb:.4f}")

üìä M√©tricas XGBoost (Optimizado) en Validation:
   - AUC-ROC: 1.0000
   - F1-Score: 0.9982
   - Precision: 0.9982
   - Recall: 0.9982
   - Precision@20%: 1.0000


## 5. Modelo 3: LightGBM Classifier

In [None]:
print("‚ö° LIGHTGBM CLASSIFIER - Configuraci√≥n Baseline\n")

# Baseline
lgb_baseline = lgb.LGBMClassifier(
    n_estimators=100,
    max_depth=10,
    learning_rate=0.1,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)

print("‚è≥ Entrenando LightGBM baseline...")
lgb_baseline.fit(X_train, y_train)

y_val_pred_lgb = lgb_baseline.predict(X_val)
y_val_proba_lgb = lgb_baseline.predict_proba(X_val)[:, 1]

print(f"\nüìä M√©tricas LightGBM (Baseline) en Validation:")
print(f"   - AUC-ROC: {roc_auc_score(y_val, y_val_proba_lgb):.4f}")
print(f"   - F1-Score: {f1_score(y_val, y_val_pred_lgb):.4f}")
print(f"   - Precision: {precision_score(y_val, y_val_pred_lgb):.4f}")
print(f"   - Recall: {recall_score(y_val, y_val_pred_lgb):.4f}")

print(f"\n‚úÖ LightGBM baseline entrenado")

‚ö° LIGHTGBM CLASSIFIER - Configuraci√≥n Baseline

‚è≥ Entrenando LightGBM baseline...

üìä M√©tricas LightGBM (Baseline) en Validation:
   - AUC-ROC: 0.9999
   - F1-Score: 0.9988
   - Precision: 0.9994
   - Recall: 0.9982

‚úÖ LightGBM baseline entrenado


### 5.1 Optimizaci√≥n de Hiperpar√°metros - LightGBM

In [None]:
print("üîß OPTIMIZACI√ìN DE HIPERPAR√ÅMETROS - LightGBM\n")

param_grid_lgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [8, 10, 15, -1],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [31, 50, 70],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

print(f"üìã Grid de b√∫squeda:")
for param, values in param_grid_lgb.items():
    print(f"   - {param}: {values}")

total_combinations = np.prod([len(v) for v in param_grid_lgb.values()])
print(f"\nüî¢ Total de combinaciones: {total_combinations:,}")
print(f"‚è±Ô∏è Tiempo estimado: ~{total_combinations * 2 // 60} minutos\n")

lgb_grid = GridSearchCV(
    lgb.LGBMClassifier(
        scale_pos_weight=scale_pos_weight,
        random_state=42,
        n_jobs=-1,
        verbose=-1
    ),
    param_grid=param_grid_lgb,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

print("‚è≥ Iniciando GridSearchCV para LightGBM...")
lgb_grid.fit(X_train, y_train)

print(f"\n‚úÖ Optimizaci√≥n completada!")
print(f"\nüèÜ Mejores hiperpar√°metros:")
for param, value in lgb_grid.best_params_.items():
    print(f"   - {param}: {value}")

print(f"\nüìä Mejor AUC-ROC (5-fold CV): {lgb_grid.best_score_:.4f}")

best_lgb = lgb_grid.best_estimator_

üîß OPTIMIZACI√ìN DE HIPERPAR√ÅMETROS - LightGBM

üìã Grid de b√∫squeda:
   - n_estimators: [100, 200, 300]
   - max_depth: [8, 10, 15, -1]
   - learning_rate: [0.01, 0.05, 0.1]
   - num_leaves: [31, 50, 70]
   - subsample: [0.7, 0.8, 0.9]
   - colsample_bytree: [0.7, 0.8, 0.9]

üî¢ Total de combinaciones: 972
‚è±Ô∏è Tiempo estimado: ~32 minutos

‚è≥ Iniciando GridSearchCV para LightGBM...
Fitting 5 folds for each of 972 candidates, totalling 4860 fits


KeyboardInterrupt: 

In [None]:
# Evaluar mejor LightGBM en validation
y_val_pred_lgb_opt = best_lgb.predict(X_val)
y_val_proba_lgb_opt = best_lgb.predict_proba(X_val)[:, 1]

print(f"üìä M√©tricas LightGBM (Optimizado) en Validation:")
print(f"   - AUC-ROC: {roc_auc_score(y_val, y_val_proba_lgb_opt):.4f}")
print(f"   - F1-Score: {f1_score(y_val, y_val_pred_lgb_opt):.4f}")
print(f"   - Precision: {precision_score(y_val, y_val_pred_lgb_opt):.4f}")
print(f"   - Recall: {recall_score(y_val, y_val_pred_lgb_opt):.4f}")

# Precision@20%
top_20_idx_lgb = np.argsort(y_val_proba_lgb_opt)[-int(len(y_val) * 0.20):]
precision_at_20_lgb = y_val.iloc[top_20_idx_lgb].mean()
print(f"   - Precision@20%: {precision_at_20_lgb:.4f}")

## 6. Comparaci√≥n de Modelos y Selecci√≥n del Mejor

In [None]:
print("="*80)
print("COMPARACI√ìN DE MODELOS EN VALIDATION SET")
print("="*80)

# Crear tabla comparativa
results = pd.DataFrame({
    'Random Forest': {
        'AUC-ROC': roc_auc_score(y_val, y_val_proba_rf_opt),
        'F1-Score': f1_score(y_val, y_val_pred_rf_opt),
        'Precision': precision_score(y_val, y_val_pred_rf_opt),
        'Recall': recall_score(y_val, y_val_pred_rf_opt),
        'Precision@20%': precision_at_20
    },
    'XGBoost': {
        'AUC-ROC': roc_auc_score(y_val, y_val_proba_xgb_opt),
        'F1-Score': f1_score(y_val, y_val_pred_xgb_opt),
        'Precision': precision_score(y_val, y_val_pred_xgb_opt),
        'Recall': recall_score(y_val, y_val_pred_xgb_opt),
        'Precision@20%': precision_at_20_xgb
    },
    'LightGBM': {
        'AUC-ROC': roc_auc_score(y_val, y_val_proba_lgb_opt),
        'F1-Score': f1_score(y_val, y_val_pred_lgb_opt),
        'Precision': precision_score(y_val, y_val_pred_lgb_opt),
        'Recall': recall_score(y_val, y_val_pred_lgb_opt),
        'Precision@20%': precision_at_20_lgb
    }
}).T

print("\n")
print(results.to_string())
print("\n")

# Identificar mejor modelo por m√©trica
print("üèÜ Mejor modelo por m√©trica:")
for metric in results.columns:
    best_model = results[metric].idxmax()
    best_value = results[metric].max()
    print(f"   - {metric}: {best_model} ({best_value:.4f})")

# Seleccionar mejor modelo basado en AUC-ROC (m√©trica principal)
best_model_name = results['AUC-ROC'].idxmax()
best_auc = results.loc[best_model_name, 'AUC-ROC']

print(f"\nüéØ MEJOR MODELO SELECCIONADO: {best_model_name}")
print(f"   - AUC-ROC: {best_auc:.4f} {'‚úÖ (>0.75 objetivo alcanzado)' if best_auc > 0.75 else '‚ö†Ô∏è (< 0.75 objetivo)'}")

# Asignar mejor modelo
if best_model_name == 'Random Forest':
    best_model = best_rf
    y_val_proba_best = y_val_proba_rf_opt
    y_val_pred_best = y_val_pred_rf_opt
elif best_model_name == 'XGBoost':
    best_model = best_xgb
    y_val_proba_best = y_val_proba_xgb_opt
    y_val_pred_best = y_val_pred_xgb_opt
else:
    best_model = best_lgb
    y_val_proba_best = y_val_proba_lgb_opt
    y_val_pred_best = y_val_pred_lgb_opt

print("="*80)

## 7. Evaluaci√≥n Cuantitativa Detallada del Mejor Modelo

In [None]:
print(f"üìä EVALUACI√ìN DETALLADA: {best_model_name}\n")

# Matriz de confusi√≥n
cm = confusion_matrix(y_val, y_val_pred_best)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Matriz de confusi√≥n - Counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No High-Growth', 'High-Growth'],
            yticklabels=['No High-Growth', 'High-Growth'])
axes[0].set_title(f'Matriz de Confusi√≥n - {best_model_name}\n(Counts)', fontsize=12)
axes[0].set_ylabel('Valor Real')
axes[0].set_xlabel('Valor Predicho')

# Matriz de confusi√≥n - Normalized
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Blues', ax=axes[1],
            xticklabels=['No High-Growth', 'High-Growth'],
            yticklabels=['No High-Growth', 'High-Growth'])
axes[1].set_title(f'Matriz de Confusi√≥n - {best_model_name}\n(Normalized)', fontsize=12)
axes[1].set_ylabel('Valor Real')
axes[1].set_xlabel('Valor Predicho')

plt.tight_layout()
plt.savefig('../documento/figuras/confusion_matrix_best_model.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Matriz de confusi√≥n guardada en documento/figuras/")

In [None]:
# Curvas ROC y Precision-Recall
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
fpr, tpr, thresholds_roc = roc_curve(y_val, y_val_proba_best)
auc_score = roc_auc_score(y_val, y_val_proba_best)

axes[0].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {auc_score:.4f})')
axes[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[0].set_xlim([0.0, 1.0])
axes[0].set_ylim([0.0, 1.05])
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title(f'ROC Curve - {best_model_name}')
axes[0].legend(loc="lower right")
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
precision_curve, recall_curve, thresholds_pr = precision_recall_curve(y_val, y_val_proba_best)
ap_score = average_precision_score(y_val, y_val_proba_best)

axes[1].plot(recall_curve, precision_curve, color='green', lw=2, label=f'PR curve (AP = {ap_score:.4f})')
axes[1].axhline(y=y_val.mean(), color='navy', linestyle='--', lw=2, label=f'Baseline (prevalence = {y_val.mean():.4f})')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title(f'Precision-Recall Curve - {best_model_name}')
axes[1].legend(loc="lower left")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../documento/figuras/roc_pr_curves_best_model.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Curvas ROC y PR guardadas en documento/figuras/")

In [None]:
# Classification Report
print("\nüìã Classification Report (Validation):")
print("="*60)
print(classification_report(y_val, y_val_pred_best, target_names=['No High-Growth', 'High-Growth']))
print("="*60)

## 8. Evaluaci√≥n Cualitativa: Feature Importance

In [None]:
print(f"üîç AN√ÅLISIS DE FEATURE IMPORTANCE - {best_model_name}\n")

# Obtener feature importances
if hasattr(best_model, 'feature_importances_'):
    importances = best_model.feature_importances_
    feature_importance_df = pd.DataFrame({
        'feature': feature_cols,
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    print("üìä Top 20 Features M√°s Importantes:\n")
    print(feature_importance_df.head(20).to_string(index=False))
    
    # Visualizar
    fig, ax = plt.subplots(figsize=(10, 8))
    top_20 = feature_importance_df.head(20)
    ax.barh(range(len(top_20)), top_20['importance'], color='steelblue')
    ax.set_yticks(range(len(top_20)))
    ax.set_yticklabels(top_20['feature'])
    ax.invert_yaxis()
    ax.set_xlabel('Importance')
    ax.set_title(f'Top 20 Feature Importances - {best_model_name}', fontsize=14)
    ax.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.savefig('../documento/figuras/feature_importance_best_model.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\n‚úÖ Feature importance guardado en documento/figuras/")
else:
    print("‚ö†Ô∏è El modelo no soporta feature_importances_")

### 8.1 An√°lisis de Errores (Casos Mal Clasificados)

In [None]:
print("üîé AN√ÅLISIS DE ERRORES\n")

# Identificar errores
errors_df = val_df.copy()
errors_df['prediction'] = y_val_pred_best
errors_df['probability'] = y_val_proba_best
errors_df['error'] = (errors_df['high_growth'] != errors_df['prediction']).astype(int)

# False Positives (predijo high-growth pero no lo es)
false_positives = errors_df[(errors_df['high_growth'] == 0) & (errors_df['prediction'] == 1)]
print(f"üìç False Positives: {len(false_positives):,} ({len(false_positives)/len(errors_df)*100:.2f}%)")

# False Negatives (predijo no high-growth pero s√≠ lo es)
false_negatives = errors_df[(errors_df['high_growth'] == 1) & (errors_df['prediction'] == 0)]
print(f"üìç False Negatives: {len(false_negatives):,} ({len(false_negatives)/len(errors_df)*100:.2f}%)")

print(f"\nüìä Estad√≠sticas de False Positives (top 5 con mayor probabilidad):")
fp_top = false_positives.nlargest(5, 'probability')[['delta_orders', 'prediction', 'probability']]
print(fp_top)

print(f"\nüìä Estad√≠sticas de False Negatives (top 5 con menor probabilidad):")
fn_top = false_negatives.nsmallest(5, 'probability')[['delta_orders', 'prediction', 'probability']]
print(fn_top)

print(f"\nüí° Insights:")
print(f"   - FP promedio delta: {false_positives['delta_orders'].mean():.2f} √≥rdenes")
print(f"   - FN promedio delta: {false_negatives['delta_orders'].mean():.2f} √≥rdenes")
print(f"   - Los FN son usuarios con alto crecimiento que el modelo subestima")
print(f"   - Los FP son usuarios que el modelo sobreestima (riesgo menor para negocio)")

## 9. Evaluaci√≥n Final en Test Set (Una Sola Vez)

In [None]:
print("="*80)
print(f"EVALUACI√ìN FINAL EN TEST SET - {best_model_name}")
print("="*80)
print("\n‚ö†Ô∏è IMPORTANTE: Esta evaluaci√≥n se realiza UNA SOLA VEZ en el conjunto de test.\n")

# Predicciones en test
y_test_pred = best_model.predict(X_test)
y_test_proba = best_model.predict_proba(X_test)[:, 1]

# M√©tricas finales
test_auc = roc_auc_score(y_test, y_test_proba)
test_f1 = f1_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Precision@20%
top_20_idx_test = np.argsort(y_test_proba)[-int(len(y_test) * 0.20):]
test_precision_at_20 = y_test.iloc[top_20_idx_test].mean()

print(f"\nüìä M√âTRICAS FINALES EN TEST SET:\n")
print(f"{'='*60}")
print(f"{'M√©trica':<25} {'Valor':>15} {'Objetivo':>15}")
print(f"{'='*60}")
print(f"{'AUC-ROC':<25} {test_auc:>15.4f} {'>0.75':>15} {'‚úÖ' if test_auc > 0.75 else '‚ö†Ô∏è'}")
print(f"{'F1-Score':<25} {test_f1:>15.4f} {'>0.65':>15} {'‚úÖ' if test_f1 > 0.65 else '‚ö†Ô∏è'}")
print(f"{'Precision@20%':<25} {test_precision_at_20:>15.4f} {'>0.80':>15} {'‚úÖ' if test_precision_at_20 > 0.80 else '‚ö†Ô∏è'}")
print(f"{'Precision':<25} {test_precision:>15.4f} {'':>15}")
print(f"{'Recall':<25} {test_recall:>15.4f} {'':>15}")
print(f"{'Accuracy':<25} {test_accuracy:>15.4f} {'':>15}")
print(f"{'='*60}")

# Matriz de confusi√≥n test
cm_test = confusion_matrix(y_test, y_test_pred)
print(f"\nüìä Matriz de Confusi√≥n (Test):")
print(cm_test)

print(f"\n‚úÖ Evaluaci√≥n en test set completada")
print("="*80)

## 10. Guardar Mejor Modelo y Reporte

In [None]:
import os

print("üíæ Guardando mejor modelo y reporte...\n")

# Crear directorio si no existe
os.makedirs('../models', exist_ok=True)
os.makedirs('../documento/figuras', exist_ok=True)

# Guardar modelo
model_path = '../models/best_classifier.pkl'
with open(model_path, 'wb') as f:
    pickle.dump(best_model, f)

print(f"‚úÖ Modelo guardado: {model_path}")

# Crear reporte
report = {
    'model_type': best_model_name,
    'model_class': str(type(best_model)),
    'best_params': best_model.get_params() if hasattr(best_model, 'get_params') else {},
    'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'dataset_sizes': {
        'train': len(X_train),
        'validation': len(X_val),
        'test': len(X_test)
    },
    'metrics_validation': {
        'auc_roc': float(roc_auc_score(y_val, y_val_proba_best)),
        'f1_score': float(f1_score(y_val, y_val_pred_best)),
        'precision': float(precision_score(y_val, y_val_pred_best)),
        'recall': float(recall_score(y_val, y_val_pred_best)),
        'precision_at_20': float(y_val.iloc[top_20_idx].mean())
    },
    'metrics_test': {
        'auc_roc': float(test_auc),
        'f1_score': float(test_f1),
        'precision': float(test_precision),
        'recall': float(test_recall),
        'accuracy': float(test_accuracy),
        'precision_at_20': float(test_precision_at_20)
    },
    'feature_count': len(feature_cols),
    'feature_names': feature_cols,
    'class_distribution_test': {
        'negative': int((y_test == 0).sum()),
        'positive': int((y_test == 1).sum())
    }
}

# Guardar reporte como JSON
report_path = '../models/classification_report.json'
with open(report_path, 'w') as f:
    json.dump(report, f, indent=2)

print(f"‚úÖ Reporte guardado: {report_path}")

# Guardar feature importance si est√° disponible
if hasattr(best_model, 'feature_importances_'):
    feature_importance_path = '../models/feature_importance.csv'
    feature_importance_df.to_csv(feature_importance_path, index=False)
    print(f"‚úÖ Feature importance guardado: {feature_importance_path}")

print(f"\n‚úÖ Todos los artefactos guardados exitosamente")

## 11. Resumen Ejecutivo

In [None]:
print("="*80)
print("RESUMEN EJECUTIVO - ENTRENAMIENTO DE MODELOS DE CLASIFICACI√ìN")
print("="*80)

print(f"\nüéØ OBJETIVO:")
print(f"   Predecir usuarios con alto potencial de crecimiento (high_growth)")

print(f"\nüìä DATOS:")
print(f"   - Train: {len(X_train):,} usuarios (60%)")
print(f"   - Validation: {len(X_val):,} usuarios (20%)")
print(f"   - Test: {len(X_test):,} usuarios (20%)")
print(f"   - Features: {len(feature_cols)} (11 num√©ricos + 40 categ√≥ricos)")
print(f"   - Desbalance: {y_train.mean()*100:.1f}% positivos")

print(f"\nü§ñ MODELOS EVALUADOS:")
print(f"   1. Random Forest Classifier (con GridSearchCV)")
print(f"   2. XGBoost Classifier (con GridSearchCV)")
print(f"   3. LightGBM Classifier (con GridSearchCV)")

print(f"\nüèÜ MEJOR MODELO: {best_model_name}")
print(f"\nüìà M√âTRICAS EN TEST SET (evaluaci√≥n final):")
print(f"   - AUC-ROC: {test_auc:.4f} (objetivo: >0.75) {'‚úÖ' if test_auc > 0.75 else '‚ö†Ô∏è'}")
print(f"   - F1-Score: {test_f1:.4f} (objetivo: >0.65) {'‚úÖ' if test_f1 > 0.65 else '‚ö†Ô∏è'}")
print(f"   - Precision@20%: {test_precision_at_20:.4f} (objetivo: >0.80) {'‚úÖ' if test_precision_at_20 > 0.80 else '‚ö†Ô∏è'}")
print(f"   - Precision: {test_precision:.4f}")
print(f"   - Recall: {test_recall:.4f}")
print(f"   - Accuracy: {test_accuracy:.4f}")

print(f"\nüíº IMPLICACIONES DE NEGOCIO:")
print(f"   - El modelo puede identificar usuarios high-growth con {test_auc:.1%} de precisi√≥n")
print(f"   - Al targetear el top-20% de usuarios, {test_precision_at_20:.1%} ser√°n realmente high-growth")
print(f"   - Esto permite optimizar presupuesto promocional enfoc√°ndose en usuarios de mayor ROI")

print(f"\nüìÅ ARTEFACTOS GENERADOS:")
print(f"   - models/best_classifier.pkl (modelo entrenado)")
print(f"   - models/classification_report.json (m√©tricas detalladas)")
print(f"   - models/feature_importance.csv (importancia de features)")
print(f"   - documento/figuras/confusion_matrix_best_model.png")
print(f"   - documento/figuras/roc_pr_curves_best_model.png")
print(f"   - documento/figuras/feature_importance_best_model.png")

print(f"\nüöÄ PR√ìXIMO PASO:")
print(f"   Fase 5: Construcci√≥n del Dashboard con Streamlit")
print(f"   - Integrar modelo entrenado")
print(f"   - Crear interfaz para predicciones en tiempo real")
print(f"   - Sistema de recomendaciones personalizadas")

print(f"\n‚úÖ ENTRENAMIENTO DE MODELOS COMPLETADO EXITOSAMENTE")
print("="*80)