# ü§ñ NeoScore - Credit Scoring Models

**Autor**: Luca Camus  
**Fecha**: Enero 2026  
**Objetivo**: Entrenar modelos de ML para predecir riesgo crediticio

**Modelos a implementar**:
1. Logistic Regression (baseline interpretable)
2. Random Forest (ensemble robusto)
3. XGBoost (estado del arte)
4. **Random Forest - Solo Comportamiento** (sin variables de balance)

## 1. Configuraci√≥n

In [None]:
# Instalar dependencias
!pip install google-cloud-bigquery pandas matplotlib seaborn scikit-learn xgboost imbalanced-learn --quiet

In [None]:
# Imports
from google.colab import auth
auth.authenticate_user()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.cloud import bigquery

# Scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    roc_auc_score, roc_curve, confusion_matrix, 
    classification_report, precision_recall_curve,
    f1_score, accuracy_score
)

# XGBoost
from xgboost import XGBClassifier

# Imbalanced-learn (si hay desbalanceo)
from imblearn.over_sampling import SMOTE

# Configuraci√≥n
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

print('‚úÖ Configuraci√≥n completa')

## 2. Cargar Datos

In [None]:
# Cliente BigQuery
PROJECT_ID = 'scoring-bancario'
client = bigquery.Client(project=PROJECT_ID)

# Cargar datos
query = """
SELECT *
FROM `scoring-bancario.analisis_bancario.customer_features`
"""

df = client.query(query).to_dataframe()
print(f'üìä Dataset cargado: {df.shape[0]:,} clientes x {df.shape[1]} features')

## 3. Preparaci√≥n de Features (¬°Evitando Leakage!)

‚ö†Ô∏è **IMPORTANTE**: NO usar `preliminary_credit_score` como feature

In [None]:
# Features PERMITIDAS (datos crudos, sin leakage)
FEATURES = [
    'age',                      # Demograf√≠a
    'avg_balance',              # Balance
    'last_balance',
    'min_balance',
    'max_balance',
    'total_spend',              # Gasto
    'avg_spend',
    'max_spend',
    'min_spend',
    'std_spend',
    'total_transactions',       # Actividad
    'days_active',
    'unique_transaction_days',
    'transaction_frequency',
    'spend_to_balance_ratio',   # Ratios
    'spend_volatility',
    'avg_daily_transactions',
    'avg_daily_spend',
]

# Variable objetivo
TARGET = 'high_risk_flag'

print(f'üìä Features a usar: {len(FEATURES)}')
print(f'üéØ Variable objetivo: {TARGET}')

In [None]:
# Verificar features disponibles
available_features = [f for f in FEATURES if f in df.columns]
missing_features = [f for f in FEATURES if f not in df.columns]

print(f'‚úÖ Features disponibles: {len(available_features)}')
if missing_features:
    print(f'‚ö†Ô∏è Features no encontradas: {missing_features}')

FEATURES = available_features

In [None]:
# Crear X e y
X = df[FEATURES].copy()
y = df[TARGET].copy()

print(f'X shape: {X.shape}')
print(f'y shape: {y.shape}')
print(f'\nüìä Distribuci√≥n del target:')
print(y.value_counts(normalize=True).round(4) * 100)

In [None]:
# Manejar valores nulos
print('\nüìä Nulos por columna antes de imputar:')
null_counts = X.isnull().sum()
print(null_counts[null_counts > 0])

# Imputar nulos con la mediana
X = X.fillna(X.median())

print('\n‚úÖ Nulos despu√©s de imputar:', X.isnull().sum().sum())

## 4. Divisi√≥n Train/Test

In [None]:
# Divisi√≥n estratificada (mantiene proporci√≥n de clases)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f'üìä Train: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.0f}%)')
print(f'üìä Test:  {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.0f}%)')

print(f'\nüìä Distribuci√≥n en Train:')
print(y_train.value_counts(normalize=True).round(4) * 100)

print(f'\nüìä Distribuci√≥n en Test:')
print(y_test.value_counts(normalize=True).round(4) * 100)

In [None]:
# Escalar features para Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('‚úÖ Features escaladas')

## 5. Modelo 1: Logistic Regression (Baseline)

In [None]:
# Entrenar Logistic Regression
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced'
)

lr_model.fit(X_train_scaled, y_train)

# Predicciones
y_pred_lr = lr_model.predict(X_test_scaled)
y_prob_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# M√©tricas
lr_auc = roc_auc_score(y_test, y_prob_lr)
lr_gini = 2 * lr_auc - 1

print('=' * 50)
print('üìä LOGISTIC REGRESSION - Resultados')
print('=' * 50)
print(f'ROC-AUC: {lr_auc:.4f}')
print(f'Gini:    {lr_gini:.4f}')
print(f'\n{classification_report(y_test, y_pred_lr, target_names=["Low Risk", "High Risk"])}')

In [None]:
# Coeficientes (interpretabilidad)
coef_df = pd.DataFrame({
    'Feature': FEATURES,
    'Coeficiente': lr_model.coef_[0]
}).sort_values('Coeficiente', key=abs, ascending=False)

print('üìä Features m√°s importantes (Logistic Regression):')
print(coef_df.head(10).to_string(index=False))

## 6. Modelo 2: Random Forest

In [None]:
# Entrenar Random Forest
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    random_state=42,
    class_weight='balanced',
    n_jobs=-1
)

rf_model.fit(X_train, y_train)

# Predicciones
y_pred_rf = rf_model.predict(X_test)
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]

# M√©tricas
rf_auc = roc_auc_score(y_test, y_prob_rf)
rf_gini = 2 * rf_auc - 1

print('=' * 50)
print('üìä RANDOM FOREST - Resultados')
print('=' * 50)
print(f'ROC-AUC: {rf_auc:.4f}')
print(f'Gini:    {rf_gini:.4f}')
print(f'\n{classification_report(y_test, y_pred_rf, target_names=["Low Risk", "High Risk"])}')

In [None]:
# Feature Importance
importance_df = pd.DataFrame({
    'Feature': FEATURES,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualizar
plt.figure(figsize=(10, 8))
plt.barh(importance_df['Feature'][::-1], importance_df['Importance'][::-1], color='steelblue')
plt.xlabel('Importancia')
plt.title('Feature Importance - Random Forest', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print('\nüìä Top 10 Features (Random Forest):')
print(importance_df.head(10).to_string(index=False))

## 7. Modelo 3: XGBoost

In [None]:
# Calcular scale_pos_weight para manejar desbalanceo
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
print(f'Scale pos weight: {scale_pos_weight:.2f}')

# Entrenar XGBoost
xgb_model = XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='auc',
    use_label_encoder=False
)

xgb_model.fit(X_train, y_train)

# Predicciones
y_pred_xgb = xgb_model.predict(X_test)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]

# M√©tricas
xgb_auc = roc_auc_score(y_test, y_prob_xgb)
xgb_gini = 2 * xgb_auc - 1

print('=' * 50)
print('üìä XGBOOST - Resultados')
print('=' * 50)
print(f'ROC-AUC: {xgb_auc:.4f}')
print(f'Gini:    {xgb_gini:.4f}')
print(f'\n{classification_report(y_test, y_pred_xgb, target_names=["Low Risk", "High Risk"])}')

## 8. üéØ Modelo 4: Random Forest - SOLO COMPORTAMIENTO

‚ö†Ô∏è **PRUEBA DE ROBUSTEZ**: ¬øEl modelo puede predecir riesgo SIN conocer los balances?

**Variables EXCLUIDAS**:
- `avg_balance`, `last_balance`, `min_balance`, `max_balance`
- `spend_to_balance_ratio`

**Variables INCLUIDAS** (solo comportamiento):
- `age`, `avg_spend`, `total_spend`, `total_transactions`, `days_active`, etc.

In [None]:
# Features SOLO DE COMPORTAMIENTO (sin balance)
FEATURES_BEHAVIOR = [
    'age',                      # Demograf√≠a
    'total_spend',              # Gasto
    'avg_spend',
    'max_spend',
    'min_spend',
    'std_spend',
    'total_transactions',       # Actividad
    'days_active',
    'unique_transaction_days',
    'transaction_frequency',
    'spend_volatility',
    'avg_daily_transactions',
    'avg_daily_spend',
]

# Variables EXCLUIDAS
EXCLUDED_FEATURES = [
    'avg_balance', 'last_balance', 'min_balance', 'max_balance',
    'spend_to_balance_ratio'
]

print('‚úÖ Features de COMPORTAMIENTO (sin balance):')
for f in FEATURES_BEHAVIOR:
    print(f'   ‚Ä¢ {f}')

print(f'\n‚ùå Features EXCLUIDAS (tienen balance):')
for f in EXCLUDED_FEATURES:
    print(f'   ‚ö†Ô∏è {f}')

In [None]:
# Filtrar features disponibles
FEATURES_BEHAVIOR = [f for f in FEATURES_BEHAVIOR if f in df.columns]

# Crear X para comportamiento
X_behavior = df[FEATURES_BEHAVIOR].copy()
X_behavior = X_behavior.fillna(X_behavior.median())

print(f'üìä Features de comportamiento: {len(FEATURES_BEHAVIOR)}')

In [None]:
# Divisi√≥n Train/Test para modelo de comportamiento
X_train_beh, X_test_beh, y_train_beh, y_test_beh = train_test_split(
    X_behavior, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f'üìä Train: {X_train_beh.shape[0]:,} samples')
print(f'üìä Test:  {X_test_beh.shape[0]:,} samples')

In [None]:
# Entrenar Random Forest - Solo Comportamiento
rf_behavior = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    random_state=42,
    class_weight='balanced',
    n_jobs=-1
)

rf_behavior.fit(X_train_beh, y_train_beh)

# Predicciones
y_pred_beh = rf_behavior.predict(X_test_beh)
y_prob_beh = rf_behavior.predict_proba(X_test_beh)[:, 1]

# M√©tricas
beh_auc = roc_auc_score(y_test_beh, y_prob_beh)
beh_gini = 2 * beh_auc - 1

print('=' * 60)
print('üìä RANDOM FOREST - SOLO COMPORTAMIENTO (Sin Balance)')
print('=' * 60)
print(f'ROC-AUC: {beh_auc:.4f}')
print(f'Gini:    {beh_gini:.4f}')
print(f'\n{classification_report(y_test_beh, y_pred_beh, target_names=["Low Risk", "High Risk"])}')

In [None]:
# Feature Importance - Comportamiento
importance_beh_df = pd.DataFrame({
    'Feature': FEATURES_BEHAVIOR,
    'Importance': rf_behavior.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualizar
plt.figure(figsize=(10, 6))
plt.barh(importance_beh_df['Feature'][::-1], importance_beh_df['Importance'][::-1], color='coral')
plt.xlabel('Importancia')
plt.title('Feature Importance - Solo Comportamiento (Sin Balance)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print('\nüìä Features m√°s importantes (Solo Comportamiento):')
print(importance_beh_df.to_string(index=False))

## 9. Comparaci√≥n: Con Balance vs Sin Balance

In [None]:
# KS
def calculate_ks(y_true, y_prob):
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    return max(tpr - fpr)

ks_lr = calculate_ks(y_test, y_prob_lr)
ks_rf = calculate_ks(y_test, y_prob_rf)
ks_xgb = calculate_ks(y_test, y_prob_xgb)
ks_beh = calculate_ks(y_test_beh, y_prob_beh)

# Tabla comparativa
comparison = pd.DataFrame({
    'Modelo': ['Logistic Regression', 'Random Forest', 'XGBoost', 'RF Solo Comportamiento'],
    'ROC-AUC': [lr_auc, rf_auc, xgb_auc, beh_auc],
    'Gini': [lr_gini, rf_gini, xgb_gini, beh_gini],
    'KS': [ks_lr, ks_rf, ks_xgb, ks_beh],
    'Usa Balance': ['S√≠', 'S√≠', 'S√≠', 'NO']
}).round(4)

print('=' * 70)
print('üìä COMPARACI√ìN: CON BALANCE vs SIN BALANCE')
print('=' * 70)
print(comparison.to_string(index=False))
print('=' * 70)

In [None]:
# Visualizaci√≥n comparativa
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

models = ['LR', 'RF', 'XGB', 'RF\n(Comportamiento)']
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']
aucs = [lr_auc, rf_auc, xgb_auc, beh_auc]

# ROC-AUC
bars = axes[0].bar(models, aucs, color=colors)
axes[0].set_title('ROC-AUC por Modelo', fontsize=14, fontweight='bold')
axes[0].set_ylim(0, 1.1)
axes[0].axhline(0.7, color='gray', linestyle='--', alpha=0.5, label='Umbral aceptable (0.7)')
for i, v in enumerate(aucs):
    axes[0].text(i, v + 0.02, f'{v:.4f}', ha='center', fontweight='bold')
axes[0].legend()

# Curvas ROC
fpr_beh, tpr_beh, _ = roc_curve(y_test_beh, y_prob_beh)

axes[1].plot(fpr_lr, tpr_lr, label=f'LR (AUC={lr_auc:.3f})', linewidth=2)
axes[1].plot(fpr_rf, tpr_rf, label=f'RF (AUC={rf_auc:.3f})', linewidth=2)
axes[1].plot(fpr_xgb, tpr_xgb, label=f'XGB (AUC={xgb_auc:.3f})', linewidth=2)
axes[1].plot(fpr_beh, tpr_beh, label=f'RF Comportamiento (AUC={beh_auc:.3f})', linewidth=2, linestyle='--')
axes[1].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('Curvas ROC', fontsize=14, fontweight='bold')
axes[1].legend(loc='lower right')

plt.tight_layout()
plt.show()

## 10. Conclusiones

In [None]:
print('=' * 70)
print('üìä CONCLUSIONES - NeoScore Credit Scoring')
print('=' * 70)

# Interpretaci√≥n
drop_in_auc = rf_auc - beh_auc
pct_drop = (drop_in_auc / rf_auc) * 100

print(f'''
1. MODELOS CON BALANCE (RF Completo):
   - ROC-AUC: {rf_auc:.4f}
   - ‚ö†Ô∏è Puede estar "haciendo trampa" usando balance para predecir riesgo

2. MODELO SOLO COMPORTAMIENTO (RF Sin Balance):
   - ROC-AUC: {beh_auc:.4f}
   - Ca√≠da en AUC: {drop_in_auc:.4f} ({pct_drop:.1f}%)
   - Este es el RENDIMIENTO REAL del modelo sin "trampa"

3. INTERPRETACI√ìN:
''')

if beh_auc >= 0.70:
    print('   ‚úÖ El modelo de comportamiento tiene AUC >= 0.70')
    print('   ‚úÖ PUEDE predecir riesgo sin conocer el balance')
    print('   ‚úÖ El modelo es ROBUSTO y √∫til para producci√≥n')
elif beh_auc >= 0.60:
    print('   ‚ö†Ô∏è El modelo de comportamiento tiene AUC entre 0.60-0.70')
    print('   ‚ö†Ô∏è Capacidad predictiva MODERADA sin balance')
    print('   ‚ö†Ô∏è Considerar agregar m√°s features de comportamiento')
else:
    print('   ‚ùå El modelo de comportamiento tiene AUC < 0.60')
    print('   ‚ùå El comportamiento SOLO no es suficiente para predecir riesgo')
    print('   ‚ùå El modelo original depend√≠a demasiado del balance')

print(f'''
4. FEATURES M√ÅS IMPORTANTES (Sin Balance):
''')
print(importance_beh_df.head(5).to_string(index=False))

print('\n' + '=' * 70)