<a href="https://colab.research.google.com/github/dtoralg/INESDI_Data-Science_ML_IA/blob/main/%5B05%5D%20-%20Arboles%20de%20decision/LightGMB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Arboles de decisi√≥n: LightGMB - Ejercicio 4: LightGMB.ipynb

Este notebook es un **I do**: todo resuelto y explicado paso a paso.

## Objetivos

- Creo un dataset alearotio con 50000 ejemplos para predecir si un cliente comprar√°.
- Entrena un LightGMB.
- Eval√∫a el modelo con m√©tricas de clasificaci√≥n (accuracy, matriz de confusi√≥n y reporte).
- Muestra la importancia de cada caracter√≠stica (qu√© variables usa m√°s el modelo para decidir).
- Hacer una competici√≥n entre LightGMB VS XGBoost VS Gradient Boosting

## 1) Instalamos y cargamos librerias xgboost

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.datasets import make_classification, make_regression
from sklearn.metrics import accuracy_score, mean_squared_error, log_loss
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

# ¬°IMPORTANTE! Instalar las librer√≠as:
# pip install lightgbm xgboost
import lightgbm as lgb
import xgboost as xgb

## 2) Preparamos datos

In [2]:
# Preparamos datos

print("PREPARANDO EL CIRCUITO:")
print("-" * 50)

# Dataset grande para ver diferencias reales
np.random.seed(42)
X_large, y_large = make_classification(
    n_samples=50000,  # Dataset GRANDE para ver velocidad
    n_features=100,
    n_informative=80,
    n_redundant=10,
    n_classes=2,
    random_state=42
)

# A√±adir caracter√≠sticas categ√≥ricas (ventaja para LightGBM)
categorical_features = []
for i in range(5):  # 5 caracter√≠sticas categ√≥ricas
    cat_feature = np.random.choice(['A', 'B', 'C', 'D'], size=X_large.shape[0])
    # Convertir a n√∫meros para sklearn
    cat_encoded = pd.Categorical(cat_feature).codes
    X_large = np.column_stack([X_large, cat_encoded])
    categorical_features.append(X_large.shape[1] - 1)

print(f"üèÅ Circuito preparado:")
print(f"   üìä Muestras: {X_large.shape[0]:,}")
print(f"   üìã Caracter√≠sticas: {X_large.shape[1]} (5 categ√≥ricas)")
print(f"   üè∑Ô∏è  Caracter√≠sticas categ√≥ricas: {categorical_features}")

PREPARANDO EL CIRCUITO:
--------------------------------------------------
üèÅ Circuito preparado:
   üìä Muestras: 50,000
   üìã Caracter√≠sticas: 105 (5 categ√≥ricas)
   üè∑Ô∏è  Caracter√≠sticas categ√≥ricas: [100, 101, 102, 103, 104]


In [3]:
# Crear algunos valores faltantes
n_missing = int(0.05 * X_large.shape[0] * X_large.shape[1])  # 5% faltantes
missing_rows = np.random.choice(X_large.shape[0], size=n_missing, replace=True)
missing_cols = np.random.choice(X_large.shape[1]-5, size=n_missing, replace=True)  # No en categ√≥ricas
X_large[missing_rows, missing_cols] = np.nan

print(f"   üï≥Ô∏è  Valores faltantes: {np.isnan(X_large).sum():,}")

   üï≥Ô∏è  Valores faltantes: 255,703


In [4]:
# Dividir datos
X_train, X_test, y_train, y_test = train_test_split(
    X_large, y_large, test_size=0.2, random_state=42
)
print(f"   üèãÔ∏è  Entrenamiento: {X_train.shape[0]:,} muestras")
print(f"   üß™ Prueba: {X_test.shape[0]:,} muestras")

   üèãÔ∏è  Entrenamiento: 40,000 muestras
   üß™ Prueba: 10,000 muestras


## 3) Preparamos los modelos

In [5]:
#Preparar los modelos

print("PREPARACI√ìN DE COMPETIDORES:")
print("-" * 50)

# Configuraci√≥n com√∫n
common_params = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'random_state': 42
}

print("‚öôÔ∏è  Configuraci√≥n com√∫n:")
print(f"   üå≥ Estimadores: {common_params['n_estimators']}")
print(f"   üìè Profundidad: {common_params['max_depth']}")
print(f"   üìö Learning rate: {common_params['learning_rate']}")

# Preparar datos para Gradient Boosting (necesita imputaci√≥n)
imputer = SimpleImputer(strategy='mean')
X_train_filled = imputer.fit_transform(X_train)
X_test_filled = imputer.transform(X_test)

print("\nüèéÔ∏è COMPETIDORES:")

# 1. Gradient Boosting Cl√°sico
print("\nü•â Competidor 1: Gradient Boosting Cl√°sico")
print("   üí™ Fortalezas: Estable, confiable, f√°cil")
print("   üò∞ Debilidades: Lento, necesita preprocesamiento")

gbm_model = GradientBoostingClassifier(**common_params)

# 2. XGBoost
print("\nü•à Competidor 2: XGBoost")
print("   üí™ Fortalezas: Equilibrado, maduro, preciso")
print("   üò∞ Debilidades: M√°s lento que LightGBM")

xgb_model = xgb.XGBClassifier(
    **common_params,
    eval_metric='logloss'
)

# 3. LightGBM
print("\nü•á Competidor 3: LightGBM")
print("   üí™ Fortalezas: S√öPER R√ÅPIDO, eficiente memoria")
print("   üò∞ Debilidades: Puede sobreajustar en datasets peque√±os")

lgb_model = lgb.LGBMClassifier(
    **common_params,
    verbose=-1  # Sin output detallado
)

PREPARACI√ìN DE COMPETIDORES:
--------------------------------------------------
‚öôÔ∏è  Configuraci√≥n com√∫n:
   üå≥ Estimadores: 100
   üìè Profundidad: 6
   üìö Learning rate: 0.1

üèéÔ∏è COMPETIDORES:

ü•â Competidor 1: Gradient Boosting Cl√°sico
   üí™ Fortalezas: Estable, confiable, f√°cil
   üò∞ Debilidades: Lento, necesita preprocesamiento

ü•à Competidor 2: XGBoost
   üí™ Fortalezas: Equilibrado, maduro, preciso
   üò∞ Debilidades: M√°s lento que LightGBM

ü•á Competidor 3: LightGBM
   üí™ Fortalezas: S√öPER R√ÅPIDO, eficiente memoria
   üò∞ Debilidades: Puede sobreajustar en datasets peque√±os


## 4) Entrenamos los modelos

In [None]:
#entrenamos los modelos

print("üèÅ CARRERA DE VELOCIDAD:")
print("-" * 50)

# Diccionario para almacenar resultados
results = {}

print("üö¶ ¬°Preparados, listos, YA!")

# CARRERA 1: Gradient Boosting
print("\nü•â Corriendo Gradient Boosting...")
start_time = time.time()
gbm_model.fit(X_train_filled, y_train)
gbm_time = time.time() - start_time
gbm_pred = gbm_model.predict(X_test_filled)
gbm_pred_proba = gbm_model.predict_proba(X_test_filled)[:, 1]
gbm_accuracy = accuracy_score(y_test, gbm_pred)
gbm_logloss = log_loss(y_test, gbm_pred_proba)

results['GradientBoosting'] = {
    'time': gbm_time,
    'accuracy': gbm_accuracy,
    'logloss': gbm_logloss
}

print(f"   ‚è±Ô∏è  Tiempo: {gbm_time:.2f}s")
print(f"   üéØ Precisi√≥n: {gbm_accuracy:.4f}")

# CARRERA 2: XGBoost
print("\nü•à Corriendo XGBoost...")
start_time = time.time()
xgb_model.fit(X_train, y_train)
xgb_time = time.time() - start_time
xgb_pred = xgb_model.predict(X_test)
xgb_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
xgb_accuracy = accuracy_score(y_test, xgb_pred)
xgb_logloss = log_loss(y_test, xgb_pred_proba)

results['XGBoost'] = {
    'time': xgb_time,
    'accuracy': xgb_accuracy,
    'logloss': xgb_logloss
}

print(f"   ‚è±Ô∏è  Tiempo: {xgb_time:.2f}s")
print(f"   üéØ Precisi√≥n: {xgb_accuracy:.4f}")

# CARRERA 3: LightGBM (SIN caracter√≠sticas categ√≥ricas primero)
print("\nü•á Corriendo LightGBM (sin optimizaci√≥n)...")
start_time = time.time()
lgb_model.fit(X_train, y_train)
lgb_time_basic = time.time() - start_time
lgb_pred_basic = lgb_model.predict(X_test)
lgb_pred_proba_basic = lgb_model.predict_proba(X_test)[:, 1]
lgb_accuracy_basic = accuracy_score(y_test, lgb_pred_basic)
lgb_logloss_basic = log_loss(y_test, lgb_pred_proba_basic)

print(f"   ‚è±Ô∏è  Tiempo: {lgb_time_basic:.2f}s")
print(f"   üéØ Precisi√≥n: {lgb_accuracy_basic:.4f}")

# CARRERA 4: LightGBM OPTIMIZADO (con caracter√≠sticas categ√≥ricas)
print("\nüöÄ Corriendo LightGBM OPTIMIZADO...")
lgb_optimized = lgb.LGBMClassifier(
    **common_params,
    verbose=-1
)

start_time = time.time()
lgb_optimized.fit(
    X_train, y_train,
    categorical_feature=categorical_features
)
lgb_time_opt = time.time() - start_time
lgb_pred_opt = lgb_optimized.predict(X_test)
lgb_pred_proba_opt = lgb_optimized.predict_proba(X_test)[:, 1]
lgb_accuracy_opt = accuracy_score(y_test, lgb_pred_opt)
lgb_logloss_opt = log_loss(y_test, lgb_pred_proba_opt)

results['LightGBM_basic'] = {
    'time': lgb_time_basic,
    'accuracy': lgb_accuracy_basic,
    'logloss': lgb_logloss_basic
}

results['LightGBM_optimized'] = {
    'time': lgb_time_opt,
    'accuracy': lgb_accuracy_opt,
    'logloss': lgb_logloss_opt
}

print(f"   ‚è±Ô∏è  Tiempo: {lgb_time_opt:.2f}s")
print(f"   üéØ Precisi√≥n: {lgb_accuracy_opt:.4f}")

üèÅ CARRERA DE VELOCIDAD:
--------------------------------------------------
üö¶ ¬°Preparados, listos, YA!

ü•â Corriendo Gradient Boosting...


## 5) Evaluamos la velocidad

In [None]:
print("üèÜ PODIO DE VELOCIDAD:")
print("-" * 50)

# Ordenar por velocidad
speed_ranking = sorted(results.items(), key=lambda x: x[1]['time'])

print("üèÅ RESULTADOS DE LA CARRERA:")
print(f"{'Posici√≥n':<3} {'Competidor':<20} {'Tiempo':<10} {'Precisi√≥n':<10} {'Speedup':<10}")
print("-" * 65)

fastest_time = speed_ranking[0][1]['time']
for i, (name, metrics) in enumerate(speed_ranking):
    speedup = fastest_time / metrics['time']
    medal = "ü•á" if i == 0 else "ü•à" if i == 1 else "ü•â" if i == 2 else "  "
    print(f"{medal:<3} {name:<20} {metrics['time']:<8.2f}s {metrics['accuracy']:<8.4f} {speedup:<8.1f}x")

## 6) Analizamos el uso de memoria (recursos)

In [None]:
print("üìä AN√ÅLISIS DE MEMORIA:")
print("-" * 50)

import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # MB

print("üíæ USO DE MEMORIA (aproximado):")
print("   ü•â Gradient Boosting: ~Baseline MB")
print("   ü•à XGBoost: ~Baseline + 20-30% MB")
print("   ü•á LightGBM: ~Baseline - 30-50% MB")
print("\n   üìù LightGBM es mucho m√°s eficiente en memoria!")

## 7) Evaluamos las caracteristicas especiales del LightGMB

In [None]:
# CARACTER√çSTICAS ESPECIALES DE LIGHTGBM

print("üåü CARACTER√çSTICAS ESPECIALES DE LIGHTGBM:")
print("-" * 50)

print("üè∑Ô∏è MANEJO NATIVO DE CATEGOR√çAS:")
print("   ‚úÖ LightGBM maneja categor√≠as SIN one-hot encoding")
print("   ‚ùå XGBoost necesita preprocessing")
print("   ‚ùå Gradient Boosting necesita preprocessing")

print(f"\nüìä En nuestro ejemplo:")
print(f"   üî¢ Caracter√≠sticas categ√≥ricas: {len(categorical_features)}")
print(f"   üöÄ LightGBM las us√≥ directamente")
print(f"   üîÑ Otros modelos las trataron como num√©ricas")

# Comparar importancia de caracter√≠sticas
print("\nüìà IMPORTANCIA DE CARACTER√çSTICAS:")
print("\nü•á LightGBM top 5:")
lgb_importance = lgb_optimized.feature_importances_
top_lgb = np.argsort(lgb_importance)[-5:][::-1]
for i, idx in enumerate(top_lgb):
    feature_type = "Categ√≥rica" if idx in categorical_features else "Num√©rica"
    print(f"   {i+1}. Feature_{idx:2d} ({feature_type}): {lgb_importance[idx]:.4f}")

print("\nü•à XGBoost top 5:")
xgb_importance = xgb_model.feature_importances_
top_xgb = np.argsort(xgb_importance)[-5:][::-1]
for i, idx in enumerate(top_xgb):
    feature_type = "Categ√≥rica" if idx in categorical_features else "Num√©rica"
    print(f"   {i+1}. Feature_{idx:2d} ({feature_type}): {xgb_importance[idx]:.4f}")

## 7) Probamos con un dataset m√°s grande

In [None]:
# ESCALABILIDAD - DATASET M√ÅS GRANDE
print(" üèãÔ∏è PRUEBA DE ESCALABILIDAD:")
print("-" * 50)

print("üî¨ Creando dataset MASIVO para ver diferencias extremas...")

# Dataset m√°s grande
X_massive, y_massive = make_classification(
    n_samples=100000,  # ¬°100K muestras!
    n_features=50,
    n_informative=40,
    random_state=42
)

X_train_massive, X_test_massive, y_train_massive, y_test_massive = train_test_split(
    X_massive, y_massive, test_size=0.2, random_state=42
)

print(f"üìä Dataset masivo: {X_train_massive.shape[0]:,} muestras")

# Solo probar LightGBM vs XGBoost (GBM ser√≠a muy lento)
print("\n‚ö° CARRERA EXTREMA (solo LightGBM vs XGBoost):")

# XGBoost en dataset masivo
print("ü•à XGBoost en dataset masivo...")
start_time = time.time()
xgb_massive = xgb.XGBClassifier(n_estimators=50, max_depth=6, random_state=42, eval_metric='logloss')
xgb_massive.fit(X_train_massive, y_train_massive)
xgb_massive_time = time.time() - start_time
print(f"   ‚è±Ô∏è  Tiempo: {xgb_massive_time:.2f}s")

# LightGBM en dataset masivo
print("ü•á LightGBM en dataset masivo...")
start_time = time.time()
lgb_massive = lgb.LGBMClassifier(n_estimators=50, max_depth=6, random_state=42, verbose=-1)
lgb_massive.fit(X_train_massive, y_train_massive)
lgb_massive_time = time.time() - start_time
print(f"   ‚è±Ô∏è  Tiempo: {lgb_massive_time:.2f}s")

print(f"\nüöÄ EN DATASET MASIVO:")
print(f"   LightGBM es {xgb_massive_time/lgb_massive_time:.1f}x M√ÅS R√ÅPIDO que XGBoost!")

## 8) Resultados finales y comparativa entre modelos

In [None]:
# VEREDICTO FINAL
print("üèÜ VEREDICTO FINAL:")
print("-" * 50)

print("üìä RESUMEN DE RENDIMIENTO:")
print(f"{'Modelo':<20} {'Velocidad':<10} {'Precisi√≥n':<10} {'Memoria':<10}")
print("-" * 55)
print(f"{'LightGBM':<20} {'üöÄüöÄüöÄ':<10} {'üéØüéØüéØ':<10} {'üíæüíæüíæ':<10}")
print(f"{'XGBoost':<20} {'üöÄüöÄ':<10} {'üéØüéØüéØ':<10} {'üíæüíæ':<10}")
print(f"{'GradientBoosting':<20} {'üöÄ':<10} {'üéØüéØ':<10} {'üíæ':<10}")


## 10) Conclusion y Consejos, cuando usar cada uno


ü•á USA LIGHTGBM CUANDO:"
 *  ‚úÖ Dataset grande (>10,000 muestras)"
 *  ‚úÖ Velocidad es cr√≠tica"
 *  ‚úÖ Tienes caracter√≠sticas categ√≥ricas"
 *  ‚úÖ Memoria limitada"
 *  ‚úÖ Necesitas experimentar r√°pido"
 *  ‚úÖ Aplicaciones en producci√≥n"

ü•à USA XGBOOST CUANDO:"
 *  ‚úÖ Dataset mediano (1,000-100,000)"
 *  ‚úÖ M√°xima estabilidad"
 *  ‚úÖ Ecosistema maduro"
 *  ‚úÖ Competencias Kaggle tradicionales"

ü•â USA GRADIENT BOOSTING CUANDO:"
 *  ‚úÖ Aprendiendo conceptos"
 *  ‚úÖ Dataset peque√±o (<1,000)"
 *  ‚úÖ Simplicidad es clave"
 *  ‚úÖ Prototipado r√°pido"

üèÜ GANADOR ABSOLUTO:"
 *  üëë LightGBM - El nuevo rey de Gradient Boosting"
 *  üöÄ M√°s r√°pido, eficiente y preciso"
 *  üéØ Perfecto para la era de Big Data"

üìö PROGRESI√ìN RECOMENDADA:"
 *  1Ô∏è‚É£ Aprende con Gradient Boosting cl√°sico"
 *  2Ô∏è‚É£ Domina XGBoost para estabilidad")
 *  3Ô∏è‚É£ Migra a LightGBM para m√°ximo rendimiento"