# Métodos de Ensamble en Python: Una Explicación Detallada

Los métodos de ensamble en Machine Learning son técnicas que combinan múltiples modelos para mejorar el rendimiento predictivo de un modelo individual. En Python, existen diversas librerías que facilitan la implementación de estos métodos.

## ¿Por qué usar métodos de ensamble?

- Mayor precisión: Al combinar múltiples modelos, se reduce la varianza y el sesgo, lo que suele resultar en mejores predicciones.
- Mayor estabilidad: Los modelos de ensamble son menos propensos a sobreajustarse a los datos de entrenamiento.
- Mayor capacidad de generalización: Los modelos pueden capturar patrones más complejos y generalizar mejor a nuevos datos.

## Tipos principales de métodos de ensamble en Python

1. Bagging (Bootstrap Aggregating)

- Idea: Crea múltiples modelos a partir de muestras aleatorias con reemplazo del conjunto de entrenamiento original.
- Implementación en Python:
    - Random Forest: Un conjunto de árboles de decisión donde cada árbol se construye con una muestra aleatoria de los datos y características.
    - Extra Trees: Similar a Random Forest, pero con algunas modificaciones para aumentar la aleatoriedad.

2. Boosting

- Idea: Crea modelos secuencialmente, donde cada nuevo modelo se enfoca en corregir los errores del modelo anterior.
- Implementación en Python:
    - AdaBoost: Asigna pesos a las observaciones según su dificultad de clasificación.
    - Gradient Boosting: Minimiza una función de pérdida mediante la adición de modelos débiles.
    - XGBoost: Una implementación altamente optimizada de Gradient Boosting.
    - LightGBM: Otra implementación optimizada de Gradient Boosting, especialmente diseñada para grandes conjuntos de datos.
    - CatBoost: Similar a LightGBM, pero con un mejor manejo de datos categóricos.

3. Stacking

- Idea: Combina las predicciones de múltiples modelos utilizando un meta-modelo.
- Implementación en Python:
    - Scikit-learn: Proporciona herramientas para crear modelos de stacking.

## Consideraciones importantes

- Selección de modelos base: La elección de los modelos base depende del problema y de los datos.
- Hiperparametrización: Es importante ajustar los hiperparámetros de los modelos base y del método de ensamble.
- Complejidad: Los métodos de ensamble pueden ser computacionalmente costosos, especialmente para grandes conjuntos de datos.
- Interpretabilidad: Los modelos de ensamble suelen ser menos interpretables que los modelos individuales.

## Librerías útiles en Python

- Scikit-learn: La librería más popular para Machine Learning en Python.
- XGBoost: Una librería especializada en Gradient Boosting.
- LightGBM: Otra librería especializada en Gradient Boosting.
- CatBoost: Una librería especializada en Gradient Boosting con manejo de datos categóricos.

Los métodos de ensamble son una poderosa herramienta para mejorar el rendimiento de los modelos de Machine Learning. Al combinar múltiples modelos, se pueden obtener resultados más robustos y precisos. 

Otros temas

Comparación de diferentes métodos de ensamble
Aplicaciones prácticas de los métodos de ensamble
Técnicas de selección de características en modelos de ensamble
Optimización de hiperparámetros en modelos de ensamble

# Implementaciones

In [16]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, StackingClassifier)
from sklearn.pipeline import make_pipeline
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score, matthews_corrcoef

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import warnings
warnings.filterwarnings("ignore")

In [3]:
iris = load_iris()
X = iris.data  
y = iris.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Bagging

1. Random Forest

In [4]:
rf = RandomForestClassifier(n_estimators=5, random_state=420).fit(X_train, y_train)

y_pred = rf.predict(X_test)
accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)
accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

In [12]:
rf_opt = RandomForestClassifier(random_state=420)
parameters = {
    'n_estimators': (1,2,3,4,5,6,7,8,9,10),                
    'criterion': ('gini', 'entropy', 'log_loss'),            
                }
# with GridSearch
grid_search_rf = GridSearchCV(
    estimator = rf_opt,
    param_grid = parameters,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 5
)

rf_optim = grid_search_rf.fit(X_train, y_train)
y_pred = rf_optim.predict(X_test)

print(grid_search_rf.best_params_ ) 
print(grid_search_rf.best_score_ ) 

{'criterion': 'entropy', 'n_estimators': 7}
0.9416666666666667


2. Extra Trees

In [4]:
ex = ExtraTreesClassifier(n_estimators=5, random_state=420).fit(X_train, y_train)

y_pred = ex.predict(X_test)
accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)
accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

In [14]:
et_opt = ExtraTreesClassifier(random_state=420)
parameters = {
    'n_estimators': (1,2,3,4,5,6,7,8,9,10),      
    'criterion': ('gini', 'entropy', 'log_loss'),
                }
# with GridSearch
grid_search_et = GridSearchCV(
    estimator = et_opt,
    param_grid = parameters,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 5
)

et_optim = grid_search_et.fit(X_train, y_train)
y_pred = et_optim.predict(X_test)

print(grid_search_et.best_params_ ) 
print(grid_search_et.best_score_ ) 

{'criterion': 'gini', 'n_estimators': 1}
0.95


## Boosting

1. AdaBoost

In [39]:
adb = AdaBoostClassifier(n_estimators=5, random_state=420).fit(X_train, y_train)

y_pred = adb.predict(X_test)
accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)
accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

In [40]:
adb_opt = AdaBoostClassifier(random_state=420)
parameters = {
    'learning_rate': (0.01, 0.05, 0.1, 0.2, 0.3, 0.5),
    'n_estimators': (1,2,3,4,5,6,7,8,9,10),
    'algorithm':('SAMME', "SAMME.R"),
                }

# with GridSearch
grid_search_adb = GridSearchCV(
    estimator = adb_opt,
    param_grid = parameters,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 5
)

adb_optim = grid_search_adb.fit(X_train, y_train)
y_pred = adb_optim.predict(X_test)

print(grid_search_adb.best_params_ ) 
print(grid_search_adb.best_score_ ) 

{'algorithm': 'SAMME', 'learning_rate': 0.5, 'n_estimators': 10}
0.9583333333333334


2. Gradient Boosting

In [34]:
gb = GradientBoostingClassifier(n_estimators=5, random_state=420).fit(X_train, y_train)

y_pred = gb.predict(X_test)
accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)
accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

In [36]:
gb_opt = GradientBoostingClassifier(random_state=420)
parameters = {
    'learning_rate': (0.01, 0.05, 0.1, 0.2, 0.3, 0.5),
    'n_estimators': (1,2,3,4,5,6,7,8,9,10),
    'loss': ('exponential', 'log_loss')
                }

# with GridSearch
grid_search_gb = GridSearchCV(
    estimator = gb_opt,
    param_grid = parameters,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 5
)

gb_optim = grid_search_gb.fit(X_train, y_train)
y_pred = gb_optim.predict(X_test)

print(grid_search_gb.best_params_ ) 
print(grid_search_gb.best_score_ ) 

{'learning_rate': 0.05, 'loss': 'log_loss', 'n_estimators': 1}
0.95


3. XGBoost

In [13]:
xgbc = xgb.XGBClassifier(tree_method="hist").fit(X_train, y_train)

y_pred = xgbc.predict(X_test)
accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)
accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

4. LightGBM

In [25]:
# Crear el conjunto de datos para LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test)

params = {
    'metric': 'auc',        # Métrica de evaluación: Área bajo la curva ROC
    'num_leaves': 31,       # Número máximo de hojas por árbol
    'learning_rate': 0.05,   # Tasa de aprendizaje
}

# Entrenar el modelo
gbm = lgb.train(params,
                train_data,
                num_boost_round=1,
                valid_sets=test_data)

y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred = np.where(y_pred > 0.5, 1, 0)  # Convertir probabilidades a clases

accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)
accuracy_vote, recall_vote, f1_vote, matthews_vote

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000314 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score 0.991667


(0.3, 0.3333333333333333, 0.15384615384615385, 0.0)

In [27]:
cat = cb.CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    eval_metric='Accuracy',
    random_seed=420
).fit(X_train, y_train)

y_pred = cat.predict(X_test)
accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)
accuracy_vote, recall_vote, f1_vote, matthews_vote

0:	learn: 0.9500000	total: 124ms	remaining: 2m 4s
1:	learn: 0.9500000	total: 126ms	remaining: 1m 2s
2:	learn: 0.9500000	total: 127ms	remaining: 42.3s
3:	learn: 0.9666667	total: 129ms	remaining: 32s
4:	learn: 0.9666667	total: 130ms	remaining: 25.9s
5:	learn: 0.9666667	total: 132ms	remaining: 21.8s
6:	learn: 0.9666667	total: 133ms	remaining: 18.9s
7:	learn: 0.9583333	total: 135ms	remaining: 16.7s
8:	learn: 0.9583333	total: 136ms	remaining: 15s
9:	learn: 0.9500000	total: 137ms	remaining: 13.6s
10:	learn: 0.9500000	total: 139ms	remaining: 12.5s
11:	learn: 0.9500000	total: 140ms	remaining: 11.5s
12:	learn: 0.9500000	total: 141ms	remaining: 10.7s
13:	learn: 0.9500000	total: 143ms	remaining: 10s
14:	learn: 0.9500000	total: 144ms	remaining: 9.47s
15:	learn: 0.9500000	total: 146ms	remaining: 9.01s
16:	learn: 0.9500000	total: 149ms	remaining: 8.59s
17:	learn: 0.9500000	total: 150ms	remaining: 8.18s
18:	learn: 0.9583333	total: 151ms	remaining: 7.8s
19:	learn: 0.9583333	total: 152ms	remaining: 7.4

(1.0, 1.0, 1.0, 1.0)

## Stacking classifier

In [19]:
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=420)),
    ('svr', make_pipeline(StandardScaler(), 
                          LinearSVC(random_state=420)))
]

clf = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression()
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)

accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

## Voting classifier

### Soft

In [20]:
# Crear modelos base
clf1 = RandomForestClassifier(n_estimators=5, random_state=420)
clf2 = SVC(probability=True, random_state=420)
clf3 = LogisticRegression(random_state=420)

# Crear un modelo de votación
voting = VotingClassifier(
    estimators=[('rf', clf1), ('svc', clf2), ('lr', clf3)],
    voting='soft'
)
# Entrenar el modelo
voting.fit(X_train, y_train)

In [21]:
y_pred = voting.predict(X_test)

accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)

accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

### Hard

In [22]:
# Crear modelos base
clf1 = RandomForestClassifier(n_estimators=5, random_state=420)
clf2 = SVC(probability=True, random_state=420)
clf3 = LogisticRegression(random_state=420)

# Crear un modelo de votación
voting = VotingClassifier(
    estimators=[('rf', clf1), ('svc', clf2), ('lr', clf3)],
    voting='hard'
)
# Entrenar el modelo
voting.fit(X_train, y_train)

In [23]:
y_pred = voting.predict(X_test)

accuracy_vote = accuracy_score(y_test, y_pred)
recall_vote = recall_score(y_test, y_pred, average='macro')
f1_vote = f1_score(y_test, y_pred, average='macro')
matthews_vote = matthews_corrcoef(y_test, y_pred)

accuracy_vote, recall_vote, f1_vote, matthews_vote

(1.0, 1.0, 1.0, 1.0)

In [24]:
print('ok_')

ok_
