# Ensemble Learning 

Ensamble Machine Learning methods combine predictions from multiple models to improve overall performance and generalization. In this lab we will use the Lending Club dataset to explore various ensemble techniques for classification tasks. To do so, we will explore the [PyCaret](https://pycaret.org/) library. 

PyCaret is an open-source, low-code machine learning library in Python that simplifies the end-to-end machine learning process. It is designed to minimize the amount of code and manual effort required for common tasks in machine learning, making it accessible to both beginners and experienced data scientists. Let's create a simple PyCaret workflow for a classification problem using Lending Club data.

Start by installing PyCaret and loading the data.

In [1]:
#!pip install pycaret
from pycaret.classification import *

In [2]:
import pandas as pd
# Load the CSV file into a Pandas DataFrame
# Load data from the first sheet
data = pd.read_excel('loan_data.xlsx')


In [3]:
# Setup PyCaret
clf_setup = setup(data, target='not.fully.paid', train_size=0.8, session_id=42)
# Compare models
best_model = compare_models()

Unnamed: 0,Description,Value
0,Session id,42
1,Target,not.fully.paid
2,Target type,Binary
3,Original data shape,"(9578, 14)"
4,Transformed data shape,"(9578, 20)"
5,Transformed train set shape,"(7662, 20)"
6,Transformed test set shape,"(1916, 20)"
7,Numeric features,12
8,Categorical features,1
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8401,0.6752,0.0326,0.5015,0.0608,0.0425,0.0938,0.204
dummy,Dummy Classifier,0.84,0.5,0.0,0.0,0.0,0.0,0.0,0.01
lr,Logistic Regression,0.8397,0.6226,0.0065,0.3517,0.0127,0.0082,0.0336,0.289
ridge,Ridge Classifier,0.8396,0.0,0.0049,0.4,0.0096,0.0058,0.0298,0.01
rf,Random Forest Classifier,0.8386,0.6461,0.0228,0.4028,0.043,0.027,0.0633,0.131
et,Extra Trees Classifier,0.8384,0.6355,0.0432,0.4628,0.0782,0.052,0.0993,0.068
lda,Linear Discriminant Analysis,0.8382,0.678,0.0416,0.4386,0.0756,0.0496,0.0937,0.013
catboost,CatBoost Classifier,0.838,0.6611,0.0416,0.4539,0.0754,0.0493,0.0951,1.529
ada,Ada Boost Classifier,0.8371,0.6617,0.031,0.3838,0.0571,0.0344,0.0695,0.055
lightgbm,Light Gradient Boosting Machine,0.8361,0.6474,0.0506,0.4113,0.0891,0.0559,0.0952,0.158


The `compare_models()` function takes various machine learning models and trains them on the provided dataset. It automatically performs cross-validation and evaluates each model's performance based on a set of predefined metrics (such as accuracy, precision, recall, F1-score, etc.). The function returns the best-performing model based on the evaluation metrics. The `best_model` variable will hold a reference to this model.

So what is the best model ? Apparently, the Gradient Boosting classifiers works pretty well with our data. Let's fine-tune this model for better fit.

In [4]:
# Create a Model (choose the best performing one from compare_models)
gbc_model = create_model('gbc')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8462,0.6632,0.0569,0.7778,0.1061,0.0861,0.1833
1,0.8396,0.7068,0.0569,0.5,0.1022,0.0718,0.1262
2,0.8473,0.6903,0.0492,0.8571,0.093,0.0771,0.1832
3,0.8381,0.6417,0.0246,0.375,0.0462,0.0271,0.0606
4,0.8355,0.6355,0.0246,0.3,0.0455,0.0219,0.0442
5,0.842,0.6724,0.041,0.5556,0.0763,0.0557,0.1181
6,0.8368,0.6977,0.0081,0.25,0.0157,0.0057,0.0176
7,0.8368,0.6633,0.0325,0.4,0.0602,0.0369,0.075
8,0.8394,0.6717,0.0244,0.5,0.0465,0.0321,0.0821
9,0.8394,0.7095,0.0081,0.5,0.016,0.0109,0.0473


The `create_model()` function in PyCaret is used to train a machine learning model on a given dataset. It initializes and sets up the training process for a specific model, allowing you to specify the model you want to use and other relevant configurations. After training, it evaluates the model's performance using cross-validation by default. You can customize the number of folds for cross-validation using the fold parameter. 

The function returns a PyCaret model object that encapsulates the trained machine learning model, along with performance metrics and other relevant information. Let's tune the model based on this output.

In [5]:
# Tune the Model
gbc_model_tuned = tune_model(gbc_model)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8396,0.6649,0.0,0.0,0.0,0.0,0.0
1,0.8396,0.7172,0.0,0.0,0.0,0.0,0.0
2,0.8407,0.6788,0.0,0.0,0.0,0.0,0.0
3,0.8407,0.6438,0.0,0.0,0.0,0.0,0.0
4,0.8407,0.6307,0.0,0.0,0.0,0.0,0.0
5,0.8407,0.6663,0.0,0.0,0.0,0.0,0.0
6,0.8394,0.7085,0.0,0.0,0.0,0.0,0.0
7,0.8394,0.6562,0.0,0.0,0.0,0.0,0.0
8,0.8394,0.6682,0.0,0.0,0.0,0.0,0.0
9,0.8394,0.696,0.0,0.0,0.0,0.0,0.0


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [6]:
# plot some sort of assessement metric if desired. Walk-in the function to learn more. 
#plot_model(gbc_model_tuned, 'confusion_matrix')


You can use this trained model object for various purposes, such as making predictions on new data, further fine-tuning the model, or incorporating it into an ensemble. Here's an example of using the trained model to make predictions on new data:

In [7]:
# Make Predictions on New Data
predictions = predict_model(gbc_model_tuned, data=data)
# Evaluate the tuned Model
evaluate_model(gbc_model_tuned)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.8496,0.7504,0.0705,0.871,0.1304,0.109,0.2221


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [9]:
#  Save and Load Model 
save_model(gbc_model_tuned, 'gbc_model_tuned')
#loaded_model = load_model('gbc_model_tuned')

# Deploy Model >>>>> TO RUN THIS YOU NEED A AMAZON AWS ACCOUNT (which costs money) not recommended. Keep this in the back of your mind. 
#deploy_model(gbc_model_tuned, model_name='gbc_model_tuned')

Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Loaded


### Ensemble Model

Once you've identified a base model, you can also create an ensemble model as follows

In [10]:
# Create an ensemble model using bagging with the best-performing model
bagged_model = ensemble_model(best_model, method='Bagging')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8422,0.6792,0.0325,0.6667,0.062,0.0478,0.1225
1,0.8318,0.704,0.0163,0.2,0.0301,0.0061,0.0124
2,0.846,0.6862,0.041,0.8333,0.0781,0.0642,0.1637
3,0.842,0.6467,0.0246,0.6,0.0472,0.0351,0.0976
4,0.8368,0.6394,0.0,0.0,0.0,-0.0077,-0.0273
5,0.8394,0.6574,0.0164,0.4,0.0315,0.0192,0.0533
6,0.8368,0.6943,0.0081,0.25,0.0157,0.0057,0.0176
7,0.8407,0.6604,0.0163,0.6667,0.0317,0.0243,0.0864
8,0.8394,0.67,0.0163,0.5,0.0315,0.0216,0.067
9,0.8407,0.721,0.0081,1.0,0.0161,0.0136,0.0827


PyCaret supports various ensemble methods, including Bagging, Boosting, and Stacking. The method parameter specifies the ensemble method. 

Bagging is an ensemble technique that aims to improve the stability and accuracy of a model by combining the predictions of multiple instances of the same model, each trained on a different subset of the training data. Here's how you can create a bagged ensemble in PyCaret:
* Choose the best model: evaluate and compare different models. Identify the best-performing model as we did before. 
* Create an Ensemble: creates an ensemble model where multiple instances of the best model are trained on different bootstrap samples (subsets of the original training data). Bagging helps reduce overfitting and variance.


In our case, accuracy gains seem modest. But this can be a valuable tool in other contexts. Let's tune and evaluated the tuned model as before. 

In [11]:
tuned_bagged_model = tune_model(bagged_model)
evaluate_model(tuned_bagged_model)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8475,0.6656,0.0488,1.0,0.093,0.0793,0.2032
1,0.8396,0.7033,0.0325,0.5,0.0611,0.0423,0.095
2,0.8446,0.6849,0.0328,0.8,0.063,0.0511,0.1419
3,0.8381,0.6472,0.0246,0.375,0.0462,0.0271,0.0606
4,0.8381,0.6377,0.0164,0.3333,0.0312,0.0166,0.0423
5,0.8446,0.6679,0.0328,0.8,0.063,0.0511,0.1419
6,0.8368,0.7092,0.0081,0.25,0.0157,0.0057,0.0176
7,0.842,0.6622,0.0244,0.75,0.0472,0.0375,0.1163
8,0.8394,0.6724,0.0244,0.5,0.0465,0.0321,0.0821
9,0.8394,0.7074,0.0081,0.5,0.016,0.0109,0.0473


Fitting 10 folds for each of 10 candidates, totalling 100 fits


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

### Exercise 

Change the ensemble method and compare it with Bagging. 