<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_16_ensemble_02_stacking_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is Stacking?

**Stacking (Stacked Generalization)** is an ensemble learning technique used to combine multiple machine learning models to improve overall performance. The fundamental idea behind stacking is to leverage the strengths of different models by combining their predictions to make more accurate and robust predictions.

#### What is Stacking?

1. **Base Models (Level-0 Models)**:
   - These are the individual models that are trained on the same dataset. Each base model may learn different aspects of the data, potentially making different types of errors.
   - Common base models include logistic regression, decision trees, random forests, support vector machines, and gradient boosting machines.

2. **Meta-Model (Level-1 Model)**:
   - The meta-model is trained to combine the predictions of the base models. It takes the outputs (predictions) of the base models as input features.
   - The meta-model learns to predict the final output based on the patterns and correlations it finds in the predictions of the base models.

3. **Training Process**:
   - The dataset is typically split into training and validation sets.
   - Base models are trained on the training set and make predictions on the validation set.
   - The predictions of the base models on the validation set are used as input features to train the meta-model.
   - In the final prediction phase, the base models make predictions on new data, and these predictions are fed into the meta-model to make the final prediction.

#### Why Use Stacking?

1. **Improved Performance**:
   - By combining the strengths of multiple models, stacking often results in better predictive performance than any single model alone.
   - Different models may capture different aspects of the data, and the meta-model can learn to weigh these appropriately.

2. **Reduced Overfitting**:
   - Stacking can help reduce overfitting, especially when combining models that are prone to different types of overfitting.
   - The meta-model helps to smooth out the biases and variances of the individual base models.

3. **Flexibility**:
   - Stacking allows the use of a wide variety of base models, including both linear and non-linear models.
   - It can be easily extended to include more complex meta-models, such as neural networks.


That sounds like a sound and systematic strategy. Here's a proposed approach to tackle the problem by breaking it down into the specified metrics and optimizing the models individually before feeding them into a meta-model:

### Proposed Strategy:

1. **Identify Best Performing Models**:
   - For each metric (class 1 recall, class 1 precision, class 0 recall, class 0 precision), identify the models that have historically performed the best.
   - This might involve running initial evaluations to gather performance metrics for various models.

2. **Tune Models Individually**:
   - Tune the identified best performing models for each metric using grid search or random search.
   - Collect the best parameters for each model based on their performance on the respective metrics.

3. **Combine Models into Meta-Model**:
   - Use the tuned models as inputs to a meta-model (e.g., stacking classifier).
   - Ensure that the meta-model leverages the strengths of each individual model.

4. **Evaluate and Adjust**:
   - Evaluate the performance of the meta-model.
   - If needed, make adjustments to the models or the meta-model based on performance.

### Steps in Detail:

1. **Initial Evaluation**:
   - Evaluate a set of candidate models (e.g., Logistic Regression, Random Forest, LightGBM, HistGradientBoosting) for each metric.
   - Record their performance for class 1 recall, class 1 precision, class 0 recall, and class 0 precision.

2. **Tune the Best Models**:
   - For the models identified as best for each metric, perform hyperparameter tuning.
   - Use grid search or random search to find the best parameters that maximize the respective metrics.

3. **Create Meta-Model**:
   - Use the tuned models in a stacking classifier or a weighted voting classifier.
   - The meta-learner in a stacking classifier can be a simple model (e.g., Logistic Regression) to combine the strengths of the base models.



### Initial Evaluation

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, precision_score
from loan_data_utils import load_and_preprocess_data

# Load and preprocess data (assuming this function is defined)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
categorical_columns = ['sex', 'education', 'marriage']
target = 'default_payment_next_month'

# Assuming the `load_and_preprocess_data` function is defined elsewhere
X, y = load_and_preprocess_data(url, categorical_columns, target)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Identify numeric and categorical columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['category']).columns.tolist()

# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(drop='first'))
        ]), categorical_features)
    ])

# Define candidate models
candidate_models = {
    'LogReg': LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000),
    'RF': RandomForestClassifier(random_state=42, class_weight='balanced'),
    'LGBM': LGBMClassifier(random_state=42, class_weight='balanced'),
    'HGB': HistGradientBoostingClassifier(random_state=42, class_weight='balanced')
}

# Create pipelines for each candidate model
pipelines = {name: Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
             for name, model in candidate_models.items()}

# Set threshold for classification
THRESHOLD = 0.25

# Function to evaluate models
def evaluate_models(pipelines, X_train, y_train, X_test, y_test, threshold=0.25):
    results = []
    for name, pipeline in pipelines.items():
        pipeline.fit(X_train, y_train)
        y_proba = pipeline.predict_proba(X_test)[:, 1]
        y_pred = (y_proba >= threshold).astype(int)
        recall_1 = recall_score(y_test, y_pred, pos_label=1)
        precision_1 = precision_score(y_test, y_pred, pos_label=1, zero_division=0)
        recall_0 = recall_score(y_test, y_pred, pos_label=0)
        precision_0 = precision_score(y_test, y_pred, pos_label=0, zero_division=0)
        results.append({
            'Model': name,
            'Recall Class 1': recall_1,
            'Precision Class 1': precision_1,
            'Recall Class 0': recall_0,
            'Precision Class 0': precision_0
        })
    return pd.DataFrame(results)

# Evaluate candidate models
evaluation_results = evaluate_models(pipelines, X_train, y_train, X_test, y_test, threshold=THRESHOLD)
evaluation_results


[LightGBM] [Info] Number of positive: 5309, number of negative: 18691
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001846 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3276
[LightGBM] [Info] Number of data points in the train set: 24000, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000


Unnamed: 0,Model,Recall Class 1,Precision Class 1,Recall Class 0,Precision Class 0
0,LogReg,0.925396,0.238678,0.16178,0.884211
1,RF,0.599849,0.461182,0.800984,0.87576
2,LGBM,0.887717,0.28816,0.377274,0.922071
3,HGB,0.915599,0.280341,0.332549,0.932773


### Select & Tune Models

In [2]:
import pandas as pd
from sklearn.metrics import make_scorer, recall_score, precision_score
from sklearn.model_selection import GridSearchCV
import joblib
import json

# Custom scorers for recall and precision for class 0 and class 1
scorers = {
    'recall_class_1': make_scorer(recall_score, pos_label=1),
    'precision_class_1': make_scorer(precision_score, pos_label=1),
    'recall_class_0': make_scorer(recall_score, pos_label=0),
    'precision_class_0': make_scorer(precision_score, pos_label=0)
}

def tune_and_save_models(pipelines, param_grids, X_train, y_train, evaluation_results, scorers, models_file, params_file):
    best_models = {}
    best_params = {}

    for metric, scorer in scorers.items():
        if 'recall' in metric:
            class_num = metric.split('_')[-1]
            model_name = evaluation_results.loc[evaluation_results[f'Recall Class {class_num}'].idxmax(), 'Model']
        else:
            class_num = metric.split('_')[-1]
            model_name = evaluation_results.loc[evaluation_results[f'Precision Class {class_num}'].idxmax(), 'Model']

        tuned_model = tune_model(pipelines[model_name], param_grids[model_name], X_train, y_train, scoring=scorer)

        best_models[metric] = tuned_model.best_estimator_
        best_params[metric] = tuned_model.best_params_

    joblib.dump(best_models, models_file)

    with open(params_file, 'w') as json_file:
        json.dump(best_params, json_file, indent=4)

    return best_models, best_params

# Function to perform grid search for a given model
def tune_model(pipeline, param_grid, X_train, y_train, scoring):
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=scoring)
    grid_search.fit(X_train, y_train)
    return grid_search

# Define parameter grids for the selected models
param_grids = {
    'LogReg': {'classifier__C': [0.1, 1, 10]},
    'RF': {'classifier__n_estimators': [100, 200], 'classifier__max_depth': [10, 20]},
    'LGBM': {'classifier__n_estimators': [100, 200], 'classifier__learning_rate': [0.01, 0.1]},
    'HGB': {'classifier__max_iter': [100, 200], 'classifier__learning_rate': [0.01, 0.1]}
}

# Assuming evaluation_results is already defined and populated
# Here is an example placeholder; replace it with actual evaluation results
evaluation_results = pd.DataFrame({
    'Model': ['LogReg', 'RF', 'LGBM', 'HGB'],
    'Recall Class 1': [0.8, 0.82, 0.85, 0.83],
    'Precision Class 1': [0.75, 0.78, 0.81, 0.79],
    'Recall Class 0': [0.9, 0.88, 0.89, 0.87],
    'Precision Class 0': [0.85, 0.86, 0.84, 0.83]
})

# Define scoring metrics using custom scorers
scoring_metrics = ['recall_class_1', 'precision_class_1', 'recall_class_0', 'precision_class_0']

# Tune and save models
models_file = 'best_models.pkl'
params_file = 'best_params.json'

best_models, best_params = tune_and_save_models(pipelines, param_grids, X_train, y_train, evaluation_results, scorers, models_file, params_file)

# Print the best parameters
for metric, params in best_params.items():
    print(f"Best parameters for {metric}: {params}")


[LightGBM] [Info] Number of positive: 4248, number of negative: 14952
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006647 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3276
[LightGBM] [Info] Number of data points in the train set: 19200, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[LightGBM] [Info] Number of positive: 4247, number of negative: 14953
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002626 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3269
[LightGBM] [Info] Number of data points in the train set: 19200, number of used features: 30
[LightGBM] [Info] [bin

In [3]:
# Print the best parameters
for metric, params in best_params.items():
    print(f"Best parameters for {metric}: {params}")

Best parameters for recall_class_1: {'classifier__learning_rate': 0.01, 'classifier__n_estimators': 200}
Best parameters for precision_class_1: {'classifier__learning_rate': 0.1, 'classifier__n_estimators': 200}
Best parameters for recall_class_0: {'classifier__C': 0.1}
Best parameters for precision_class_0: {'classifier__max_depth': 10, 'classifier__n_estimators': 200}


### Create Meta-Model - Stacking Classifier
A stacking classifier is an ensemble method that uses a meta-learner to combine the predictions of multiple base models. The base models are first trained on the training data, and their predictions are then used as input features for the meta-learner. The meta-learner learns the best way to combine these predictions to make the final prediction. This approach leverages the strengths of different models to achieve better overall performance.

In [4]:
import joblib
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import recall_score, precision_score, f1_score, classification_report

# Load the best models
best_models = joblib.load('best_models.pkl')

# Create a list of (name, model) tuples for the VotingClassifier
estimators = [
    ('recall_class_1', best_models['recall_class_1']),
    ('precision_class_1', best_models['precision_class_1']),
    ('recall_class_0', best_models['recall_class_0']),
    ('precision_class_0', best_models['precision_class_0'])
]

# Initialize the VotingClassifier with the best models
voting_clf = VotingClassifier(estimators=estimators, voting='soft')

# Fit the VotingClassifier on the training data
voting_clf.fit(X_train, y_train)

# Set threshold for classification
THRESHOLD = 0.25

# Predict probabilities
y_proba_voting = voting_clf.predict_proba(X_test)[:, 1]

# Apply the threshold to get the final predictions
y_pred_voting = (y_proba_voting >= THRESHOLD).astype(int)

# Evaluate the performance of the VotingClassifier
recall_1 = recall_score(y_test, y_pred_voting, pos_label=1)
precision_1 = precision_score(y_test, y_pred_voting, pos_label=1, zero_division=0)
recall_0 = recall_score(y_test, y_pred_voting, pos_label=0)
precision_0 = precision_score(y_test, y_pred_voting, pos_label=0, zero_division=0)
f1_macro = f1_score(y_test, y_pred_voting, average='macro')

# Print the evaluation metrics
print(f'Recall Class 1: {recall_1:.4f}')
print(f'Precision Class 1: {precision_1:.4f}')
print(f'Recall Class 0: {recall_0:.4f}')
print(f'Precision Class 0: {precision_0:.4f}')
print(f'F1 Macro: {f1_macro:.4f}')

# Print the classification report
print(classification_report(y_test, y_pred_voting))

# Save the final VotingClassifier model
joblib.dump(voting_clf, 'voting_classifier.pkl')


[LightGBM] [Info] Number of positive: 5309, number of negative: 18691
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003467 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3276
[LightGBM] [Info] Number of data points in the train set: 24000, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[LightGBM] [Info] Number of positive: 5309, number of negative: 18691
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001807 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3276
[LightGBM] [Info] Number of data points in the train set: 24000, number of used features: 30
[LightGBM] [Info] [bin

['voting_classifier.pkl']

###Custom Aggregation Rules
Custom aggregation rules involve combining the predictions of multiple models using predefined rules or heuristics. Instead of simply averaging the predictions, custom aggregation can apply different weights or thresholds to the models' predictions based on their performance. This allows for more control over how the final prediction is made, ensuring that the combined model leverages the strengths of individual models more effectively.

In [None]:
# Predict probabilities for all models
probs_logreg = pipelines_2['LogReg_2'].predict_proba(X_test)[:, 1]
probs_rf = pipelines_2['RF_2'].predict_proba(X_test)[:, 1]
probs_lgbm = pipelines_2['LGBM_2'].predict_proba(X_test)[:, 1]
probs_hgb = pipelines_2['HGB_2'].predict_proba(X_test)[:, 1]

# Combine probabilities using custom rules
combined_probs = (probs_logreg + 2 * probs_rf + 2 * probs_lgbm + probs_hgb) / 6

# Apply threshold
final_y_pred_2 = (combined_probs >= LOW_THRESHOLD).astype(int)

# Evaluate the custom aggregation approach
recall = recall_score(y_test, final_y_pred_2, pos_label=1)
precision = precision_score(y_test, final_y_pred_2, pos_label=1, zero_division=0)
f1 = f1_score(y_test, final_y_pred_2, pos_label=1)

print(f'Custom Aggregation Performance:')
print(f'Recall: {recall}')
print(f'Precision: {precision}')
print(f'F1 Score: {f1}')

# Classification report for detailed evaluation
print(classification_report(y_test, final_y_pred_2))


#### Write Loan Data Utils Script

In [1]:
script_content=r'''
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def load_data_from_url(url):
    try:
        df = pd.read_excel(url, header=1)
        logging.info("Data loaded successfully from URL.")
    except Exception as e:
        logging.error(f"Error loading data from URL: {e}")
        return None
    return df

def clean_column_names(df):
    df.columns = [col.lower().replace(' ', '_') for col in df.columns]
    return df

def remove_id_column(df):
    if 'id' in df.columns:
        df = df.drop(columns=['id'])
    return df

def rename_columns(df):
    rename_dict = {'pay_0': 'pay_1'}
    df = df.rename(columns=rename_dict)
    return df

def convert_categorical(df, categorical_columns):
    df[categorical_columns] = df[categorical_columns].astype('category')
    return df

def split_features_target(df, target):
    X = df.drop(columns=[target])
    y = df[target]
    return X, y

def load_and_preprocess_data(url, categorical_columns, target):
    df = load_data_from_url(url)
    if df is not None:
        df = clean_column_names(df)
        df = remove_id_column(df)
        df = rename_columns(df)
        df = convert_categorical(df, categorical_columns)
        X, y = split_features_target(df, target)
        return X, y
    return None, None

def plot_class_distribution(y_train, target_name):
    plt.figure(figsize=(8, 5))
    sns.countplot(x=y_train, hue=y_train, palette='mako')
    plt.title(f'Class Distribution in Training Set: {target_name}')
    plt.xlabel('Class')
    plt.ylabel('Count')
    plt.legend([], [], frameon=False)

    # Calculate the percentage for each class
    total = len(y_train)
    class_counts = y_train.value_counts()
    for i, count in enumerate(class_counts):
        percentage = 100 * count / total
        plt.text(i, count, f'{percentage:.1f}%', ha='center', va='bottom')

    plt.show()


'''

# Write the script to a file
with open("loan_data_utils.py", "w") as file:
    file.write(script_content)

print("Script successfully written to loan_data_utils.py")
# Reload script to make functions available for use
import importlib
import loan_data_utils
importlib.reload(loan_data_utils)

from loan_data_utils import *


Script successfully written to loan_data_utils.py
