<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_16_ensemble_03_resampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import sklearn
print(sklearn.__version__)

1.3.2


### Load & Preprocess Data

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import recall_score, precision_score, f1_score, classification_report, make_scorer
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from loan_data_utils import load_and_preprocess_data
import joblib
import json
# Suppress warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Load and preprocess data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
categorical_columns = ['sex', 'education', 'marriage']
target = 'default_payment_next_month'

# Assuming the `load_and_preprocess_data` function is defined elsewhere
X, y = load_and_preprocess_data(url, categorical_columns, target)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Identify numeric and categorical columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['category']).columns.tolist()

# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(drop='first'))
        ]), categorical_features)
    ])

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



### Best Models

In [13]:
# Load the best models information from the JSON file
with open('best_models.json', 'r') as json_file:
    best_models = json.load(json_file)

# Print out the contents of the JSON file
print(json.dumps(best_models, indent=4))

{
    "recall_class_1": "LGBMClassifier",
    "precision_class_1": "RandomForestClassifier",
    "recall_class_0": "RandomForestClassifier",
    "precision_class_0": "HistGradientBoostingClassifier",
    "best_params": {
        "recall_class_1": {
            "classifier__force_row_wise": true,
            "classifier__learning_rate": 0.01,
            "classifier__n_estimators": 200
        },
        "precision_class_1": {
            "classifier__max_depth": 20,
            "classifier__n_estimators": 200
        },
        "recall_class_0": {
            "classifier__max_depth": 20,
            "classifier__n_estimators": 200
        },
        "precision_class_0": {
            "classifier__learning_rate": 0.1,
            "classifier__max_iter": 100
        }
    }
}


### Define Models, Resampling Techniques, and Pipelines

In [4]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from lightgbm import LGBMClassifier
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, precision_score

# Assuming preprocessor and data loading functions are defined elsewhere

# Define candidate models
candidate_models = {
    'LGBM': LGBMClassifier(random_state=42, class_weight='balanced', force_row_wise=True),
    'RF': RandomForestClassifier(random_state=42, class_weight='balanced'),
    'HGB': HistGradientBoostingClassifier(random_state=42, class_weight='balanced')
}

# Define resampling methods
resampling_methods = {
    'SMOTE': SMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'UnderSampling': RandomUnderSampler(random_state=42)
}

# Create pipelines for each candidate model with resampling
pipelines = {}
for resampling_name, resampler in resampling_methods.items():
    for model_name, model in candidate_models.items():
        pipeline_name = f'{resampling_name}_{model_name}'
        pipelines[pipeline_name] = ImbPipeline(steps=[('preprocessor', preprocessor),
                                                      ('resampler', resampler),
                                                      ('classifier', model)])

# Function to apply class-specific thresholds
def predict_with_class_specific_thresholds(model, X, threshold_class_1, threshold_class_0):
    y_proba = model.predict_proba(X)
    y_pred = np.zeros(y_proba.shape[0])

    # Apply thresholds to obtain predictions
    y_pred[y_proba[:, 1] >= threshold_class_1] = 1  # Predict class 1 for probabilities above threshold_class_1
    y_pred[y_proba[:, 0] >= threshold_class_0] = 0  # Predict class 0 for probabilities above threshold_class_0

    return y_pred

# Function to evaluate models with multiple thresholds
def evaluate_models_with_multiple_thresholds(pipelines, X_train, y_train, X_test, y_test, thresholds_class_1, thresholds_class_0):
    results = []
    for name, pipeline in pipelines.items():
        pipeline.fit(X_train, y_train)
        for threshold_class_1 in thresholds_class_1:
            for threshold_class_0 in thresholds_class_0:
                y_pred = predict_with_class_specific_thresholds(pipeline, X_test, threshold_class_1, threshold_class_0)
                recall_1 = recall_score(y_test, y_pred, pos_label=1)
                precision_1 = precision_score(y_test, y_pred, pos_label=1, zero_division=0)
                recall_0 = recall_score(y_test, y_pred, pos_label=0)
                precision_0 = precision_score(y_test, y_pred, pos_label=0, zero_division=0)
                results.append({
                    'Model': name,
                    'Threshold Class 1': threshold_class_1,
                    'Threshold Class 0': threshold_class_0,
                    'Recall Class 1': recall_1,
                    'Precision Class 1': precision_1,
                    'Recall Class 0': recall_0,
                    'Precision Class 0': precision_0
                })
    return pd.DataFrame(results)

# Set thresholds for evaluation
thresholds_class_1 = np.arange(0.2, 0.5, 0.05)
thresholds_class_0 = np.arange(0.2, 0.5, 0.05)

# Assuming data is loaded and split into X_train, X_test, y_train, y_test
# Evaluate candidate models with multiple thresholds
evaluation_results_multiple_thresholds = evaluate_models_with_multiple_thresholds(pipelines, X_train, y_train, X_test, y_test, thresholds_class_1, thresholds_class_0)

# Find the best threshold combination for each model based on F1 Macro score
evaluation_results_multiple_thresholds['F1 Macro'] = 2 * (evaluation_results_multiple_thresholds['Precision Class 1'] * evaluation_results_multiple_thresholds['Recall Class 1']) / (evaluation_results_multiple_thresholds['Precision Class 1'] + evaluation_results_multiple_thresholds['Recall Class 1'])
best_thresholds = evaluation_results_multiple_thresholds.loc[evaluation_results_multiple_thresholds.groupby('Model')['F1 Macro'].idxmax()]

# Save the best threshold combinations to a JSON file
best_thresholds_dict = best_thresholds.to_dict(orient='records')

with open('best_thresholds.json', 'w') as json_file:
    json.dump(best_thresholds_dict, json_file, indent=4)

print("Best threshold combinations saved to 'best_thresholds.json'")
print("Best threshold combinations for each model:")
best_thresholds


[LightGBM] [Info] Number of positive: 18691, number of negative: 18691
[LightGBM] [Info] Total Bins 6533
[LightGBM] [Info] Number of data points in the train set: 37382, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 18146, number of negative: 18691
[LightGBM] [Info] Total Bins 6544
[LightGBM] [Info] Number of data points in the train set: 36837, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Info] Number of positive: 5309, number of negative: 5309
[LightGBM] [Info] Total Bins 3260
[LightGBM] [Info] Number of data points in the train set: 10618, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Best threshold combinations saved to 'best_thresholds.json'
Best threshold combinations for each model:


Unnamed: 0,Model,Threshold Class 1,Threshold Class 0,Recall Class 1,Precision Class 1,Recall Class 0,Precision Class 0,F1 Macro
185,ADASYN_HGB,0.2,0.45,0.393369,0.617751,0.93088,0.843841,0.480663
113,ADASYN_LGBM,0.2,0.45,0.386586,0.646096,0.939867,0.843642,0.483734
149,ADASYN_RF,0.2,0.45,0.415976,0.580442,0.914616,0.846504,0.484636
77,SMOTE_HGB,0.2,0.45,0.400904,0.610092,0.927242,0.844969,0.483856
5,SMOTE_LGBM,0.2,0.45,0.38734,0.628362,0.934945,0.843111,0.479254
41,SMOTE_RF,0.2,0.45,0.423512,0.580579,0.913118,0.847973,0.48976
293,UnderSampling_HGB,0.2,0.45,0.581763,0.502277,0.836294,0.875644,0.539106
221,UnderSampling_LGBM,0.2,0.45,0.599096,0.482403,0.817462,0.877757,0.534454
256,UnderSampling_RF,0.2,0.4,0.531274,0.527695,0.864969,0.866638,0.529478


### Evaluating Models with Multiple Thresholds

In [5]:
# Define the range of thresholds to test
thresholds_class_1 = np.arange(0.2, 0.5, 0.05)
thresholds_class_0 = np.arange(0.2, 0.5, 0.05)

# Evaluate candidate models with multiple thresholds
evaluation_results_multiple_thresholds = evaluate_models_with_multiple_thresholds(pipelines, X_train, y_train, X_test, y_test, thresholds_class_1, thresholds_class_0)

# Find the best threshold combination for each model based on F1 Macro score
evaluation_results_multiple_thresholds['F1 Macro'] = 2 * (evaluation_results_multiple_thresholds['Precision Class 1'] * evaluation_results_multiple_thresholds['Recall Class 1']) / (evaluation_results_multiple_thresholds['Precision Class 1'] + evaluation_results_multiple_thresholds['Recall Class 1'])
best_thresholds = evaluation_results_multiple_thresholds.loc[evaluation_results_multiple_thresholds.groupby('Model')['F1 Macro'].idxmax()]

# Save the best threshold combinations to a JSON file
best_thresholds_dict = best_thresholds.to_dict(orient='records')
with open('best_thresholds.json', 'w') as json_file:
    json.dump(best_thresholds_dict, json_file, indent=4)

print("Best threshold combinations saved to 'best_thresholds.json'")
print("Best threshold combinations for each model:")
best_thresholds


[LightGBM] [Info] Number of positive: 18691, number of negative: 18691
[LightGBM] [Info] Total Bins 6533
[LightGBM] [Info] Number of data points in the train set: 37382, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 18146, number of negative: 18691
[LightGBM] [Info] Total Bins 6544
[LightGBM] [Info] Number of data points in the train set: 36837, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Info] Number of positive: 5309, number of negative: 5309
[LightGBM] [Info] Total Bins 3260
[LightGBM] [Info] Number of data points in the train set: 10618, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Best threshold combinations saved to 'best_thresholds.json'
Best threshold combinations for each model:


Unnamed: 0,Model,Threshold Class 1,Threshold Class 0,Recall Class 1,Precision Class 1,Recall Class 0,Precision Class 0,F1 Macro
185,ADASYN_HGB,0.2,0.45,0.393369,0.617751,0.93088,0.843841,0.480663
113,ADASYN_LGBM,0.2,0.45,0.386586,0.646096,0.939867,0.843642,0.483734
149,ADASYN_RF,0.2,0.45,0.415976,0.580442,0.914616,0.846504,0.484636
77,SMOTE_HGB,0.2,0.45,0.400904,0.610092,0.927242,0.844969,0.483856
5,SMOTE_LGBM,0.2,0.45,0.38734,0.628362,0.934945,0.843111,0.479254
41,SMOTE_RF,0.2,0.45,0.423512,0.580579,0.913118,0.847973,0.48976
293,UnderSampling_HGB,0.2,0.45,0.581763,0.502277,0.836294,0.875644,0.539106
221,UnderSampling_LGBM,0.2,0.45,0.599096,0.482403,0.817462,0.877757,0.534454
256,UnderSampling_RF,0.2,0.4,0.531274,0.527695,0.864969,0.866638,0.529478


### Defining Custom Scorers and the Tuning Function

In [6]:
from sklearn.metrics import make_scorer, recall_score, precision_score
from sklearn.model_selection import GridSearchCV
import joblib
import json

# Custom scorers for recall and precision for class 0 and class 1
scorers = {
    'recall_class_1': make_scorer(recall_score, pos_label=1),
    'precision_class_1': make_scorer(precision_score, pos_label=1),
    'recall_class_0': make_scorer(recall_score, pos_label=0),
    'precision_class_0': make_scorer(precision_score, pos_label=0)
}

# Function to perform grid search for a given model
def tune_model(pipeline, param_grid, X_train, y_train, scoring):
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=scoring)
    grid_search.fit(X_train, y_train)
    return grid_search

# Function to tune and save models
def tune_and_save_models(pipelines, param_grids, X_train, y_train, best_thresholds_df, scorers):
    best_models = {}
    best_params = {}

    for metric, scorer in scorers.items():
        if 'recall' in metric:
            class_num = metric.split('_')[-1]
            model_name = best_thresholds_df.loc[best_thresholds_df[f'Recall Class {class_num}'].idxmax(), 'Model']
        else:
            class_num = metric.split('_')[-1]
            model_name = best_thresholds_df.loc[best_thresholds_df[f'Precision Class {class_num}'].idxmax(), 'Model']

        tuned_model = tune_model(pipelines[model_name], param_grids[model_name.split('_')[-1]], X_train, y_train, scoring=scorer)

        best_models[metric] = tuned_model.best_estimator_
        best_params[metric] = tuned_model.best_params_

        # Save each model individually
        joblib.dump(tuned_model.best_estimator_, f'best_model_{metric}.pkl')
        print(f"Best model for {metric} saved to 'best_model_{metric}.pkl'")

    with open('best_params.json', 'w') as json_file:
        json.dump(best_params, json_file, indent=4)
    print(f"Best parameters saved to 'best_params.json'")

    return best_models, best_params


### Defining Parameter Grids for Model Tuning

In [7]:
# Define parameter grids for the selected models
param_grids = {
    # 'LogReg': {'classifier__C': [0.1, 1, 10]},
    'RF': {'classifier__n_estimators': [100, 200], 'classifier__max_depth': [10, 20]},
    'LGBM': {'classifier__n_estimators': [100, 200], 'classifier__learning_rate': [0.01, 0.1]},
    'HGB': {'classifier__max_iter': [100, 200], 'classifier__learning_rate': [0.01, 0.1]}
}

### Tuning and Saving Models (Step 1: Recall for Class 1)

In [8]:
# Load the best threshold combinations
with open('best_thresholds.json', 'r') as json_file:
    best_thresholds = json.load(json_file)
best_thresholds_df = pd.DataFrame(best_thresholds)

# File names for saving the best models and parameters
best_models_file_recall_class_1 = 'best_model_recall_class_1.pkl'
best_params_file_recall_class_1 = 'best_params_recall_class_1.json'

# Tune and save models for recall for class 1
best_models_recall_class_1 = {}
best_params_recall_class_1 = {}

# Get the best model name for recall class 1
model_name_recall_class_1 = best_thresholds_df.loc[best_thresholds_df['Recall Class 1'].idxmax(), 'Model']

# Perform tuning for recall class 1
tuned_model_recall_class_1 = tune_model(pipelines[model_name_recall_class_1], param_grids[model_name_recall_class_1.split('_')[-1]], X_train, y_train, scoring=scorers['recall_class_1'])

best_models_recall_class_1['recall_class_1'] = tuned_model_recall_class_1.best_estimator_
best_params_recall_class_1['recall_class_1'] = tuned_model_recall_class_1.best_params_

# Save the best models for recall class 1
joblib.dump(best_models_recall_class_1, best_models_file_recall_class_1)
print(f"Best models for recall class 1 saved to '{best_models_file_recall_class_1}'")

# Save the best parameters for recall class 1
with open(best_params_file_recall_class_1, 'w') as json_file:
    json.dump(best_params_recall_class_1, json_file, indent=4)
print(f"Best parameters for recall class 1 saved to '{best_params_file_recall_class_1}'")

# Print the best parameters for recall for class 1
for metric, params in best_params_recall_class_1.items():
    print(f"Best parameters for {metric}: {params}")


[LightGBM] [Info] Number of positive: 4248, number of negative: 4248
[LightGBM] [Info] Total Bins 3259
[LightGBM] [Info] Number of data points in the train set: 8496, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4247, number of negative: 4247
[LightGBM] [Info] Total Bins 3254
[LightGBM] [Info] Number of data points in the train set: 8494, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4247, number of negative: 4247
[LightGBM] [Info] Total Bins 3256
[LightGBM] [Info] Number of data points in the train set: 8494, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4247, number of negative: 4247
[LightGBM] [Info] Total Bins 3258
[LightGBM] [Info] Number of data points in the train set: 8494, number of u

### Tuning and Saving Models (Step 2: Precision for Class 1)

In [9]:
# File names for saving the best models and parameters
best_models_file_precision_class_1 = 'best_model_precision_class_1.pkl'
best_params_file_precision_class_1 = 'best_params_precision_class_1.json'

# Tune and save models for precision for class 1
best_models_precision_class_1 = {}
best_params_precision_class_1 = {}

# Get the best model name for precision class 1
model_name_precision_class_1 = best_thresholds_df.loc[best_thresholds_df['Precision Class 1'].idxmax(), 'Model']

# Perform tuning for precision class 1
tuned_model_precision_class_1 = tune_model(pipelines[model_name_precision_class_1], param_grids[model_name_precision_class_1.split('_')[-1]], X_train, y_train, scoring=scorers['precision_class_1'])

best_models_precision_class_1['precision_class_1'] = tuned_model_precision_class_1.best_estimator_
best_params_precision_class_1['precision_class_1'] = tuned_model_precision_class_1.best_params_

# Save the best models for precision class 1
joblib.dump(best_models_precision_class_1, best_models_file_precision_class_1)
print(f"Best models for precision class 1 saved to '{best_models_file_precision_class_1}'")

# Save the best parameters for precision class 1
with open(best_params_file_precision_class_1, 'w') as json_file:
    json.dump(best_params_precision_class_1, json_file, indent=4)
print(f"Best parameters for precision class 1 saved to '{best_params_file_precision_class_1}'")

# Print the best parameters for precision for class 1
for metric, params in best_params_precision_class_1.items():
    print(f"Best parameters for {metric}: {params}")


[LightGBM] [Info] Number of positive: 14566, number of negative: 14952
[LightGBM] [Info] Total Bins 6560
[LightGBM] [Info] Number of data points in the train set: 29518, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Info] Number of positive: 14573, number of negative: 14953
[LightGBM] [Info] Total Bins 6520
[LightGBM] [Info] Number of data points in the train set: 29526, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Info] Number of positive: 14508, number of negative: 14953
[LightGBM] [Info] Total Bins 6543
[LightGBM] [Info] Number of data points in the train set: 29461, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[Light

### Tuning and Saving Models (Step 3: Recall for Class 0)

In [10]:
# File names for saving the best models and parameters
best_models_file_recall_class_0 = 'best_model_recall_class_0.pkl'
best_params_file_recall_class_0 = 'best_params_recall_class_0.json'

# Tune and save models for recall for class 0
best_models_recall_class_0 = {}
best_params_recall_class_0 = {}

# Get the best model name for recall class 0
model_name_recall_class_0 = best_thresholds_df.loc[best_thresholds_df['Recall Class 0'].idxmax(), 'Model']

# Perform tuning for recall class 0
tuned_model_recall_class_0 = tune_model(pipelines[model_name_recall_class_0], param_grids[model_name_recall_class_0.split('_')[-1]], X_train, y_train, scoring=scorers['recall_class_0'])

best_models_recall_class_0['recall_class_0'] = tuned_model_recall_class_0.best_estimator_
best_params_recall_class_0['recall_class_0'] = tuned_model_recall_class_0.best_params_

# Save the best models for recall class 0
joblib.dump(best_models_recall_class_0, best_models_file_recall_class_0)
print(f"Best models for recall class 0 saved to '{best_models_file_recall_class_0}'")

# Save the best parameters for recall class 0
with open(best_params_file_recall_class_0, 'w') as json_file:
    json.dump(best_params_recall_class_0, json_file, indent=4)
print(f"Best parameters for recall class 0 saved to '{best_params_file_recall_class_0}'")

# Print the best parameters for recall for class 0
for metric, params in best_params_recall_class_0.items():
    print(f"Best parameters for {metric}: {params}")


[LightGBM] [Info] Number of positive: 14566, number of negative: 14952
[LightGBM] [Info] Total Bins 6560
[LightGBM] [Info] Number of data points in the train set: 29518, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Info] Number of positive: 14573, number of negative: 14953
[LightGBM] [Info] Total Bins 6520
[LightGBM] [Info] Number of data points in the train set: 29526, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Info] Number of positive: 14508, number of negative: 14953
[LightGBM] [Info] Total Bins 6543
[LightGBM] [Info] Number of data points in the train set: 29461, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[Light

### Tuning and Saving Models (Step 4: Precision for Class 0)

In [11]:
# File names for saving the best models and parameters
best_models_file_precision_class_0 = 'best_model_precision_class_0.pkl'
best_params_file_precision_class_0 = 'best_params_precision_class_0.json'

# Tune and save models for precision for class 0
best_models_precision_class_0 = {}
best_params_precision_class_0 = {}

# Get the best model name for precision class 0
model_name_precision_class_0 = best_thresholds_df.loc[best_thresholds_df['Precision Class 0'].idxmax(), 'Model']

# Perform tuning for precision class 0
tuned_model_precision_class_0 = tune_model(pipelines[model_name_precision_class_0], param_grids[model_name_precision_class_0.split('_')[-1]], X_train, y_train, scoring=scorers['precision_class_0'])

best_models_precision_class_0['precision_class_0'] = tuned_model_precision_class_0.best_estimator_
best_params_precision_class_0['precision_class_0'] = tuned_model_precision_class_0.best_params_

# Save the best models for precision class 0
joblib.dump(best_models_precision_class_0, best_models_file_precision_class_0)
print(f"Best models for precision class 0 saved to '{best_models_file_precision_class_0}'")

# Save the best parameters for precision class 0
with open(best_params_file_precision_class_0, 'w') as json_file:
    json.dump(best_params_precision_class_0, json_file, indent=4)
print(f"Best parameters for precision class 0 saved to '{best_params_file_precision_class_0}'")

# Print the best parameters for precision for class 0
for metric, params in best_params_precision_class_0.items():
    print(f"Best parameters for {metric}: {params}")



[LightGBM] [Info] Number of positive: 4248, number of negative: 4248
[LightGBM] [Info] Total Bins 3259
[LightGBM] [Info] Number of data points in the train set: 8496, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4247, number of negative: 4247
[LightGBM] [Info] Total Bins 3254
[LightGBM] [Info] Number of data points in the train set: 8494, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4247, number of negative: 4247
[LightGBM] [Info] Total Bins 3256
[LightGBM] [Info] Number of data points in the train set: 8494, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 4247, number of negative: 4247
[LightGBM] [Info] Total Bins 3258
[LightGBM] [Info] Number of data points in the train set: 8494, number of u

In [12]:
import joblib
import json
import pandas as pd

# Load the best models and parameters for class 0 recall and precision
best_models = joblib.load('best_models.pkl')
with open('best_params.json', 'r') as json_file:
    best_params = json.load(json_file)

# Load the best threshold combinations
with open('best_thresholds.json', 'r') as json_file:
    best_thresholds = json.load(json_file)

# Function to print model details
def print_model_details(metric, model, params):
    model_name = model.__class__.__name__
    print(f"{metric}: {model_name}")
    print(f"Parameters:")
    for param, value in params.items():
        print(f"  {param}: {value}")

# Print best models and parameters
print("Best Models and Parameters:\n")
for metric in ['recall_class_1', 'precision_class_1', 'recall_class_0', 'precision_class_0']:
    print_model_details(metric, best_models[metric], best_params[metric])
    print("\n")

# Print best thresholds
print("Best Thresholds:\n")
best_thresholds_df = pd.DataFrame(best_thresholds)
print(best_thresholds_df[['Model', 'Threshold Class 1', 'Threshold Class 0']])


FileNotFoundError: [Errno 2] No such file or directory: 'best_models.pkl'

### Create and Evaluate Voting Classifier

In [15]:
import joblib
import json
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the best models and parameters
best_model_recall_class_1 = joblib.load('best_model_recall_class_1.pkl')
best_model_precision_class_1 = joblib.load('best_model_precision_class_1.pkl')
best_model_recall_class_0 = joblib.load('best_model_recall_class_0.pkl')
best_model_precision_class_0 = joblib.load('best_model_precision_class_0.pkl')

# Initialize the VotingClassifier with all four models
voting_clf_optimized = VotingClassifier(estimators=[
    ('recall_class_1', best_model_recall_class_1),
    ('precision_class_1', best_model_precision_class_1),
    ('recall_class_0', best_model_recall_class_0),
    ('precision_class_0', best_model_precision_class_0)
], voting='soft')

# Fit the VotingClassifier on the training data
voting_clf_optimized.fit(X_train, y_train)

# Extract thresholds for evaluation
with open('best_thresholds.json', 'r') as json_file:
    best_thresholds = json.load(json_file)
best_thresholds_df = pd.DataFrame(best_thresholds)

THRESHOLD_CLASS_1 = best_thresholds_df.loc[best_thresholds_df['Model'] == 'LGBM', 'Threshold Class 1'].values[0]
THRESHOLD_CLASS_0 = best_thresholds_df.loc[best_thresholds_df['Model'] == 'LGBM', 'Threshold Class 0'].values[0]

# Predict with the VotingClassifier
y_pred_voting_optimized = predict_with_class_specific_thresholds(voting_clf_optimized, X_test, THRESHOLD_CLASS_1, THRESHOLD_CLASS_0)

def evaluate_and_print_performance(y_test, y_pred, classifier_name):
    recall_1 = recall_score(y_test, y_pred, pos_label=1)
    precision_1 = precision_score(y_test, y_pred, pos_label=1, zero_division=0)
    recall_0 = recall_score(y_test, y_pred, pos_label=0)
    precision_0 = precision_score(y_test, y_pred, pos_label=0, zero_division=0)
    f1_macro = f1_score(y_test, y_pred, average='macro')
    accuracy = accuracy_score(y_test, y_pred)

    print(f"\n{classifier_name} Performance:")
    print(f'Recall Class 1: {recall_1:.4f}')
    print(f'Precision Class 1: {precision_1:.4f}')
    print(f'Recall Class 0: {recall_0:.4f}')
    print(f'Precision Class 0: {precision_0:.4f}')
    print(f'F1 Macro: {f1_macro:.4f}')
    print(f'Accuracy: {accuracy:.4f}')
    print(f"\nClassification Report for {classifier_name}:\n")
    print(classification_report(y_test, y_pred))

# Evaluate and print performance for VotingClassifier
evaluate_and_print_performance(y_test, y_pred_voting_optimized, "Optimized VotingClassifier")

# Combine the results into a DataFrame
results = pd.DataFrame({
    'Metric': ['Recall Class 1', 'Precision Class 1', 'Recall Class 0', 'Precision Class 0', 'F1 Macro', 'Accuracy'],
    'Optimized VotingClassifier': [recall_1, precision_1, recall_0, precision_0, f1_macro, accuracy]
})

# Set the Metric column as the index
results.set_index('Metric', inplace=True)

# Plot the results
plt.figure(figsize=(14, 10))
sns.barplot(data=results.reset_index().melt(id_vars='Metric', var_name='Model', value_name='Score'),
            x='Metric', y='Score', hue='Model', palette='viridis')

plt.title('Performance Comparison: Optimized VotingClassifier', fontsize=16)
plt.ylabel('Score', fontsize=14)
plt.xlabel('Metric', fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='upper right', fontsize=12)
plt.tight_layout()
plt.show()

# Save the final VotingClassifier model
joblib.dump(voting_clf_optimized, 'voting_classifier_optimized.pkl')


ValueError: The estimator dict should be a classifier.

In [14]:
import joblib
import json
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the best models and parameters
best_models_recall_class_1 = joblib.load('best_model_recall_class_1.pkl')
best_models_precision_class_1 = joblib.load('best_model_precision_class_1.pkl')
best_models_recall_class_0 = joblib.load('best_model_recall_class_0.pkl')
best_models_precision_class_0 = joblib.load('best_model_precision_class_0.pkl')

# Initialize the VotingClassifier with all four models
voting_clf_optimized = VotingClassifier(estimators=[
    ('recall_class_1', best_models_recall_class_1['classifier']),
    ('precision_class_1', best_models_precision_class_1['classifier']),
    ('recall_class_0', best_models_recall_class_0['classifier']),
    ('precision_class_0', best_models_precision_class_0['classifier'])
], voting='soft')

# Fit the VotingClassifier on the training data
voting_clf_optimized.fit(X_train, y_train)

# Set thresholds for evaluation
THRESHOLD_CLASS_1 = best_thresholds_df.loc[best_thresholds_df['Model'] == 'LGBM', 'Threshold Class 1'].values[0]
THRESHOLD_CLASS_0 = best_thresholds_df.loc[best_thresholds_df['Model'] == 'LGBM', 'Threshold Class 0'].values[0]

# Predict with the VotingClassifier
y_pred_voting_optimized = predict_with_class_specific_thresholds(voting_clf_optimized, X_test, THRESHOLD_CLASS_1, THRESHOLD_CLASS_0)

def evaluate_and_print_performance(y_test, y_pred, classifier_name):
    recall_1 = recall_score(y_test, y_pred, pos_label=1)
    precision_1 = precision_score(y_test, y_pred, pos_label=1, zero_division=0)
    recall_0 = recall_score(y_test, y_pred, pos_label=0)
    precision_0 = precision_score(y_test, y_pred, pos_label=0, zero_division=0)
    f1_macro = f1_score(y_test, y_pred, average='macro')
    accuracy = accuracy_score(y_test, y_pred)

    print(f"\n{classifier_name} Performance:")
    print(f'Recall Class 1: {recall_1:.4f}')
    print(f'Precision Class 1: {precision_1:.4f}')
    print(f'Recall Class 0: {recall_0:.4f}')
    print(f'Precision Class 0: {precision_0:.4f}')
    print(f'F1 Macro: {f1_macro:.4f}')
    print(f'Accuracy: {accuracy:.4f}')
    print(f"\nClassification Report for {classifier_name}:\n")
    print(classification_report(y_test, y_pred))

# Evaluate and print performance for VotingClassifier
evaluate_and_print_performance(y_test, y_pred_voting_optimized, "Optimized VotingClassifier")

# Combine the results into a DataFrame
results = pd.DataFrame({
    'Metric': ['Recall Class 1', 'Precision Class 1', 'Recall Class 0', 'Precision Class 0', 'F1 Macro', 'Accuracy'],
    'Optimized VotingClassifier': [recall_1, precision_1, recall_0, precision_0, f1_macro, accuracy]
})

# Set the Metric column as the index
results.set_index('Metric', inplace=True)

# Plot the results
plt.figure(figsize=(14, 10))
sns.barplot(data=results.reset_index().melt(id_vars='Metric', var_name='Model', value_name='Score'),
            x='Metric', y='Score', hue='Model', palette='viridis')

plt.title('Performance Comparison: Optimized VotingClassifier', fontsize=16)
plt.ylabel('Score', fontsize=14)
plt.xlabel('Metric', fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='upper right', fontsize=12)
plt.tight_layout()
plt.show()

# Save the final VotingClassifier model
joblib.dump(voting_clf_optimized, 'voting_classifier_optimized.pkl')


KeyError: 'classifier'

### Create and Evaluate Stacking Classifier

### Compare Classifiers

In [None]:
# Calculate accuracy for both models
accuracy_voting = accuracy_score(y_test, y_pred_voting_optimized)
accuracy_stacking = accuracy_score(y_test, y_pred_stacking_optimized)

# Combine the results into a DataFrame
results = pd.DataFrame({
    'Metric': ['Recall Class 1', 'Precision Class 1', 'Recall Class 0', 'Precision Class 0', 'F1 Macro', 'Accuracy'],
    'VotingClassifier': [recall_score(y_test, y_pred_voting_optimized, pos_label=1),
                         precision_score(y_test, y_pred_voting_optimized, pos_label=1, zero_division=0),
                         recall_score(y_test, y_pred_voting_optimized, pos_label=0),
                         precision_score(y_test, y_pred_voting_optimized, pos_label=0, zero_division=0),
                         f1_score(y_test, y_pred_voting_optimized, average='macro'),
                         accuracy_voting],
    'StackingClassifier': [recall_score(y_test, y_pred_stacking_optimized, pos_label=1),
                           precision_score(y_test, y_pred_stacking_optimized, pos_label=1, zero_division=0),
                           recall_score(y_test, y_pred_stacking_optimized, pos_label=0),
                           precision_score(y_test, y_pred_stacking_optimized, pos_label=0, zero_division=0),
                           f1_score(y_test, y_pred_stacking_optimized, average='macro'),
                           accuracy_stacking]
})

# Set the Metric column as the index
results.set_index('Metric', inplace=True)

# Plot the results
plt.figure(figsize=(14, 10))
sns.barplot(data=results.reset_index().melt(id_vars='Metric', var_name='Model', value_name='Score'),
            x='Metric', y='Score', hue='Model', palette='viridis')

plt.title('Performance Comparison: VotingClassifier vs. StackingClassifier', fontsize=16)
plt.ylabel('Score', fontsize=14)
plt.xlabel('Metric', fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='upper right', fontsize=12)
plt.tight_layout()
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import accuracy_score

# Calculate accuracy for both models
accuracy_voting = accuracy_score(y_test, y_pred_voting)
accuracy_stacking = accuracy_score(y_test, y_pred_stacking)

# Combine the results into a DataFrame
results = pd.DataFrame({
    'Metric': ['Recall Class 1', 'Precision Class 1', 'Recall Class 0', 'Precision Class 0', 'F1 Macro', 'Accuracy'],
    'VotingClassifier': [recall_1_voting, precision_1_voting, recall_0_voting, precision_0_voting, f1_macro_voting, accuracy_voting],
    'StackingClassifier': [recall_1_stacking, precision_1_stacking, recall_0_stacking, precision_0_stacking, f1_macro_stacking, accuracy_stacking]
})

# Set the Metric column as the index
results.set_index('Metric', inplace=True)

# Plot the results
plt.figure(figsize=(14, 10))
sns.barplot(data=results.reset_index().melt(id_vars='Metric', var_name='Model', value_name='Score'),
            x='Metric', y='Score', hue='Model', palette='viridis')

plt.title('Performance Comparison: VotingClassifier vs. StackingClassifier', fontsize=16)
plt.ylabel('Score', fontsize=14)
plt.xlabel('Metric', fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.legend(loc='upper right', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
results

### Interpretation of Results

1. **Recall for Class 1 (Loan Defaults)**:
   - VotingClassifier: 0.583
   - StackingClassifier: 0.399
   - The VotingClassifier shows a higher recall for class 1, indicating it is better at identifying loan defaults.

2. **Precision for Class 1 (Loan Defaults)**:
   - VotingClassifier: 0.491
   - StackingClassifier: 0.641
   - The StackingClassifier has higher precision for class 1, meaning it has fewer false positives compared to the VotingClassifier.

3. **Recall for Class 0 (Non-Defaults)**:
   - VotingClassifier: 0.828
   - StackingClassifier: 0.937
   - The StackingClassifier shows higher recall for class 0, indicating it is better at identifying non-defaults.

4. **Precision for Class 0 (Non-Defaults)**:
   - VotingClassifier: 0.875
   - StackingClassifier: 0.846
   - The VotingClassifier has a slightly higher precision for class 0.

5. **F1 Macro**:
   - VotingClassifier: 0.692
   - StackingClassifier: 0.690
   - The F1 Macro scores are quite close, with the VotingClassifier having a marginally higher score.

6. **Accuracy**:
   - VotingClassifier: 0.774
   - StackingClassifier: 0.818
   - The StackingClassifier has higher overall accuracy.

### Recommendations for Next Steps

1. **Further Improve Recall for Class 1**:
   - **Adjust Class Weights**: Consider adjusting the class weights further to give more importance to class 1 (loan defaults) in both classifiers.
   - **Threshold Tuning**: Explore a wider range of thresholds specifically targeting the recall for class 1. Lower thresholds might help improve recall.

2. **Balancing Precision and Recall**:
   - **Combine Models**: Experiment with combining models that have high recall and high precision for class 1 in an ensemble.
   - **Stacking with Weighted Voting**: Instead of simple stacking, try weighted voting where models that perform better on class 1 are given higher weights.

3. **Use of Advanced Resampling Techniques**:
   - **SMOTE with Tomek Links**: This combination can help improve the quality of the synthetic samples and remove borderline cases.
   - **ADASYN**: Adaptive Synthetic Sampling can generate more synthetic data for minority class instances that are harder to learn.

4. **Feature Engineering**:
   - **Interaction Features**: Create new features by interacting existing features to capture complex relationships.
   - **Domain-Specific Features**: Add domain-specific features that could provide better signals for loan defaults.

5. **Hyperparameter Tuning**:
   - **Grid Search**: Perform a more exhaustive grid search over hyperparameters, especially focusing on the parameters that control the balance between precision and recall.
   - **Random Search**: Use random search for hyperparameter optimization over a broader range of values.

6. **Model Interpretability**:
   - **SHAP Values**: Use SHAP values to understand the impact of features on the predictions. This can provide insights into which features are most important for predicting loan defaults.

By focusing on these areas, you can work towards achieving higher recall for class 1 while maintaining high precision, thereby improving the overall performance of your loan default prediction model.