
Separating SMOTE and ADASYN evaluations into a dedicated, follow-up notebook allows for a more controlled and focused analysis of each resampling technique's performance. By isolating this comparison, we can systematically assess their respective impacts on model training, validation, and prediction outcomes without interference from other preprocessing steps or analyses. This approach enables a clearer comparison of metrics, like ROC AUC and classification accuracy, and ensures that the results are interpretable and actionable. Additionally, having a standalone notebook makes it easier to share insights and replicate results for future projects involving imbalanced datasets.

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from tensorflow.keras.utils import to_categorical
import tensorflow as tf
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense
from sklearn.model_selection import StratifiedKFold, KFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE, ADASYN
from collections import Counter
import xgboost as xgb

In [4]:
df = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [6]:
df.drop("id", axis=1, inplace=True) #we don't want to include this label when training our model
X = df.drop('loan_status', axis=1)
y = df['loan_status'] #label we want to predict

In [8]:
numeric_features = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']
categorical_features = ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']

# Define ColumnTransformer for scaling and one-hot encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

In [10]:
X = preprocessor.fit_transform(X)

In [12]:
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.20)

In [14]:
smote = SMOTE(sampling_strategy='minority', random_state=42)
adasyn = ADASYN(sampling_strategy='minority',random_state=42)
print("Original dataset distribution:", Counter(y_train))

# Apply SMOTE on the training dataset
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Apply ADASYN on the training dataset
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train, y_train)

# Check the new distribution after SMOTE and ADASYN
print("SMOTE oversampled dataset distribution:", Counter(y_train_smote))
print("ADASYN oversampled dataset distribution:", Counter(y_train_adasyn))

Original dataset distribution: Counter({0: 40216, 1: 6700})
SMOTE oversampled dataset distribution: Counter({0: 40216, 1: 40216})
ADASYN oversampled dataset distribution: Counter({1: 40547, 0: 40216})


In [24]:
xgb_model_smote = xgb.XGBClassifier(
    max_depth=8,              # Max depth of trees
    learning_rate=0.005,       # Learning rate (step size shrinkage)
    n_estimators=8000,         # Number of trees to be built
    subsample=0.8,            # Fraction of samples used per tree
    colsample_bytree=1,     # Fraction of features used per tree
    colsample_bylevel=0.8,    # Fraction of features per tree level
    min_child_weight=1,       # Minimum sum of instance weight in a child
    gamma=0.005,                # Minimum loss reduction required for split
    scale_pos_weight=1,       # Balancing positive/negative classes
    reg_alpha=0.4,           # L1 regularization
    reg_lambda=0.15,           # L2 regularization
    tree_method='hist',       # Use histogram-based algorithm
    random_state=42,          # Seed for reproducibility
    objective='binary:logistic',  # For binary classification
    eval_metric='auc',        # Evaluation metric
    n_jobs=-1                 # Use all available cores
)

In [26]:
xgb_model_adasyn = xgb.XGBClassifier(
    max_depth=8,              # Max depth of trees
    learning_rate=0.005,       # Learning rate (step size shrinkage)
    n_estimators=8000,         # Number of trees to be built
    subsample=0.8,            # Fraction of samples used per tree
    colsample_bytree=1,     # Fraction of features used per tree
    colsample_bylevel=0.8,    # Fraction of features per tree level
    min_child_weight=1,       # Minimum sum of instance weight in a child
    gamma=0.005,                # Minimum loss reduction required for split
    scale_pos_weight=1,       # Balancing positive/negative classes
    reg_alpha=0.4,           # L1 regularization
    reg_lambda=0.15,           # L2 regularization
    tree_method='hist',       # Use histogram-based algorithm
    random_state=42,          # Seed for reproducibility
    objective='binary:logistic',  # For binary classification
    eval_metric='auc',        # Evaluation metric
    n_jobs=-1                 # Use all available cores
)

In [22]:
# SMOTE pipeline
xgb_model_smote.fit(X_train_smote, y_train_smote)
smote_scores = cross_val_score(xgb_model_smote, X_cv, y_cv, scoring='roc_auc', cv=5)

# ADASYN pipeline
xgb_model_adasyn.fit(X_train_adasyn, y_train_adasyn)
adasyn_scores = cross_val_score(xgb_model_adasyn, X_cv, y_cv, scoring='roc_auc', cv=5)

# Compare scores
print("Average ROC AUC with SMOTE:", smote_scores.mean())
print("Average ROC AUC with ADASYN:", adasyn_scores.mean())

Average ROC AUC with SMOTE: 0.9404447268681139
Average ROC AUC with ADASYN: 0.9404447268681139


In [32]:
from sklearn.metrics import roc_auc_score


skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize lists to hold the AUC scores for each method
smote_aucs = []
adasyn_aucs = []

# Stratified K-Fold Cross-Validation
for train_index, val_index in skf.split(X, y):
    # Split data
    X_train, X_cv = X[train_index], X[val_index]
    y_train, y_cv = y[train_index], y[val_index]
    
    # SMOTE Pipeline
    xgb_model_smote.fit(X_train_smote, y_train_smote)
    y_pred_smote = xgb_model_smote.predict_proba(X_cv)[:, 1]
    smote_auc = roc_auc_score(y_cv, y_pred_smote)
    smote_aucs.append(smote_auc)
    
    # ADASYN Pipeline
    xgb_model_adasyn.fit(X_train_adasyn, y_train_adasyn)
    y_pred_adasyn = xgb_model_adasyn.predict_proba(X_cv)[:, 1]
    adasyn_auc = roc_auc_score(y_cv, y_pred_adasyn)
    adasyn_aucs.append(adasyn_auc)

# Calculate the mean AUC scores for each method
print("Average ROC AUC with SMOTE across folds:", np.mean(smote_aucs))
print("Average ROC AUC with ADASYN across folds:", np.mean(adasyn_aucs))


Average ROC AUC with SMOTE across folds: 0.9912255596174189
Average ROC AUC with ADASYN across folds: 0.9906954358506368


In [34]:
# The 'id' column is present in the test data, and we want to retain it in the output
ids = test['id']
# Drop the 'id' column before preprocessing and prediction
X_new_test = test.drop(['id'], axis=1)

In [40]:
# Preprocess the new test data (same preprocessor used for training data)
X_new_test_preprocessed = preprocessor.transform(X_new_test)

# Make predictions
predictions_smote = xgb_model_smote.predict(X_new_test_preprocessed)
predictions_adasyn = xgb_model_adasyn.predict(X_new_test_preprocessed)
predictions_probs_smote = xgb_model_smote.predict_proba(X_new_test_preprocessed)[:,1]
predictions_probs_adasyn = xgb_model_adasyn.predict_proba(X_new_test_preprocessed)[:,1]

In [42]:
# Assign labels based on a threshold of 0.5
predicted_labels_smote = [1 if prob >= 0.5 else 0 for prob in predictions_probs_smote]
predicted_labels_adasyn = [1 if prob >= 0.5 else 0 for prob in predictions_probs_adasyn]

# Count the distribution of predicted labels
test_distribution_smote = Counter(predicted_labels_smote)
test_distribution_adasyn = Counter(predicted_labels_adasyn)
print(f"Approximate test dataset distribution with SMOTE (based on predictions): {test_distribution_smote}")
print(f"Approximate test dataset distribution with ADASYN (based on predictions): {test_distribution_adasyn}")

Approximate test dataset distribution with SMOTE (based on predictions): Counter({0: 34700, 1: 4398})
Approximate test dataset distribution with ADASYN (based on predictions): Counter({0: 34713, 1: 4385})


In [44]:
from scipy.stats import chi2_contingency
from collections import Counter

# Define the distributions based on predictions
smote_distribution = Counter({0: 34700, 1: 4398})
adasyn_distribution = Counter({0: 34713, 1: 4385})

# Convert distributions to list format for the chi-squared test
observed = [smote_distribution[0], smote_distribution[1]]
expected = [adasyn_distribution[0], adasyn_distribution[1]]

# Perform chi-squared test
chi2, p_value = chi2_contingency([observed, expected])[:2]
chi2, p_value


(0.01846984847130902, 0.8918973150483489)

The chi-squared test comparing the SMOTE and ADASYN distributions yields a chi-squared statistic of approximately 0.018 and a p-value of 0.892. This high p-value indicates no statistically significant difference between the distributions produced by SMOTE and ADASYN in terms of class proportions in the predictions. Therefore, the sampling method does not appear to significantly alter the distribution of predicted classes in this case.

This suggests that either method could be suitable, with a focus on other metrics (such as model performance) for further refinement in this context. ​​