# Ronald's analysis of how SMOTE-ENN can be used to artificially boost model accuracy
This notebook demonstrates how classifier performance on the diabetes dataset can be artificially boosted by applying SMOTE-ENN to the data *before* splitting them into test and train sets.

An explanation of the boost is provided in *Final Thoughts* below.

In [1]:
# reference to the data
data_file = './data/diabetes_binary_health_indicators_BRFSS2015.csv'

## Imports and settings

In [2]:
# imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Load data

In [3]:
# load the data
df = pd.read_csv('./data/diabetes_binary_health_indicators_BRFSS2015.csv')

# Drop uninformative features to speed up the rest of the script

In [4]:
# drop uninformative features to speed things up a bit
drop_cols = ['CholCheck', 'AnyHealthcare', 'HvyAlcoholConsump', 'Stroke', 'NoDocbcCost', 'DiffWalk', 'HeartDiseaseorAttack', 'Veggies', 'PhysActivity', 'Sex', 'Fruits']
for col in drop_cols:
    df = df.drop(col, axis=1)

# Define features (X), target (y), model, and resampler

In [5]:
# extract features and target
X = df.drop('Diabetes_binary', axis=1)
y = df['Diabetes_binary']

# define model and its parameter space
model = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=20230629))
param_space = {
    'gradientboostingclassifier__n_estimators': [100, 200, 300],
    'gradientboostingclassifier__learning_rate': [0.1, 0.01, 0.001],
    'gradientboostingclassifier__max_depth': [3, 5, 10, 20]
}
    
# create the optimizer object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_space, n_iter=36, cv=5, n_jobs=6, scoring='accuracy', random_state=20230629)

# create the resampler object
resampler = SMOTEENN(sampling_strategy = 'all', random_state = 20230630)

# Analysis 1: Split -> SMOTE-ENN -> Fit -> Evaluate
This is the common order of doing things. It is proper, because the test set is an unmodified subset of the original data.

In [6]:
# split the raw data
print('splitting...')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20230629)

# resample using SMOTE-ENN (applied to training set only)
print('resampling...')
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train)

# fit
print('fitting...')
random_search.fit(X_train_resampled, y_train_resampled)

# evaluate
y_pred = random_search.predict(X_test)
print(f"accuracy = {accuracy_score(y_test, y_pred):.3f}")

splitting...
resampling...
fitting...
accuracy = 0.773


# Analysis 2: SMOTE-ENN -> Split -> Fit -> Evaluate
This order is unusual and problematic, because the test data are tinkered with - that can create biases in the evaluation (for more on this, see *Final Thoughts* below)

In [7]:
# resample raw data
print('resampling...')
Xr, yr = resampler.fit_resample(X, y)

# split resampled data
print('splitting...')
X_train, X_test, y_train, y_test = train_test_split(Xr, yr, test_size=0.2, random_state=20230629)

# fit
print('fitting...')
random_search.fit(X_train, y_train)

# evaluate
y_pred = random_search.predict(X_test)
print(f"accuracy = {accuracy_score(y_test, y_pred):.3f}")

resampling...
splitting...
fitting...
accuracy = 0.963


# Final Thoughts 
I think that the accuracy boost comes from the ENN part of SMOTE-ENN: 

"*ENN (Edited Nearest Neighbor) is a cleaning technique that removes any instance of the majority class whose predicted class by the k-NN method contradicts the actual class of the instance*" (source: GPT-4)

By applying ENN *before* splitting the data, we will have removed many of the difficult cases from the dataset before we create the test set. As a consequence, the test set will be much less challenging, leading to higher model accuracy. 

However, this seems to be a bad practice, because the accuracy increase is not due a better trained model, but rather because of the removal of difficult-to-classify instances from the test data. This tinkering with the test set biases the model evaluation process, making it seem as if the model is performing better than it actually would on unseen, real-world data. 

## Appendix: rerun the above with SMOTE instead of SMOTE-ENN
If the accuracy boost is really due to ENN (as I hypothesized above), then the accuracy difference should be much smaller when using SMOTE. To test this, I reran the above analysis with SMOTE instead of SMOTE-ENN. 

As can be seen below, the accuracy boost due to oversampling *before* splitting then indeed becomes smaller, but there still is a boost.

This is because there is still a problem of *data leakage*: the synthetic data in the training set are partly based on the samples of the test set. Hence, the training set contains information about the test set, which artificially increases model accuracy.

In [8]:
# redefine the resampler
resampler = SMOTE(sampling_strategy = 'all', random_state = 20230630)

# rerun analysis #1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20230629)
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train)
random_search.fit(X_train_resampled, y_train_resampled)
y_pred = random_search.predict(X_test)
print(f"accuracy (split -> resample) = {accuracy_score(y_test, y_pred):.3f}")

# rerun analysis #2
Xr, yr = resampler.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(Xr, yr, test_size=0.2, random_state=20230629)
random_search.fit(X_train, y_train)
y_pred = random_search.predict(X_test)
print(f"accuracy (resample -> split) = {accuracy_score(y_test, y_pred):.3f}")

accuracy (split -> resample) = 0.856
accuracy (resample -> split) = 0.909
