# A look into how SMOTE-ENN can artificially boost model accuracy
### By Ronald van den Berg
This notebook demonstrates how classifier performance on the diabetes dataset can be artificially boosted by applying SMOTE-ENN to the data *before* splitting them into test and train sets.

As explained below, the boost is the result of a poor methodological choice, not because of having a good model.

In [1]:
# reference to the data
data_file = './data/diabetes_binary_health_indicators_BRFSS2015.csv'

## Imports and settings

In [2]:
# imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Load data

In [3]:
# load the data
df = pd.read_csv('./data/diabetes_binary_health_indicators_BRFSS2015.csv')

# log-transform the BMI column as its distribution looks log-normalish
df['BMI_log'] = np.log(df['BMI'])
df = df.drop('BMI', axis=1)

# Define features (X), target (y), model, and resampler

In [5]:
# extract features and target
X = df.drop('Diabetes_binary', axis=1)
y = df['Diabetes_binary']

# define model and its parameter space
model = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=20230629))
param_space = {
    'gradientboostingclassifier__n_estimators': [100, 200, 300],
    'gradientboostingclassifier__learning_rate': [0.1, 0.01, 0.001],
    'gradientboostingclassifier__max_depth': [3, 5, 10, 20]
}
    
# create the optimizer object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_space, n_iter=36, cv=5, n_jobs=6, scoring='accuracy', random_state=20230629)

# create the resampler object
resampler = SMOTEENN(sampling_strategy = 'all', random_state = 20230630)

# Analysis 1: Split -> SMOTE-ENN -> Fit -> Evaluate
This is the common order of doing things. It is proper, because the test set is an unmodified subset of the original data.

In [6]:
# split the raw data
print('splitting...')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20230629)

# resample using SMOTE-ENN (applied to training set only)
print('resampling...')
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train)

# fit
print('fitting...')
random_search.fit(X_train_resampled, y_train_resampled)

# evaluate
y_pred = random_search.predict(X_test)
print(f"accuracy = {accuracy_score(y_test, y_pred):.3f}")

splitting...
resampling...
fitting...
accuracy = 0.780


# Analysis 2: SMOTE-ENN -> Split -> Fit -> Evaluate
This order is unusual and problematic, because the test data are tinkered with - that can create biases in the evaluation (for more on this, see *Final Thoughts* below)

In [7]:
# resample raw data
print('resampling...')
Xr, yr = resampler.fit_resample(X, y)

# split resampled data
print('splitting...')
X_train, X_test, y_train, y_test = train_test_split(Xr, yr, test_size=0.2, random_state=20230629)

# fit
print('fitting...')
random_search.fit(X_train, y_train)

# evaluate
y_pred = random_search.predict(X_test)
print(f"accuracy = {accuracy_score(y_test, y_pred):.3f}")

resampling...
splitting...
fitting...
accuracy = 0.962


# Intermediate conclusions 
Applying SMOTE-ENN resampling before splitting the data gives an enormous accuracy boost compared to applying resampling to only the training data (78.0% -> 96.2%).

I think that the boost comes from the ENN part of SMOTE-ENN: 

"*ENN (Edited Nearest Neighbor) is a cleaning technique that removes any instance of the majority class whose predicted class by the k-NN method contradicts the actual class of the instance*" (source: GPT-4)

In other words: **ENN removes difficult-to-classify cases from the data**. By doing this before splitting the data into test and train sets, we prevent difficult cases from entering the test data. 

That leads to a boost in accuracy, but in my view this is a *cheat*: the superior accuracy is not due to having a superior model or a superior training method, but due to having made the test set less challenging. The high accuracy will *not* generalize to the real-world actual data, which contains many challengings cases.

## Appendix: does ENN perhaps even *hurt* classifier training?
If ENN removes difficult cases from the dataset, then the classifier will never be exposit might actually hurt model training, because it will only learn how to deal with the easier cases. 

To test this, we rerun the above analysis with SMOTE resampling. It's identical to SMOTE-ENN, except that it doesn't remove challenging cases from the resampled data.

In [8]:
# redefine the resampler
resampler = SMOTE(sampling_strategy = 'all', random_state = 20230630)

# rerun analysis #1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20230629)
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train)
random_search.fit(X_train_resampled, y_train_resampled)
y_pred = random_search.predict(X_test)
print(f"accuracy (split -> resample) = {accuracy_score(y_test, y_pred):.3f}")

# rerun analysis #2
Xr, yr = resampler.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(Xr, yr, test_size=0.2, random_state=20230629)
random_search.fit(X_train, y_train)
y_pred = random_search.predict(X_test)
print(f"accuracy (resample -> split) = {accuracy_score(y_test, y_pred):.3f}")

accuracy (split -> resample) = 0.863
accuracy (resample -> split) = 0.919


## Summary & Final Thoughts
Here is a summary of all results:

| Approach | Classifier | Resampling Method | Resampling target | Accuracy | Notes |
| --- | --- | --- | --- | --- | --- |
| 1 | Gradient boosting | SMOTE-ENN | Training data only | 78.0% | The ENN step harms model training (cf. Approach 3)|
| 2 | Gradient boosting | SMOTE-ENN | Train + test data | 96.2% | Accuracy boost due to cheating |
| 3 | Gradient boosting | SMOTE | Training data only | 86.3% | Best approach out of these four |
| 4 | Gradient boosting | SMOTE | Train + test data | 91.9% | Accuracy boost due to cheating |

**Approaches 2** and **4** give superior accuracy, but that's because of fiddling with the test data. This will not generalize to real data and, therefore, these approaches should probably be avoided.

**Approach 3** gives 86.3% performance. This is similar to the best performance that I got when playing around with other classifiers (including KNN, Random Forest, XGboost, etc) and class balacing methods (oversampling, class weights). Hence, I believe that 86% is close to the maximum accuracy on this data set.

Interestingly, when we comparing **Approach 1** to Approach 3, we see that the ENN step even *hurts* performance. This is because the classifier is not exposed to difficult cases during training. As a result, it can only classify the relatively easy cases when evaluating the test set.
