# Binary Classifier to Predict Outcome Y with 5-fold Cross-Validation

Import or install required libraries. 
Note that there could be incompatibilities between the scikit learn and imblearn packages that can be resolved as noted here:
https://github.com/scikit-learn-contrib/imbalanced-learn/issues/995

In [1]:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_validate
from sklearn.metrics import balanced_accuracy_score
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler

Load the dataset and provide contextual information on column types.

In [2]:
# Load the dataset
df = pd.read_csv("mydata.csv")

# Contextual information
binary_columns = ['X5', 'W', 'Y']
categorical_columns = ['X6', 'X8']
numeric_columns = ['X1', 'X2', 'X3', 'X4', 'X7', 'X9']

Split the data into the features and the target.

In [3]:
X = df.drop(columns=['Y'])  # Features
y = df['Y']  # Target

Create the preprocessing pipeline using standard scaling of numeric columns and one-hot encoding of categorical columns.

In [4]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_columns),
        ('cat', OneHotEncoder(), categorical_columns),
    ],
    remainder='passthrough'
)

Create the gradient boosting classifier pipeline and model, with random undersampling of the majority class (Hasanin, Khoshgoftaar, Leevy, & Bauder, 2019).

In [5]:
# Classifier pipeline and model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('undersampler', RandomUnderSampler(sampling_strategy='majority')),
    ('classifier', GradientBoostingClassifier(n_estimators=100))
])

Perform 5-fold cross-validation on the model and print the balanced accuracy along with some additional metrics. Note that balanced accuracy is used to deal with imbalanced datasets (Brodersen, Ong, Stephan, & Buhmann, 2010).

In [6]:
# Cross-validation (5 folds) with various metrics
for scoring_metric in ['balanced_accuracy', 'accuracy', 'f1', 'roc_auc']:
    cv_results = cross_validate(model, X, y, cv=5, scoring=scoring_metric, return_train_score=True, return_estimator=True)

    print(
        scoring_metric+
        f" mean +/- std. dev.: "
        f"{cv_results['test_score'].mean():.3f} +/- "
        f"{cv_results['test_score'].std():.3f}"
    )

balanced_accuracy mean +/- std. dev.: 0.755 +/- 0.007
accuracy mean +/- std. dev.: 0.710 +/- 0.010
f1 mean +/- std. dev.: 0.572 +/- 0.010
roc_auc mean +/- std. dev.: 0.823 +/- 0.005


References:

Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). The balanced accuracy and its posterior distribution. Paper presented at the 2010 20th International Conference on Pattern Recognition, 3121-3124. doi:10.1109/ICPR.2010.764

Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., & Bauder, R. A. (2019). Severely imbalanced big data challenges: Investigating data sampling approaches. Journal of Big Data, 6(1), 107. doi:10.1186/s40537-019-0274-4