# XGBoost in a pipeline using the Cronic Kidney Disease dataset

by Héctor Ramírez
<hr>

Throughout this example, we will be working with the Cronic Kidney Disease dataset from https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease. This dataset requires significantly more wrangling. 

The chronic kidney disease dataset contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.
<hr>

In [1]:
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper, CategoricalImputer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.feature_extraction import DictVectorizer
import xgboost as xgb
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
import warnings; warnings.simplefilter(action='ignore', category=FutureWarning)
# import warnings; warnings.simplefilter('ignore')

In [3]:
df = pd.read_csv('chronic_kidney_disease.csv', na_values='?')
nulls_per_column = df.isnull().sum()
print("Number of missing values in each column:\n", nulls_per_column)

Number of missing values in each column:
 age                 9
bp                 12
sg                 47
al                 46
su                 49
rbc               152
pc                 65
pcc                 4
ba                  4
bgr                44
bu                 19
sc                 17
sod                87
pot                88
hemo               52
pcv                71
wc                106
rc                131
htn                 2
dm                  2
cad                 2
appet               1
pe                  1
ane                 1
classification      0
dtype: int64


<hr>

## Preprocessing data

To use this dataset to train a model and use it into a pipeline, we need to do some preprocessing in advance. 

We will be using sklearn_pandas, which allows you to chain many more processing steps inside of a pipeline than are currently supported in scikit-learn. Specifically, we will impute missing categorical values directly using the Categorical_Imputer() class in sklearn_pandas, and the DataFrameMapper() class to apply any arbitrary sklearn-compatible transformer on DataFrame columns, where the resulting output can be either a NumPy array or DataFrame.
<hr>

In [4]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
target = y.apply(lambda x: 0 if x=='ckd' else 1)

categorical_feature_mask = X.dtypes == object
categorical_columns = X.columns[categorical_feature_mask].tolist()
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
    [([numeric_feature], SimpleImputer(strategy="median")) for numeric_feature in non_categorical_columns]
    , input_df=True, df_out=True)

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
                                                [(category_feature, CategoricalImputer()) for category_feature 
                                                in categorical_columns],
                                                input_df=True, df_out=True)
# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ("num_mapper", numeric_imputation_mapper),
                                          ("cat_mapper", categorical_imputation_mapper)
                                         ])

<hr>
We also create a function called Dictifier() to include into the pipeline to convert dataframes to dictionaries as needed for DictVectorizer():
<hr>

In [5]:
class Dictifier(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if type(X) == pd.core.frame.DataFrame:
            return X.to_dict("records")
        else:
            return pd.DataFrame(X).to_dict("records")

<hr>
Finally, the pipeline includes the FeatureUnion object made out of the mappers, the Dictifier(), the DictVectorizer() to get label- and onehot-encodings, and our XGBClassifier().
<br><br>
We perform a 3-fold cross_validation and compute the area under the receiver operating characteristic curve (AUC):
<hr>

In [6]:
pipeline = Pipeline([
                     ("featureunion", numeric_categorical_union),
                     ("dictifier", Dictifier()),
                     ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier())
                    ])

cross_val_scores = cross_val_score(pipeline, X, target.values, scoring="roc_auc", cv=3)

print("3-fold AUC: ", np.mean(cross_val_scores))

3-fold AUC:  0.998637406769937


<hr>
Finally, let's perform a randomized search and identify the best hyperparameters:
<hr>

In [11]:
gbm_param_grid = {
    'clf__learning_rate': np.arange(0.05, 1, 0.05),
    'clf__max_depth': np.arange(3, 10, 1),
    'clf__n_estimators': np.arange(50, 200, 50)
}

# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(estimator=pipeline, param_distributions=gbm_param_grid, 
                                        cv=2, n_iter=2, scoring='roc_auc')

# Fit the estimator
randomized_roc_auc.fit(X, target.values)

# Compute metrics
print(randomized_roc_auc.best_score_)
print("\nThe best set of parameters for this grid are:\n", randomized_roc_auc.best_params_)

0.9980266666666665

The best set of parameters for this grid are:
 {'clf__n_estimators': 50, 'clf__max_depth': 9, 'clf__learning_rate': 0.1}
