## This example is recreated from the towards datascience tutorial that is saved as a pdf in the github

In [1]:
import pandas as pd

rain = pd.read_csv("data/weatherAUS.csv")

In [2]:
cols_to_drop = ["Date", "Location", "RainTomorrow", "Rainfall"]

rain.drop(cols_to_drop, axis=1, inplace=True)

If the proportion is higher than 40% we will drop the column:

In [3]:
missing_props = rain.isna().mean(axis=0)
over_threshold = missing_props[missing_props >= 0.4]

Three columns contain more than 40% missing values. We will drop them:

In [4]:
rain.drop(over_threshold.index, 
          axis=1, 
          inplace=True)

Now, before we move on to pipelines, let’s divide the data into feature and target arrays beforehand:

In [5]:
X = rain.drop("RainToday", axis=1)
y = rain.RainToday

Next, there are both categorical and numeric features. We will build two separate pipelines and combine them later.
The next code examples will heavily use Sklearn-Pipelines. If you are not familiar with them, check out my separate article for the complete guide on them.
For the categorical features, we will impute the missing values with the mode of the column and encode them with One-Hot encoding:

In [6]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

categorical_pipeline = Pipeline(
    steps=[
        ("impute", SimpleImputer(strategy="most_frequent")),
        ("oh-encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),
    ]
)

For the numeric features, I will choose the mean as an imputer and StandardScaler so that the features have 0 mean and a variance of 1:

In [7]:
from sklearn.preprocessing import StandardScaler

numeric_pipeline = Pipeline(
    steps=[("impute", SimpleImputer(strategy="mean")), 
           ("scale", StandardScaler())]
)

Finally, we will combine the two pipelines with a column transformer. To specify which columns the pipelines are designed for, we should first isolate the categorical and numeric feature names:

In [8]:
cat_cols = X.select_dtypes(exclude="number").columns
num_cols = X.select_dtypes(include="number").columns

Next, we will input these along with their corresponding pipelines into a ColumnTransFormer instance:

In [9]:
from sklearn.compose import ColumnTransformer

full_processor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, num_cols),
        ("categorical", categorical_pipeline, cat_cols),
    ]
)

The full pipeline is finally ready. The only thing missing is the XGBoost classifier, which we will add in the next section.

In [10]:
import xgboost as xgb

xgb_cl = xgb.XGBClassifier()

Fortunately, the classifier follows the familiar fit-predict pattern of sklearn meaning we can freely use it as any sklearn model.
Before we train the classifier, let’s preprocess the data and divide it into train and test sets:

In [11]:
# Apply preprocessing
X_processed = full_processor.fit_transform(X)
y_processed = SimpleImputer(strategy="most_frequent").fit_transform(
    y.values.reshape(-1, 1)
)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y_processed, stratify=y_processed, random_state=1121218
)

Since the target contains NaN, I imputed it by hand. Also, it is important to pass y_processed to stratify so that the split contains the same proportion of categories in both sets.
Now, we fit the classifier with default parameters and evaluate its performance:

In [12]:
from sklearn.metrics import accuracy_score

# Init classifier
xgb_cl = xgb.XGBClassifier()

# Fit
xgb_cl.fit(X_train, y_train)

# Predict
preds = xgb_cl.predict(X_test)

# Score
accuracy_score(y_test, preds)

  return f(*args, **kwargs)




0.8507080984463082

### Check out this resource deeper as it goes into detailed discussion about hyperparameters

In [13]:
param_grid = {
    "max_depth": [3, 4, 5, 7],
    "learning_rate": [0.1, 0.01, 0.05],
    "gamma": [0, 0.25, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1, 3, 5],
    "subsample": [0.8],
    "colsample_bytree": [0.5],
}

In the grid, I fixed subsample and colsample_bytree to recommended values to speed things up and prevent overfitting.
We will import GridSearchCV from sklearn.model_selection, instantiate and fit it to our preprocessed data:

**This step below can take 10 - 20 minutes.**

In [14]:

# from sklearn.model_selection import GridSearchCV

# # Init classifier
# xgb_cl = xgb.XGBClassifier(objective="binary:logistic")

# # Init Grid Search
# grid_cv = GridSearchCV(xgb_cl, param_grid, n_jobs=-1, cv=3, scoring="roc_auc")

# # Fit
# _ = grid_cv.fit(X_processed, y_processed)

After an excruciatingly long time, we finally got the best params and best score:

**Expect the step above to take about 10-20 minutes.**

In [15]:
#grid_cv.best_score_

In [16]:
#grid_cv.best_params_

In [17]:
#### There is a little more after this point, but I stopped here for the night.