<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# XGBoost - A binary classification example with hyper-parameters optimization
<a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/template.ipynb" target="_parent">
<img src="https://img.shields.io/badge/-Open%20in%20Naas-success?labelColor=000000&logo="/>
</a>

**Tags:** #xgboost #snippet #classification #tabular #cross-validation #optimization #modeling

**Author:** [Oussama El Bahaoui](https://www.linkedin.com/in/oelbahaoui/)

[XGBoost outperforms ML/DL models ](https://arxiv.org/pdf/2106.03253.pdf) when it comes to tabular data.

This is a recipe for a sample training code of a XGBoost binary classification model, including hyper-parameters optimization using a grid search.

## Input

### Install required packages

In [None]:
%pip install xgboost

### Import packages

In [64]:
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

from xgboost import XGBClassifier

### Variables

Define a parameter grid that will be explored during the hyper-parameters optimization.

In [79]:
# Random seed.
SEED = 42

# A parameters grid for XGBoost classifier.
PARAMS = {
    'objective': ['binary:logistic'],
    'nthread': [-1],
    'random_state': [SEED],
    
    'min_child_weight': [1, 5, 10],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.02, 0.3],
    'n_estimators': [100, 200, 500],
}

### Read the dataseet

In [66]:
# Load a toy dataset for binary classification task.

data = load_breast_cancer(as_frame=True)

## Model

### Create input features and labels

In [67]:
X = data["data"]
y = data["target"]

### Split the dataset into train and test sets

In [80]:
# Use 70% of the data for training and 30% for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

### Create a model

In [82]:
# Create a XGBoost classifier model.

model = XGBClassifier()

### Optimize the model's hyper-parameters using grid search

A good practice is to perform a [cross-validation training](https://scikit-learn.org/stable/modules/cross_validation.html).

3 or 5-folds are the most recommended values. But to run the notebook faster, we'll reduce it to 2.

In [83]:
# Perfom a grid search on the hyper-parameters defined in the PARAM dict.

grid_search = GridSearchCV(
    estimator=model,
    param_grid=PARAMS,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 2,
    verbose=True
).fit(X_train, y_train)

Fitting 2 folds for each of 81 candidates, totalling 162 fits


Note that the training above takes about 2min to finish. Increasing the number of CV folds will also increase the execution time.

## Output

## Accuracy on test data

In [84]:
# Evaluate on the test data.

y_pred = grid_search.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9707602339181286


## Best parameters

In [85]:
grid_search.best_params_

{'learning_rate': 0.02,
 'max_depth': 4,
 'min_child_weight': 1,
 'n_estimators': 500,
 'nthread': -1,
 'objective': 'binary:logistic',
 'random_state': 42}

## Saving the model

In [87]:
# Recreate the model using the best hyper-parameters.

best_model = XGBClassifier(**grid_search.best_params_).fit(X_train, y_train)

In [89]:
# Saving the model as a json file.

best_model.save_model("best_model.json")

## (Optional) Load the model

In [96]:
from xgboost import Booster, DMatrix

trained_model = XGBClassifier()
trained_model.load_model("best_model.json")