# Loading and Exploring the Data

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import xgboost as xgb

warnings.filterwarnings('ignore')
diamonds = sns.load_dataset('diamonds')
diamonds.head()

# in real-world datasets, need to explore, clean, and visualize the dataset first
# here, 5-number summary of the numeric and categorial features built-in to seaborn
diamonds.describe(exclude = np.number)

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


# How to Build an XGBoost DMatrix

In [2]:
from sklearn.model_selection import train_test_split

# goal: predict diamond prices using their physical measurements, so target will be the price column
# candidate features are isolated into X and target labels into y

# extract feature and target arrays
X, y = diamonds.drop('price', axis=1), diamonds[['price']]

# this dataset has three categorical columns. normally would encode with ordinal or one-hot encoding
# XGBoost as the ability to internally deal with categoricals by casting to pandas "category" data type

# extract text features
cats = X.select_dtypes(exclude = np.number).columns.tolist()

# convert to pandas category
for col in cats:
    X[col] = X[col].astype('category')

# should get three category features when printing dtypes attribute:
print(X.dtypes)

# split the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

# create regression matrices
import xgboost as xgb
print(xgb.__version__)
build_info = xgb.build_info()
for name in sorted(build_info.keys()):
    print(f'{name}: {build_info[name]}')

dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical = True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical = True)

carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
x           float64
y           float64
z           float64
dtype: object
3.0.0
BUILTIN_PREFETCH_PRESENT: True
CLANG_VERSION: [15, 0, 0]
DEBUG: False
MM_PREFETCH_PRESENT: False
USE_CUDA: False
USE_DLOPEN_NCCL: False
USE_FEDERATED: False
USE_NCCL: False
USE_OPENMP: True
USE_RMM: False
libxgboost: /opt/anaconda3/lib/python3.12/site-packages/xgboost/lib/libxgboost.dylib


# Python XGBoost Regression

**After building the DMatrices, need to choose a value for the `objective` parameter. This tells XGBoost the machine learning problem to be solved and what metrics or loss functions to use to solve that problem.**

## Training

The chosen objective function and any other hyperparameters of XGBoost should be specified in a dictionary, which by convention should be called params.

Inside these initial `params`, also set `tree_method` to `gpu_hist`, which enables GPU acceleration. If no GPU, can omit the parameter or set it to `hist`.

Then, set another parameter called `num_boost_round`, which stands for number of boosting rounds. Internally, XGBoost minimizes the loss function RMSE in small incremental rounds; this parameter specifies the number of those rounds.

Ideal number of rounds is usually found through hyperparameter tuning.

In [12]:
# define hyperparameters
params = {'objective': 'reg:squarederror', 
          'tree_method': 'hist'} # set tree_method to hist because no GPU

n = 100 
model = xgb.train(
    params = params,
    dtrain = dtrain_reg, 
    num_boost_round = n,
)

## Evaluation

1. Use `dtest_reg` DMatrix to measure the model's performance on unseen data.
2. Once predictions are generated with `predict`, pass them to `mean_squared_error` function of Sklearn to compare against `y_test`.

In [13]:
from sklearn.metrics import mean_squared_error

preds = model.predict(dtest_reg)

# compare results against y_test
rmse = mean_squared_error(y_test, preds, squared=False)
print(f"RMSE of the base bodel: {rmse:.3f}")

RMSE of the base bodel: 552.861


## Using Validation Sets During Training

Use evaluation arrays to track model performance in real time as it improves incrementally across boosting rounds.

1. Set up parameters again.
2. Create a list of two tuples that each contain: array for the model to evaluate; array name.
3. Pass array to `evals` parameter of `xgb.train` and see model performance after each boosting round.

**Notes:** when using high number of boosting rounds, can use `verbose_eval` parameter to print output every `verbose_eval` rounds.

In [15]:
# 1: set up params
params = {"objective": "reg:squarederror",
          "tree_method": "hist"}
n = 100

# 2: set array and array names
evals = [(dtrain_reg, "train"), (dtest_reg, "validation")]

# 3: track model performance
model = xgb.train(
    params = params,
    dtrain = dtrain_reg, 
    num_boost_round = n,
    evals = evals,
    verbose_eval = 10 
)


[0]	train-rmse:2874.49146	validation-rmse:2817.90814
[10]	train-rmse:548.36512	validation-rmse:592.03160
[20]	train-rmse:491.09887	validation-rmse:558.53485
[30]	train-rmse:469.58201	validation-rmse:555.51015
[40]	train-rmse:454.32953	validation-rmse:554.45666
[50]	train-rmse:438.68033	validation-rmse:554.13365
[60]	train-rmse:425.38361	validation-rmse:551.57888
[70]	train-rmse:414.71115	validation-rmse:549.26109
[80]	train-rmse:405.41008	validation-rmse:549.03952
[90]	train-rmse:391.04269	validation-rmse:551.87206
[99]	train-rmse:383.48826	validation-rmse:552.86131


# XGBoost Early Stopping

Goal: achieve **golden middle**, where the model has learned enough to optimize performance on the validation set. Can use **early stopping** to force the model to stop when validation loss achieves stable, optimized value. 

In [16]:
n = 10000

model = xgb.train(
    params = params,
    dtrain = dtrain_reg, 
    num_boost_round = n,
    evals = evals,
    verbose_eval = 50, 
    early_stopping_rounds = 50
)

[0]	train-rmse:2874.49146	validation-rmse:2817.90814
[50]	train-rmse:438.68033	validation-rmse:554.13365
[100]	train-rmse:381.96310	validation-rmse:553.73941
[128]	train-rmse:358.11000	validation-rmse:553.05030


# XGBoost Cross-Validation

**k-fold cross-validation:** set aside a test set for final performance evaluation of each model. Split training data into $k$ folds. Use $k - 1$ segments for training and the $k$th part for validation.

In [18]:
params = {"objective": "reg:squarederror",
          "tree_method": "hist"}
n = 1000

results = xgb.cv(
    params, dtrain_reg,
    num_boost_round = n,
    nfold = 5, 
    early_stopping_rounds = 20
)

results.head()

# take the minimum of the test-rmse-mean column
best_rmse = results['test-rmse-mean'].min()

best_rmse

550.2735543625861

# XGBoost Classification

Two most popular classification objectives:
1. `binary:logistic`: binary classification (the target contains only two classes)
2. `multi:softprob`: multi-class classification (more than two classes in the target)

In [22]:
from sklearn.preprocessing import OrdinalEncoder

X, y = diamonds.drop("cut", axis=1), diamonds[["cut"]]

# encode y to numeric
y_encoded = OrdinalEncoder().fit_transform(y)

# extract text features
cats = X.select_dtypes(exclude = np.number).columns.tolist()

# convert to pd.Categorical
for col in cats:
    X[col] = X[col].astype('category')

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, random_state=1, stratify=y_encoded)

# create classification matrices
dtrain_clf = xgb.DMatrix(X_train, y_train, enable_categorical = True)
dtest_clf = xgb.DMatrix(X_test, y_test, enable_categorical = True)

In [23]:
# set params
params = {'objective': 'multi:softprob',
          'tree_method': 'hist',
          'num_class': 5}
n = 1000

# train the model using 5-fold cv
results = xgb.cv(
    params, dtrain_clf,
    num_boost_round = n, 
    nfold = 5,
    metrics = {'mlogloss', 'auc', 'merror'}
)

In [27]:
# three classification metrics were used. results:
results.keys()

# see the best AUC score (take the maximum of test-auc-mean column)
results['test-auc-mean'].max()

0.9403143599245043