# XGBoost

In [34]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

## Intro to XGBoost

## XGBoost Library

XGBoost is an industry-proven, open-source software library that provides a gradient boosting framework for scaling billions of data points quickly and efficiently.

Docs: https://xgboost.readthedocs.io/en/stable/index.html

**XGBoost** is an optimized distributed gradient boosting library designed to be highly **efficient, flexible and portable**. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

Installation: https://xgboost.readthedocs.io/en/stable/install.html#python
- `pip install xgboost`

In [35]:
import xgboost

print(xgboost.__version__)

1.7.4


## XGBoost native API
- lahko uporabljamo direktno z native API-jem

In [36]:
diamonds = sns.load_dataset("diamonds")
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [37]:
diamonds.shape

(53940, 10)

In [38]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [39]:
diamonds.describe(exclude=np.number)

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


In [40]:
from sklearn.model_selection import train_test_split

# Extract feature and target arrays
X, y = diamonds.drop('price', axis=1), diamonds[['price']]

# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to Pandas category
# Ni potrebno posebej one-hot encodati -> samo povemo da gre za categoricne podatke
for col in cats:
    X[col] = X[col].astype('category')
    
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [41]:
import xgboost as xgb

# Create regression matrices - DMatrix je visoko optimiziran class za hranjenje podatkov (podobno kot pandas)
dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

In [66]:
# Define hyperparameters

params = {"objective": "reg:squarederror", "tree_method": "hist"}
# objective je loss funkcija - na podlagi tega xgboost ve ali gre za klassifikacijski ali regresijski problem
# reg:squarederror -> regresijski problem z loss funkcijo squarederror

# tree_method določa HW parameters
# gpu_hist -> grafična kartica

In [43]:
# .train je tako kot da bi dali .fit v sklearn-u
model = xgb.train(params=params, dtrain=dtrain_reg, num_boost_round=100)

In [44]:
from sklearn.metrics import mean_squared_error
# delno lahko native API kombiniramo s sklearn knjižnico
preds = model.predict(dtest_reg)

In [45]:
rmse = mean_squared_error(y_test, preds, squared=False)
print(f"RMSE of the base model: {rmse:.3f}")

RMSE of the base model: 545.388


### Using Validation Sets During Training
- prikaz kako se z iteracijami izboljšuje model

In [46]:
params = {"objective": "reg:squarederror", "tree_method": "hist"}

In [47]:
evals = [(dtest_reg, "validation"), (dtrain_reg, "train")]

In [67]:
evals = [(dtest_reg, "validation"), (dtrain_reg, "train")]

model = xgb.train(params=params, dtrain=dtrain_reg, num_boost_round=100, evals=evals,verbose_eval=10) # Print every ten rounds

[0]	validation-rmse:3930.87087	train-rmse:3985.31595
[10]	validation-rmse:591.03042	train-rmse:557.19710
[20]	validation-rmse:550.76666	train-rmse:495.31647
[30]	validation-rmse:547.16647	train-rmse:467.13670
[40]	validation-rmse:544.10422	train-rmse:447.26879
[50]	validation-rmse:543.97371	train-rmse:432.51681
[60]	validation-rmse:544.77874	train-rmse:420.72943
[70]	validation-rmse:544.77491	train-rmse:408.72053
[80]	validation-rmse:544.33808	train-rmse:395.88816
[90]	validation-rmse:545.99682	train-rmse:383.62262
[99]	validation-rmse:545.38842	train-rmse:378.37454


### XGBoost Early Stopping

In [49]:
model = xgb.train(params=params, dtrain=dtrain_reg, num_boost_round=5000, evals=evals,verbose_eval=500)

[0]	validation-rmse:3930.87087	train-rmse:3985.31595
[500]	validation-rmse:567.09440	train-rmse:202.29270
[1000]	validation-rmse:573.92496	train-rmse:124.11251
[1500]	validation-rmse:577.05346	train-rmse:86.53817
[2000]	validation-rmse:579.29924	train-rmse:64.88654
[2500]	validation-rmse:580.64166	train-rmse:49.60797
[3000]	validation-rmse:581.33497	train-rmse:39.07501
[3500]	validation-rmse:581.94197	train-rmse:31.53707
[4000]	validation-rmse:582.09487	train-rmse:26.22932
[4500]	validation-rmse:582.35214	train-rmse:22.08176
[4999]	validation-rmse:582.41460	train-rmse:19.09406


In [68]:
# If there’s more than one metric in the eval_metric parameter given in params, 
# the last metric will be used for early stopping.
evals = [(dtrain_reg, "train"), (dtest_reg, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain_reg,
   num_boost_round=1000,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50 # če se 50 iteracij rezultat ne izboljša se proces ustavi
)

[0]	train-rmse:3985.31595	validation-rmse:3930.87087
[50]	train-rmse:432.51681	validation-rmse:543.97371
[88]	train-rmse:385.79451	validation-rmse:545.25177


### XGBoost Cross-Validation

In [51]:
params = {"objective": "reg:squarederror", "tree_method": "hist"}
n = 1000

# podobno kot .cv metoda v sklearnu
results = xgb.cv(params, dtrain_reg, num_boost_round=n, nfold=5, early_stopping_rounds=20)

In [52]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,3985.648654,10.343596,3986.913623,41.642778
1,2848.365726,8.014086,2851.020437,28.028733
2,2063.401458,4.637773,2068.629977,19.969459
3,1521.493751,3.874078,1530.496272,13.59233
4,1156.827103,2.991735,1170.413316,11.695597


In [53]:
best_rmse = results['test-rmse-mean'].min()
best_rmse

550.7196748119261

### XGBoost Classification

In [54]:
from sklearn.preprocessing import OrdinalEncoder

X, y = diamonds.drop("cut", axis=1), diamonds[['cut']]

# Encode y to numeric
y_encoded = OrdinalEncoder().fit_transform(y)

# Extract text features
cats = X.select_dtypes(exclude=np.number).columns.tolist()

# Convert to pd.Categorical
for col in cats:
    X[col] = X[col].astype('category')

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, random_state=1, stratify=y_encoded)

In [55]:
y["cut"].unique()

['Ideal', 'Premium', 'Good', 'Very Good', 'Fair']
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

In [56]:
# Create classification matrices
dtrain_clf = xgb.DMatrix(X_train, y_train, enable_categorical=True)
dtest_clf = xgb.DMatrix(X_test, y_test, enable_categorical=True)

In [57]:
params = {"objective": "multi:softprob", "tree_method": "hist", "num_class": 5}
# "objective": "multi:softprob" -> klaisifkacijski model v tem primeru je potrebno napovedato število napovedanih klasov "num_class": 5

results = xgb.cv(
   params, dtrain_clf,
   num_boost_round=100,
   nfold=5,
   metrics=["mlogloss", "auc", "merror"],
   early_stopping_rounds=20
)

In [58]:
results.head()

Unnamed: 0,train-mlogloss-mean,train-mlogloss-std,train-auc-mean,train-auc-std,train-merror-mean,train-merror-std,test-mlogloss-mean,test-mlogloss-std,test-auc-mean,test-auc-std,test-merror-mean,test-merror-std
0,1.25734,0.000793,0.892373,0.00052,0.255772,0.000619,1.260954,0.001528,0.88706,0.001848,0.260042,0.002096
1,1.073036,0.000955,0.897153,0.000239,0.253918,0.00065,1.079572,0.002348,0.891436,0.001965,0.258139,0.00289
2,0.954985,0.001309,0.900132,0.00081,0.251533,0.002001,0.96439,0.003143,0.894313,0.001703,0.255617,0.001872
3,0.874229,0.001387,0.902673,0.000454,0.249951,0.00196,0.886645,0.003445,0.896172,0.002041,0.255172,0.002239
4,0.815914,0.002027,0.905565,0.001047,0.248999,0.002175,0.831286,0.003392,0.898215,0.001337,0.254283,0.002097


In [59]:
results['test-auc-mean'].max()

0.9387768205705852

## XGBoost Sklearn
- lahko uporabljamo tako da uporabljamo znotraj že znanega Sklearn workflowa (npr. pipelines)
- bolj poznana sintaksa (ni potrebno delati z DMAtrix classi itd...)
- manjsa fleksibilnost v primerjavi z native API uporabo xgboost knjižnice

In [60]:
from sklearn import datasets

X,y = datasets.load_diabetes(return_X_y=True)

In [61]:
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score

In [62]:
scores = cross_val_score(XGBRegressor(objective='reg:squarederror'), X, y, scoring='neg_mean_squared_error')

In [63]:
(-scores)**0.5 

array([62.80101886, 65.78389959, 62.21211593, 66.40836809, 67.3001013 ])

In [64]:
from sklearn import datasets

X,y = datasets.load_breast_cancer(return_X_y=True)

In [65]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

cross_val_score(XGBClassifier(), X, y).mean()

0.9771619313771154