# Extreme Gradient Boosting with XGBoost

- XGBoost is one of the most popular machine learning frameworks among data scientists.
- According to the Kaggle [State of Data Science Survey 2021](https://www.kaggle.com/kaggle-survey-2021), almost 50% of respondents said they used XGBoost, ranking below only TensorFlow and Sklearn. 

## Set up the workspace

In [1]:
import pickle
import dill
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb

## Load the data

In [2]:
# Load the pre-pickled ready-to-use data arrays:
with open("./assets/clean_trans_split_arrays.pkl", mode="rb") as file_bin:
    arrs = pickle.load(file_bin)
    
X_train, X_test, y_train, y_test = arrs.values()

## Create regression matrices

In [3]:
dtrain_reg = xgb.DMatrix(X_train, y_train)
dtest_reg = xgb.DMatrix(X_test, y_test)

## XGBoost Cross-Validation

In [4]:
%%time
# Set params dict and number of boosting rounds:
params = {"objective": "reg:squarederror", "tree_method": "hist"}
n = 1000

# Perform cross-validation:
results = xgb.cv(
    params,
    dtrain_reg,
    num_boost_round=n,
    nfold=5,
    early_stopping_rounds=20
)

CPU times: total: 9.05 s
Wall time: 5.26 s


In [5]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,172256.235355,577.967148,172457.275831,2598.436413
1,128740.931671,387.619438,129320.91278,2332.304914
2,100177.072093,271.279978,101183.366792,2142.113519
3,81726.725022,290.016991,83285.255977,1767.843292
4,70135.803183,325.686084,72182.948068,1408.304093


## Making prediction and model evaluation

In [6]:
# Train the model:
evals = [(dtest_reg, "validation"), (dtrain_reg, "train")]

model = xgb.train(
    params,
    dtrain_reg,
    num_boost_round=n,
    verbose_eval=50,
    evals=evals,
    early_stopping_rounds=100
)

[0]	validation-rmse:174652.56938	train-rmse:172287.51008
[50]	validation-rmse:49398.48671	train-rmse:35251.88040
[100]	validation-rmse:48353.49164	train-rmse:28057.61499
[150]	validation-rmse:48165.12815	train-rmse:23541.82792
[200]	validation-rmse:48095.89689	train-rmse:20145.62204
[250]	validation-rmse:48011.49016	train-rmse:17420.78824
[300]	validation-rmse:48063.54561	train-rmse:15199.31090
[350]	validation-rmse:48128.03629	train-rmse:13402.41046
[400]	validation-rmse:48214.65729	train-rmse:11742.19678
[450]	validation-rmse:48200.43362	train-rmse:10369.65087
[500]	validation-rmse:48225.83824	train-rmse:9170.97598
[550]	validation-rmse:48245.27286	train-rmse:8167.28447
[600]	validation-rmse:48284.36748	train-rmse:7214.76369
[650]	validation-rmse:48320.55914	train-rmse:6387.25938
[700]	validation-rmse:48343.15453	train-rmse:5747.40130
[750]	validation-rmse:48343.52475	train-rmse:5132.44643
[800]	validation-rmse:48347.67699	train-rmse:4577.34768
[850]	validation-rmse:48373.63408	train

In [7]:
# Make some prediction:
y_preds = model.predict(dtest_reg)

In [8]:
# Evaluate the model:
rmse = mean_squared_error(y_test, y_preds, squared=False)
print(f"RMSE of the base model: {rmse:.3f}")

r2 = r2_score(y_test, y_preds)
print(f"R-squaed of the base model: {r2:.2f}")

RMSE of the base model: 48391.618
R-squaed of the base model: 0.83
