# Introduction

<div class="alert alert-block alert-warning">
<font color=black><br>

**What?** Native XGBoost API vs. Sklearn API 

**Reference:** https://coderzcolumn.com/tutorials/machine-learning/xgboost-an-in-depth-guide-python#6<br>

<br></font>
</div>

# Import modules

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import xgboost as xgb
import sklearn
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Import dataset

<div class="alert alert-block alert-info">
<font color=black><br>

- Boston Housing Dataset: It's a regression problem dataset which has information about a various attribute of houses in Boston and their price in dollar. 
- This will be used for regression tasks.

<br></font>
</div>

In [9]:
boston = load_boston()

# Print just the lines from 5 to 29
for line in boston.DESCR.split("\n")[5:29]:
    print(line)

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the

In [10]:
boston_df = pd.DataFrame(data=boston.data, columns = boston.feature_names)
# Add one column with the target
boston_df["Price"] = boston.target
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [12]:
# Splitting the dataset
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((455, 13), (51, 13), (455,), (51,))

# Native XGBoost API

<div class="alert alert-block alert-info">
<font color=black><br>

- XGBoost offers two APIs: native and a scikit-like one. **Why would you use the natve XGBoost API?**
- Although the scikit-learn API of XGBoost is easy to use and fits well in a scikit-learn pipeline, it is sometimes better to use the native API. 
- Advantages include:
    - Automatically find the best number of boosting rounds
    - Built-in cross validation
    - Custom objective functions
- Xgboost default API only accepts a dataset that is wrapped in **DMatrix**. 
- DMatrix is an internal data structure of xgboost which wraps data features and labels both into it. 
- It's designed to be efficient and fastens the training process.

<br></font>
</div>

In [14]:
dmat_train = xgb.DMatrix(X_train, Y_train, feature_names = boston.feature_names)
dmat_test  = xgb.DMatrix(X_test, Y_test, feature_names = boston.feature_names)
dmat_train, dmat_test

(<xgboost.core.DMatrix at 0x13e4877c0>, <xgboost.core.DMatrix at 0x13e487760>)

In [17]:
booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:squarederror'},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

[0]	train-rmse:3.94894	test-rmse:3.59159
[1]	train-rmse:3.37195	test-rmse:3.26373
[2]	train-rmse:3.09769	test-rmse:3.12218
[3]	train-rmse:2.78200	test-rmse:2.94107
[4]	train-rmse:2.53499	test-rmse:2.75222
[5]	train-rmse:2.37140	test-rmse:2.78515
[6]	train-rmse:2.23286	test-rmse:2.64519
[7]	train-rmse:2.16047	test-rmse:2.64290
[8]	train-rmse:2.03129	test-rmse:2.58895
[9]	train-rmse:1.96511	test-rmse:2.61442


In [18]:
booster

<xgboost.core.Booster at 0x131496100>

In [19]:
pd.DataFrame({ "Actuals":Y_test[:10], "Prediction":booster.predict(dmat_test)[:10]})

Unnamed: 0,Actuals,Prediction
0,23.6,25.580267
1,32.4,31.743393
2,13.6,13.508162
3,22.8,23.470869
4,16.1,13.658171
5,20.0,22.350372
6,17.8,17.217281
7,14.0,14.332675
8,19.6,20.501831
9,16.8,20.756474


In [26]:
print("Train RMSE : ",booster.eval(dmat_train))
print("Test  RMSE : ",booster.eval(dmat_test))

Train RMSE :  [0]	eval-rmse:1.965108
Test  RMSE :  [0]	eval-rmse:2.614419


In [32]:
print("Test  R2 Score: %.2f" % r2_score(Y_test, booster.predict(dmat_test)))
print("Train R2 Score: %.2f" % r2_score(Y_train, booster.predict(dmat_train)))

Test  R2 Score: 0.89
Train R2 Score: 0.96


# Native SKlearn API

<div class="alert alert-block alert-info">
<font color=black><br>

- Xgboost provides estimators that have almost the same API like that of sklearn estimators. 
- This helps developers with sklearn background to grasp the usage of xgboost faster. 
- It even lets us use the xgboost model with sklearn's grid search functionality.

<br></font>
</div>

In [34]:
xgb_regressor = xgb.XGBRegressor()

xgb_regressor.fit(X_train, Y_train, eval_set=[(X_test, Y_test)], eval_metric="mae", verbose=10)

[0]	validation_0-mae:14.61328
[10]	validation_0-mae:1.86316
[20]	validation_0-mae:1.70020
[30]	validation_0-mae:1.62740
[40]	validation_0-mae:1.63325
[50]	validation_0-mae:1.62120
[60]	validation_0-mae:1.61760
[70]	validation_0-mae:1.62004
[80]	validation_0-mae:1.61866
[90]	validation_0-mae:1.62278
[99]	validation_0-mae:1.62320


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [35]:
print("Test  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))

Test  R2 Score : 0.93
Train R2 Score : 1.00


# Conclusion

<div class="alert alert-block alert-danger">
<font color=black><br>

- XGBoost offers 2 APIs, it up to you tp decide which one to use. 

<br></font>
</div>