### XGBoost Tutorial

Overview on XGBoost; High-dimensional regression and classification tree models

Reasons for implementation
- Popular across multiple languages
- Allows for distrubution with both Apache Spark and Pyspark
- Support for model inference across a variety of data types (arrays, dataframes)

In [2]:
import xgboost as xgb
import numpy as np

### Splitting Data

Given our set of observations, let's generate train-test splits using hold-one-out cross validation. One can use the train_test_split function in scikit learn to do this



In [3]:
data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target


We need to create an instance of DMatrix for each train-test split. These are xgboost specific data structures; optimized for the task at hand

In [4]:
dtrain = xgb.DMatrix(data, label=label)

data

array([[0.28431508, 0.28683263, 0.20217617, 0.28254538, 0.25608215,
        0.80456018, 0.59100851, 0.90657822, 0.34999947, 0.66811695],
       [0.48179547, 0.98085121, 0.58507306, 0.71301614, 0.07395093,
        0.10818143, 0.09168387, 0.57739787, 0.53345759, 0.49890587],
       [0.34311117, 0.79279597, 0.45381072, 0.80826824, 0.95939382,
        0.80803174, 0.79325094, 0.98760777, 0.84828737, 0.2745148 ],
       [0.43117076, 0.62000763, 0.09584611, 0.45536008, 0.02126448,
        0.25107378, 0.17018281, 0.77229535, 0.95354187, 0.77518453],
       [0.38378566, 0.49475906, 0.5087447 , 0.7828558 , 0.90374019,
        0.45839864, 0.25924966, 0.36963579, 0.623325  , 0.64791987]])

In [5]:
targets=np.random.rand(5, 10) # features in evaluation set 
label=np.random.randint(2, size=5) # targets in evaluation set 

dtest=xgb.DMatrix(data, label=label)

### Specifying Parameters



In [6]:
param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}
param['nthread'] = 4
param['eval_metric'] = 'auc'

evallist = [(dtrain, 'train'), (dtest, 'eval')]

In [7]:
num_round = 10
bst = xgb.train(param, dtrain, num_round, evallist)

[0]	train-auc:0.50000	eval-auc:0.50000
[1]	train-auc:0.50000	eval-auc:0.50000
[2]	train-auc:0.50000	eval-auc:0.50000
[3]	train-auc:0.50000	eval-auc:0.50000
[4]	train-auc:0.50000	eval-auc:0.50000
[5]	train-auc:0.50000	eval-auc:0.50000
[6]	train-auc:0.50000	eval-auc:0.50000
[7]	train-auc:0.50000	eval-auc:0.50000
[8]	train-auc:0.50000	eval-auc:0.50000
[9]	train-auc:0.50000	eval-auc:0.50000


