### XGBoost Tutorial

Overview on XGBoost; Gradient Boosting Library

Reasons for implementation
- Popular across multiple languages
- Allows for distrubution with both Apache Spark and Pyspark
- Support for model inference across a variety of data types (arrays, dataframes)

In [2]:
import xgboost as xgb
import numpy as np

### Data Structures

Two types of  structures for training models: numpy arrays and DMatrices. DMatrices are a data structure unique to XGBoost that optimizes for both memory efficiency and training speed. DMatrices are recommended for use with XGBoost.



In [10]:
X_train = np.random.rand(100, 10)
y_train = np.random.randint(2, size=100)
X_test = np.random.rand(100, 10)
y_test = np.random.randint(2, size=100)



dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)


### Specifying Parameters

There are 3 styles of parameters that can be used to specify parameters for XGBoost:

1. General Parameters: Guide the overall functioning
2. Booster Parameters: Guide the individual booster (tree/regression) at each step
3. Learning Task Parameters: Guide the optimization performed


### General Parameters
- booster [default=gbtree]
    - Select the type of model to run at each iteration. It has 2 options:
        - gbtree: tree-based models
        - gblinear: linear models
- silent [default=0]:
    - Silent mode is activated is set to 1, i.e. no running messages will be printed.
    - It’s generally good to keep it 0 as the messages might help in understanding the model.
- nthread [default to maximum number of threads available if not set]
    - This is used for parallel processing and number of cores in the system should be entered
    - If you wish to run on all cores, value should not be entered and algorithm will detect automatically
- num_pbuffer [set automatically by XGBoost, no need to be set by user]
    - This is a parameter that is set automatically by XGBoost to be equal to number of rows in training data. This is used for parallel processing and hence, this parameter should not be set by the user.
- num_feature [set automatically by XGBoost, no need to be set by user]
    - This is also set automatically by XGBoost and is equal to maximum number of features to be used. If not set by user, XGBoost will automatically select the maximum number of features present in the data.
- num_class [set automatically by XGBoost, no need to be set by user]
    - This is set automatically by XGBoost and is equal to number of unique classes in the data. It is used for multiclass classification problems. If not set, XGBoost will automatically set to 1 for regression problems and number of classes for classification problems.
- eval_metric [ default according to objective parameter ]
    - Evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and error for classification, mean average precision for ranking )
    - User can add multiple evaluation metrics separated by ‘,’
- seed [default=0]
    - The random number seed.
    - Can be used for generating reproducible results and also for parameter tuning.

### Booster Parameters (Linear Booster)

- lambda [default=0, alias: reg_lambda]
    - L2 regularization term on weights, increase this value will make model more conservative.
- alpha [default=0, alias: reg_alpha]
    - L1 regularization term on weights, increase this value will make model more conservative.
- lambda_bias [default=0, alias: reg_lambda_bias]
    - L2 regularization term on bias, increase this value will make model more conservative.
- updater [default=shotgun]
    - Algorithm for Inference


### Learning Task Parameters

- objective [ default=reg:linear ]
    - This defines the loss function to be minimized. Mostly used values are:
        - binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
        - multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
            - you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
        - multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.




### Component-Wise Linear Model Gradient Boosting Setup

Here we set the objective as squared error by specifying the reg:linear objective. We also set the eval_metric to rmse, which is root mean squared error. Under the assumption of a standard linear model; additional regularization is not specified, (lambda = 0, alpha = 0). The number of rounds is set to 100, which means that 100 linear models will be selected

Importantly; each linear model is trained on the feature with the largest gradient magnitude. This is a greedy approach to feature selection, and is a key component of the XGBoost algorithm.

In [67]:
param = {'objective':'reg:squarederror', 'booster':'gblinear',
         'updater':'coord_descent','alpha': 0, 'lambda': 0,
         'feature_selector': 'greedy','top_k':1,'eta':0.001}

param['eval_metric']='rmse'
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 1000

### Compatability with Scikit-Learn

XGBoost can be wrapped as a scikit-learn estimator. This allows for the use of the scikit-learn API for training and inference. This is useful for cross-validation and grid-searching.

In [68]:
boosted_regresion = xgb.XGBRegressor(**param)
boosted_regresion.fit(X_train, y_train, verbose=True)



In [69]:
boosted_regresion.predict(X_test)

array([0.5092283 , 0.5098041 , 0.5098117 , 0.50728726, 0.5045136 ,
       0.5113001 , 0.5044889 , 0.51273537, 0.51384264, 0.50841755,
       0.511699  , 0.51606023, 0.50493   , 0.50798434, 0.5067369 ,
       0.50631016, 0.5106881 , 0.5100418 , 0.5128517 , 0.51386493,
       0.5082587 , 0.50911283, 0.50645244, 0.5053252 , 0.5068168 ,
       0.5086351 , 0.51529956, 0.5109724 , 0.5108534 , 0.51142025,
       0.51053756, 0.50996286, 0.51328623, 0.5076377 , 0.5063854 ,
       0.5105663 , 0.50518596, 0.5093748 , 0.5080307 , 0.51511973,
       0.51483965, 0.5069136 , 0.5159051 , 0.5113169 , 0.515719  ,
       0.5131129 , 0.50840616, 0.50934434, 0.50524217, 0.51557666,
       0.5100183 , 0.51206183, 0.5111399 , 0.51266617, 0.50658846,
       0.5139409 , 0.5090528 , 0.5126986 , 0.51089936, 0.5100083 ,
       0.5104867 , 0.5093832 , 0.51042724, 0.50744903, 0.5117112 ,
       0.5065937 , 0.5158394 , 0.515577  , 0.50447214, 0.5095995 ,
       0.5148381 , 0.515874  , 0.5121128 , 0.5110455 , 0.50492

In [83]:
boosted_regresion.score(X_test, y_test)

-0.004055962991140083

In [70]:
boosted_regresion.get_params()

{'objective': 'reg:squarederror',
 'base_score': None,
 'booster': 'gblinear',
 'callbacks': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': 'rmse',
 'feature_types': None,
 'gamma': None,
 'gpu_id': None,
 'grow_policy': None,
 'importance_type': None,
 'interaction_constraints': None,
 'learning_rate': None,
 'max_bin': None,
 'max_cat_threshold': None,
 'max_cat_to_onehot': None,
 'max_delta_step': None,
 'max_depth': None,
 'max_leaves': None,
 'min_child_weight': None,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': None,
 'num_parallel_tree': None,
 'predictor': None,
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'sampling_method': None,
 'scale_pos_weight': None,
 'subsample': None,
 'tree_method': None,
 'validate_parameters': None,
 'verbosity': None,
 'updater': 'coord_descent',
 'alpha': 0,
 'lambda': 0,
 'f

In [71]:
boosted_regresion.intercept_

array([0.00445262])

In [72]:
boosted_regresion.coef_

array([0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
       0.0121198, 0.       , 0.       , 0.       ])

### Seems to result in sparse solutions but need to confirm