~ XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

~ XGBoost stands for eXtreme Gradient Boosting.

~ It was developed by Tianqi Chen and is laser focused on computational speed and model performance.

# The first XGBoost model: Basic model with default parameters

In [1]:
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

In [3]:
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]

In [4]:
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

In [5]:
# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

subsample [default=1]

    Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.

scale_pos_weight [default=1]

    Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances)
    
booster [default= gbtree ]

    Which booster to use. Can be gbtree, gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.

base_score [default=0.5]

    The initial prediction score of all instances, global bias

colsample_bylevel [default=1]

    Subsample ratio of columns for each split, in each level. Subsampling will occur each time a new split is made. This paramter has no effect when tree_method is set to hist.
    range: (0,1]

gamma [default=0, alias: min_split_loss]

    Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.

max_delta_step [default=0]

    Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.

max_depth [default=6]

    Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. Note that limit is required when grow_policy is set of depthwise.
    range: [0,∞]

min_child_weight [default=1]

    Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be.

lambda [default=1, alias: reg_lambda]

    L2 regularization term on weights. Increasing this value will make model more conservative.

alpha [default=0, alias: reg_alpha]

    L1 regularization term on weights. Increasing this value will make model more conservative.

silent [default=0]

    0 means printing running messages, 1 means silent mode




In [6]:
print(model)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)


In [7]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [8]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 77.95%


References:
----------------------------

https://machinelearningmastery.com/xgboost-python-mini-course/
