# Python Guide

## Loading Data

The XGBoost python module is able to load data from:

- LibSVM text format file

- Comma-separated values (CSV) file

- NumPy 2D array

- SciPy 2D sparse array

- cuDF DataFrame

- Pandas data frame, and

- XGBoost binary buffer file.

### Loading LibSVM text file

In [0]:
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')

### Loading a CSV File

Categorical features not supported

Note that XGBoost does not provide specialization for categorical features; if your data contains categorical features, load it as a NumPy array first and then perform corresponding preprocessing steps like one-hot encoding.

In [0]:
import xgboost as xgb

In [0]:
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')

Use Pandas to load CSV files with headers

Currently, the DMLC data parser cannot parse CSV files with headers. Use Pandas (see below) to read CSV files with headers.

### Loading Numpy Array

In [0]:
import numpy as np

In [6]:
data = np.random.rand(5,10)
print(data.shape)                          # 5 rows and 10 columns
print(data)

(5, 10)
[[0.34650435 0.96959556 0.37590924 0.04053457 0.84074108 0.60443536
  0.18500431 0.45165124 0.90740282 0.94262351]
 [0.80151117 0.15329277 0.26991186 0.85996061 0.76673533 0.36360156
  0.27042515 0.3736201  0.74103191 0.39593222]
 [0.6176251  0.28389767 0.0563803  0.3557452  0.46335382 0.88809816
  0.67109161 0.89352332 0.83948869 0.64880124]
 [0.22806585 0.43723511 0.03820966 0.84788656 0.35439614 0.94936312
  0.27664777 0.02229596 0.70474659 0.62919792]
 [0.62069342 0.97258938 0.66994115 0.96339683 0.22351267 0.23808324
  0.69661897 0.38953276 0.6236524  0.10591636]]


In [0]:
label = np.random.randint(2, size=5)  # Binary target

In [0]:
dtrain = xgb.DMatrix(data, label = label)

In [10]:
print(dtrain)

<xgboost.core.DMatrix object at 0x7fcdd0a401d0>


### Loading Pandas DataFrame

In [0]:
import pandas as pd

In [0]:
df = pd.DataFrame(np.arange(12).reshape((4,3)), columns = ['a','b','c'])

In [13]:
df.head()

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [0]:
label = pd.DataFrame(np.random.randint(2, size=4))

In [15]:
label.head()

Unnamed: 0,0
0,1
1,1
2,0
3,1


In [0]:
dtrain = xgb.DMatrix(data, label = label)

### Saving into XGBoost Buffer file

In [0]:
dtrain.save_binary('train.buffer')

In [18]:
dtrain2 = xgb.DMatrix('train.buffer')

[13:08:06] 5x10 matrix with 50 entries loaded from train.buffer


## Other Stuff

In [0]:
# Missing values can be replaced by a default value in the DMatrix constructor:
dtrain = xgb.DMatrix(data, label=label, missing=-999.0)

In [0]:
# Weights can be set when needed:
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)

## Setting Parameters, Training, Saving, Re-Loading, Visualization

### Parameters Setting

- XGBoost can use either a list of pairs or a dictionary to set parameters. 

For instance:

- Booster parameters

In [0]:
param = {'max_depth': 2, 'eta': 1, 'objective' : 'binary:logistic'}
param['nthread'] = 4
param['eval_metric']  = 'auc'

In [0]:
# Can set multiple metrics as well
param['eval_metric'] = ['auc','rmse']

In [0]:
# Specify validation to watch performance
evallist = [(dtest, 'eval'), (dtrain, 'train')]

### Training example

- Training a model requires a parameter list and data set.

In [0]:
df = pd.DataFrame(np.arange(12).reshape((4,3)), columns = ['a','b','c'])
label = pd.DataFrame(np.random.randint(2, size=4))

In [98]:
df.head()

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [99]:
label.head()

Unnamed: 0,0
0,1
1,1
2,1
3,0


In [0]:
dtrain = xgb.DMatrix(df, label = label)

In [0]:
param = {'max_depth' : 2, 'eta' : 0.2, 'objective' : 'binary:logistic', 'eval_metric' : 'error'}

In [0]:
num_round = 10

In [0]:
bst = xgb.train(params = param, dtrain = dtrain, num_boost_round=num_round)

In [0]:
# After training, the model can be saved.
bst.save_model('save_model.model')

In [0]:
# dumping model as text file
bst.dump_model('dump.raw.txt')

In [0]:
# dumping model with feature map
bst.dump_model('dump.raw.txt', 'featmap.txt')

In [115]:
bst = xgb.Booster()
bst.load_model('/content/save_model.model')
print(bst)

<xgboost.core.Booster object at 0x7fcdce97f400>


### Early Stopping

If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in evals. If there’s more than one, it will use the last.

In [0]:
 bst = xgb.train(params = param, dtrain = dtrain, num_boost_round=num_round, early_stopping_rounds=2)

The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.

- If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. Note that xgboost.train() will return a model from the last iteration, not the best one.

### Predictions from Model

- Trained Model can be used to make predictions on dataset

In [0]:
ypred = bst.predict(dtest)

- If early stopping is enabled you can get predictions from the best iteration with bst.best_ntree_limit

In [0]:
ypred = bst.predict(dtest, ntree_limit=bst.best_ntree_limit)

### Plotting

You can use plotting module to plot importance and output tree.

To plot importance, use xgboost.plot_importance(). This function requires matplotlib to be installed.

In [0]:
xgb.plot_importance(bst)

To plot the output tree via matplotlib, use xgboost.plot_tree(), specifying the ordinal number of the target tree. This function requires graphviz and matplotlib.

In [0]:
xgb.plot_tree(bst, num_trees=2)

When you use IPython, you can use the xgboost.to_graphviz() function, which converts the target tree to a graphviz instance. The graphviz instance is automatically rendered in IPython.


In [0]:
xgb.to_graphviz(bst, num_trees=2)

## Parameter Tuning

- Use the concept of Bias-Variance TradeOff

### Control Overfitting

When you observe high training accuracy, but low test accuracy,it is likely that you encountered overfitting problem.

There are in general two ways that you can control overfitting in XGBoost:

- The first way is to directly control model complexity.

- This includes max_depth, min_child_weight and gamma.

- The second way is to add randomness to make training robust to noise.

- This includes subsample and colsample_bytree.

- You can also reduce stepsize eta. Remember to increase num_round when you do so.

### Handle Imbalanced Dataset

For common cases such as ads clickthrough log, the dataset is extremely imbalanced. This can affect the training of XGBoost model, and there are two ways to improve it.

If you care only about the overall performance metric (AUC) of your prediction

- Balance the positive and negative weights via scale_pos_weight

- Use AUC for evaluation

If you care about predicting the right probability

- In such a case, you cannot re-balance the dataset

- Set parameter max_delta_step to a finite number (say 1) to help convergence

Parameter tuning is art use the following webpage and master

https://xgboost.readthedocs.io/en/latest/parameter.html

### GPUs for XGBoost

- Can be used.
- Specify the tree_method parameter as 'gpu_hist'

Equivalent to the XGBoost fast histogram algorithm. Much faster and uses considerably less memory. NOTE: Will run very slowly on GPUs older than Pascal architecture.

- Faster performance.