# DMatrix: Data Structure for XGBoost

This a simple review of DMatrix object provided for XGBoost.

#### Notes:
- DMatrix does not allow working with categorical variables, only with: **float, int, bool**.
- Using this kind of object, the XGBoost library usage is **more efficient**.

#### References:
- [Python API References - Core Data Structure](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix).

In [30]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import sys
sys.path.append('../../')
from datasets import solar
from tools.reader import get_dcol
import xgboost as xgb

### load data

In [31]:
# load data
data, dcol = solar.load()
# select data
ly = ['y']
lx = ['doy', 'hour', 'LCDC267', 'MCDC267', 'HCDC267', 'TCDC267', 'logAPCP267', 'RH267', 'TMP267', 'DSWRF267']
data = data[lx + ly]
dcol = get_dcol(data, ltarget=ly)

Load data..


### creating DMatrix object

In [32]:
# xg object
xgdata = xgb.DMatrix(
    data[dcol['lx']], 
    label=data[dcol['ly']], 
    missing = np.nan, 
    silent = False,
    feature_names = dcol['lx'],
    nthread = -1
)

In [43]:
# basix information
print('feature names:',xgdata.feature_names)
print('feature types:',xgdata.feature_types)
print('shape: %s x %s'%(xgdata.num_row(), xgdata.num_col()))

feature names: ['doy', 'hour', 'LCDC267', 'MCDC267', 'HCDC267', 'TCDC267', 'logAPCP267', 'RH267', 'TMP267', 'DSWRF267']
feature types: ['int', 'int', 'float', 'float', 'float', 'float', 'float', 'float', 'float', 'float']
shape: 26029 x 10
