## eXtreme Gradient boosting

Using Taylor expension:

$L(y_i, a_{N-1}(x_i)+b_N(x_i)) \approx L(y_i, a_{N-1}(x_i)) + g_ib_N(x_i)+ \frac{1}{2}h_ib_N^2(x_i)$

where 

$g_i=\frac{\partial}{\partial z}L(y_i,z)|_{z=a_{N-1}(x_i)}$

$h_i=\frac{\partial^2}{\partial z^2}L(y_i,z)|_{z=a_{N-1}(x_i)}$

(we use not only first-order derivative to train a new decision tree, but also second-order derivative.)

Previous gradient bossting task:

$\sum^{l}_{i=1}L(y_i, a_{N-1}(x_i) + b_N(x_i)) \to min_{b_N(x)}$

Become new optimization task:

$\sum^{l}_{i=1}(g_ib_N(x_i)+ \frac{1}{2}h_ib_N^2(x_i)) \to min_{b_N(x)}$

(it doesn't depend on this new decision tree algorithm. instead depend on first-order derivative and second-order derivative of the new decision tree algorithm)

### Regularization

A decision tree splits a feature space into T several regions:

$q : \mathbb{R}^d \to \{X_1, X_2, \dots, X_T\}$

We can write a decision tree formula as follows:

$b_N(x_i)=w_q(x_i)$

where $w\in \mathbb{R}^T$ is a vector of scores, and $w_i$ is a score in the tree leaf which corresponds to the region $X_i$

(We wrote decision tree formula based on leaves and the regions and the leaf values)

Tree complexity:

$\Omega(b_N) = \gamma T+\frac{1}{2}\lambda \sum^T_{j=1} w^2_j$

(Number of leaves, Squared sum of leaf scores)

(We have a some new coefficients, Gamma and Lambda. And some of squared leaf values. Here we can add this term to loss function and then we obtain a regularization which penalizes tree for too many regions. Gamma is a regularization coefficient for too many regions (curse of dimensionality?), then your tree will tend to make less number of leaves. Also Lambda penalize our tree to have too big leaf values.)

__example__

What will be the value of tree complexity of XGBoost for the tree with two leaves having weights 1 and 2? Suppose that regularization coefficients are equal to 1.

$T=2, \frac{1}{2}\lambda \sum^T_{j=1} w^2_j = 2.5. \implies \Omega(b_N)=4.5$

Optimization task with regularization:

$\sum^{l}_{i=1}(g_ib_N(x_i)+ \frac{1}{2}h_ib_N^2(x_i)) + \Omega(b_N) \to min_{b_N(x)}$

### Structure score

Rewrite the objective function:

$\sum^l_{i=1}(g_ib_N(x_i)+ \frac{1}{2}h_ib_N^2(x_i)) + \gamma T+\frac{1}{2}\lambda \sum^T_{j=1} w^2_j$

$=\sum^T_{j=1}((\sum_{i\in I_j} g_i)w_j + \frac{1}{2}(\sum_{i\in I_j} h_i+\lambda)w^2_j)+\gamma T$

where $I_j=\{i | q(x_i)=X_j\}$

( $I_J$, it just denotes a set of such indexes $i$, that observation $X_i$ goes through the region $X$ within the $X_j$. Here again, $q$ is a decision function which decides in which region to move this current observation. )

Make above formula simpler

Denote:

$G_j= \sum_i\in I_j g_i$

$H_j= \sum_i\in I_j h_i$

Optimization task:

$\sum^T_{j=1}(G_jw_j+\frac{1}{2}(H_j+\lambda)w^2_j)+\gamma T \to min_{b_N(x)}$

(T is a number of these regions)

$w_j$ are independent and each term in the sum is quadratic, so we can find the optimal score values and the loss function.

$w^{*}_j = -\frac{G_j}{H_j+\lambda}$

$L^{*} = -\frac{1}{2}\sum^T_{j=1}\frac{G^2_j}{H_j+\lambda} + \gamma T$

(In this case, if there is an task with this first-order derivative and second-order derivatives, we can understand how good is the value in the decision tree, or how good is our loss function that we obtained. Because we know the exact optimal values.)

### Tree structure learning

Is it good to split a leaf into two others?

Left leaf loss:

$L_l = -\frac{1}{2} \times \frac{G^2_l}{H_l+\lambda} + \gamma$

Right leaf loss:

$L_r = -\frac{1}{2} \times \frac{G^2_r}{H_r+\lambda} + \gamma$

Initial loss:

$L_i = -\frac{1}{2} \times \frac{(G_l+G_r)^2}{H_l+H_r+\lambda} + \gamma$

Gain which we have after we split the leaf into two others:

Gain = $L_i-(L_l+L_r)=\frac{1}{2}(\frac{G^2_l}{H_l+\lambda} + \frac{G^2_r}{H_r+\lambda} - \frac{(G_l+G_r)^2}{H_l+H_r+\lambda}) - \gamma$

If Gain > 0, we may split the leaf, otherwise we may leave it.

__Summary__

First and second derivatives of the loss function are used for the optimaztion.

It is possible to add __explicit__ regularization to loss function.

We can measure the quality of the tree structure and use gain to make split decisions (clearly). 