# Gradient Boost

## Definition

Gradient boosting is a powerful machine-learning technique that has produced state-of-the-art results in a wide range of practical applications. It works by combining several simple prediction models (weak learners) to create an accurate prediction model.

- **Ensemble Learning**: Gradient boosting is a form of ensemble learning, where multiple models (typically decision trees) are combined to improve prediction accuracy.

- **Stage-wise Additive Model**: In gradient boosting, models are added one at a time, and existing models in the ensemble are not changed. 

- **Gradient Descent**: The algorithm uses gradient descent to minimize loss when adding new models. 

- **Loss Function Optimization**: It can optimize on any differentiable loss function, making it applicable to a wide range of tasks.

## Assumptions

None

## Algorithm

1. **Initial Model**: Gradient boosting starts with an initial model. This is often a very simple model, like a constant value (for instance, the mean of the target values for regression tasks). Let's denote this initial model as $ F_0(x) $. For regression, it could be the mean of the target values $ y $:

   $$ F_0(x) = \arg\min_\gamma \sum_{i=1}^n L(y_i, \gamma) $$

   Here, $ L $ is the loss function, $ y_i $ are the target values, and $ \gamma $ is the constant that minimizes the loss.

2. **Iterative Improvement**: For each subsequent iteration $ t = 1, 2, ..., T $:

   a. Compute the pseudo-residuals. For each data point $ i $, the pseudo-residual is the gradient of the loss function with respect to the prediction of the previous model, evaluated at the actual target value:

      $$ r_{it} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x) = F_{t-1}(x)} $$

   b. Train a weak learner (e.g., a decision tree) on these residuals. Let this weak learner be $ h_t(x) $.

   c. Choose a multiplier $ \gamma_t $ that minimizes the loss when added to the current model:

      $$ \gamma_t = \arg\min_\gamma \sum_{i=1}^n L(y_i, F_{t-1}(x_i) + \gamma h_t(x_i)) $$

   d. Update the model:

      $$ F_t(x) = F_{t-1}(x) + \gamma_t h_t(x) $$

3. **Final Model**: After T iterations, the final model $ F_T(x) $ is used for making predictions. It's the sum of the initial model and all the incremental improvements:

   $$ F_T(x) = F_0(x) + \sum_{t=1}^T \gamma_t h_t(x) $$

In this formulation:
- $ F_t(x) $ is the model at iteration $ t $.
- $ L(y, F(x)) $ is the loss function, comparing the true value $ y $ and the model prediction $ F(x) $.
- $ h_t(x) $ is the weak learner trained at iteration $ t $.
- $ \gamma_t $ is the multiplier for the weak learner's predictions at iteration $ t $.
- $ r_{it} $ are the pseudo-residuals, representing the negative gradient of the loss function with respect to the model's predictions.

## Pros and Cons

### Pros of Gradient Boosting

1. **High Performance**: Gradient boosting often provides highly accurate models and can win many machine learning competitions.

2. **Handles Different Types of Data**: It can handle categorical and numerical features and doesn't require data pre-processing like normalization or scaling.

3. **Automatic Feature Selection**: Gradient boosting inherently performs feature selection, which can be advantageous in case of high-dimensional data.

4. **Flexibility**: Can be used for both regression and classification tasks. It also allows for the optimization of different loss functions.

5. **Handles Non-linear Relationships**: Capable of capturing complex non-linear relationships between features and the target.

6. **Robust to Outliers**: The boosting procedure is less sensitive to outliers than other algorithms, like linear regression.

7. **Missing Values Handling**: It can handle missing data to some extent.

<br>

### Cons of Gradient Boosting

1. **Overfitting Risk**: If not properly tuned, gradient boosting models can easily overfit, especially on small datasets.

2. **Computationally Intensive**: Requires significant computational resources, making it slower in training, especially with large datasets.

3. **Parameter Tuning**: Requires careful tuning of several parameters like the number of trees, depth of trees, learning rate, etc., which can be time-consuming and complex.

4. **Difficult to Interpret**: The models, being an ensemble of trees, are not as interpretable as simpler models like linear regression.

5. **Less Effective on Sparse Data**: Compared to algorithms like linear models, gradient boosting might not perform as well with very sparse data.

6. **Memory Usage**: Can consume more memory than other models, which might be a limiting factor in some environments.

7. **Scalability Issues**: While there are scalable implementations, the algorithm itself is more challenging to scale for very large datasets compared to simpler models.


# Xgboost

## Details
XGBoost is an open-source software library providing a high-performance implementation of gradient boosted decision trees. Designed for efficiency, flexibility, and portability, XGBoost works well on large and complex datasets. It extends the standard gradient boosting method with several key improvements and optimizations.

### Objective Function

The objective function in XGBoost, which it tries to minimize, consists of two parts: the loss function and the regularization term. 

1. **Loss Function**: This part is similar to standard gradient boosting and measures the difference between the predicted and actual values. For a set of $ n $ predictions and actual values, it is given by:

   $$ \text{Obj}(\Theta) = \sum_{i=1}^{n} l(y_i, \hat{y}_i) $$

   Here, $ l(y_i, \hat{y}_i) $ is a differentiable convex loss function, $ y_i $ is the actual value, and $ \hat{y}_i $ is the predicted value for the $ i $-th instance.

2. **Regularization Term**: This term is added to reduce overfitting and is unique to XGBoost. It penalizes complex models (i.e., models with many leaves or very deep trees).

   $$ \Omega(f) = \gamma T + \frac{1}{2} \lambda \| w \|^2 $$

   In this formula:
   - $ f $ represents a tree.
   - $ T $ is the number of leaves in the tree.
   - $ w $ is a vector of scores on the leaves.
   - $ \gamma $ and $ \lambda $ are regularization parameters (where $ \gamma $ penalizes the number of leaves, and $ \lambda $ penalizes the leaf weights).

The overall objective at each step is, therefore:

$$ \text{Obj}(\Theta) = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k) $$

### Tree Building and Pruning

XGBoost uses a greedy algorithm for tree building. For each split in each tree, the algorithm chooses the split that maximizes the gain. The gain is calculated as follows:

$$ \text{Gain} = \text{Loss}_{\text{before split}} - \text{Loss}_{\text{after split}} $$

The loss reduction after a split is given by:

$$ \text{Gain} = \frac{1}{2}\left[ \frac{(\sum_{i\in I_L} g_i)^2}{\sum_{i\in I_L} h_i + \lambda} + \frac{(\sum_{i\in I_R} g_i)^2}{\sum_{i\in I_R} h_i + \lambda} - \frac{(\sum_{i\in I} g_i)^2}{\sum_{i\in I} h_i + \lambda} \right] - \gamma $$

Where:
- $ I $ is the set of indices of the data points in the parent node.
- $ I_L $ and $ I_R $ are the indices of the data points in the left and right child nodes, respectively, after the split.
- $ g_i $ and $ h_i $ are the first and second order gradients of the loss function with respect to the prediction $ \hat{y}_i $.

Tree pruning in XGBoost is depth-first and prunes trees back up once a negative gain is encountered.

### Approximate Greedy Algorithm

For large datasets, XGBoost uses an approximate algorithm for split finding. It proposes candidate splitting points based on percentiles of feature distribution. The algorithm then maps continuous features into buckets split by these candidate points and aggregates the statistics for these buckets.

## Differences Between XGBoost and Standard Gradient Boosting

The provided text outlines several key differences between XGBoost (Extreme Gradient Boosting) and traditional Gradient Boosting Decision Tree (GBDT). Here's a detailed explanation of each point:

1. **Regularization to Avoid Overfitting**:
   - XGBoost incorporates regularization terms in the objective function (both L1 and L2 regularization), which helps to avoid overfitting. This makes XGBoost's generalization performance often superior to traditional GBDT, which doesn't include regularization by default.

2. **Use of First and Second Order Derivatives**:
   - XGBoost utilizes a second-order Taylor expansion of the loss function, incorporating both first-order (gradient) and second-order (Hessian) derivatives. This approach can lead to a faster convergence of the algorithm compared to GBDT, which typically uses only first-order derivatives.

3. **Support for Different Base Learners**:
   - While GBDT exclusively uses decision trees as base learners, XGBoost is more flexible, allowing the use of linear classifiers as base learners in addition to trees. This extends the range of problems XGBoost can effectively tackle.

4. **Feature Subsampling**:
   - Similar to Random Forest, XGBoost introduces feature subsampling, which not only helps in preventing overfitting but also reduces computational requirements. This feature is not a standard part of traditional GBDT.

5. **Approximate Greedy Algorithm for Split Finding**:
   - XGBoost implements an approximate greedy algorithm to find the best split, which is more efficient than the exact greedy algorithm used in traditional GBDT, especially for large datasets. This method is particularly effective in handling large and sparse datasets, reducing both computational time and memory usage.

6. **Handling of Missing Values and Sparse Data**:
   - XGBoost can automatically handle missing values and is designed to work well with sparse datasets, a feature that may not be as efficiently handled in traditional GBDT.

7. **Parallel Processing**:
   - XGBoost's design allows for parallel processing on the level of feature selection. It sorts features and stores them in a block structure, which is reused in subsequent iterations. This structure enables efficient parallel computations of the gain for each feature during the tree's node splitting. In contrast, traditional GBDT implementations might not offer this level of parallelism, particularly in feature handling.

In summary, XGBoost extends the gradient boosting framework by incorporating features like regularization, use of second-order derivatives, more flexible base learners, feature subsampling, efficient algorithms for large datasets, enhanced handling of sparse data, and parallel processing capabilities. These advancements contribute to XGBoost's effectiveness, especially in large-scale and complex machine learning tasks.

## Code