In [1]:
%load_ext autoreload
%autoreload 2

## XGBoost Algorithm

XGBoost (Extreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. The algorithm has gained popularity in machine learning competitions for its performance and speed. Here's a high-level overview of how XGBoost works and its key features:

## How XGBoost Works

1. **Ensemble of Trees**: XGBoost builds an ensemble of decision trees in a sequential manner. Each tree tries to correct the mistakes of the previous ones.

2. **Gradient Boosting**: At its core, XGBoost utilizes the concept of gradient boosting where it constructs new trees that predict the residuals or errors of prior trees combined together in an additive manner.

3. **Regularization**: Unlike traditional gradient boosting, XGBoost includes a regularization term (L1 and L2 regularization) on the tree weights, which helps in reducing overfitting.

4. **Handling Missing Values**: XGBoost can automatically handle missing values. When it encounters a missing value during a split, it will try both directions and choose the direction that gives it a better split.

5. **Tree Pruning**: XGBoost uses depth-first approach and prunes trees backward, a method known as "pruning". It grows the tree up to a max depth and then starts pruning it back until the improvement in loss function is below a certain threshold.

6. **Learning Rate (Shrinkage)**: Like other boosting methods, XGBoost uses a learning rate to control how quickly it corrects errors. This can prevent overfitting by making the model more robust.

7. **Parallel Processing**: XGBoost is designed to be efficient and can run on single machines as well as distributed environments. It parallelizes the construction of trees across multiple CPU cores during the training phase.

8. **Objective Function**: The objective function in XGBoost is composed of a loss function (dependent on the problem type) and a regularization term. The algorithm supports custom objective functions as well.

9. **Cross-validation**: XGBoost has an in-built routine for cross-validation at each iteration of the boosting process.

### Simplified XGBoost Classifier Class

The `SimplifiedXGBoostClassifier` class that is implemented below is a basic representation aiming to capture the essence of how XGBoost operates, specifically focusing on binary classification tasks. Here are the functionalities it includes:

- **Binary Classification Support**: It is designed to handle binary classification tasks, such as email spam detection.

- **Sequential Tree Building**: The class builds decision trees sequentially, where each tree learns from the mistakes (residuals) of all trees before it.

- **Learning Rate**: Incorporates a learning rate to scale the contribution of each tree.

- **Logistic Loss for Pseudo-Residuals**: Uses the logistic function to calculate pseudo-residuals for binary classification, facilitating the learning from errors in a probabilistic context.

- **Predictions and Probabilities**: It can output class labels for predictions and also provide the probability scores for belonging to the positive class.

- **Custom CART Tree Usage**: Utilizes a custom CART implementation for tree building, allowing flexibility in modifying the tree construction process.


In [2]:
%%writefile ../../src/models/xgboost.py
import numpy as np
from src.models.cart import CART

class SimplifiedXGBoostClassifier:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, min_samples_split=2):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.trees = []
        self.initial_prediction = 0.0  # Initial prediction will be updated to log(odds)

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def _log_odds(self, p):
        return np.log(p / (1 - p))

    def fit(self, X, y):
        # Convert labels to {0, 1}
        y = (y == 1).astype(int)
        
        # Start with an initial prediction of log(odds)
        p = np.mean(y)
        self.initial_prediction = self._log_odds(p)
        F_m = np.full(len(y), self.initial_prediction)
        
        for _ in range(self.n_estimators):
            # Compute pseudo-residuals as gradient of logistic loss
            preds = self._sigmoid(F_m)
            residuals = y - preds
            
            # Fit a CART to the pseudo-residuals
            tree = CART(max_depth=self.max_depth, min_samples_split=self.min_samples_split, criterion='mse')
            tree.fit(X, residuals)
            self.trees.append(tree)
            
            # Update model predictions
            update_preds = tree.predict(X)
            F_m += self.learning_rate * update_preds
            
    def predict_proba(self, X):
        # Aggregate predictions from all trees
        F_m = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            F_m += self.learning_rate * tree.predict(X)
        
        # Convert to probabilities
        probs = self._sigmoid(F_m)
        return np.vstack((1 - probs, probs)).T

    def predict(self, X):
        proba = self.predict_proba(X)
        # Convert probabilities to class labels
        return (proba[:, 1] >= 0.5).astype(int)


Overwriting ../../src/models/xgboost.py


In [3]:
import numpy as np

from src.data.load_dataset import load_spambase
from src.models.xgboost import SimplifiedXGBoostClassifier

from sklearn.model_selection import train_test_split

In [8]:
X, y = load_spambase()
# Split the dataset into training+validation and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Further split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, stratify=y_train_val, random_state=42) # 0.25 x 0.8 = 0.2

X_train.shape, X_val.shape, X_test.shape

((2760, 57), (920, 57), (921, 57))

In [9]:
model = SimplifiedXGBoostClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, min_samples_split=2)
model.fit(X_train, y_train)

In [10]:
y_pred = model.predict(X_val)
print(f'Accuracy: {np.mean(y_val == y_pred):.2f}')

Accuracy: 0.92
