# XGBoost
* XGBoost (Extreme Gradient Boosting) is an optimized and efficient gradient boosting framework. It follows the usual Gradient Boosting process, but includes several key concepts to improve model performance, speed, and scalability.
    * Regularization: XGBoost includes L1 and L2 regularization terms to the algorithm's objective function to prevent overfitting and control tree complexity
    * Loss Functions: XGBoost gives users the option to define loss functions based on the specific problem, allowing flexibility for custom tasks
    * Tree Construction: XGBoost uses a histogram-based approach to find the best splits for the dataset. This involves precomputing statistics on features and storing them in histograms to speed up the process. XGBoost also handles sparse data by using a compressed format to skip missing/empty values
    * Missing Values: XGBoost handles missing values during the training step and can learn how to handle them based on the given data
    * Scalability: Parallel and distributed computing is supported, making it efficient for larger datasets
    * Categorical Features: Categorical features are encoded as integers and splits are done on these encodings (no need to do One-Hot Encoding)

XGBoost is currently regarded as one of the most powerful and effective machine learning algorithms. There are other packages that use the Gradient Boosting framework and are worth checking out (e.g. LightGBM, CatBoost). LightGBM focuses on optimizing memory usage and CatBoost (Categorical Boosting) handles categorical features more effectively by using a permutation-based approach to reduce overfitting from categorical features.

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [2]:
# Load the dataset
california_housing = fetch_california_housing()
X = california_housing.data
y = california_housing.target

In [3]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Create an XGBoost regressor
model = xgb.XGBRegressor(random_state=42)

# Define the hyperparameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Perform grid search cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best parameters from the grid search
best_params = grid_search.best_params_

# Train a model using the best parameters
best_model = xgb.XGBRegressor(random_state=42, **best_params)
best_model.fit(X_train, y_train)

print(f"Best Parameters: {best_params}")

# Make predictions on the test set
y_pred = best_model.predict(X_test)

print(f"Training R-Squared: {best_model.score(X_train, y_train)}")
print(f"Testing R-Squared: {best_model.score(X_test, y_test)}")
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Best Parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 200, 'subsample': 1.0}
Training R-Squared: 0.9280429184163838
Testing R-Squared: 0.8436781914859918
Mean Squared Error: 0.20484550137161106
