# XGBoost Classifier

XGBoost (Extreme Gradient Boosting) is a powerful and efficient implementation of the gradient boosting algorithm. It is widely used for supervised learning tasks such as classification and regression. XGBoost builds an ensemble of decision trees in a sequential manner, where each tree attempts to correct the errors of the previous ones. Key features include:

- **Gradient Boosting**: Optimizes a loss function by adding weak learners (decision trees) iteratively.
- **Regularization**: Includes L1 and L2 regularization to prevent overfitting.
- **Parallelization**: Supports parallel computation to speed up training.
- **Handling Missing Values**: Automatically learns the best way to handle missing data.
- **Custom Objective Functions**: Allows users to define their own loss functions.

In this notebook, XGBoost is used to classify images into different categories based on extracted features. You can check the [documentation](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier) for more details.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# imports and path setup
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

import numpy as np
import tqdm
from joblib import Parallel, delayed
from xgboost import XGBClassifier
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from userkits.features import *
from userkits.utils import *

## Load and Shuffle Data

In [3]:
# load data from train directories
X, y = load_train_data(data_dir='../train')
X, y = shuffle(X, y, random_state=42)

Loading train data: 100%|██████████| 42/42 [00:07<00:00,  5.57it/s]


## Transform Data and Add Features

The steps to include new features are detailed in (the file). You can find the definitions of currently included features there.

In [4]:
def extract_features(images):
    features_list = []
    def process_image(img):
        feats = []
        # add feature functions here
        feats.extend(color_histogram(img))
        feats.extend(lbp_texture_features(img))
        feats.extend(find_mean(img))
        feats.extend(find_stddev(img))
        return feats

    features_list = Parallel(n_jobs=-1)(delayed(process_image)(img) for img in tqdm.tqdm(images, desc="Extracting features"))
    return np.array(features_list)

In [5]:
X_features = extract_features(X)
X_features.shape

Extracting features: 100%|██████████| 17735/17735 [00:21<00:00, 822.13it/s]


(17735, 528)

In [6]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

## Common hyperparameters
- `learning_rate`: Controls the step size at each iteration while moving toward a minimum of the loss function. Default is 0.1.
- `max_depth`: Maximum depth of a tree. Increasing this value makes the model more complex and likely to overfit. Default is 5.
- `n_estimators`: Number of gradient boosted trees. Equivalent to the number of boosting rounds. Default is 100.
- `min_child_weight`: Minimum sum of instance weight (hessian) needed in a child. Used to control overfitting. Default is 1.
- `subsample`: Fraction of samples used for fitting the individual base learners. Default is 1.0.
- `colsample_bytree`: Fraction of features used for fitting individual trees. Default is 1.0.
- `gamma`: Minimum loss reduction required to make a further partition on a leaf node. Default is 0.

In [7]:
# Split the data and train the model
X_train, X_test, y_train, y_test = train_test_split(X_features, y_encoded, test_size=0.2)  # you can change test_size
clf = XGBClassifier()  # you can tune hyperparameters here
clf.fit(X_train, y_train)
print("Train Accuracy:", clf.score(X_train, y_train))
print("Test Accuracy:", clf.score(X_test, y_test))

Train Accuracy: 0.9988018043416972
Test Accuracy: 0.5996616859317734


## Evaluate

In [8]:
# load eval data
X_eval, file_ids = load_eval_data("../eval_data")

Loading eval data: 100%|██████████| 1486/1486 [00:31<00:00, 46.52it/s]


In [9]:
X_eval_features = extract_features(X_eval)
eval_predictions = clf.predict(X_eval_features)
print(eval_predictions[:5])

Extracting features: 100%|██████████| 1486/1486 [00:49<00:00, 30.26it/s]


[31 18  0 16 36]


In [10]:
try:
    preds = label_encoder.inverse_transform(eval_predictions)
except Exception:
    preds = eval_predictions

save_predictions(preds, file_ids, output_file='../output/xgboost_predictions.csv')

Saved ../output/xgboost_predictions.csv
