## Simple XGBoost Example

In this notebook, we show a very simple use pattern for XGBoost.  To run this, you need to 'pip install XGBoost' into your Python environment.

author: Keith Chugg (chugg@usc.edu)

ChatGPT was used in the generation of this code.

In [1]:
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Breast Cancer Dataset

* Source: Wisconsin Diagnostic Breast Cancer (WDBC) dataset
* Task: Binary classification (malignant vs. benign breast cancer)
* Features: 30 numerical features computed from digitized images of fine needle aspirates of breast masses
* Samples: 569 instances
* Classes:
- 0 = Malignant (cancerous)
- 1 = Benign (non-cancerous)

More details:  https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset

In [2]:
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [3]:
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [4]:
print(f'X_train: shape: {X_train.shape}')
print(f'X_test: shape: {X_test.shape}\n')

print(f'y_train: shape: {y_train.shape}')
print(f'y_test: shape: {y_test.shape}\n')

print(f'Classses:  {set(y_train)}')
print(f'Class 1 (Benign) examples in train: {np.sum(y_train)}  or {100 * np.mean(y_train) : 2.2f}% ')
print(f'Class 1 (Benign) examples in test: {np.sum(y_test)} or {100 * np.mean(y_test) : 2.2f}% ')

X_train: shape: (455, 30)
X_test: shape: (114, 30)

y_train: shape: (455,)
y_test: shape: (114,)

Classses:  {0, 1}
Class 1 (Benign) examples in train: 286  or  62.86% 
Class 1 (Benign) examples in test: 71 or  62.28% 


In [5]:
# Create and train an XGBoost classifier
model = xgb.XGBClassifier(eval_metric="logloss")
model.fit(X_train, y_train)


In [6]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.9561


## Compare with Logistic Regression
Let's run a quick comparison using logistic regression...

In [7]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression model
lr_model = LogisticRegression(max_iter=10000)  # Increased iterations for convergence
lr_model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy:.4f}")

Logistic Regression Accuracy: 0.9561


## Hyperparameter OPtimization for XGBoost Using Optuna

Let's use a package to optimize hyperparameters for XGBoost -- Optuna is one such package. Use `pip install optuna'

You could also use grid search (e.g., `GridSearchCV` from sklearn), but Optuna uses a more sophisticated zero-order optimization technique and is generally more efficent.

In [10]:
import optuna
from sklearn.model_selection import cross_val_score



# Define the objective function for Optuna
def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300, step=50),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "gamma": trial.suggest_float("gamma", 0, 5),
        "lambda": trial.suggest_float("lambda", 1e-3, 10),
        "alpha": trial.suggest_float("alpha", 1e-3, 10),
    }

    model = xgb.XGBClassifier(**params, eval_metric="logloss") 
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="accuracy") # use K-fold cross-validation for model selection with K=5
    return scores.mean()

# Run hyperparameter optimization
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)  # Run 30 trials

# Best hyperparameters
best_params = study.best_params
print('\n\nBest parameters:', best_params)

# Train final model using the best parameters
best_model = xgb.XGBClassifier(**best_params, eval_metric="logloss")
best_model.fit(X_train, y_train)

# Evaluate on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'\nOptimized XGBoost Accuracy: {accuracy:.4f}')

[I 2025-04-02 14:53:08,645] A new study created in memory with name: no-name-9cff65c8-0f09-4ac3-8df2-d818f14e1491
[I 2025-04-02 14:53:08,949] Trial 0 finished with value: 0.9494505494505494 and parameters: {'n_estimators': 300, 'max_depth': 6, 'learning_rate': 0.1646975871476703, 'subsample': 0.9732396819302107, 'colsample_bytree': 0.6071848655882724, 'gamma': 2.913471373403009, 'lambda': 0.9546463235629663, 'alpha': 4.624736085639256}. Best is trial 0 with value: 0.9494505494505494.
[I 2025-04-02 14:53:09,095] Trial 1 finished with value: 0.9362637362637363 and parameters: {'n_estimators': 50, 'max_depth': 8, 'learning_rate': 0.023144240677034304, 'subsample': 0.6546200536072991, 'colsample_bytree': 0.8248673875255169, 'gamma': 0.6242570370149325, 'lambda': 8.166998359684966, 'alpha': 6.640090849873428}. Best is trial 0 with value: 0.9494505494505494.
[I 2025-04-02 14:53:09,393] Trial 2 finished with value: 0.9582417582417582 and parameters: {'n_estimators': 250, 'max_depth': 3, 'lear



Best parameters: {'n_estimators': 200, 'max_depth': 10, 'learning_rate': 0.040459536751322404, 'subsample': 0.7428904006870367, 'colsample_bytree': 0.9088127698005426, 'gamma': 0.9932675506954631, 'lambda': 3.8568616614781113, 'alpha': 0.06132368540752009}

Optimized XGBoost Accuracy: 0.9561
