# Module 3 - Exercise 5: XGBoost

<a href="https://colab.research.google.com/github/jumpingsphinx/jumpingsphinx.github.io/blob/main/notebooks/module3-trees/exercise5-xgboost.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives

By the end of this exercise, you will be able to:

- Apply XGBoost for classification and regression
- Perform systematic hyperparameter tuning
- Implement early stopping and cross-validation
- Analyze feature importance (gain, weight, cover)
- Handle imbalanced data with scale_pos_weight
- Compare XGBoost with other algorithms
- Build complete ML pipelines

## Prerequisites

- Completion of Exercise 4 (Boosting)
- Understanding of gradient boosting
- Familiarity with hyperparameter tuning

## Setup

Run this cell first to import required libraries:

In [None]:
# Install XGBoost (required for Google Colab)
!pip install xgboost -q

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, load_wine, fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_auc_score, roc_curve, mean_squared_error, r2_score
)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print(f"XGBoost version: {xgb.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print("\nSetup complete!")

---

## Part 1: XGBoost Basics

### Background

**XGBoost** (eXtreme Gradient Boosting) is an optimized gradient boosting library that has become the dominant algorithm for structured/tabular data in machine learning competitions.

**Key advantages:**
- Speed: Highly optimized C++ backend
- Performance: Often achieves state-of-the-art results
- Regularization: Built-in L1 and L2 regularization
- Missing values: Handles them automatically
- Parallel processing: Tree construction is parallelized
- Built-in cross-validation

**Core hyperparameters:**
- `n_estimators`: Number of boosting rounds (trees)
- `max_depth`: Maximum tree depth
- `learning_rate` (eta): Step size shrinkage (0.01-0.3)
- `subsample`: Fraction of samples for each tree (0.5-1.0)
- `colsample_bytree`: Fraction of features for each tree (0.5-1.0)
- `gamma`: Minimum loss reduction for split
- `reg_alpha`: L1 regularization
- `reg_lambda`: L2 regularization

### Exercise 1.1: Simple Classification with XGBoost

**Task:** Train XGBoost on the Breast Cancer dataset with default parameters.

In [None]:
# Load Breast Cancer dataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer)

# Create and train XGBoost classifier
xgb_basic = XGBClassifier(random_state=42, eval_metric='logloss')
xgb_basic.fit(X_train, y_train)
y_pred_train = xgb_basic.predict(X_train)
y_pred_test = xgb_basic.predict(X_test)
train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)
print(f"Training Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

---

## Part 2: Early Stopping

### Background

**Early stopping** monitors performance on a validation set and stops training when performance stops improving. This prevents overfitting and saves computation time.

**Key parameter:**
- `early_stopping_rounds`: Stop if no improvement for N rounds

**Usage:**
```python
model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          early_stopping_rounds=10)
```

### Exercise 2.1: Implement Early Stopping

**Task:** Use early stopping to find the optimal number of trees.

In [None]:
# Create train/validation/test split
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.2, random_state=42, stratify=y_train_full
)

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test.shape}")
print()

# Your code here: Train XGBoost with early stopping
xgb_early = XGBClassifier(
    n_estimators=500,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss',
    early_stopping_rounds=20
)

xgb_early.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_val, y_val)],
    verbose=False
)
print(f"\nBest iteration: {xgb_early.best_iteration}")
print(f"Best score: {xgb_early.best_score:.4f}")
print()

# Retrieve evaluation results
results = xgb_early.evals_result()

# Plot training curves
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(results['validation_0']['logloss'], label='Training')
plt.plot(results['validation_1']['logloss'], label='Validation')
plt.axvline(x=xgb_early.best_iteration, color='r', linestyle='--', 
           label=f'Best iteration ({xgb_early.best_iteration})')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.title('Training vs Validation Loss')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Calculate accuracy from predictions
train_accs = []
val_accs = []
for i in range(len(results['validation_0']['logloss'])):
    xgb_temp = XGBClassifier(
        n_estimators=i+1,
        learning_rate=0.1,
        random_state=42,
        eval_metric='logloss'
    )
    xgb_temp.fit(X_train, y_train, verbose=False)
    if i % 20 == 0 or i < 10:  # Sample points to speed up
        train_accs.append((i, accuracy_score(y_train, xgb_temp.predict(X_train))))
        val_accs.append((i, accuracy_score(y_val, xgb_temp.predict(X_val))))

train_iters, train_acc_vals = zip(*train_accs) if train_accs else ([], [])
val_iters, val_acc_vals = zip(*val_accs) if val_accs else ([], [])

plt.plot(train_iters, train_acc_vals, 'o-', label='Training', alpha=0.7)
plt.plot(val_iters, val_acc_vals, 's-', label='Validation', alpha=0.7)
plt.axvline(x=xgb_early.best_iteration, color='r', linestyle='--',
           label=f'Best iteration ({xgb_early.best_iteration})')
plt.xlabel('Boosting Round')
plt.ylabel('Accuracy')
plt.title('Training vs Validation Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Evaluate on test set
test_acc = accuracy_score(y_test, xgb_early.predict(X_test))
print(f"Test Accuracy: {test_acc:.4f}")
print("\n✓ Early stopping prevents overfitting!")

---

## Part 3: Hyperparameter Tuning

### Background

Systematic hyperparameter tuning is crucial for XGBoost performance.

**Tuning strategy:**
1. Fix learning_rate at 0.1
2. Tune tree-specific parameters (max_depth, min_child_weight)
3. Tune sampling parameters (subsample, colsample_bytree)
4. Tune regularization (gamma, reg_alpha, reg_lambda)
5. Lower learning_rate and increase n_estimators

### Exercise 3.1: Grid Search for Hyperparameters

**Task:** Perform systematic grid search on key XGBoost parameters.

In [None]:
# Define parameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0],
    'gamma': [0, 0.1, 0.3]
}

print("Hyperparameter Grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")
print(f"\nTotal combinations: {np.prod([len(v) for v in param_grid.values()])}")
print("\nThis will take a few minutes...\n")

# Your code here: Perform grid search
xgb_grid = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)

grid_search = GridSearchCV(
    xgb_grid,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_full, y_train_full)
print("\nBest parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nBest CV score: {grid_search.best_score_:.4f}")

# Evaluate best model on test set
best_xgb = grid_search.best_estimator_
test_acc = accuracy_score(y_test, best_xgb.predict(X_test))
test_auc = roc_auc_score(y_test, best_xgb.predict_proba(X_test)[:, 1])

print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test AUC-ROC: {test_auc:.4f}")

# Visualize top parameter combinations
results_df = pd.DataFrame(grid_search.cv_results_)
results_df = results_df.sort_values('rank_test_score')

print("\nTop 5 parameter combinations:")
print(results_df[['params', 'mean_test_score', 'std_test_score']].head())

print("\n✓ Hyperparameter tuning complete!")

---

## Part 4: Feature Importance Types

### Background

XGBoost provides three types of feature importance:

1. **Weight (Frequency)**: Number of times feature appears in splits
2. **Gain**: Average improvement in loss when splitting on feature
3. **Cover**: Average number of samples affected by splits on feature

**Best practice:** Use 'gain' for most interpretable results.

### Exercise 4.1: Compare Feature Importance Metrics

**Task:** Analyze feature importance using all three metrics on Wine dataset.

In [None]:
# Load Wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

print("Wine Dataset:")
print(f"Shape: {X_wine.shape}")
print(f"Classes: {wine.target_names}")
print(f"Features: {wine.feature_names}")
print()

# Split data
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
)

# Train XGBoost
xgb_wine = XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42,
    eval_metric='mlogloss'  # For multiclass
)

xgb_wine.fit(X_train_wine, y_train_wine)

# Get feature importance with different metrics
importance_types = ['weight', 'gain', 'cover']

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for idx, importance_type in enumerate(importance_types):
    # Get importance
    importance = xgb_wine.get_booster().get_score(importance_type=importance_type)
    
    # Convert to feature names and sort
    feature_importance = {}
    for key, value in importance.items():
        feature_idx = int(key[1:])  # Remove 'f' prefix
        feature_importance[wine.feature_names[feature_idx]] = value
    
    # Sort by importance
    sorted_features = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)
    features, importances = zip(*sorted_features)
    
    # Plot
    axes[idx].barh(range(len(features)), importances, alpha=0.7)
    axes[idx].set_yticks(range(len(features)))
    axes[idx].set_yticklabels(features)
    axes[idx].set_xlabel(f'{importance_type.capitalize()} Importance')
    axes[idx].set_title(f'Feature Importance: {importance_type.upper()}')
    axes[idx].invert_yaxis()
    axes[idx].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Print top features for each metric
print("Top 5 features by importance type:\n")
for importance_type in importance_types:
    importance = xgb_wine.get_booster().get_score(importance_type=importance_type)
    feature_importance = {}
    for key, value in importance.items():
        feature_idx = int(key[1:])
        feature_importance[wine.feature_names[feature_idx]] = value
    
    sorted_features = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)[:5]
    
    print(f"{importance_type.upper()}:")
    for i, (feature, score) in enumerate(sorted_features, 1):
        print(f"  {i}. {feature}: {score:.2f}")
    print()

# Evaluate model
test_acc = accuracy_score(y_test_wine, xgb_wine.predict(X_test_wine))
print(f"Test Accuracy: {test_acc:.4f}")
print("\n✓ Feature importance analysis complete!")

---

## Part 5: XGBoost Regression

### Background

XGBoost excels at regression tasks as well as classification.

**Key differences for regression:**
- Use `XGBRegressor` instead of `XGBClassifier`
- Evaluation metrics: RMSE, MAE, R²
- Objective function: `reg:squarederror` (default)

### Exercise 5.1: XGBoost for California Housing

**Task:** Apply XGBoost regression with hyperparameter tuning.

In [None]:
# Load Housing Data
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(X_h, y_h, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_h_scaled = scaler.fit_transform(X_train_h)
X_test_h_scaled = scaler.transform(X_test_h)

# Train Baseline
xgb_reg_baseline = XGBRegressor(random_state=42)
xgb_reg_baseline.fit(X_train_h_scaled, y_train_h)

# Predict
y_pred_train = xgb_reg_baseline.predict(X_train_h_scaled)
y_pred_test = xgb_reg_baseline.predict(X_test_h_scaled)

# Evaluate
train_mse = mean_squared_error(y_train_h, y_pred_train)
test_mse = mean_squared_error(y_test_h, y_pred_test)
test_r2 = r2_score(y_test_h, y_pred_test)

print("Baseline XGBoost Regression:")
print(f"Train MSE: {train_mse:.4f}")
print(f"Test MSE:  {test_mse:.4f}")
print(f"Test R²:   {test_r2:.4f}")

---

## Part 6: Imbalanced Classification

### Background

Real-world datasets are often imbalanced (e.g., fraud detection, disease diagnosis).

**XGBoost solutions:**
1. `scale_pos_weight`: Weight for positive class = (count_negative / count_positive)
2. Custom evaluation metrics (AUC-ROC, F1, Precision-Recall)
3. Threshold adjustment

### Exercise 6.1: Handle Imbalanced Data

**Task:** Create imbalanced dataset and use scale_pos_weight.

In [None]:
# Create imbalanced dataset from Breast Cancer
# Keep all malignant (class 0) and only 10% of benign (class 1)
np.random.seed(42)

malignant_idx = np.where(y_cancer == 0)[0]
benign_idx = np.where(y_cancer == 1)[0]

# Keep all malignant and subsample benign to create 1:9 imbalance
n_benign_keep = len(malignant_idx) // 9
benign_idx_sample = np.random.choice(benign_idx, size=n_benign_keep, replace=False)

# Combine indices
imbalanced_idx = np.concatenate([malignant_idx, benign_idx_sample])
np.random.shuffle(imbalanced_idx)

X_imbalanced = X_cancer[imbalanced_idx]
y_imbalanced = y_cancer[imbalanced_idx]

print("Imbalanced Dataset:")
print(f"Total samples: {len(y_imbalanced)}")
print(f"Class 0 (malignant): {np.sum(y_imbalanced == 0)} ({np.sum(y_imbalanced == 0) / len(y_imbalanced) * 100:.1f}%)")
print(f"Class 1 (benign): {np.sum(y_imbalanced == 1)} ({np.sum(y_imbalanced == 1) / len(y_imbalanced) * 100:.1f}%)")
print(f"Imbalance ratio: 1:{len(malignant_idx) / n_benign_keep:.1f}")
print()

# Split data
X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imbalanced, y_imbalanced, test_size=0.2, random_state=42, stratify=y_imbalanced
)

# Train without scale_pos_weight
xgb_no_weight = XGBClassifier(
    n_estimators=100,
    max_depth=3,
    random_state=42,
    eval_metric='logloss'
)

xgb_no_weight.fit(X_train_imb, y_train_imb)
y_pred_no_weight = xgb_no_weight.predict(X_test_imb)

# Your code here: Calculate scale_pos_weight and train with it
scale_pos_weight = np.sum(y_train_imb == 0) / np.sum(y_train_imb == 1)

xgb_with_weight = XGBClassifier(
    n_estimators=100,
    max_depth=3,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='logloss'
)

xgb_with_weight.fit(X_train_imb, y_train_imb)
print(f"Calculated scale_pos_weight: {scale_pos_weight:.2f}")
print()

xgb_with_weight = XGBClassifier(
    n_estimators=100,
    max_depth=3,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='logloss'
)

xgb_with_weight.fit(X_train_imb, y_train_imb)
y_pred_with_weight = xgb_with_weight.predict(X_test_imb)

# Compare results
print("Performance Comparison:\n")

print("WITHOUT scale_pos_weight:")
print(classification_report(y_test_imb, y_pred_no_weight, 
                          target_names=['Malignant', 'Benign']))

print("\nWITH scale_pos_weight:")
print(classification_report(y_test_imb, y_pred_with_weight,
                          target_names=['Malignant', 'Benign']))

# ROC curves
y_proba_no_weight = xgb_no_weight.predict_proba(X_test_imb)[:, 1]
y_proba_with_weight = xgb_with_weight.predict_proba(X_test_imb)[:, 1]

fpr_no, tpr_no, _ = roc_curve(y_test_imb, y_proba_no_weight)
fpr_with, tpr_with, _ = roc_curve(y_test_imb, y_proba_with_weight)

auc_no = roc_auc_score(y_test_imb, y_proba_no_weight)
auc_with = roc_auc_score(y_test_imb, y_proba_with_weight)

plt.figure(figsize=(10, 6))
plt.plot(fpr_no, tpr_no, label=f'Without weight (AUC = {auc_no:.3f})', linewidth=2)
plt.plot(fpr_with, tpr_with, label=f'With weight (AUC = {auc_with:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves: Impact of scale_pos_weight')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\n✓ Imbalanced classification handled!")
print("\nKey observation: scale_pos_weight improves recall on minority class")

---

## Part 7: Comparing XGBoost vs Random Forest vs Gradient Boosting

### Background

**Head-to-head comparison** of three ensemble methods:

| Method | Speed | Accuracy | Overfitting | Tuning |
|--------|-------|----------|-------------|--------|
| Random Forest | Fast | Good | Low | Easy |
| Gradient Boosting | Medium | Very Good | Medium | Moderate |
| XGBoost | Very Fast | Excellent | Low (regularized) | Complex |

### Exercise 7.1: Comprehensive Comparison

**Task:** Compare all three methods on the same dataset.

In [None]:
# Use Breast Cancer dataset
X_train_comp, X_test_comp, y_train_comp, y_test_comp = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

# Define models
models = {
    'Random Forest': RandomForestClassifier(
        n_estimators=100,
        max_depth=5,
        random_state=42,
        n_jobs=-1
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        random_state=42
    ),
    'XGBoost': XGBClassifier(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        random_state=42,
        eval_metric='logloss'
    )
}

# Train and evaluate each model
import time

results = {}

print("Training and evaluating models...\n")
print(f"{'Model':<20} {'Train Time (s)':<15} {'Train Acc':<12} {'Test Acc':<12} {'AUC-ROC'}")
print("-" * 75)

for name, model in models.items():
    # Train
    start_time = time.time()
    model.fit(X_train_comp, y_train_comp)
    train_time = time.time() - start_time
    
    # Predictions
    y_pred_train = model.predict(X_train_comp)
    y_pred_test = model.predict(X_test_comp)
    y_proba_test = model.predict_proba(X_test_comp)[:, 1]
    
    # Metrics
    train_acc = accuracy_score(y_train_comp, y_pred_train)
    test_acc = accuracy_score(y_test_comp, y_pred_test)
    auc = roc_auc_score(y_test_comp, y_proba_test)
    
    # Store results
    results[name] = {
        'train_time': train_time,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'auc': auc,
        'y_proba': y_proba_test
    }
    
    print(f"{name:<20} {train_time:<15.4f} {train_acc:<12.4f} {test_acc:<12.4f} {auc:.4f}")

print()

# Visualize ROC curves
plt.figure(figsize=(10, 6))

for name, result in results.items():
    fpr, tpr, _ = roc_curve(y_test_comp, result['y_proba'])
    plt.plot(fpr, tpr, linewidth=2, label=f"{name} (AUC = {result['auc']:.3f})")

plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves: XGBoost vs Random Forest vs Gradient Boosting')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

# Cross-validation comparison
print("\nCross-Validation Comparison (5-fold):\n")
print(f"{'Model':<20} {'Mean CV Score':<15} {'Std CV Score'}")
print("-" * 50)

for name, model in models.items():
    cv_scores = cross_val_score(model, X_cancer, y_cancer, cv=5, scoring='accuracy')
    print(f"{name:<20} {cv_scores.mean():<15.4f} {cv_scores.std():.4f}")

print("\n✓ Comparison complete!")
print("\nKey insights:")
print("- XGBoost typically fastest training time")
print("- Similar accuracy across methods (well-tuned data)")
print("- XGBoost has regularization built-in (less overfitting)")
print("- Random Forest easiest to tune")

---

## Part 8: Model Persistence

### Background

Trained models need to be saved for production deployment.

**XGBoost options:**
1. **Pickle/Joblib**: Python-specific, includes sklearn wrapper
2. **save_model/load_model**: XGBoost native binary format (preferred)
3. **JSON**: Human-readable, cross-platform

### Exercise 8.1: Save and Load XGBoost Models

**Task:** Save a trained model and reload it for predictions.

In [None]:
import pickle
import json
import os

# Train a model to save
xgb_to_save = XGBClassifier(
    n_estimators=50,
    max_depth=3,
    learning_rate=0.1,
    random_state=42,
    eval_metric='logloss'
)

xgb_to_save.fit(X_train, y_train)
original_predictions = xgb_to_save.predict(X_test)
original_accuracy = accuracy_score(y_test, original_predictions)

print(f"Original model accuracy: {original_accuracy:.4f}\n")

# Method 1: Pickle (Python-specific)
print("Method 1: Pickle")
pickle_file = 'xgb_model.pkl'
with open(pickle_file, 'wb') as f:
    pickle.dump(xgb_to_save, f)
print(f"  Saved to {pickle_file}")

with open(pickle_file, 'rb') as f:
    xgb_loaded_pickle = pickle.load(f)
pickle_predictions = xgb_loaded_pickle.predict(X_test)
pickle_accuracy = accuracy_score(y_test, pickle_predictions)
print(f"  Loaded accuracy: {pickle_accuracy:.4f}")
print(f"  Predictions match: {np.array_equal(original_predictions, pickle_predictions)}\n")

# Method 2: XGBoost native format (recommended)
print("Method 2: XGBoost native binary")
binary_file = 'xgb_model.ubj'
xgb_to_save.save_model(binary_file)
print(f"  Saved to {binary_file}")

xgb_loaded_binary = XGBClassifier()
xgb_loaded_binary.load_model(binary_file)
binary_predictions = xgb_loaded_binary.predict(X_test)
binary_accuracy = accuracy_score(y_test, binary_predictions)
print(f"  Loaded accuracy: {binary_accuracy:.4f}")
print(f"  Predictions match: {np.array_equal(original_predictions, binary_predictions)}\n")

# Method 3: JSON (human-readable)
print("Method 3: JSON (human-readable)")
json_file = 'xgb_model.json'
xgb_to_save.save_model(json_file)
print(f"  Saved to {json_file}")

xgb_loaded_json = XGBClassifier()
xgb_loaded_json.load_model(json_file)
json_predictions = xgb_loaded_json.predict(X_test)
json_accuracy = accuracy_score(y_test, json_predictions)
print(f"  Loaded accuracy: {json_accuracy:.4f}")
print(f"  Predictions match: {np.array_equal(original_predictions, json_predictions)}\n")

# Compare file sizes
print("File sizes:")
for filename in [pickle_file, binary_file, json_file]:
    size_kb = os.path.getsize(filename) / 1024
    print(f"  {filename}: {size_kb:.2f} KB")

# Clean up
for filename in [pickle_file, binary_file, json_file]:
    os.remove(filename)

print("\n✓ Model persistence complete!")
print("\nRecommendation: Use save_model/load_model with .ubj or .json")

---

## Part 9: Complete ML Pipeline

### Background

A production ML pipeline includes:
1. Data preprocessing
2. Feature engineering
3. Train/validation/test split
4. Hyperparameter tuning
5. Model training
6. Evaluation
7. Model persistence

### Exercise 9.1: End-to-End Pipeline

**Task:** Build a complete pipeline for Wine classification.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load data
wine = load_wine()
X = wine.data
y = wine.target

print("Building complete ML pipeline for Wine classification\n")

# Step 1: Split data (train/val/test)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.2, random_state=42, stratify=y_temp
)

print(f"Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")
print()

# Step 2: Create preprocessing pipeline
preprocessor = Pipeline([
    ('scaler', StandardScaler())
])

# Fit preprocessor on training data only
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

# Step 3: Hyperparameter tuning with early stopping
print("Step 1: Hyperparameter tuning...\n")

param_grid = {
    'max_depth': [3, 5],
    'learning_rate': [0.1, 0.3],
    'n_estimators': [50, 100]
}

xgb_pipeline = XGBClassifier(
    random_state=42,
    eval_metric='mlogloss'
)

grid_search = GridSearchCV(
    xgb_pipeline,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)

grid_search.fit(X_train_processed, y_train)

print("Best parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")
print()

# Step 4: Train final model with early stopping
print("Step 2: Training final model with early stopping...\n")

final_model = XGBClassifier(
    **grid_search.best_params_,
    n_estimators=200,  # Allow more trees
    random_state=42,
    eval_metric='mlogloss'
)

final_model.fit(
    X_train_processed, y_train,
    eval_set=[(X_train_processed, y_train), (X_val_processed, y_val)],
    early_stopping_rounds=20,
    verbose=False
)

print(f"Optimal number of trees: {final_model.best_iteration}")
print()

# Step 5: Comprehensive evaluation
print("Step 3: Model evaluation\n")

# Predictions
y_pred_train = final_model.predict(X_train_processed)
y_pred_val = final_model.predict(X_val_processed)
y_pred_test = final_model.predict(X_test_processed)

# Metrics
train_acc = accuracy_score(y_train, y_pred_train)
val_acc = accuracy_score(y_val, y_pred_val)
test_acc = accuracy_score(y_test, y_pred_test)

print(f"Train Accuracy: {train_acc:.4f}")
print(f"Validation Accuracy: {val_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
print()

print("Test Set Classification Report:")
print(classification_report(y_test, y_pred_test, target_names=wine.target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=wine.target_names,
           yticklabels=wine.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Final Model')
plt.show()

# Step 6: Feature importance
importance = final_model.get_booster().get_score(importance_type='gain')
feature_importance = {}
for key, value in importance.items():
    feature_idx = int(key[1:])
    feature_importance[wine.feature_names[feature_idx]] = value

sorted_features = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)

plt.figure(figsize=(10, 6))
features, importances = zip(*sorted_features[:10])
plt.barh(range(len(features)), importances, alpha=0.7)
plt.yticks(range(len(features)), features)
plt.xlabel('Importance (Gain)')
plt.title('Top 10 Feature Importances')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

# Step 7: Save complete pipeline
print("\nStep 4: Saving pipeline...\n")

pipeline_dict = {
    'preprocessor': preprocessor,
    'model': final_model,
    'feature_names': wine.feature_names,
    'target_names': wine.target_names.tolist()
}

with open('wine_pipeline.pkl', 'wb') as f:
    pickle.dump(pipeline_dict, f)

print("Pipeline saved to wine_pipeline.pkl")

# Test loading and prediction
with open('wine_pipeline.pkl', 'rb') as f:
    loaded_pipeline = pickle.load(f)

# Make prediction on new sample
sample = X_test[0:1]
sample_processed = loaded_pipeline['preprocessor'].transform(sample)
prediction = loaded_pipeline['model'].predict(sample_processed)
prediction_proba = loaded_pipeline['model'].predict_proba(sample_processed)

print(f"\nSample prediction: {loaded_pipeline['target_names'][prediction[0]]}")
print(f"Confidence: {prediction_proba[0][prediction[0]]:.4f}")

# Clean up
os.remove('wine_pipeline.pkl')

print("\n✓ Complete ML pipeline built successfully!")

---

## Part 10: Kaggle-Style Challenge

### Background

In Kaggle competitions, you need to:
1. Maximize a specific metric (accuracy, AUC, RMSE, etc.)
2. Create predictions for a test set
3. Submit in required format

### Exercise 10.1: Mini Competition

**Task:** Build the best XGBoost model for California Housing (minimize RMSE).

In [None]:
# Load data
housing = fetch_california_housing()
X = housing.data
y = housing.target

print("KAGGLE-STYLE CHALLENGE: California Housing Price Prediction")
print("="*70)
print("\nObjective: Minimize RMSE on test set")
print("\nDataset:")
print(f"  Samples: {X.shape[0]}")
print(f"  Features: {X.shape[1]}")
print(f"  Target: Median house value ($100k)")
print()

# Create train/test split (simulate Kaggle setup)
X_train_kaggle, X_test_kaggle, y_train_kaggle, y_test_kaggle = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training set: {X_train_kaggle.shape}")
print(f"Test set: {X_test_kaggle.shape} (labels hidden for competition)")
print()

# Your turn: Build the best model!
# Tips:
# 1. Feature engineering (create new features)
# 2. Hyperparameter tuning
# 3. Ensemble multiple models
# 4. Use cross-validation

print("Building competition model...\n")

# Baseline model
baseline_model = XGBRegressor(
    n_estimators=100,
    random_state=42
)
baseline_model.fit(X_train_kaggle, y_train_kaggle)
baseline_pred = baseline_model.predict(X_test_kaggle)
baseline_rmse = np.sqrt(mean_squared_error(y_test_kaggle, baseline_pred))

print(f"Baseline RMSE: {baseline_rmse:.4f}\n")

# Your improved model (example)
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_kaggle)
X_test_scaled = scaler.transform(X_test_kaggle)

# Hyperparameter tuning
param_grid_competition = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [200, 500],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'gamma': [0, 0.1]
}

# Use RandomizedSearchCV for speed
from sklearn.model_selection import RandomizedSearchCV

xgb_competition = XGBRegressor(random_state=42)

random_search = RandomizedSearchCV(
    xgb_competition,
    param_grid_competition,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring='neg_root_mean_squared_error',
    random_state=42,
    n_jobs=-1,
    verbose=0
)

random_search.fit(X_train_scaled, y_train_kaggle)

print("Best parameters found:")
for param, value in random_search.best_params_.items():
    print(f"  {param}: {value}")
print()

# Get best model
best_model = random_search.best_estimator_

# Make predictions on test set
test_predictions = best_model.predict(X_test_scaled)

# Evaluate (in real Kaggle, you wouldn't have test labels)
test_rmse = np.sqrt(mean_squared_error(y_test_kaggle, test_predictions))
test_r2 = r2_score(y_test_kaggle, test_predictions)

print("COMPETITION RESULTS")
print("="*50)
print(f"Baseline RMSE: {baseline_rmse:.4f}")
print(f"Your RMSE:     {test_rmse:.4f}")
print(f"Improvement:   {((baseline_rmse - test_rmse) / baseline_rmse * 100):.2f}%")
print(f"Test R²:       {test_r2:.4f}")
print()

# Create submission file (Kaggle format)
submission = pd.DataFrame({
    'Id': range(len(test_predictions)),
    'Predicted': test_predictions
})

submission.to_csv('submission.csv', index=False)
print("Submission file created: submission.csv")
print(submission.head())

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predicted vs Actual
axes[0].scatter(y_test_kaggle, test_predictions, alpha=0.5, s=20)
axes[0].plot([y_test_kaggle.min(), y_test_kaggle.max()],
            [y_test_kaggle.min(), y_test_kaggle.max()],
            'r--', lw=2, label='Perfect prediction')
axes[0].set_xlabel('Actual Price ($100k)')
axes[0].set_ylabel('Predicted Price ($100k)')
axes[0].set_title(f'Competition Results (RMSE = {test_rmse:.4f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Distribution comparison
axes[1].hist(y_test_kaggle, bins=30, alpha=0.5, label='Actual', density=True)
axes[1].hist(test_predictions, bins=30, alpha=0.5, label='Predicted', density=True)
axes[1].set_xlabel('Price ($100k)')
axes[1].set_ylabel('Density')
axes[1].set_title('Distribution Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Clean up
os.remove('submission.csv')

print("\n✓ Kaggle-style challenge complete!")
print("\nNext steps to improve:")
print("- Feature engineering (polynomial features, interactions)")
print("- Ensemble multiple XGBoost models with different seeds")
print("- Stack XGBoost with other models (RF, GBM)")
print("- Use XGBoost's built-in cross-validation (xgb.cv)")

---

## Challenge Problems

### Challenge 1: Custom Objective Function

Implement a custom objective function for XGBoost (e.g., Huber loss for robust regression).

In [None]:
def huber_loss_objective(y_pred, dtrain):
    """
    Custom Huber loss objective function.
    
    Huber loss is less sensitive to outliers than MSE.
    """
    # Your code here
    # Return gradient and hessian
    pass

# Use custom objective
# xgb_custom = XGBRegressor(objective=huber_loss_objective)

print("Challenge 1: Implement custom Huber loss objective!")

### Challenge 2: XGBoost with DMatrix

Use XGBoost's native DMatrix API for maximum performance.

In [None]:
# Convert to DMatrix (XGBoost's internal data structure)
# dtrain = xgb.DMatrix(X_train, label=y_train)
# dtest = xgb.DMatrix(X_test, label=y_test)

# Train using native API
# params = {
#     'max_depth': 3,
#     'eta': 0.1,
#     'objective': 'binary:logistic',
#     'eval_metric': 'logloss'
# }

# bst = xgb.train(params, dtrain, num_boost_round=100,
#                 evals=[(dtrain, 'train'), (dtest, 'test')],
#                 early_stopping_rounds=10)

print("Challenge 2: Use XGBoost's DMatrix API for better performance!")

### Challenge 3: Multi-Output XGBoost

Implement XGBoost for multi-output regression (predict multiple targets simultaneously).

In [None]:
from sklearn.multioutput import MultiOutputRegressor

# Create multi-output dataset
# from sklearn.datasets import make_regression
# X, y = make_regression(n_samples=1000, n_features=10, n_targets=3, random_state=42)

# Wrap XGBoost in MultiOutputRegressor
# multi_xgb = MultiOutputRegressor(XGBRegressor(n_estimators=100))
# multi_xgb.fit(X_train, y_train)

print("Challenge 3: Implement multi-output regression with XGBoost!")

### Challenge 4: XGBoost Feature Interaction

Use SHAP values to analyze feature interactions in XGBoost models.

In [None]:
# Install SHAP: !pip install shap

# import shap

# explainer = shap.TreeExplainer(xgb_model)
# shap_values = explainer.shap_values(X_test)

# shap.summary_plot(shap_values, X_test, feature_names=feature_names)
# shap.dependence_plot('feature_name', shap_values, X_test)

print("Challenge 4: Use SHAP for interpretability!")

### Challenge 5: XGBoost with GPU Acceleration

Train XGBoost using GPU for massive speedup (requires CUDA-enabled GPU).

In [None]:
# Use GPU for training
# xgb_gpu = XGBClassifier(
#     n_estimators=1000,
#     tree_method='gpu_hist',  # Use GPU
#     gpu_id=0,
#     predictor='gpu_predictor'
# )

# xgb_gpu.fit(X_train, y_train)

print("Challenge 5: Use GPU acceleration for faster training!")
print("Note: Requires CUDA-enabled GPU")

---

## Reflection Questions

1. **What makes XGBoost faster than traditional gradient boosting?**
   - Think about parallelization and tree construction

2. **When should you use early stopping vs setting a fixed n_estimators?**
   - Consider computational cost and overfitting

3. **How does scale_pos_weight help with imbalanced datasets?**
   - What's the mathematical effect on the loss function?

4. **Why are there three different feature importance metrics (weight, gain, cover)?**
   - When would each be most useful?

5. **How does XGBoost handle missing values automatically?**
   - What's the algorithm's approach?

6. **What's the relationship between learning_rate and n_estimators?**
   - How do you balance them for best performance?

7. **When would you choose XGBoost over Random Forest?**
   - Consider accuracy, speed, interpretability, and tuning effort

8. **How do regularization parameters (gamma, alpha, lambda) prevent overfitting?**
   - What aspect of the model does each control?

9. **Why is feature scaling often recommended but not required for XGBoost?**
   - How do tree-based models differ from linear models?

10. **What are the trade-offs of using XGBoost's native API vs sklearn wrapper?**
    - Consider functionality, ease of use, and performance

---

## Summary

In this exercise, you mastered:

✓ XGBoost installation and basic usage  
✓ Early stopping to prevent overfitting  
✓ Systematic hyperparameter tuning  
✓ Three types of feature importance (weight, gain, cover)  
✓ XGBoost for regression tasks  
✓ Handling imbalanced datasets with scale_pos_weight  
✓ Comparing XGBoost with Random Forest and Gradient Boosting  
✓ Model persistence (pickle, binary, JSON)  
✓ Building complete production pipelines  
✓ Kaggle-style competition workflows  

**Key Takeaways:**

- **XGBoost dominates structured data**: Industry standard for tabular data
- **Regularization is built-in**: Less prone to overfitting than standard GBM
- **Speed matters**: Highly optimized, supports parallel and GPU training
- **Hyperparameter tuning is crucial**: Default params rarely optimal
- **Early stopping saves time**: Automatically finds optimal tree count
- **Feature importance aids interpretation**: Three metrics provide different insights
- **Handles real-world challenges**: Missing values, imbalanced data, large datasets

**XGBoost Hyperparameter Tuning Strategy:**

1. **Start with defaults**: Establish baseline performance
2. **Tune tree parameters**: max_depth, min_child_weight
3. **Tune sampling**: subsample, colsample_bytree
4. **Add regularization**: gamma, reg_alpha, reg_lambda
5. **Fine-tune learning**: Lower learning_rate, increase n_estimators
6. **Use early stopping**: Find optimal iteration automatically

**When to Use XGBoost:**

✓ Structured/tabular data  
✓ Need high accuracy  
✓ Kaggle competitions  
✓ Large datasets  
✓ Mixed feature types  
✓ Missing values  
✓ Imbalanced classes  

**When NOT to Use XGBoost:**

✗ Image/video data (use CNNs)  
✗ Text data (use transformers)  
✗ Need simple interpretability  
✗ Very small datasets (<100 samples)  
✗ Real-time inference critical (too slow)  

**Next Steps:**

- Practice on Kaggle competitions
- Explore LightGBM and CatBoost (XGBoost alternatives)
- Study SHAP for model interpretability
- Learn XGBoost's distributed training (Dask, Spark)
- Experiment with custom objective functions
- Try GPU acceleration for large datasets

**Resources:**

- [XGBoost Documentation](https://xgboost.readthedocs.io/)
- [XGBoost Parameters Explained](https://xgboost.readthedocs.io/en/latest/parameter.html)
- [Kaggle XGBoost Tutorials](https://www.kaggle.com/learn/xgboost)
- [SHAP for XGBoost](https://github.com/slundberg/shap)

---

**Congratulations!** You now have comprehensive knowledge of XGBoost, one of the most powerful machine learning algorithms for structured data. Apply these skills to real-world problems and Kaggle competitions!

**Need help?** Check the solution notebook or open an issue on [GitHub](https://github.com/jumpingsphinx/jumpingsphinx.github.io/issues).