# 高性能梯度提升库教程 (XGBoost, LightGBM, CatBoost)

欢迎来到 XGBoost, LightGBM, 和 CatBoost 教程！这三个库都是**梯度提升决策树 (Gradient Boosting Decision Tree, GBDT)** 算法的高效、可扩展且流行的实现。它们在处理**表格/结构化数据**方面表现出色，经常在数据科学竞赛和工业界应用中取得顶尖性能。

**为什么使用这些库？**

它们通常比 Scikit-learn 内置的梯度提升实现更快、更精确，并提供更多高级功能，如内置正则化、缺失值处理和对类别特征的特殊支持（尤其是 CatBoost）。

本教程将独立地介绍这三个库的基础用法：

1.  **XGBoost**: 最早广泛流行的高效 GBDT 实现之一。
2.  **LightGBM**: 以速度快和内存占用低著称。
3.  **CatBoost**: 特别擅长自动处理类别特征。

我们将使用 Scikit-learn 内置的数据集进行分类和回归任务，分别展示每个库如何训练、预测和评估模型。

**本教程结构：**
1.  准备工作（安装库、公共数据准备）。
2.  使用 XGBoost。
3.  使用 LightGBM。
4.  使用 CatBoost。
5.  性能比较与总结。

## 1. 准备工作

安装必要的库，并准备用于演示的数据集。

### 1.1 安装库

```bash
pip install xgboost lightgbm catboost scikit-learn pandas numpy matplotlib
```

In [None]:
# --- 公共导入 (用于数据准备和评估) ---
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score, roc_auc_score
from sklearn.datasets import load_breast_cancer, fetch_california_housing
import time
import os
import warnings

# 忽略特定库可能产生的未来警告，使输出更整洁
warnings.filterwarnings('ignore', category=FutureWarning)

# 用于计时的辅助函数
def time_it(func, *args, **kwargs):
    start_time = time.time()
    result = func(*args, **kwargs)
    end_time = time.time()
    print(f"Execution time: {end_time - start_time:.4f} seconds")
    return result

# --- 检查库版本 --- 
print("Checking library versions...")
try: import xgboost as xgb; print(f"  XGBoost version: {xgb.__version__}")
except ImportError: print("  XGBoost not installed."); xgb = None
try: import lightgbm as lgb; print(f"  LightGBM version: {lgb.__version__}")
except ImportError: print("  LightGBM not installed."); lgb = None
try: import catboost as cb; print(f"  CatBoost version: {cb.__version__}")
except ImportError: print("  CatBoost not installed."); cb = None

# --- 公共数据准备 (执行一次) ---
print("\n--- Preparing Datasets ---")
# Classification Data
cancer = load_breast_cancer()
X_cancer_base, y_cancer_base = cancer.data, cancer.target
cancer_feature_names = cancer.feature_names
print(f"Base classification data shape: X={X_cancer_base.shape}")

# Regression Data
regression_available = False
X_housing_base, y_housing_base, housing_feature_names = None, None, None
try:
    housing = fetch_california_housing()
    X_housing_base, y_housing_base = housing.data, housing.target
    housing_feature_names = housing.feature_names
    print(f"Base regression data shape: X={X_housing_base.shape}")
    regression_available = True
except ImportError:
    print("California housing dataset not available (requires scikit-learn >= 0.20). Skipping regression examples.")
except Exception as e:
     print(f"Error loading regression dataset: {e}. Skipping regression examples.")

## 2. 使用 XGBoost

XGBoost (eXtreme Gradient Boosting) 是一个优化过的分布式梯度提升库，旨在高效、灵活和可移植。它实现了正则化学习目标，有助于控制过拟合。

In [None]:
# --- XGBoost: 导入与数据划分 ---
print("\n--- XGBoost Section --- ")
accuracy_xgb, auc_xgb, mse_xgb, r2_xgb = None, None, None, None # Initialize results
if xgb:
    # 划分分类数据
    X_cancer_train_xgb, X_cancer_test_xgb, y_cancer_train_xgb, y_cancer_test_xgb = train_test_split(
        X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base
    )
    print("XGBoost: Classification data split.")
    
    # 划分回归数据
    if regression_available:
        X_housing_train_xgb, X_housing_test_xgb, y_housing_train_xgb, y_housing_test_xgb = train_test_split(
            X_housing_base, y_housing_base, test_size=0.2, random_state=42
        )
        print("XGBoost: Regression data split.")
    else:
        print("XGBoost: Regression data unavailable.")
else:
     print("XGBoost not available, skipping this section.")

In [None]:
# --- XGBoost: 分类 --- 
if xgb:
    print("\nXGBoost: Training XGBClassifier...")
    xgb_clf = xgb.XGBClassifier(
        objective='binary:logistic',
        n_estimators=100,         
        learning_rate=0.1,
        max_depth=3,
        subsample=0.8,            
        colsample_bytree=0.8,     
        use_label_encoder=False,  # Recommended setting 
        eval_metric='logloss',    
        random_state=42,
        n_jobs=-1                 
    )
    
    # Train the model with early stopping
    time_it(xgb_clf.fit, X_cancer_train_xgb, y_cancer_train_xgb, 
            early_stopping_rounds=10, 
            eval_set=[(X_cancer_test_xgb, y_cancer_test_xgb)], 
            verbose=False)
    
    # Predict and Evaluate
    y_pred_xgb_clf = xgb_clf.predict(X_cancer_test_xgb)
    y_proba_xgb_clf = xgb_clf.predict_proba(X_cancer_test_xgb)[:, 1] 
    accuracy_xgb = accuracy_score(y_cancer_test_xgb, y_pred_xgb_clf)
    auc_xgb = roc_auc_score(y_cancer_test_xgb, y_proba_xgb_clf)
    print(f"XGBoost Classifier Accuracy: {accuracy_xgb:.4f}")
    print(f"XGBoost Classifier AUC: {auc_xgb:.4f}")
else:
    print("Skipping XGBoost classification (library not loaded).")

In [None]:
# --- XGBoost: 回归 --- 
if xgb and regression_available:
    print("\nXGBoost: Training XGBRegressor...")
    xgb_reg = xgb.XGBRegressor(
        objective='reg:squarederror', 
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        n_jobs=-1
    )
    
    time_it(xgb_reg.fit, X_housing_train_xgb, y_housing_train_xgb,
            early_stopping_rounds=10,
            eval_set=[(X_housing_test_xgb, y_housing_test_xgb)],
            verbose=False)
            
    y_pred_xgb_reg = xgb_reg.predict(X_housing_test_xgb)
    mse_xgb = mean_squared_error(y_housing_test_xgb, y_pred_xgb_reg)
    r2_xgb = r2_score(y_housing_test_xgb, y_pred_xgb_reg)
    print(f"XGBoost Regressor MSE: {mse_xgb:.4f}")
    print(f"XGBoost Regressor R2: {r2_xgb:.4f}")
elif xgb:
    print("\nXGBoost: Skipping Regressor example (data unavailable).")
else:
     print("Skipping XGBoost regression (library not loaded).")

## 3. 使用 LightGBM

LightGBM (Light Gradient Boosting Machine) 以其训练速度快和内存占用低而闻名。它使用基于直方图的算法和叶子优先 (leaf-wise) 的树生长策略。

In [None]:
# --- LightGBM: 导入与数据划分 ---
print("\n--- LightGBM Section --- ")
accuracy_lgb, auc_lgb, mse_lgb, r2_lgb = None, None, None, None # Initialize results
if lgb:
    from lightgbm import early_stopping, log_evaluation # Callbacks

    # 划分分类数据
    X_cancer_train_lgb, X_cancer_test_lgb, y_cancer_train_lgb, y_cancer_test_lgb = train_test_split(
        X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base
    )
    print("LightGBM: Classification data split.")
    
    # 划分回归数据
    if regression_available:
        X_housing_train_lgb, X_housing_test_lgb, y_housing_train_lgb, y_housing_test_lgb = train_test_split(
            X_housing_base, y_housing_base, test_size=0.2, random_state=42
        )
        print("LightGBM: Regression data split.")
    else:
         print("LightGBM: Regression data unavailable.")
else:
     print("LightGBM not available, skipping this section.")

In [None]:
# --- LightGBM: 分类 --- 
if lgb:
    print("\nLightGBM: Training LGBMClassifier...")
    lgb_clf = lgb.LGBMClassifier(
        objective='binary',
        metric='auc',
        n_estimators=100,
        learning_rate=0.1,
        num_leaves=31,           
        max_depth=-1,            
        subsample=0.8,           # bagging_fraction
        colsample_bytree=0.8,    # feature_fraction
        random_state=42,
        n_jobs=-1
    )
    
    lgbm_clf_callbacks = [
        early_stopping(stopping_rounds=10, verbose=False),
        log_evaluation(period=0)
    ]
    time_it(lgb_clf.fit, X_cancer_train_lgb, y_cancer_train_lgb, 
            eval_set=[(X_cancer_test_lgb, y_cancer_test_lgb)], 
            eval_metric='auc',
            callbacks=lgbm_clf_callbacks)
    
    y_pred_lgb_clf = lgb_clf.predict(X_cancer_test_lgb)
    y_proba_lgb_clf = lgb_clf.predict_proba(X_cancer_test_lgb)[:, 1]
    accuracy_lgb = accuracy_score(y_cancer_test_lgb, y_pred_lgb_clf)
    auc_lgb = roc_auc_score(y_cancer_test_lgb, y_proba_lgb_clf)
    print(f"LightGBM Classifier Accuracy: {accuracy_lgb:.4f}")
    print(f"LightGBM Classifier AUC: {auc_lgb:.4f}")
else:
    print("Skipping LightGBM classification (library not loaded).")

In [None]:
# --- LightGBM: 回归 --- 
if lgb and regression_available:
    print("\nLightGBM: Training LGBMRegressor...")
    lgb_reg = lgb.LGBMRegressor(
        objective='regression_l2',
        metric='rmse',
        n_estimators=100,
        learning_rate=0.1,
        num_leaves=31,
        max_depth=-1,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        n_jobs=-1
    )
    
    lgbm_reg_callbacks = [
         early_stopping(stopping_rounds=10, verbose=False),
         log_evaluation(period=0)
    ]
    time_it(lgb_reg.fit, X_housing_train_lgb, y_housing_train_lgb,
            eval_set=[(X_housing_test_lgb, y_housing_test_lgb)],
            eval_metric='rmse',
            callbacks=lgbm_reg_callbacks)
            
    y_pred_lgb_reg = lgb_reg.predict(X_housing_test_lgb)
    mse_lgb = mean_squared_error(y_housing_test_lgb, y_pred_lgb_reg)
    r2_lgb = r2_score(y_housing_test_lgb, y_pred_lgb_reg)
    print(f"LightGBM Regressor MSE: {mse_lgb:.4f}")
    print(f"LightGBM Regressor R2: {r2_lgb:.4f}")
elif lgb:
    print("\nLightGBM: Skipping Regressor example (data unavailable).")
else:
    print("Skipping LightGBM regression (library not loaded).")

## 4. 使用 CatBoost

CatBoost (Categorical Boosting) 的主要特点是其内置的对类别特征的高效处理，通常无需预处理即可获得良好效果。

In [None]:
# --- CatBoost: 导入与数据划分 ---
print("\n--- CatBoost Section --- ")
accuracy_cb, auc_cb, mse_cb, r2_cb = None, None, None, None # Initialize results
if cb:
    # 划分分类数据
    X_cancer_train_cb, X_cancer_test_cb, y_cancer_train_cb, y_cancer_test_cb = train_test_split(
        X_cancer_base, y_cancer_base, test_size=0.2, random_state=42, stratify=y_cancer_base
    )
    print("CatBoost: Classification data split.")

    # 划分回归数据
    if regression_available:
        X_housing_train_cb, X_housing_test_cb, y_housing_train_cb, y_housing_test_cb = train_test_split(
            X_housing_base, y_housing_base, test_size=0.2, random_state=42
        )
        print("CatBoost: Regression data split.")
    else:
         print("CatBoost: Regression data unavailable.")

else:
    print("CatBoost not available, skipping this section.")

In [None]:
# --- CatBoost: 分类 --- 
if cb:
    print("\nCatBoost: Training CatBoostClassifier...")
    cb_clf = cb.CatBoostClassifier(
        iterations=100,         
        learning_rate=0.1,
        depth=6,                
        l2_leaf_reg=3,          
        loss_function='Logloss',
        eval_metric='AUC',      
        random_seed=42,
        verbose=0,              # Suppress iteration output
        early_stopping_rounds=10
    )
    
    time_it(cb_clf.fit, X_cancer_train_cb, y_cancer_train_cb,
            eval_set=(X_cancer_test_cb, y_cancer_test_cb),
            verbose=0) # Pass verbose=0 to fit as well
    
    y_pred_cb_clf = cb_clf.predict(X_cancer_test_cb)
    y_proba_cb_clf = cb_clf.predict_proba(X_cancer_test_cb)[:, 1]
    accuracy_cb = accuracy_score(y_cancer_test_cb, y_pred_cb_clf)
    auc_cb = roc_auc_score(y_cancer_test_cb, y_proba_cb_clf)
    print(f"CatBoost Classifier Accuracy: {accuracy_cb:.4f}")
    print(f"CatBoost Classifier AUC: {auc_cb:.4f}")
else:
    print("Skipping CatBoost classification (library not loaded).")

In [None]:
# --- CatBoost: 回归 --- 
if cb and regression_available:
    print("\nCatBoost: Training CatBoostRegressor...")
    cb_reg = cb.CatBoostRegressor(
        iterations=100,
        learning_rate=0.1,
        depth=6,
        l2_leaf_reg=3,
        loss_function='RMSE', 
        eval_metric='RMSE',
        random_seed=42,
        verbose=0,
        early_stopping_rounds=10
    )
    
    time_it(cb_reg.fit, X_housing_train_cb, y_housing_train_cb,
            eval_set=(X_housing_test_cb, y_housing_test_cb),
            verbose=0)
            
    y_pred_cb_reg = cb_reg.predict(X_housing_test_cb)
    mse_cb = mean_squared_error(y_housing_test_cb, y_pred_cb_reg)
    r2_cb = r2_score(y_housing_test_cb, y_pred_cb_reg)
    print(f"CatBoost Regressor MSE: {mse_cb:.4f}")
    print(f"CatBoost Regressor R2: {r2_cb:.4f}")
elif cb:
    print("\nCatBoost: Skipping Regressor example (data unavailable).")
else:
    print("Skipping CatBoost regression (library not loaded).")

In [None]:
# --- CatBoost: Categorical Feature Handling --- 
if cb:
    print("\n--- CatBoost Categorical Feature Handling Example ---")
    # Create sample data with categorical features
    cat_df = pd.DataFrame({
        'Num1': np.random.rand(100),
        'City': np.random.choice(['London', 'Paris', 'Tokyo', 'NYC'], 100),
        'Weather': np.random.choice(['Sunny', 'Cloudy', 'Rainy'], 100),
        'Target': np.random.randint(0, 2, 100)
    })
    X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(
        cat_df[['Num1', 'City', 'Weather']], cat_df['Target'], test_size=0.25, random_state=42
    )
    
    # Identify categorical features
    categorical_features_indices = np.where(X_cat_train.dtypes != float)[0]
    print(f"Categorical feature indices: {categorical_features_indices}") # Should be [1, 2]
    
    cb_clf_cat = cb.CatBoostClassifier(
        iterations=50, verbose=0, random_seed=42,
        cat_features=categorical_features_indices # Pass indices
    )
    print("Fitting CatBoostClassifier with cat_features...")
    time_it(cb_clf_cat.fit, X_cat_train, y_cat_train, 
            eval_set=(X_cat_test, y_cat_test), verbose=0)
    
    y_pred_cb_cat = cb_clf_cat.predict(X_cat_test)
    print(f"CatBoost (with cat features) Accuracy: {accuracy_score(y_cat_test, y_pred_cb_cat):.4f}")
else:
    print("Skipping CatBoost categorical example (library not loaded).")

# 比较与总结

## 5. 性能比较与总结

让我们回顾一下这三个库在我们的简单示例上的性能。
**免责声明**: 本次运行使用了非常基础的参数和有限的训练轮数，结果仅供演示 API 用法，**不代表**各库在优化后的真实相对性能。实际项目中需要仔细进行超参数调优。

In [None]:
print("--- Performance Summary (Basic Run) ---")

print("Classification (Breast Cancer - Higher is better):")
if xgb:
    print(f"  XGBoost : Accuracy={accuracy_xgb if accuracy_xgb is not None else 'N/A':.4f}, AUC={auc_xgb if auc_xgb is not None else 'N/A':.4f}")
else: print("  XGBoost: Not run.")
if lgb:
    print(f"  LightGBM: Accuracy={accuracy_lgb if accuracy_lgb is not None else 'N/A':.4f}, AUC={auc_lgb if auc_lgb is not None else 'N/A':.4f}")
else: print("  LightGBM: Not run.")
if cb:
    print(f"  CatBoost: Accuracy={accuracy_cb if accuracy_cb is not None else 'N/A':.4f}, AUC={auc_cb if auc_cb is not None else 'N/A':.4f}")
else: print("  CatBoost: Not run.")

if regression_available:
    print("\nRegression (California Housing):")
    print("  Metric: MSE (Lower is better), R2 (Higher is better)")
    if xgb:
        print(f"  XGBoost : MSE={mse_xgb if mse_xgb is not None else 'N/A':.4f}, R2={r2_xgb if r2_xgb is not None else 'N/A':.4f}")
    else: print("  XGBoost: Not run.")
    if lgb:
        print(f"  LightGBM: MSE={mse_lgb if mse_lgb is not None else 'N/A':.4f}, R2={r2_lgb if r2_lgb is not None else 'N/A':.4f}")
    else: print("  LightGBM: Not run.")
    if cb:
        print(f"  CatBoost: MSE={mse_cb if mse_cb is not None else 'N/A':.4f}, R2={r2_cb if r2_cb is not None else 'N/A':.4f}")
    else: print("  CatBoost: Not run.")
else:
    print("\nRegression results not available.")

### 总结要点

*   **XGBoost, LightGBM, CatBoost** 都是处理表格数据的强大武器。
*   它们都提供了方便的 **Scikit-learn 兼容接口**。
*   **LightGBM** 通常以**速度**见长。
*   **CatBoost** 在**类别特征处理**上具有独特优势。
*   **XGBoost** 是一个**成熟、稳定、功能全面**的选择。
*   **超参数调优**和**提前停止**对于获得最佳性能至关重要。

在实际项目中，建议根据数据特点和需求尝试不同的库，并通过交叉验证和调优来选择最佳模型。