# Week 9 — Gradient Boost (Capstone)
**Author:** James Hardison II  
**Date:** 2025-10-29

This notebook applies **Gradient Boosting** to the Capstone dataset, aligned with BU DX799 Week 9 objectives. It includes:
- Learning rate, number of estimators, and tree depth exploration
- Regularization via subsampling and leaf constraints
- Metrics, tuning, and feature importance
- Short reflections per section for Milestone Two

> ⚠️ Replace the **PLUG IN YOUR DATA** section with your real dataset.


## 1. Project Context (Brief)
- **Project title:** Predicting Kidney Disease Progression (example)
- **Objective:** Predict `GFR` (or another clinical target) from demographics and labs.
- **Dataset summary:** N rows, M features. Data sources: [describe].
- **Target variable:** e.g., `gfr_cleaned` (regression)

> Keep this section concise so peers from other domains can follow the context.


In [None]:
# 2. Imports and Utility
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Helper to print metrics
def regression_report(y_true, y_pred, label="model"):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return pd.DataFrame({
        "model":[label],
        "RMSE":[rmse],
        "MAE":[mae],
        "R2":[r2]
    })

np.random.seed(42)


## 3. Data — PLUG IN YOUR DATA
Replace the synthetic data block with data loading steps for your project.

**Expected cell to customize:**
- Load your data (CSV, SQL, etc.).
- Define `numeric_features` and `categorical_features` if applicable.
- Select `X` features and `y` target.


In [None]:
# Example synthetic dataset for illustration (remove when using your real data)
# Let's create a regression dataset with a few informative features.
n = 1000
X_df = pd.DataFrame({
    "age": np.random.randint(20, 85, size=n),
    "creatinine": np.random.gamma(shape=2.0, scale=0.8, size=n),
    "uacr": np.random.gamma(shape=1.5, scale=30.0, size=n),
    "bmi": np.random.normal(28, 5, size=n),
    "albumin": np.random.normal(4.2, 0.5, size=n),
})
# True signal (toy): GFR decreases with creatinine, age; mild noise
y = 120 - 12*X_df["creatinine"] - 0.3*X_df["age"] + 0.02*X_df["uacr"] - 0.5*X_df["bmi"] + np.random.normal(0, 5, size=n)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42)

# Basic numeric-only pipeline (imputer placeholder in case of missing values)
numeric_features = list(X_df.columns)
numeric_transform = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

preprocess = ColumnTransformer(
    transformers=[("num", numeric_transform, numeric_features)],
    remainder="drop"
)

print("Train size:", X_train.shape, " Test size:", X_test.shape)
X_train.head()


## 4. Baseline Gradient Boost
Train an initial model to set a baseline. Record RMSE/MAE/R2 and a brief interpretation.


In [None]:
gb0 = Pipeline(steps=[
    ("prep", preprocess),
    ("gb", GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    ))
])

gb0.fit(X_train, y_train)
pred0 = gb0.predict(X_test)
baseline_metrics = regression_report(y_test, pred0, label="GB_baseline")
baseline_metrics


## 5. Learning Rate Study
Compare a few learning rates with a fixed number of estimators and depth.
Record RMSE vs. `learning_rate`. Smaller can generalize better but needs more trees.


In [None]:
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.3]
results_lr = []
for lr in learning_rates:
    model = Pipeline(steps=[
        ("prep", preprocess),
        ("gb", GradientBoostingRegressor(
            n_estimators=200,
            learning_rate=lr,
            max_depth=3,
            random_state=42
        ))
    ])
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, pred, squared=False)
    results_lr.append((lr, rmse))

df_lr = pd.DataFrame(results_lr, columns=["learning_rate", "RMSE"])

# Plot
plt.figure()
plt.plot(df_lr["learning_rate"], df_lr["RMSE"], marker="o")
plt.title("RMSE vs Learning Rate")
plt.xlabel("learning_rate")
plt.ylabel("RMSE")
plt.show()

df_lr.sort_values("RMSE")


## 6. Number of Estimators Curve
Hold `learning_rate` constant. Increase `n_estimators` and observe train vs. test RMSE.
Stop when test error plateaus to avoid overfitting and excess compute.


In [None]:
from sklearn.metrics import mean_squared_error

n_list = [50, 100, 150, 200, 300, 400]
train_rmse, test_rmse = [], []

for n in n_list:
    model = Pipeline(steps=[
        ("prep", preprocess),
        ("gb", GradientBoostingRegressor(
            n_estimators=n,
            learning_rate=0.1,
            max_depth=3,
            random_state=42
        ))
    ])
    model.fit(X_train, y_train)
    pred_tr = model.predict(X_train)
    pred_te = model.predict(X_test)
    train_rmse.append(mean_squared_error(y_train, pred_tr, squared=False))
    test_rmse.append(mean_squared_error(y_test, pred_te, squared=False))

plt.figure()
plt.plot(n_list, train_rmse, marker="o", label="Train RMSE")
plt.plot(n_list, test_rmse, marker="o", label="Test RMSE")
plt.title("RMSE vs n_estimators")
plt.xlabel("n_estimators")
plt.ylabel("RMSE")
plt.legend()
plt.show()

pd.DataFrame({"n_estimators": n_list, "Train_RMSE": train_rmse, "Test_RMSE": test_rmse})


## 7. Regularization: Depth, Leaf Size, Subsample
Control complexity to reduce overfitting:
- `max_depth` → shallower trees generalize better
- `min_samples_leaf` → prevents tiny leaves
- `subsample` < 1.0 → randomness per tree

Evaluate combinations and compare RMSE.


In [None]:
params = [
    {"max_depth":2, "min_samples_leaf":5, "subsample":0.8},
    {"max_depth":3, "min_samples_leaf":5, "subsample":0.8},
    {"max_depth":3, "min_samples_leaf":10, "subsample":0.8},
    {"max_depth":4, "min_samples_leaf":5, "subsample":0.7},
]

rows = []
for p in params:
    model = Pipeline(steps=[
        ("prep", preprocess),
        ("gb", GradientBoostingRegressor(
            n_estimators=200,
            learning_rate=0.1,
            max_depth=p["max_depth"],
            min_samples_leaf=p["min_samples_leaf"],
            subsample=p["subsample"],
            random_state=42
        ))
    ])
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, pred, squared=False)
    rows.append({**p, "RMSE": rmse})

pd.DataFrame(rows).sort_values("RMSE")


## 8. Hyperparameter Tuning (GridSearchCV)
Search a small grid to find a good balance. Report best params and CV score.


In [None]:
pipe = Pipeline(steps=[
    ("prep", preprocess),
    ("gb", GradientBoostingRegressor(random_state=42))
])

param_grid = {
    "gb__learning_rate": [0.05, 0.1, 0.2],
    "gb__n_estimators": [150, 200, 300],
    "gb__max_depth": [2, 3, 4],
    "gb__min_samples_leaf": [3, 5, 10],
    "gb__subsample": [0.7, 0.8, 1.0]
}

gcv = GridSearchCV(pipe, param_grid, scoring="neg_root_mean_squared_error", cv=5, n_jobs=-1)
gcv.fit(X_train, y_train)

print("Best params:", gcv.best_params_)
print("CV RMSE:", -gcv.best_score_)

best_model = gcv.best_estimator_
pred_best = best_model.predict(X_test)
report_best = regression_report(y_test, pred_best, label="GB_best")
report_best


## 9. Feature Importance
Rank features by importance. Discuss clinical or domain sense.


In [None]:
# Extract importance from the final GB step
gb_step = best_model.named_steps["gb"]
importances = pd.Series(gb_step.feature_importances_, index=numeric_features).sort_values(ascending=False)
importances.head(15)


In [None]:
plt.figure()
importances.sort_values(ascending=True).plot(kind="barh")
plt.title("Gradient Boost Feature Importances")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()


## 10. Interpretation and Reflection
- **Overfitting controls:** [Summarize which settings helped and evidence from RMSE curves.]
- **Metrics used:** RMSE, MAE, R2. Best test RMSE = [value].
- **Expected vs unexpected:** [Note any surprises in feature importance or curves.]
- **EDA linkage:** [Explain how EDA guided feature selection and transformations.]
- **Decision:** Is Gradient Boost a top candidate for Milestone Two depth? Why?

> Keep this concise and Milestone-ready.


## 11. Sources (APA) + Yellowdig Note
- Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. *Journal of Machine Learning Research*, 12, 2825–2830.
- Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *Annals of Statistics*, 1189–1232.
- Scikit-learn User Guide (Gradient Boosting).

**Yellowdig post idea:** Share the sklearn User Guide link and justify it as a high-quality source due to clear API docs, examples, and mathematical grounding.
