In [None]:
!wget https://raw.githubusercontent.com/mattswatson/intro-to-trees-workshop/refs/heads/main/eicu_processed.csv

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import numpy as np
from sklearn.inspection import DecisionBoundaryDisplay
import matplotlib.pyplot as plt

def plot_tree_boundaries(model, x_train, y_train, feature_names, target_names):
    # Parameters
    n_classes = len(np.unique(y_train))
    plot_colors = "rb"
    plot_step = 0.02

    # Plot the decision boundary
    g = DecisionBoundaryDisplay.from_estimator(
        model,
        x_train,
        cmap=plt.cm.RdYlBu,
        response_method="predict",
        xlabel=feature_names[0],
        ylabel=feature_names[1],
    )

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y_train == i)[0]
        plt.scatter(
            x_train.iloc[idx, 0],
            x_train.iloc[idx, 1],
            c=color,
            label=target_names[i],
            cmap=plt.cm.RdYlBu,
            edgecolor="black",
            s=15
        )
        
    return g

features = ['age','acutephysiologyscore']
outcome = 'actualhospitalmortality'

data = pd.read_csv('eicu_processed.csv')

x = data[features]
y = data[outcome]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=42)

# Gradient Boosting

Next, we move on to gradient boosting. Gradient boosting elegantly combines concepts from the previous methods. As a “boosting” method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree. Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting.

While it is too much to express in this tutorial, the biggest innovation in gradient boosting is that it provides a unifying mathematical framework for boosting models. The approach explicitly casts the problem of building a tree as an optimization problem, defining mathematical functions for how well a tree is performing (which we had before) and how complex a tree is. In this light, one can actually treat AdaBoost as a “special case” of gradient boosting, where the loss function is chosen to be the exponential loss.

Again, our data cannot contain missing data. Let’s fix that and build a gradient boosting model.

In [4]:
# Fill missing data with -1
data_no_nans = data.fillna(-1)

x = data_no_nans[features]
y = data_no_nans[outcome]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state =  42)

**Task:** Use [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) to train a gradient boosting classifier with 10 estimators.

In [None]:
from sklearn import ensemble

np.random.seed(321)
model =
model =

plot_tree_boundaries(model, x_train, y_train, feature_names=features, target_names=['Alive', 'Dead'])