In [None]:
!wget https://raw.githubusercontent.com/mattswatson/intro-to-trees-workshop/refs/heads/main/eicu_processed.csv

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import numpy as np
from sklearn.inspection import DecisionBoundaryDisplay
import matplotlib.pyplot as plt

def plot_tree_boundaries(model, x_train, y_train, feature_names, target_names):
    # Parameters
    n_classes = len(np.unique(y_train))
    plot_colors = "rb"
    plot_step = 0.02

    # Plot the decision boundary
    g = DecisionBoundaryDisplay.from_estimator(
        model,
        x_train,
        cmap=plt.cm.RdYlBu,
        response_method="predict",
        xlabel=feature_names[0],
        ylabel=feature_names[1],
    )

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y_train == i)[0]
        plt.scatter(
            x_train.iloc[idx, 0],
            x_train.iloc[idx, 1],
            c=color,
            label=target_names[i],
            cmap=plt.cm.RdYlBu,
            edgecolor="black",
            s=15
        )
        
    return g

features = ['age','acutephysiologyscore']
outcome = 'actualhospitalmortality'

data = pd.read_csv('eicu_processed.csv')

x = data[features]
y = data[outcome]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state =  42)

In the previous workbook we saw how one ensembling technique, and how it works by reweighting incorrectly classified samples, such that the ensemble's sub-models place more importance on them.

Bootstrap aggregation, or “Bagging”, is another form of ensemble learning.

With boosting, we iteratively changed the dataset to have new trees focus on the “difficult” observations. Bagging involves the same approach, except we don’t selectively choose which observations to focus on, but rather we randomly select subsets of data each time.

Boosting aimed to iteratively improve our overall model with new trees. With bagging, we now build trees on what we hope are independent datasets.

Let’s take a step back, and think about a practical example. Say we wanted a good model of heart disease. If we saw researchers build a model from a dataset of patients from their hospital, we might think this would be sufficient. If the researchers were able to acquire a new dataset from new patients, and built a new model, we’d be inclined to feel that the combination of the two models would be better than any one individually.

This is the scenario that bagging aims to replicate, except instead of actually going out and collecting new datasets, we instead use “bootstrapping” to create new sets of data from our current dataset. If you are unfamiliar with bootstrapping, you can treat it as magic for now (and if you are familiar with the bootstrap, you already know that it is magic).

Let’s take a look at a simple bootstrap model.

**Task:** Use [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) to train a bagging decision tree with 6 estimators, where the `base_estimator` is a `DecisionTreeClassifier` with `max_depth=5`.

In [None]:
import numpy as np
from sklearn import tree, ensemble
import matplotlib.pyplot as plt

np.random.seed(321)
tree_model = 
ensemble_model = 
ensemble_model = ensemble_model.fit(x_train, y_train)

fig = plt.figure(figsize=[12,6])
for i, estimator in enumerate(ensemble_model.estimators_):    
    plot_tree_boundaries(estimator, x_train, y_train, feature_names=features, target_names=['Alive', 'Dead'])

We can see that each individual tree varies considerably. This is a result of using a random set of data to train the classifier.

In [None]:
plot_tree_boundaries(ensemble_model, x_train, y_train, feature_names=features,  target_names=['Alive', 'Dead'])

Of course, since this is a simple dataset, we are not seeing that many dramatic changes between different models. Don’t worry, we’ll quantitatively evaluate them later.

Next up, a minor addition creates one of the most popular models in machine learning.