In [None]:
!wget https://raw.githubusercontent.com/mattswatson/intro-to-trees-workshop/refs/heads/main/eicu_processed.csv

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import numpy as np
from sklearn.inspection import DecisionBoundaryDisplay
import matplotlib.pyplot as plt

def plot_tree_boundaries(model, x_train, y_train, feature_names, target_names):
    # Parameters
    n_classes = len(np.unique(y_train))
    plot_colors = "rb"
    plot_step = 0.02

    # Plot the decision boundary
    g = DecisionBoundaryDisplay.from_estimator(
        model,
        x_train,
        cmap=plt.cm.RdYlBu,
        response_method="predict",
        xlabel=feature_names[0],
        ylabel=feature_names[1],
    )

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y_train == i)[0]
        plt.scatter(
            x_train.iloc[idx, 0],
            x_train.iloc[idx, 1],
            c=color,
            label=target_names[i],
            cmap=plt.cm.RdYlBu,
            edgecolor="black",
            s=15
        )
        
    return g

features = ['age','acutephysiologyscore']
outcome = 'actualhospitalmortality'

data = pd.read_csv('eicu_processed.csv')

x = data[features]
y = data[outcome]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=42)

# Model Evaluation

We’ve now learned the basics of the various tree methods and have visualized most of them; however, how do we actually know which one is the best predictive model? Let’s finish by comparing the performance of our models on our held-out test data. Our goal, remember, is to predict whether or not a patient will survive their hospital stay using the patient’s age and acute physiology score computed on the first day of their ICU stay.

We will begin by training a model for each of the techniques we have looked at so far.

In [2]:
# Fill missing data with -1
data_no_nans = data.fillna(-1)

x = data_no_nans[features]
y = data_no_nans[outcome]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state =  42)

In [3]:
from sklearn import metrics, ensemble, tree

models = dict()
models['Decision Tree'] = tree.DecisionTreeClassifier(criterion='entropy', splitter='best').fit(x_train, y_train)
models['Gradient Boosting'] = ensemble.GradientBoostingClassifier(n_estimators=10).fit(x_train, y_train)
models['Random Forest'] = ensemble.RandomForestClassifier(n_estimators=10).fit(x_train, y_train)
models['Bagging'] =  ensemble.BaggingClassifier(n_estimators=10).fit(x_train, y_train)
models['AdaBoost'] =  ensemble.AdaBoostClassifier(n_estimators=10).fit(x_train, y_train)

We now have a model for each of the techniques we have looked at. There are a number of different metrics we can use to measure the performance of our model, depending on what we want to measure. The most basic is accuracy - on the test set, how many predictions does our model get correct?

In [None]:
print('Accuracy\tModel')
for current_model in models:    
    predicted_proba = models[current_model].predict_proba(x_test)[:, 1]
    predictions = models[current_model].predict(x_test)
    
    score = metrics.accuracy_score(y_test, predictions)
    print('{:0.3f}\t{}'.format(score, current_model))

As you might expect (but is, crucially, not always the case!), the more advanced gradient boosting technique achieves the best performance.

**Question:** What is a possible issue with relying on accuracy as a performance metric?

Accuacy is not always a suitable performance metric, especially when our data is unbalanced (i.e. there are many more samples in one class than the other).

Another issue with accuracy is that it only measures performance at a single decision threshold. To inspect how well our models perform at different thresholds, we can plot the Receiver Operating Characteristic curve. This curve plots the false positive rate (x-axis) against the true positive rate (y-axis) at different decision thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives (and vice-versa when increasing the threshold). We can summarise this curve by calculating the area under the curve (AUROC, for Area Under the Receiver Operating Characteristic curve).

**Task:** Search for the correct functions in [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) to display the ROC curve, and calculate the AUROC for each model.

In [None]:
import matplotlib.pyplot as plt

ax = plt.subplot()

for current_model in models:    
   # Display the ROC curve

print('AUROC\tModel')
for current_model in models:    
    # Calculate and print AUROC

## Precision and Recall

A binary classification model can fail in one of two ways: false negatives (a predictive model misses a positive sample) and false positives (a predictive model incorrectly labels a negative sample as positive). In some cases, we may prefer our model to make one type of error over the other. For example, in our mortality prediction scenario, it may be desireable to make fewer false negative classifications (as this means we will miss patients at risk of critical deterioration) at the expense of making more false positive errors. Conversely, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives).

This is where metrics such as precision and recall come in. Precision measures: What proportion of positive identifications was actually correct?

**Question:** Can you come up with a possible formula for precision?

Recall, on the other hand, measures proportion of actual positives was identified correctly?

**Question:** Can you come up with a possible formula for recall?

Let's calculate the precision and recall of our models.

**Task:** Search for the correct functions in [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) to calculate the precision and recall for each model.

In [None]:
print('Precision\tRecall\tModel')
for current_model in models:    
    # Calculate and print the precision and recall

We can see that all of our models have low precision and recall scores, despite having high AUROC and accuracy. Let's take a look at the Gradient Boosting results in a bit more detail to see what these results are telling us.

**Question:** The gradient boosting model has a precision of 0.6667 - what does this mean in terms of the positively classified predictions?

**Question:** The gradient boosting model has a recall of 0.333 - what does this mean in terms of our predictions?

As our model has a precision of 0.6667, this means that when it predicts a patient will not survive their hospital stay, it is correct two thirds of the time. With a recall of 0.3333, it means we correctly 33% of patients who do not survive their stay.

Precision and recall are opposing metrics - by increasing the decision threshold we will increase precision while reducing recall, and vice-versa. We can visualise this by plotting a precision recall curve.

In [None]:
ax = plt.subplot()

for current_model in models:    
    # Show the precision-recall curve

The F-score measures this trade-off between precision and recall, and is often used to give a general idea of model performance - especially when the data is imbalanced. Typically we use the F1 score, which is the harmonic mean of precision and recall, but this can be changed to the F-β score, which weights precision by β.

**Question:** What is the formula for the F-β score?

Let's calculate the F1 score of our models.

In [None]:
print('F1\tModel')
for current_model in models:    
    # Calculate the F1 score for the model

With such low precision and recall values, our models may not be very useful in practice. There are multiple possible explanations for this: our task might be difficult, our models might not be powerful enough and so on. One notable point is that our data is imbalanced, and we have (so far) only used two features (age and acute physiology score) in our models.

**Question:** What proportion of our data is classified as a positive class (i.e. how many patients do not survive their hospital stay)?

In the next (and final) workbook, we will look at using different Python libraries to create more complex models that incorporate all available features in the dataset.