# DATASCI 503, Group Work 7: Trees and Tree Ensembles

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. During lab, feel free to flag down your GSI to ask questions at any point!

##  Classification Trees

A **classification tree** is built through a process known as binary recursive partitioning. The main idea is as follows:
* Recursively partition the input space into rectangular boxes
* At each step, ask a question about one variable (split the feature space into parts)
* Repeat for each branch (recursively partition the feature space into boxes)
* Goal: each box should contain data points mostly from the same class
* Each box is labelled with its majority class
* The end result: a tree of splits, a partitioning of the variable space into boxes and assignment of a class label to each box



Scikit-learn implements CART, whose fundamental principles and methodologies behind its decision tree algorithms are largely based on the concepts introduced by Breiman et al. in the book Classification and Regression Trees.

Additionally, scikit-learn enhances the basic CART algorithm with several modern features like support for missing values, various criteria for splitting (Gini impurity, entropy for classification, and mean squared error, mean absolute error for regression), and pre-pruning options.

In [None]:
import io
import zipfile

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from sklearn import tree
from sklearn.ensemble import (
    GradientBoostingClassifier,
    GradientBoostingRegressor,
    RandomForestClassifier,
    RandomForestRegressor,
)
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    confusion_matrix,
    mean_squared_error,
)
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

We look at the spam email classification problem. This dataset is available at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/spambase).

In [None]:
url = "https://archive.ics.uci.edu/static/public/94/spambase.zip"
response = requests.get(url, timeout=30)
z = zipfile.ZipFile(io.BytesIO(response.content))
z.extractall(".")

In [None]:
column_names = [
    "word_freq_make",
    "word_freq_address",
    "word_freq_all",
    "word_freq_3d",
    "word_freq_our",
    "word_freq_over",
    "word_freq_remove",
    "word_freq_internet",
    "word_freq_order",
    "word_freq_mail",
    "word_freq_receive",
    "word_freq_will",
    "word_freq_people",
    "word_freq_report",
    "word_freq_addresses",
    "word_freq_free",
    "word_freq_business",
    "word_freq_email",
    "word_freq_you",
    "word_freq_credit",
    "word_freq_your",
    "word_freq_font",
    "word_freq_000",
    "word_freq_money",
    "word_freq_hp",
    "word_freq_hpl",
    "word_freq_george",
    "word_freq_650",
    "word_freq_lab",
    "word_freq_labs",
    "word_freq_telnet",
    "word_freq_857",
    "word_freq_data",
    "word_freq_415",
    "word_freq_85",
    "word_freq_technology",
    "word_freq_1999",
    "word_freq_parts",
    "word_freq_pm",
    "word_freq_direct",
    "word_freq_cs",
    "word_freq_meeting",
    "word_freq_original",
    "word_freq_project",
    "word_freq_re",
    "word_freq_edu",
    "word_freq_table",
    "word_freq_conference",
    "char_freq_;",
    "char_freq_(",
    "char_freq_[",
    "char_freq_!",
    "char_freq_$",
    "char_freq_#",
    "capital_run_length_average",
    "capital_run_length_longest",
    "capital_run_length_total",
    "is_spam",
]

In [None]:
spam_data = pd.read_csv("data/spambase.data", names=column_names)

In the dataset, there are 57 continuous variables as input variables and 1 binary outcome variable. The variable information is given as follows:

``word_freq_WORD``: percentage of words in the email that match WORD (48 variables taking value in [0,100]);

``char_freq_CHAR``: percentage of characters in the email that match CHAR (6 variables taking value in [0,100]);

``capital_run_length_average``: average length of uninterrupted sequences of capital letters;

``capital_run_length_longest``: length of longest uninterrupted sequence of capital letters;

``capital_run_length_total``: total number of capital letters in the email;

``spam``: denotes whether the email was considered spam (1) or not (0).


In [None]:
spam_train, spam_test = train_test_split(
    spam_data, test_size=0.3, random_state=1, stratify=spam_data["is_spam"]
)

Let us take a look at the word frequencies. In a word vector, words appearing too frequently or too rarely can usually be useless for making predictions. We order the first 48 columns accroding to their mean frequency. We will use tree model later to confirm if our intuition is correct.

In [None]:
# Calculate the column mean for the first 48 columns which corresponds to word frequency
word_freq_mean = spam_train.mean(axis=0)[0:48]

# Change the order by the frequency value
word_freq_mean_sort = word_freq_mean.sort_values(ascending=True)
word_freq_mean_sort

It’s interesting to see that the word ``george`` has high frequency. This is because George is the donor of the dataset. We may see later that ``hp``, the place George works, is selected by tree model. We should note that using those two words may causes generalization problem when we apply the model to emails of other users. We also find the labels are slightly unbalanced. This reminds us to adjust the output label distribution when we generalize the model to a broader application.

In [None]:
pd.crosstab(index=spam_train["is_spam"], columns="count")

You can check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) for `DecisionTreeClassifier`. Here are some arguments we use:

- ``criterion``: {“gini”, “entropy”, “log_loss”}, default= "gini"

The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see Mathematical formulation.

- ``class_weight``: dict, list of dict or "balanced", default=None

Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

- ``ccp_alpha``: non-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed.

We will explain each argument in details later.

In [None]:
X_spam_train = spam_train.drop(["is_spam"], axis=1)
y_spam_train = spam_train["is_spam"]
X_spam_test = spam_test.drop(["is_spam"], axis=1)
y_spam_test = spam_test["is_spam"]

In [None]:
tree1 = DecisionTreeClassifier(random_state=0)
tree1 = tree1.fit(X_spam_train, y_spam_train)

In [None]:
training_error = 1 - tree1.score(X_spam_train, y_spam_train)
test_error = 1 - tree1.score(X_spam_test, y_spam_test)
print("training error is", training_error)
print("test error is", test_error)

One of the most attractive properties of trees is that they can be graphically displayed. We use the ``plot_tree()`` function to depict the tree structure.

In [None]:
plt.figure(figsize=(20, 10))
tree.plot_tree(tree1)
plt.show()

In [None]:
import graphviz

In [None]:
dot_data = tree.export_graphviz(
    tree1,
    out_file=None,
    feature_names=column_names[:-1],
    class_names=["not_spam", "spam"],
    filled=True,
    rounded=True,
    special_characters=True,
)

graph = graphviz.Source(dot_data)
graph.render("decision_tree.dot", format="png", cleanup=True)  # Save and render the graph as PNG
# graph  # Display the graph

Since we didn't penalize the size of the tree, we generated an extremely complex model. The length (or depth) of each branch is proportional to the reduction in impurity (or quality of split) of the corresponding split. There is a very long branch in our complex model.

As mentioned in the lecture, we prune a tree by finding a sub-tree $T$ that minimizes:
$$
  C(T)=\sum_{t=1}^{|T|} N_{t} \cdot \text{Impurity} (t)+c_{p} \cdot|T|.
$$

## Comparing Different Splitting Criteria


There are multiple options for splitting measure. Previously we have implemented Gini index. Now we try the entropy($-\sum_{k=1}^{K} p_{k}(m) \log p_{k}(m)$) splitting measure.

We modify the argument ``criterion`` in ``DecisionTreeClassifier``:

-``criterion``: {“gini”, “entropy”, “log_loss”}, default= ”gini”

Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see [Mathematical formulation](https://scikit-learn.org/stable/modules/tree.html#tree-mathematical-formulation).

In [None]:
tree2 = DecisionTreeClassifier(random_state=0, criterion="entropy")
tree2 = tree2.fit(X_spam_train, y_spam_train)

training_error = 1 - tree2.score(X_spam_train, y_spam_train)
test_error = 1 - tree2.score(X_spam_test, y_spam_test)
print("training error is", training_error)
print("test error is", test_error)

Using cross-entropy, we get smaller test error compared with that we obtained using Gini index.

There are two types of misclassification, i.e. to misclassify ``Spam`` as ``Non-Spam`` and to misclassify ``Non-Spam`` as ``Spam``. The number of two types of errors can be unbalanced.

Our model tends to make the error of predicting ``Spam`` to be ``Non-Spam``, since in our training set ``Non-Spam`` is the dominant class. This can be a problem when we want to control certain type of errors.

We can assign weights to adjust the unbalance to reduce the another type of error. In our case, since training and test set have the same label distribution, the true test error might increase slightly.

To assign weights, we modify the argument ``class_weight`` in ``DecisionTreeClassifier``:

- ``class_weight``: dict, list of dict or “balanced”, default=None

Explanation of this argument: Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.




In [None]:
y_spam_train.value_counts()

In [None]:
proportion_non_spam = y_spam_train.value_counts()[0] / y_spam_train.shape[0]
proportion_spam = y_spam_train.value_counts()[1] / y_spam_train.shape[0]
proportion_non_spam / proportion_spam

In [None]:
# Model without balancing labels
tree2_no_weight = DecisionTreeClassifier(random_state=0)
tree2_no_weight.fit(X_spam_train, y_spam_train)

training_error = 1 - tree2_no_weight.score(X_spam_train, y_spam_train)
test_error = 1 - tree2_no_weight.score(X_spam_test, y_spam_test)

print("training error is", training_error)
print("test error is", test_error)

y_pred_test = tree2_no_weight.predict(X_spam_test)
cm = confusion_matrix(y_spam_test, y_pred_test)
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=tree2_no_weight.classes_)
cm_display.plot();

In [None]:
# Model with balancing labels
tree2_weight = DecisionTreeClassifier(
    random_state=0, class_weight={0: 1, 1: proportion_non_spam / proportion_spam}
)
tree2_weight.fit(X_spam_train, y_spam_train)

training_error = 1 - tree2_weight.score(X_spam_train, y_spam_train)
test_error = 1 - tree2_weight.score(X_spam_test, y_spam_test)
print("training error is", training_error)
print("test error is", test_error)

y_pred_test = tree2_weight.predict(X_spam_test)
cm = confusion_matrix(y_spam_test, y_pred_test)
cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=tree2_weight.classes_)
cm_display.plot();

## Regression Trees

In [None]:
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
y[::5, :] += 0.5 - rng.rand(20, 2)

# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_3 = DecisionTreeRegressor(max_depth=8)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)

# Predict
X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)

# Plot the results
plt.figure()
s = 25
plt.scatter(y[:, 0], y[:, 1], c="navy", s=s, edgecolor="black", label="data")
plt.scatter(
    y_1[:, 0],
    y_1[:, 1],
    c="cornflowerblue",
    s=s,
    edgecolor="black",
    label="max_depth=2",
)
plt.scatter(y_2[:, 0], y_2[:, 1], c="red", s=s, edgecolor="black", label="max_depth=5")
plt.scatter(y_3[:, 0], y_3[:, 1], c="orange", s=s, edgecolor="black", label="max_depth=8")
plt.xlim([-6, 6])
plt.ylim([-6, 6])
plt.xlabel("target 1")
plt.ylabel("target 2")
plt.title("Multi-output Decision Tree Regression")
plt.legend(loc="best")
plt.show()

## Random Forests

A **random forest** is an ensemble method that combines multiple CART for classification and regression tasks. Each CART model is fitted using a bootstrapped sample, just like bagging, but at each split node, only a subset of $m$ predictors of all $p$ features are chosen as candidates for creating the split, typically $m \approx \sqrt{p}$ for classification and $m \approx p/3$ for regression (Section 15.3 of *Elements of Statistical Learning*). Random Forest selects random subsets of features to *decorrelate* trees.




In this section, we will also look at the spam email classification problem using the spam dataset.

RandomForestClassifier in sklearn will be used to implement random forest. Official document could be found here

Note that we could adjust the number of predictors used in random forest by changing the max_features argument.

    max_features: {“sqrt”, “log2”, None}, int or float, default=”sqrt”

The usgae of other arguments such as criterion, ccp_alpha and class_weight are the same as we introduced in DecisionTreeClassifier.


We could fit the bagging model with ``RandomForestClassifier``. Instead of writing a for-loop for bootstrap, we can set the number of predictors used in random forest as the number of all predictors in our training set ($m = p$)

In [None]:
clf = RandomForestClassifier(random_state=0, max_features=X_spam_train.shape[1])
clf.fit(X_spam_train, y_spam_train)
accuracy = clf.score(X_spam_test, y_spam_test)
test_error_bagging = 1 - accuracy
test_error_bagging

Now we fit random forest, notice that by default, ``max_features``= "sqrt", which means we choose number of features used in random forest $m$ based on the square root of the number of all predictors in our training set $\sqrt{p}$

In [None]:
rf = RandomForestClassifier(random_state=0)
rf.fit(X_spam_train, y_spam_train)
accuracy = rf.score(X_spam_test, y_spam_test)
test_error_rf = 1 - accuracy
test_error_rf

## Feature importance based on mean decrease in impurity

Feature importances are provided by the fitted attribute ``feature_importances_`` and they are computed as the mean and standard deviation of accumulation of the impurity decrease within each tree. Note that Impurity-based feature importances can be misleading for high cardinality features (many unique values).

In [None]:
importances = rf.feature_importances_
forest_importances = pd.Series(importances, index=rf.feature_names_in_)

In [None]:
plt.figure(figsize=(10, 12))
forest_importances.sort_values(ascending=True).plot.barh()
plt.ylabel("Mean decrease in impurity")
plt.show()

## Feature importance based on feature permutation

Permutation feature importance overcomes limitations of the impurity-based feature importance: they do not have a bias toward high-cardinality features and can be computed on a left-out test set.

In [None]:
from sklearn.inspection import permutation_importance

In [None]:
result = permutation_importance(
    rf, X_spam_test, y_spam_test, n_repeats=10, random_state=42, n_jobs=2
)

In [None]:
forest_importances = pd.Series(result.importances_mean, index=rf.feature_names_in_)
plt.figure(figsize=(10, 12))
forest_importances.sort_values(ascending=True).plot.barh()
plt.ylabel("Mean decrease in accuracy")
plt.show()

We observe some negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data. This happens when the feature didn't matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate.

## Boosting

A single learned tree may be a bit too restrictive to fully capture patterns in the data. One way to ensure multiple trees get used is with boosting. By using CART as a base learner, gradient boosting trains new iterations of a CART based on the errors of the previous learned models.

$$f_m(x)=f_{m-1}(x)+\left(\underset{h_m \epsilon H}{\operatorname{argmin}}\left[\sum_{i=1}^N L\left(y_i, f_{m-1}\left(x_i\right)+h_m\left(x_i\right)\right)\right]\right)(x)$$

The $m^{\textrm{th}}$ model is an amalgamation of weak learners. Each $h_m$ is a weak learner trained to minimize the remaining error after $f_{m-1}$ is learned.

In [None]:
clf = GradientBoostingClassifier(
    n_estimators=1000, random_state=0, max_features=X_spam_train.shape[1]
)
clf.fit(X_spam_train, y_spam_train)
accuracy = clf.score(X_spam_test, y_spam_test)
test_error_boosting = 1 - accuracy
test_error_boosting

In [None]:
clf = GradientBoostingClassifier(n_estimators=1000, random_state=0, max_features="sqrt")
clf.fit(X_spam_train, y_spam_train)
accuracy = clf.score(X_spam_test, y_spam_test)
test_error_boosting = 1 - accuracy
test_error_boosting

---

**Problem 1:** Motivation (free response)

In at most **two** sentences, explain whether you expect decision tree techniques to outperform our previously covered regression methods **and why.**

> BEGIN SOLUTION

Decision trees may outperform linear regression methods when the underlying relationships between features and the target are nonlinear or involve complex interactions. However, if the relationship is approximately linear, traditional regression methods may perform equally well or better due to their lower variance.
> END SOLUTION


---

**Problem 2:** Variable Setup and Selection

For this problem, we will try to predict individuals' high-density lipoprotein (HDL) cholesterol levels. Please do the following:

1. Use the following features for predictive purposes from our NHANES dataset: Gender, Age, Weight, Height, BMI, WaistSize, HouseholdSize, and Ethnicity. You may need to refer to the documentation to figure out their variable names:
   - [HDL_L](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/HDL_L.htm)
   - [DEMO_L](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.htm)
   - [BMX_L](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/BMX_L.htm)

2. Rename the variable names to be English-readable but still in Python variable style (e.g., `RIAGENDR` becomes `Gender`).

3. Drop all rows with missing values.

Store your final cleaned DataFrame in a variable named `my_df`.

In [None]:
# BEGIN SOLUTION
bmx_df = pd.read_sas("data/NHANES/BMX_L.xpt")
demo_df = pd.read_sas("data/NHANES/DEMO_L.xpt")
hdl_df = pd.read_sas("data/NHANES/HDL_L.xpt")

# Inner join on SEQN
df = pd.merge(hdl_df, bmx_df, on="SEQN", how="inner")
df = pd.merge(df, demo_df, on="SEQN", how="inner")

# Select and rename columns
selected_columns = [
    "LBDHDD",
    "RIAGENDR",
    "RIDAGEYR",
    "BMXWT",
    "BMXHT",
    "DMDHHSIZ",
    "BMXBMI",
    "BMXWAIST",
    "RIDRETH1",
]
filtered_data = df[selected_columns].copy()
my_df = filtered_data.rename(
    columns={
        "LBDHDD": "HDL",
        "RIAGENDR": "Gender",
        "RIDAGEYR": "Age",
        "BMXWT": "Weight",
        "BMXHT": "Height",
        "BMXBMI": "BMI",
        "BMXWAIST": "WaistSize",
        "DMDHHSIZ": "HouseholdSize",
        "RIDRETH1": "Ethnicity",
    }
)

# Drop rows with missing values
my_df = my_df.dropna()
# END SOLUTION
my_df.head()

In [None]:
# Test assertions
assert "my_df" in dir(), "my_df should be defined"
assert len(my_df.columns) == 9, f"Expected 9 columns, got {len(my_df.columns)}"
assert "HDL" in my_df.columns, "HDL column should be present"
assert "Gender" in my_df.columns, "Gender column should be present"
assert my_df.isna().sum().sum() == 0, "There should be no missing values"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "Age" in my_df.columns, "Age column should be present"
assert "Weight" in my_df.columns, "Weight column should be present"
assert "Height" in my_df.columns, "Height column should be present"
assert "BMI" in my_df.columns, "BMI column should be present"
assert "WaistSize" in my_df.columns, "WaistSize column should be present"
assert "HouseholdSize" in my_df.columns, "HouseholdSize column should be present"
assert "Ethnicity" in my_df.columns, "Ethnicity column should be present"
assert len(my_df) > 0, "DataFrame should not be empty"
# END HIDDEN TESTS

---

**Problem 3:** Train, Val, Test Split

Split your data into train, validation, and test sets with a 60%/20%/20% breakdown of observations, respectively.

**Use `random_state=42` for this and all subsequent problems.**

Store your splits in variables named `X_train`, `X_validation`, `X_test`, `y_train`, `y_validation`, and `y_test`.

In [None]:
# BEGIN SOLUTION
X = my_df.drop(columns=["HDL"])
y = my_df["HDL"]

# First split: 80% train+val, 20% test
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Second split: 75% of 80% = 60% train, 25% of 80% = 20% validation
X_train, X_validation, y_train, y_validation = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42
)
# END SOLUTION

In [None]:
# Test assertions
total_samples = len(my_df)
train_ratio = len(X_train) / total_samples
val_ratio = len(X_validation) / total_samples
test_ratio = len(X_test) / total_samples

assert 0.58 < train_ratio < 0.62, f"Train ratio should be ~60%, got {train_ratio:.2%}"
assert 0.18 < val_ratio < 0.22, f"Validation ratio should be ~20%, got {val_ratio:.2%}"
assert 0.18 < test_ratio < 0.22, f"Test ratio should be ~20%, got {test_ratio:.2%}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    len(X_train) + len(X_validation) + len(X_test) == total_samples
), "Splits should cover all data"
assert len(y_train) == len(X_train), "X_train and y_train should have same length"
assert len(y_validation) == len(
    X_validation
), "X_validation and y_validation should have same length"
assert len(y_test) == len(X_test), "X_test and y_test should have same length"
# END HIDDEN TESTS

---

**Problem 4:** Regression Models

Train an instance of CART (`DecisionTreeRegressor`), Random Forest (`RandomForestRegressor`), and Gradient Boosting (`GradientBoostingRegressor`) to predict HDL levels. Evaluate each model's performance on the train, validation, and test sets using mean squared error (MSE).

Store your results in variables with the format `[split]_mse_[model]`. For example:
- `train_mse_cart`, `validation_mse_cart`, `test_mse_cart`
- `train_mse_rf`, `validation_mse_rf`, `test_mse_rf`
- `train_mse_boosting`, `validation_mse_boosting`, `test_mse_boosting`

Use `random_state=42` when initializing your models.

In [None]:
# BEGIN SOLUTION
# CART Model
nhanes_cart = DecisionTreeRegressor(random_state=42)
nhanes_cart.fit(X_train, y_train)

train_mse_cart = mean_squared_error(y_train, nhanes_cart.predict(X_train))
validation_mse_cart = mean_squared_error(y_validation, nhanes_cart.predict(X_validation))
test_mse_cart = mean_squared_error(y_test, nhanes_cart.predict(X_test))

# Random Forest Model
nhanes_rf = RandomForestRegressor(random_state=42)
nhanes_rf.fit(X_train, y_train)

train_mse_rf = mean_squared_error(y_train, nhanes_rf.predict(X_train))
validation_mse_rf = mean_squared_error(y_validation, nhanes_rf.predict(X_validation))
test_mse_rf = mean_squared_error(y_test, nhanes_rf.predict(X_test))

# Gradient Boosting Model
nhanes_boosting = GradientBoostingRegressor(random_state=42)
nhanes_boosting.fit(X_train, y_train)

train_mse_boosting = mean_squared_error(y_train, nhanes_boosting.predict(X_train))
validation_mse_boosting = mean_squared_error(y_validation, nhanes_boosting.predict(X_validation))
test_mse_boosting = mean_squared_error(y_test, nhanes_boosting.predict(X_test))
# END SOLUTION

print(
    f"CART - Train: {train_mse_cart:.2f}, Val: {validation_mse_cart:.2f}, Test: {test_mse_cart:.2f}"
)
print(f"RF - Train: {train_mse_rf:.2f}, Val: {validation_mse_rf:.2f}, Test: {test_mse_rf:.2f}")
print(
    f"Boosting - Train: {train_mse_boosting:.2f}, "
    f"Val: {validation_mse_boosting:.2f}, Test: {test_mse_boosting:.2f}"
)

In [None]:
# Test assertions
assert train_mse_cart >= 0, "MSE should be non-negative"
assert validation_mse_cart >= 0, "MSE should be non-negative"
assert test_mse_cart >= 0, "MSE should be non-negative"
assert train_mse_rf >= 0, "MSE should be non-negative"
assert test_mse_rf >= 0, "MSE should be non-negative"
assert train_mse_boosting >= 0, "MSE should be non-negative"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# CART should have near-zero training error (overfits)
assert train_mse_cart < 1, "CART should have very low training MSE"
# Ensemble methods should generally have lower test error than single tree
assert (
    test_mse_rf < test_mse_cart or test_mse_boosting < test_mse_cart
), "Ensemble methods should generally outperform single tree"
# END HIDDEN TESTS

---

**Problem 5:** Regression Results Display

Create a DataFrame that compares the performance of the decision tree methods. The DataFrame should have:
- **Index**: Train Error, Validation Error, Test Error
- **Columns**: CART, Random Forest, Boosting
- **Values**: The corresponding MSE values

Store the DataFrame in a variable named `results_df`.

In [None]:
# BEGIN SOLUTION
results = np.array(
    [
        [train_mse_cart, train_mse_rf, train_mse_boosting],
        [validation_mse_cart, validation_mse_rf, validation_mse_boosting],
        [test_mse_cart, test_mse_rf, test_mse_boosting],
    ]
)
results_df = pd.DataFrame(
    results,
    index=["Train Error", "Validation Error", "Test Error"],
    columns=["CART", "Random Forest", "Boosting"],
)
# END SOLUTION
results_df

In [None]:
# Test assertions
assert isinstance(results_df, pd.DataFrame), "results_df should be a DataFrame"
assert results_df.shape == (3, 3), f"Expected shape (3, 3), got {results_df.shape}"
assert list(results_df.columns) == [
    "CART",
    "Random Forest",
    "Boosting",
], "Column names are incorrect"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "Train Error" in results_df.index, "Index should contain 'Train Error'"
assert "Validation Error" in results_df.index, "Index should contain 'Validation Error'"
assert "Test Error" in results_df.index, "Index should contain 'Test Error'"
assert results_df.loc["Train Error", "CART"] == train_mse_cart, "CART train MSE should match"
# END HIDDEN TESTS

---

**Problem 6:** Switching to Classification

Previously, we treated HDL as a regression problem. However, it can be helpful to view HDL levels in categorical bins. According to the [Cleveland Clinic](https://my.clevelandclinic.org/health/articles/11920-cholesterol-numbers-what-do-they-mean), HDL levels can be categorized as:

- **Heart-Healthy (0)**: HDL >= 60 mg/dL
- **At-Risk (1)**: HDL 40-59 mg/dL for men, or HDL 50-59 mg/dL for women
- **Dangerous (2)**: HDL < 40 mg/dL for men, or HDL < 50 mg/dL for women

Create a new column called `Level` in `my_df` with these categorical labels (0, 1, or 2).

Then create new train, validation, and test splits with the same 60%/20%/20% breakdown, but this time **stratify on the `Level` values**. Use `random_state=42`.

**Note**: Gender is encoded as 1 for male and 2 for female in the NHANES dataset.

In [None]:
# BEGIN SOLUTION
def get_hdl_level(gender, hdl):
    """
    Categorize HDL level based on gender and HDL value.
    Gender: 1 = male, 2 = female
    Returns: 0 = Heart-Healthy, 1 = At-Risk, 2 = Dangerous
    """
    if hdl >= 60:
        return 0  # Heart-Healthy
    elif gender == 1:  # Male
        if hdl >= 40:
            return 1  # At-Risk
        else:
            return 2  # Dangerous
    elif hdl >= 50:
        return 1  # At-Risk
    else:
        return 2  # Dangerous


my_df["Level"] = my_df.apply(lambda row: get_hdl_level(row["Gender"], row["HDL"]), axis=1)

# Create new splits for classification, stratified by Level
X = my_df.drop(columns=["HDL", "Level"])
y = my_df["Level"]

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_validation, y_train, y_validation = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val
)
# END SOLUTION

In [None]:
# Test assertions
assert "Level" in my_df.columns, "Level column should be added to my_df"
assert set(my_df["Level"].unique()).issubset(
    {0, 1, 2}
), "Level should only contain values 0, 1, or 2"

# Test that Level assignments follow the correct logic based on HDL and Gender
# Heart-Healthy (0): HDL >= 60 for all genders
heart_healthy = my_df[my_df["HDL"] >= 60]
assert (heart_healthy["Level"] == 0).all(), "All HDL >= 60 should be Heart-Healthy (0)"

# Dangerous (2) for males: HDL < 40
dangerous_males = my_df[(my_df["Gender"] == 1) & (my_df["HDL"] < 40)]
assert (dangerous_males["Level"] == 2).all(), "Males with HDL < 40 should be Dangerous (2)"

# Dangerous (2) for females: HDL < 50
dangerous_females = my_df[(my_df["Gender"] == 2) & (my_df["HDL"] < 50)]
assert (dangerous_females["Level"] == 2).all(), "Females with HDL < 50 should be Dangerous (2)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check that stratification was applied (class proportions should be similar across splits)
train_prop = y_train.value_counts(normalize=True).sort_index()
test_prop = y_test.value_counts(normalize=True).sort_index()
for level in [0, 1, 2]:
    if level in train_prop.index and level in test_prop.index:
        assert (
            abs(train_prop[level] - test_prop[level]) < 0.05
        ), f"Stratification issue for level {level}"
# END HIDDEN TESTS

---

**Problem 7:** Classification Models

Similar to Problem 4, train CART (`DecisionTreeClassifier`), Random Forest (`RandomForestClassifier`), and Gradient Boosting (`GradientBoostingClassifier`) models for the classification task. Evaluate each model's performance using accuracy.

Store your results in variables with the format `[split]_acc_[model]`. For example:
- `train_acc_cart`, `validation_acc_cart`, `test_acc_cart`
- `train_acc_rf`, `validation_acc_rf`, `test_acc_rf`
- `train_acc_boosting`, `validation_acc_boosting`, `test_acc_boosting`

Use `random_state=42` when initializing your models.

In [None]:
# BEGIN SOLUTION
# CART Model
nhanes_cart = DecisionTreeClassifier(random_state=42)
nhanes_cart.fit(X_train, y_train)

train_acc_cart = accuracy_score(y_train, nhanes_cart.predict(X_train))
validation_acc_cart = accuracy_score(y_validation, nhanes_cart.predict(X_validation))
test_acc_cart = accuracy_score(y_test, nhanes_cart.predict(X_test))

# Random Forest Model
nhanes_rf = RandomForestClassifier(random_state=42)
nhanes_rf.fit(X_train, y_train)

train_acc_rf = accuracy_score(y_train, nhanes_rf.predict(X_train))
validation_acc_rf = accuracy_score(y_validation, nhanes_rf.predict(X_validation))
test_acc_rf = accuracy_score(y_test, nhanes_rf.predict(X_test))

# Gradient Boosting Model
nhanes_boosting = GradientBoostingClassifier(random_state=42)
nhanes_boosting.fit(X_train, y_train)

train_acc_boosting = accuracy_score(y_train, nhanes_boosting.predict(X_train))
validation_acc_boosting = accuracy_score(y_validation, nhanes_boosting.predict(X_validation))
test_acc_boosting = accuracy_score(y_test, nhanes_boosting.predict(X_test))
# END SOLUTION

print(
    f"CART - Train: {train_acc_cart:.3f}, Val: {validation_acc_cart:.3f}, Test: {test_acc_cart:.3f}"
)
print(f"RF - Train: {train_acc_rf:.3f}, Val: {validation_acc_rf:.3f}, Test: {test_acc_rf:.3f}")
print(
    f"Boosting - Train: {train_acc_boosting:.3f}, "
    f"Val: {validation_acc_boosting:.3f}, Test: {test_acc_boosting:.3f}"
)

In [None]:
# Test assertions
assert 0 <= train_acc_cart <= 1, "Accuracy should be between 0 and 1"
assert 0 <= validation_acc_cart <= 1, "Accuracy should be between 0 and 1"
assert 0 <= test_acc_cart <= 1, "Accuracy should be between 0 and 1"
assert 0 <= train_acc_rf <= 1, "Accuracy should be between 0 and 1"
assert 0 <= test_acc_boosting <= 1, "Accuracy should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# CART typically overfits, so training accuracy should be very high
assert train_acc_cart > 0.9, "CART should have high training accuracy"
# Ensemble methods should generally perform better on test set
assert (
    test_acc_rf >= test_acc_cart * 0.9 or test_acc_boosting >= test_acc_cart * 0.9
), "Ensemble methods should be competitive"
# END HIDDEN TESTS

---

**Problem 8:** Classification Results Display

Create a DataFrame that compares the classification performance of the decision tree methods. The DataFrame should have:
- **Index**: Train Accuracy, Validation Accuracy, Test Accuracy
- **Columns**: CART, Random Forest, Boosting
- **Values**: The corresponding accuracy values

Store the DataFrame in a variable named `classification_results_df`.

In [None]:
# BEGIN SOLUTION
results = np.array(
    [
        [train_acc_cart, train_acc_rf, train_acc_boosting],
        [validation_acc_cart, validation_acc_rf, validation_acc_boosting],
        [test_acc_cart, test_acc_rf, test_acc_boosting],
    ]
)
classification_results_df = pd.DataFrame(
    results,
    index=["Train Accuracy", "Validation Accuracy", "Test Accuracy"],
    columns=["CART", "Random Forest", "Boosting"],
)
# END SOLUTION
classification_results_df

In [None]:
# Test assertions
assert isinstance(
    classification_results_df, pd.DataFrame
), "classification_results_df should be a DataFrame"
assert classification_results_df.shape == (
    3,
    3,
), f"Expected shape (3, 3), got {classification_results_df.shape}"
assert list(classification_results_df.columns) == [
    "CART",
    "Random Forest",
    "Boosting",
], "Column names are incorrect"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "Train Accuracy" in classification_results_df.index, "Index should contain 'Train Accuracy'"
assert (
    "Validation Accuracy" in classification_results_df.index
), "Index should contain 'Validation Accuracy'"
assert "Test Accuracy" in classification_results_df.index, "Index should contain 'Test Accuracy'"
assert (
    classification_results_df.loc["Train Accuracy", "CART"] == train_acc_cart
), "CART train accuracy should match"
# END HIDDEN TESTS

---

**Problem 9a:** Error Costs (free response)

In 1-2 sentences, identify which misclassification error is the worst one to make. Your answer should follow this format: "Predicting [THIS CLASS] when the ground truth is [THIS OTHER CLASS] is the worst because [REASON]."

Consider the health implications of each type of error.

> BEGIN SOLUTION

Predicting "Heart-Healthy" (0) when the ground truth is "Dangerous" (2) is the worst error because it would lead to a false sense of security for patients who actually have critically low HDL levels and need immediate medical intervention to prevent heart disease.
> END SOLUTION


---

**Problem 9b:** Error Analysis

Now that you have identified which error is the most costly, analyze the confusion matrices for each model to determine which model best avoids that specific error.

The code below generates confusion matrices for each model. After examining them, provide the ranking of models from **lowest to highest** misclassification for your identified worst error.

In [None]:
# Generate confusion matrices for analysis
cm_cart = confusion_matrix(y_test, nhanes_cart.predict(X_test))
cm_cart_display = ConfusionMatrixDisplay(
    confusion_matrix=cm_cart, display_labels=nhanes_cart.classes_
)

cm_rf = confusion_matrix(y_test, nhanes_rf.predict(X_test))
cm_rf_display = ConfusionMatrixDisplay(confusion_matrix=cm_rf, display_labels=nhanes_rf.classes_)

cm_boosting = confusion_matrix(y_test, nhanes_boosting.predict(X_test))
cm_boosting_display = ConfusionMatrixDisplay(
    confusion_matrix=cm_boosting, display_labels=nhanes_boosting.classes_
)

# Plot confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

cm_cart_display.plot(ax=axes[0])
axes[0].set_title("CART Confusion Matrix")
axes[0].set_xticklabels(["Heart-Healthy", "At-Risk", "Dangerous"])
axes[0].set_yticklabels(["Heart-Healthy", "At-Risk", "Dangerous"])

cm_rf_display.plot(ax=axes[1])
axes[1].set_title("Random Forest Confusion Matrix")
axes[1].set_xticklabels(["Heart-Healthy", "At-Risk", "Dangerous"])
axes[1].set_yticklabels(["Heart-Healthy", "At-Risk", "Dangerous"])

cm_boosting_display.plot(ax=axes[2])
axes[2].set_title("Boosting Confusion Matrix")
axes[2].set_xticklabels(["Heart-Healthy", "At-Risk", "Dangerous"])
axes[2].set_yticklabels(["Heart-Healthy", "At-Risk", "Dangerous"])

plt.tight_layout()
plt.show()

> BEGIN SOLUTION

Based on the confusion matrices, the ranking of models from lowest to highest misclassification of predicting "Heart-Healthy" when the truth is "Dangerous" (i.e., the (2,0) entry) is:

1. Random Forest (lowest misclassification)
2. Boosting
3. CART (highest misclassification)

Note: The exact ranking may vary depending on the random state and data split.
> END SOLUTION


---

**Problem 10:** Maximizing Performance

Decision trees work well for data with nonlinear relationships, and their performance depends heavily on hyperparameter choices. Your task is to achieve a test classification accuracy of **57% or higher**.

You may use any decision tree ensemble method covered in this lab. Hyperparameters you can tune include:
- Learning rate
- Number of estimators (trees)
- Tree depth (`max_depth`)
- Minimum samples in each leaf (`min_samples_leaf`)
- Maximum features considered per tree (`max_features`)
- Minimum impurity decrease (`min_impurity_decrease`)

Store your final model in a variable named `final_model`.

**Important**: Do not include HDL as a feature, as it would make the problem trivial.

**Hint**: See the [GradientBoostingClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) for available hyperparameters.

In [None]:
# BEGIN SOLUTION
# Tuned Gradient Boosting model for improved accuracy
final_model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=5,
    min_samples_leaf=10,
    max_features="sqrt",
    random_state=42,
)
final_model.fit(X_train, y_train)
# END SOLUTION

final_accuracy = final_model.score(X_test, y_test)
print(f"Test Accuracy: {final_accuracy:.4f}")

In [None]:
# Test assertions
assert "final_model" in dir(), "final_model should be defined"
final_accuracy = final_model.score(X_test, y_test)
assert final_accuracy > 0.57, f"Test accuracy should be > 57%, got {final_accuracy:.2%}"
print(f"Congratulations! Your model achieved {final_accuracy:.2%} accuracy.")
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Ensure HDL is not used as a feature
assert "HDL" not in X_test.columns, "HDL should not be used as a feature"
# Check that the model is a valid classifier
assert hasattr(final_model, "predict"), "Model should have a predict method"
assert hasattr(final_model, "score"), "Model should have a score method"
# END HIDDEN TESTS