# DATASCI 503, Homework 7: Ensemble Methods and Decision Trees

This assignment covers classification trees, random forests, and gradient boosting classifiers.

---

**Problem 1:** Majority Vote vs Average Probability

Suppose we produce ten bootstrapped samples from a dataset containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.

There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability.

**Question:** What is the final classification under each of these two approaches?

> BEGIN SOLUTION

**Majority Vote Approach:**

For the majority vote, we count how many estimates predict Red (probability > 0.5). The estimates above 0.5 are: 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75. That gives us 6 out of 10 votes for Red. Since 6/10 > 0.5, the majority decision is **Red**.

**Average Probability Approach:**

Taking the average of all probabilities:

$$\frac{0.1 + 0.15 + 0.2 + 0.2 + 0.55 + 0.6 + 0.6 + 0.65 + 0.7 + 0.75}{10} = \frac{4.5}{10} = 0.45$$

Since 0.45 < 0.5, the average probability approach predicts **Green**.

**Conclusion:** The two methods give different predictions. Majority vote predicts Red, while average probability predicts Green. This discrepancy occurs because the four low probabilities (0.1, 0.15, 0.2, 0.2) are quite extreme and pull down the average, even though the majority of individual classifiers predict Red.
> END SOLUTION


---

**Problem 2:** Gradient Boosting with Decision Stumps

Consider using gradient boosting to solve a regression problem. Assume that at each iteration, we fit the residuals using a "decision stump": a decision tree with exactly two leaf nodes. In this case, the final estimate of the regression function can be expressed in the form:

$$\hat{f}(X) = \sum_{j=1}^{p} \hat{f}_j(X_j)$$

**Question:** Explain why this is the case.

> BEGIN SOLUTION

In gradient boosting, $\hat{f}$ is constructed as a sum of weak learners:

$$\hat{f}(X) = \lambda \sum_{m=1}^{M} h_m(X)$$

where each $h_m$ is a decision stump and $\lambda$ is the learning rate.

A decision stump can only make one split, so each $h_m$ effectively chooses some feature $k \in \{1, \ldots, p\}$ and a threshold $c$, then predicts:

$$h_m(X) = \begin{cases}
a_1 & \text{if } X_k < c \\
a_2 & \text{if } X_k \geq c
\end{cases}$$

where $a_1$ and $a_2$ are the predicted values (typically the average residuals in each leaf).

Crucially, each stump depends on only **one** feature. We can regroup the stumps by which feature they split on. For each feature $j$, let $\mathcal{S}_j$ be the set of stumps that split on feature $X_j$. Define:

$$\hat{f}_j(X_j) = \lambda \sum_{m \in \mathcal{S}_j} h_m(X)$$

Since each $h_m$ in $\mathcal{S}_j$ depends only on $X_j$, the function $\hat{f}_j$ depends only on $X_j$.

If some feature $X_j$ is never split on by any stump, we simply set $\hat{f}_j(X_j) = 0$.

This decomposition allows us to write:

$$\hat{f}(X) = \sum_{j=1}^{p} \hat{f}_j(X_j)$$

This is an **additive model**, where the contribution of each feature is separable. This property makes gradient boosting with stumps particularly interpretable.
> END SOLUTION


---

**Problem 3:** Classification Tree Sketches

This problem contains hand-drawn sketches illustrating decision tree concepts.

![Picture_A.jpg](attachment:Picture_A.jpg)

![Picture_B.jpg](attachment:Picture_B.jpg)

> BEGIN SOLUTION

The sketches above illustrate the partitioning of feature space by decision trees. These visualizations demonstrate how recursive binary splitting creates rectangular decision regions in 2D feature space.
> END SOLUTION


## Crab Species Classification

The following problems use the crabs dataset, which contains five size-related measurements of two different species of crabs (blue and orange). There are 50 male and 50 female crabs of each species. We will classify species based on the predictor variables and evaluate errors using the misclassification rate.

In [None]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")

---

**Problem 4:** Train-Test Split

Load the crabs dataset and perform a train-test split with the following specifications:
- Random state: 6789
- Training set size: 80% of the dataset
- Stratify the split according to both species and sex

This stratification ensures that approximately the same proportions of species/sex combinations appear in both the training and test datasets.

Store the results in `X_train`, `X_test`, `y_train`, and `y_test`. The features should include the five numerical measurements and sex (encoded as 0 for Female, 1 for Male). The target should be the species column.

In [None]:
# BEGIN SOLUTION
# Load the crabs dataset
crabs = pd.read_csv("data/crabs.csv", index_col=[0])

# Encode sex as numerical: Male=1, Female=0
sex_mapping = {"M": 1, "F": 0}
crabs["sex"] = crabs["sex"].map(sex_mapping)

# Perform stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    crabs.drop(["sp", "index"], axis=1),
    crabs[["sp"]],
    train_size=0.8,
    random_state=6789,
    stratify=crabs[["sex", "sp"]],
)
# END SOLUTION

In [None]:
# Test assertions
assert X_train.shape == (160, 6), f"X_train should have shape (160, 6), got {X_train.shape}"
assert X_test.shape == (40, 6), f"X_test should have shape (40, 6), got {X_test.shape}"
assert y_train.shape == (160, 1), f"y_train should have shape (160, 1), got {y_train.shape}"
assert y_test.shape == (40, 1), f"y_test should have shape (40, 1), got {y_test.shape}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "sex" in X_train.columns, "X_train should include 'sex' column"
assert set(X_train["sex"].unique()) == {0, 1}, "Sex should be encoded as 0 and 1"
assert set(y_train["sp"].unique()) == {"B", "O"}, "Species should be 'B' and 'O'"
# Check stratification: roughly equal proportions in train and test
train_ratio = y_train["sp"].value_counts(normalize=True)
test_ratio = y_test["sp"].value_counts(normalize=True)
assert abs(train_ratio["B"] - test_ratio["B"]) < 0.1, "Stratification failed"
# END HIDDEN TESTS

---

**Problem 5:** Decision Tree Classifier

Train a classification tree to predict species from the five numerical measurements and sex. 

**(a)** Use cross-validation (5 folds) to select the optimal `max_leaf_nodes` from the values {2, 3, ..., 10}.

In [None]:
# BEGIN SOLUTION
# Use GridSearchCV to find optimal max_leaf_nodes
grid = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid={"max_leaf_nodes": [2, 3, 4, 5, 6, 7, 8, 9, 10]},
    cv=5,
    verbose=0,
    return_train_score=True,
)

grid.fit(X_train, y_train)
print(f"Best parameters: {grid.best_params_}")
tree_clf = grid.best_estimator_
# END SOLUTION

In [None]:
# Test assertions
assert hasattr(tree_clf, "predict"), "tree_clf should be a fitted classifier"
assert tree_clf.max_leaf_nodes is not None, "tree_clf should have max_leaf_nodes set"
assert 2 <= tree_clf.max_leaf_nodes <= 10, "max_leaf_nodes should be between 2 and 10"
print("All tests passed!")

# BEGIN HIDDEN TESTS
train_acc = tree_clf.score(X_train, y_train)
test_acc = tree_clf.score(X_test, y_test)
assert train_acc > 0.8, "Training accuracy should be above 80%"
assert test_acc > 0.7, "Test accuracy should be above 70%"
# END HIDDEN TESTS

**(b)** Plot the tree and comment on which variables are used.

**(c)** Compute and report training and test errors.

Store the best tree in a variable called `tree_clf`.

In [None]:
# Plot the tree and compute errors
train_accuracy = tree_clf.score(X_train, y_train)
test_accuracy = tree_clf.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy}")
print(f"Testing Accuracy: {test_accuracy}")
print(f"Training Error: {round(1 - train_accuracy, 4)}")
print(f"Testing Error: {round(1 - test_accuracy, 4)}")

plt.figure(figsize=(16, 10))
plot_tree(tree_clf, feature_names=X_train.columns, filled=True)
plt.title("Decision Tree for Crab Species Classification")
plt.show()

> BEGIN SOLUTION

**Analysis of Variables Used by the Tree:**

The tree primarily uses BD (body depth), CW (carapace width), and FL (frontal lobe size). FL appears to be used most frequently for splitting. Notably, 'sex' is not a helpful predictor for species classification. RW (rear width) and CL (carapace length) are not used, likely because they are highly correlated with CW and FL, so the tree obtains similar information from the variables it does use.
> END SOLUTION


---

**Problem 6:** Random Forest Classifier

Train a random forest with the following specifications:
- Use m=5 randomly selected predictors for each split (`max_features=5`)
- Use 1000 trees (`n_estimators=1000`)

**(a)** Make a variable importance plot.

**(b)** Compare the variable importance with your results from the single decision tree.

**(c)** Compute training and test errors.

Store the random forest classifier in a variable called `rf_clf`.

In [None]:
# BEGIN SOLUTION
# Train random forest classifier
rf_clf = RandomForestClassifier(n_estimators=1000, max_features=5, random_state=42)
rf_clf.fit(X_train, y_train.values.ravel())

print(f"Training Accuracy: {rf_clf.score(X_train, y_train)}")
print(f"Testing Accuracy: {rf_clf.score(X_test, y_test)}")
print(f"Training Error: {round(1 - rf_clf.score(X_train, y_train), 4)}")
print(f"Testing Error: {round(1 - rf_clf.score(X_test, y_test), 4)}")

# Create variable importance plot
importance_order = np.argsort(rf_clf.feature_importances_)
plt.figure(figsize=(10, 6))
plt.barh(
    X_train.columns[importance_order],
    width=rf_clf.feature_importances_[importance_order],
    color=["#F9EAF9", "#E1C2E1", "#C8AAC8", "#AB90AB", "#917A91", "#6C596C"],
    edgecolor="#413E41",
)
plt.title("Random Forest Feature Importance")
plt.xlabel("Average Decrease in Impurity")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert hasattr(rf_clf, "predict"), "rf_clf should be a fitted classifier"
assert rf_clf.n_estimators == 1000, "Random forest should have 1000 trees"
assert rf_clf.max_features == 5, "max_features should be 5"
print("All tests passed!")

# BEGIN HIDDEN TESTS
rf_train_acc = rf_clf.score(X_train, y_train)
rf_test_acc = rf_clf.score(X_test, y_test)
assert rf_train_acc > 0.9, "RF training accuracy should be above 90%"
assert len(rf_clf.feature_importances_) == 6, "Should have importance for all 6 features"
# END HIDDEN TESTS

> BEGIN SOLUTION

**Comparison with Single Decision Tree:**

The results differ from the single decision tree. In the decision tree, FL (frontal lobe size) was used frequently, but in the random forest importance ranking, it may be further down the list. BD (body depth) and CW (carapace width) remain important predictors. CL (carapace length) gains more usage in the random forest. With random subsets of features at each split, CL and RW become more useful when CW is not in the subset, due to the correlation between these measurements. This demonstrates how random forests can reveal the importance of correlated predictors that might be masked in a single tree.
> END SOLUTION


---

**Problem 7:** Gradient Boosting Classifier

Fit a `HistGradientBoostingClassifier` to the data. Store the histogram gradient boosting classifier (with `max_iter=1000`) in a variable called `hgb_clf`.

**(a)** Plot the training and test errors as a function of the number of trees M, for M from 1 to 1000.

**Hint:** Fit the classifier once with `max_iter=1000`, then use `staged_predict` to calculate the error at each iteration.

In [None]:
# BEGIN SOLUTION
# Fit HistGradientBoostingClassifier
hgb_clf = HistGradientBoostingClassifier(max_iter=1000, random_state=42)
hgb_clf.fit(X_train, y_train.values.ravel())

# Calculate errors at each iteration using staged_predict
train_scores = [accuracy_score(y_train, output) for output in hgb_clf.staged_predict(X_train)]
test_scores = [accuracy_score(y_test, output) for output in hgb_clf.staged_predict(X_test)]

train_errors = np.ones(len(train_scores)) - np.array(train_scores)
test_errors = np.ones(len(test_scores)) - np.array(test_scores)
# END SOLUTION

In [None]:
# Test assertions
assert hasattr(hgb_clf, "predict"), "hgb_clf should be a fitted classifier"
assert hgb_clf.max_iter == 1000, "max_iter should be 1000"
assert len(train_scores) == 1000, "Should have 1000 training scores"
assert len(test_scores) == 1000, "Should have 1000 test scores"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert all(0 <= s <= 1 for s in train_scores), "All train scores should be between 0 and 1"
assert all(0 <= s <= 1 for s in test_scores), "All test scores should be between 0 and 1"
# Check that early iterations have worse performance than later ones (generally)
assert train_scores[0] < train_scores[-1], "Training should improve over iterations"
# END HIDDEN TESTS

In [None]:
# Plot error curves
plt.figure(figsize=(10, 6))
plt.title("Error Curves for Histogram Gradient Boosting Classifier")
plt.plot(train_errors, color="#900F30", label="Training Error")
plt.plot(test_errors, color="#571C3D", label="Testing Error")
plt.xlabel("Number of Iterations (Number of Boosting Trees)")
plt.ylabel("Classification Error")
plt.axvline(x=40, color="grey", linestyle="--", label="Chosen M=40")
plt.legend()
plt.tight_layout()
plt.show()

**(b)** Choose an optimal value of M and justify your choice.

**(c)** Report training and test errors for your chosen M.

In [None]:
# Fit with optimal M and report errors
optimal_m = 40  # SOLUTION
hgb_optimal = HistGradientBoostingClassifier(max_iter=optimal_m, random_state=42)
hgb_optimal.fit(X_train, y_train.values.ravel())

print(f"Training Accuracy: {hgb_optimal.score(X_train, y_train)}")
print(f"Testing Accuracy: {hgb_optimal.score(X_test, y_test)}")
print(f"Training Error: {round(1 - hgb_optimal.score(X_train, y_train), 4)}")
print(f"Testing Error: {round(1 - hgb_optimal.score(X_test, y_test), 4)}")

> BEGIN SOLUTION

**Justification for M=40:**

I chose M=40 because, according to the error plot, there is no significant improvement in test error beyond this point. In fact, continuing to add more trees may lead to overfitting (as evidenced by the training error continuing to decrease while test error plateaus or increases). Using M=40 reduces model complexity and improves computational efficiency without sacrificing predictive performance. With this setting, we obtain a test error of approximately 0.075 and a training error of approximately 0.031.
> END SOLUTION


---

**Problem 8:** Method Comparison

Comment on which method appears to perform best for this dataset and whether the results (training and test errors) are consistent across methods.

> BEGIN SOLUTION

**Comparison of Methods:**

| Method | Training Error | Test Error |
|--------|----------------|------------|
| Decision Tree | ~0.06 | ~0.10 |
| Random Forest | ~0.00 | ~0.15 |
| Gradient Boosting (M=40) | ~0.03 | ~0.075 |

**Best Performer:** Gradient Boosting appears to perform best on this dataset with a test error of approximately 0.075 (92.5% accuracy).

**Consistency:** All methods achieve reasonable performance, but there are notable differences:
- The single decision tree achieves ~90% test accuracy, which is quite good for an interpretable model.
- The random forest achieves perfect training accuracy but slightly lower test accuracy (~85%), suggesting some overfitting.
- Gradient boosting achieves the best balance between training and test performance.

**Conclusion:** For this dataset, gradient boosting provides the best generalization, though all ensemble methods outperform (or match) the single decision tree. The relatively small dataset size (200 samples) may limit the ability of more complex models to show their full advantage.
> END SOLUTION
