# Chapter 3.5-3.6: Decision Trees and Random Forests

Goal: Visualize how decision trees make splits, observe overfitting, and see how random forests improve stability.

### Topics:
- How decision trees split on features (Gini impurity)
- Controlling tree depth to prevent overfitting
- Random forests: bagging + random feature subsets
- Feature importances

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

## Quick Recap

- **Decision trees** split data on features that best separate the classes (using Gini impurity)
- Deeper trees fit training data better but are prone to **overfitting**
- `max_depth` controls how deep the tree can grow — smaller = simpler model
- **Random forests** = many decision trees, each trained on a random subset of data and features
- Random forests are more robust and stable than a single tree

## Data

We'll use the **California Housing** dataset, converted to a binary classification problem: is the median house value above or below the overall median?

In [None]:
# Load California Housing and convert to binary classification
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Create binary target: 1 if above median, 0 if below
median_value = df['MedHouseVal'].median()
df['high_value'] = (df['MedHouseVal'] > median_value).astype(int)

print(f"Median house value: ${median_value * 100000:,.0f}")
df['high_value'].value_counts()

## Practice

### 1. Use AI — Prepare features and train/test split

Select features (`MedInc`, `HouseAge`, `AveRooms`, `AveOccup`, `Latitude`, `Longitude`), define the target as `high_value`, and split into train/test (80/20).

In [None]:
# Step 1: Select features and target


# Step 2: Train/test split (80/20, random_state=42)


### 2. Use AI — Fit and visualize a shallow decision tree

Fit a `DecisionTreeClassifier(max_depth=3, random_state=42)` and visualize it with `plot_tree()`. Use `filled=True` and pass the feature names.

In [None]:
# Step 1: Fit DecisionTreeClassifier with max_depth=3


# Step 2: Visualize with plot_tree (use figsize=(20, 10) for readability)


### 3. Interpretation — Reading the tree

Look at the tree visualization:
- What is the **first feature** the tree splits on? Why do you think it chose that feature?
- Trace the path for a neighborhood with: median income = 3.5, house age = 20. What does the tree predict?

**Your answer:**

(Write your answer here)

### 4. Use AI — Compare trees with different depths

Fit decision trees with `max_depth` = 2, 5, 10, and None (unlimited). Print the **train accuracy** and **test accuracy** for each. Which one overfits?

In [None]:
# Fit trees with different max_depth values and compare train/test accuracy


### 5. Interpretation — Overfitting

Explain in your own words why an unlimited-depth tree overfits. What is it "memorizing"?

**Your answer:**

(Write your answer here)

### 6. Use AI — Fit a random forest and compare

Fit a `RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)`. Compare its test accuracy to the best single decision tree from above.

In [None]:
# Step 1: Fit RandomForestClassifier


# Step 2: Print test accuracy and compare to best single tree


### 7. Use AI — Compare feature importances

Plot the feature importances for both the single decision tree (max_depth=5) and the random forest side by side.

In [None]:
# Plot feature importances side by side
# Hint: use plt.subplots(1, 2, figsize=(14, 5))


### 8. Interpretation — Feature importance stability

Are the feature importances the same between the single tree and the random forest? Why might a random forest give more stable and reliable feature importances?

**Your answer:**

(Write your answer here)

## Discussion

If you had to explain your model to a home buyer, would you show them the decision tree or the random forest results? Why?

(Discuss with a neighbor)