# Lesson 5.5: Decision Trees

## A Flowchart of Yes/No Questions

### PHP Parallel
A decision tree is like **nested if/else statements** that the computer writes for itself:
```php
if ($tds > 85) {
    if ($flow_rate < 1.0) return 'critical';
    else return 'warning';
} else {
    return 'ok';
}
```
But instead of YOU writing these rules, the tree LEARNS them from data!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

%matplotlib inline

In [None]:
# Water filter data
np.random.seed(42)
n = 300
age = np.random.randint(10, 365, n)
tds = 30 + age * 0.25 + np.random.randn(n) * 15
flow = 2.5 - age * 0.004 + np.random.randn(n) * 0.3
pressure = np.random.uniform(30, 70, n)

X = pd.DataFrame({'tds_output': tds, 'flow_rate': flow, 'age_days': age, 'pressure': pressure})
y = ((tds > 80) | (flow < 1.0)).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train a simple tree (limited depth so we can visualize it)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print(f"Accuracy: {accuracy_score(y_test, tree.predict(X_test)):.1%}")

In [None]:
# VISUALIZE THE TREE - this is the best part!
plt.figure(figsize=(16, 8))
plot_tree(tree, feature_names=X.columns, class_names=['OK', 'Maintenance'],
          filled=True, rounded=True, fontsize=10)
plt.title('Decision Tree: Water Filter Maintenance Prediction')
plt.show()
print("Read it top-to-bottom: each box asks a yes/no question!")

In [None]:
# Feature importance - which features matter most?
importance = pd.Series(tree.feature_importances_, index=X.columns).sort_values(ascending=True)

plt.figure(figsize=(8, 4))
importance.plot(kind='barh', color='steelblue')
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()

In [None]:
# DANGER: Overfitting! Deep trees memorize the training data
deep_tree = DecisionTreeClassifier(max_depth=None, random_state=42)  # No limit!
deep_tree.fit(X_train, y_train)

print(f"Shallow tree (depth=3): Train={accuracy_score(y_train, tree.predict(X_train)):.1%}, Test={accuracy_score(y_test, tree.predict(X_test)):.1%}")
print(f"Deep tree (no limit):   Train={accuracy_score(y_train, deep_tree.predict(X_train)):.1%}, Test={accuracy_score(y_test, deep_tree.predict(X_test)):.1%}")
print("\nDeep tree: perfect on training but worse on test â†’ OVERFITTING!")

## Random Forest (Preview)

Instead of one tree, train MANY trees on random subsets of data and **vote** on the answer. This reduces overfitting dramatically.

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

print(f"Random Forest: Train={accuracy_score(y_train, forest.predict(X_train)):.1%}, Test={accuracy_score(y_test, forest.predict(X_test)):.1%}")
print("Much better generalization!")

## Exercise

1. Try different max_depth values (2, 5, 10, None) and compare train vs test accuracy
2. Which features are most important for predicting maintenance?
3. Compare DecisionTree vs RandomForest accuracy

In [None]:
# YOUR CODE HERE