# Decision Trees

This notebook introduces decision trees.

The Iris dataset is one of the most famous and widely used datasets in machine learning. It was introduced by the statistician Ronald Fisher in 1936 in his paper "The Use of Multiple Measurements in Taxonomic Problems."

The dataset contains 150 samples from three species of Iris flowers (Iris setosa, Iris virginica, and Iris versicolor), with 50 samples from each species. For each flower, four features were measured (all in centimeters):

- sepal length
- sepal width
- petal length
- petal width

The Iris dataset is often used for classification tasks because it provides a simple but non-trivial challenge. One class (Iris setosa) is linearly separable from the other two, while the other two classes (Iris virginica and Iris versicolor) have some overlap, making it a good test case for different classification algorithms. It's commonly used for teaching machine learning concepts, demonstrating visualization techniques, and benchmarking classification algorithms.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

# Load the iris dataset
iris = load_iris()
X = iris.data[:, :2]  # Use only the first two features for visualization
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train a decision tree classifier
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_clf.fit(X_train, y_train)

# Make predictions
y_pred = tree_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


In [None]:
# Create a figure with multiple plots
plt.figure(figsize=(15, 10))

# Plot 1: Decision boundaries
plt.subplot(2, 2, 1)
h = 0.02  # Step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = tree_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.RdYlBu, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Decision Boundaries')

# Plot 2: Decision Tree Visualization
plt.subplot(2, 2, 2)
plot_tree(tree_clf, filled=True, feature_names=iris.feature_names[:2], 
          class_names=iris.target_names, rounded=True)
plt.title('Decision Tree')

# Plot 3: Feature Importance
plt.subplot(2, 2, 3)
importance = tree_clf.feature_importances_
plt.bar(iris.feature_names[:2], importance)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importance')

# Plot 4: Confusion Matrix
plt.subplot(2, 2, 4)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')

plt.tight_layout()
plt.show()

# Print some information about the tree
print(f"Number of nodes: {tree_clf.tree_.node_count}")
print(f"Tree depth: {tree_clf.get_depth()}")
print(f"Number of leaves: {tree_clf.get_n_leaves()}")

The plots provide a comprehensive view of how the decision tree is classifying the Iris flowers:

1. **Decision Boundaries**: The plot shows how the decision tree divides the feature space using sepal length and width. The red, blue, and yellow regions represent the three Iris species. The decision boundaries are perpendicular to the axes (characteristic of decision trees), creating rectangular regions. There's a clear vertical boundary around sepal length = 5, which appears to separate one species (likely Setosa) from the others.

2. **Decision Tree Structure**: The tree has a depth of 3 with 15 nodes total and 8 leaf nodes. The first split is on sepal length, which aligns with what we see in the decision boundaries. The tree's structure reveals the hierarchical decision-making process.

3. **Feature Importance**: Sepal length is shown to be significantly more important (about 0.72) than sepal width (about 0.28). This explains why the first split in the tree is based on sepal length.

4. **Confusion Matrix**: The model performs very well, particularly for setosa (18/19 correct). There's some confusion between versicolor and virginica, which is expected as these two species are known to have overlapping characteristics. The overall accuracy appears to be around (18+8+8)/45 ≈ 76%, which is reasonable given we're only using two of the four available features.

This visualization effectively demonstrates how decision trees make classification decisions with orthogonal boundaries. It also shows why we often need ensemble methods like random forests for better performance - a single decision tree has limitations in capturing the relationship between versicolor and virginica when using only sepal measurements.