# Decision Trees

Continuing our discussion of DTs.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

In [None]:
# Load the iris dataset
iris = load_iris()
print(iris.feature_names)
print(iris.target_names)

In [None]:
X = iris.data
y = iris.target
X.shape

In [None]:
# Split the data into training (n=105) and testing (n=45) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train a decision tree classifier
tree_clf = DecisionTreeClassifier(max_depth=4)
tree_clf.fit(X_train, y_train)

# Make predictions
y_pred = tree_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.2f}")

100% accuracy on the test data, even for a single split, is noteworthy.

Let's see if we just got lucky. 5-fold CV will give us a better estimate of the model's performance on new data.

In [None]:
from sklearn.model_selection import cross_val_score

# Use 5-fold cross-validation
cv_scores = cross_val_score(tree_clf, X, y, cv=5)

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.2f}")
print(f"Standard deviation: {cv_scores.std():.2f}")

In [None]:
plt.figure(figsize=(15, 10))
plot_tree(tree_clf, filled=True, feature_names=iris.feature_names, 
          class_names=iris.target_names, rounded=True, fontsize=12)
plt.title('Decision Tree')

plt.tight_layout()
plt.show()

# Print some information about the tree
print(f"Number of nodes: {tree_clf.tree_.node_count}")
print(f"Tree depth: {tree_clf.get_depth()}")
print(f"Number of leaves: {tree_clf.get_n_leaves()}")

Each node contains:

- Decision feature and threshold
- Gini score for that node distribution
- Number of samples in the node
- Distribution of samples: setosa, versicolor, virginica
- Majority class (tie goes to the class with the lowest index in the target array)

Starting from the top:

1. **Root Node**: The first decision is based on petal width ≤ 0.8 cm
   - If TRUE (left branch): The flower is classified as "setosa" with 100% certainty (gini = 0.0)
   - If FALSE (right branch): Continue to the next decision
2. **Second Level** (right branch): Check if petal width ≤ 1.75 cm
   - If TRUE (left branch): Likely versicolor, but needs more checks
   - If FALSE (right branch): Continue checking for virginica
3. **Third Level**: 
   - Left path checks petal length ≤ 4.95 cm, then further splits on petal width ≤ 1.6 cm
   - Right path checks petal length ≤ 4.85 cm, then further splits on sepal width ≤ 3.1 cm

The key insights:
- Setosa is very easy to identify (just one rule: petal width ≤ 0.8 cm)
- Versicolor and virginica require more complex rules to separate
- As we get to the leaf nodes (bottom), most have gini = 0.0, meaning they're pure classifications
- The model achieved 100% accuracy because it found perfect separation rules

The tree demonstrates that even a relatively simple decision tree can perfectly classify the Iris dataset because the species are fairly well-separated in the feature space.

## The SKL Way

Here we will demonstrate the general process for model selection and evaluation using SKL pipelines.

First, load the packages.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

Then load the data and split it into test and train.

In [None]:
# Load the data
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Create a `pipeline` object that defines the steps to apply in the train / fit methods.

In [None]:
# Create a pipeline with preprocessing and model
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('classifier', DecisionTreeClassifier(random_state=42))
])

Assume we want to test results for a range of depths for the DT. We can define a dictionary of parameters and the range of interest. Note the syntax `classifier__max_depth` where the double-underbar separates the pipeline step name and the corresponding parameter of interest.

In [None]:
# Define parameter grid for grid search
param_grid = {
    'classifier__max_depth': np.arange(1, 10)  # Try depths 1-9
}

Use `GridSearchCV` to search the parameter grid using cross-validation. First we create the properly specified object instance.

In [None]:
# Set up GridSearchCV
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',
    return_train_score=True
)

Calling the `fit` method performs the following:

- For each parameter value specified in the grid
- Create a 5-fold split
- Scale the folds appropriately
- Fit the classifier on 4 folds
- Generate predictions for the 5th
- Score the predictions


In [None]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

The best resulting parameter and cv score is found in the `best_params_` and `best_score_` attributes of the resulting object.

In [None]:
# Print results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.2f}")

Pick the best model using `best_estimator_` (model fit with the best parameter) and sore it on the original test data to estimate its performance on unseen data.

In [None]:
# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test accuracy with best model: {test_accuracy:.2f}")

Generate a classification report for the test data and predicitions.

In [None]:
# More detailed evaluation
y_pred = best_model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Feature importances are calculated based on their contribution to decreasing impurity across all nodes in the tree. This information is a natural byproduct of the DT process.

In [None]:
# Show feature importances
importances = best_model.named_steps['classifier'].feature_importances_
for feature, importance in zip(iris.feature_names, importances):
    print(f"{feature}: {importance:.4f}")