# Decision Trees

Finishing our discussion of Tree Based methods.

But first...

## Feature Selection / Extraction Methods

Alternatives to sequential other feature selection methods we discussed: procedural (best / forward / backward select) and algorithmic (regularization in MLR, feature importance in d-trees). Not functionally the same, but used to answer the same question: how do we simplify the data?

### Make Dataset

In [None]:
import numpy as np
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

# Generate a synthetic dataset with built-in correlations and redundant features
def generate_sample_data(n_samples=1000, n_features=20, n_informative=10, random_state=42):
    """
    Generate a synthetic dataset with some informative and some redundant features.
    
    Parameters:
    -----------
    n_samples : int
        Number of samples to generate
    n_features : int
        Total number of features to generate
    n_informative : int
        Number of features that are informative
    random_state : int
        Random seed for reproducibility
        
    Returns:
    --------
    X : ndarray of shape (n_samples, n_features)
        The generated feature matrix
    y : ndarray of shape (n_samples,)
        The generated target vector
    feature_names : list
        Names for each feature
    """
    # Create a classification problem
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=n_informative,
        n_redundant=int(n_features * 0.3),  # 30% redundant features
        n_repeated=0,
        n_classes=2,
        n_clusters_per_class=2,
        weights=None,
        flip_y=0.05,
        class_sep=1.0,
        hypercube=True,
        shift=0.0,
        scale=1.0,
        shuffle=True,
        random_state=random_state
    )
    
    # Create feature names
    feature_names = [f'feature_{i+1}' for i in range(n_features)]
    
    # Add some additional correlated features to make PCA more interesting
    if n_features > 5:
        # Make feature_5 correlated with feature_1 and feature_2
        X[:, 4] = 0.7 * X[:, 0] + 0.3 * X[:, 1] + 0.2 * np.random.randn(n_samples)
        
        # Make feature_6 correlated with feature_3
        X[:, 5] = 0.85 * X[:, 2] + 0.15 * np.random.randn(n_samples)
    
    print(f"Generated dataset with {n_samples} samples and {n_features} features")
    print(f"Class distribution: {np.bincount(y)}")
    
    return X, y, feature_names

# Generate the data
X, y, feature_names = generate_sample_data()

# Print some basic information
print("\nFeature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
print("First 5 feature names:", feature_names[:5])

# Calculate feature correlations to show the redundancy
import pandas as pd
import seaborn as sns

# Create a correlation matrix
df = pd.DataFrame(X, columns=feature_names)
correlation_matrix = df.corr().abs()

# Visualize correlations
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', 
            xticklabels=feature_names, yticklabels=feature_names)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

print("\nYou can now use this data with the PCA and SelectKBest examples!")

### Feature Selection

Choosing the "best" features in the dataset. Everything we've done so far falls into this category - doesn't make new features, just helps us choose / identify the most "important."

`SelectKBest` is another alternative from SKL. It selects features based on statistical tests of relationship (e.g., ANOVA F-test, Chi-squared) with the target. This bridges the gap between the inferential and predictive approaches we've discussed.

The basic implemenation is given below, using ANOVA (`f_classif`). This measures the ratio ($F$) of between-class and within-class variance, so higher F-values indicate features that better separate the classes. Compares the means of different classes for each feature. Refer to your stats class notes for more info! ;-P

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Create and fit SelectKBest for classification
k = 5  # Select top 5 features
selector = SelectKBest(f_classif, k=k)  
X_selected = selector.fit_transform(X, y)

# Get selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = [feature_names[i] for i in selected_indices]
print(selected_features)

Here is some code to extract the scores and plot them.

In [None]:
# Get scores and p-values
scores = selector.scores_
pvalues = selector.pvalues_

# Create a DataFrame with all scores and selection status
feature_scores = pd.DataFrame({
    'Feature': feature_names,
    'Score': scores,
    'p-value': pvalues,
    'Selected': np.isin(np.arange(len(feature_names)), selected_indices)
})

# Sort by score in descending order
feature_scores = feature_scores.sort_values('Score', ascending=False)

# Visualization 1: Bar chart of feature scores
plt.figure(figsize=(12, 6))
bars = plt.bar(
    feature_scores['Feature'], 
    feature_scores['Score'],
    color=[('steelblue' if selected else 'lightgray') for selected in feature_scores['Selected']]
)
plt.xticks(rotation=90)
plt.title(f'Feature Importance Scores (Top {k} Selected)')
plt.xlabel('Features')
plt.ylabel('F-Score')
plt.tight_layout()
plt.show()

#### Key Characteristics

1. Univariate Analysis: Each feature is evaluated independently, ignoring interactions
2. Quick Computation: Much faster than wrapper methods like RFE
3. No Model Training: Doesn't require training a model to evaluate features
4. Statistical Foundation: Based on well-established statistical tests

The key limitation is that `SelectKBest` doesn't consider feature interactions or redundancy. Two highly correlated features might both be selected even though they provide similar information.

### Feature Extraction

Unlike feature *selection*, feature *extraction* creates **new features** by transforming the original ones. Several such methods exist, including factor analysis, t-SNE, and UMAP. Here, we'll focus on Principal Component Analysis (PCA).

PCA transforms existing features into uncorrelated components ordered by the amount of variance they explain. It works by finding orthogonal directions (eigenvectors) in the feature space where data varies the most, and projects the data onto these directions. See also, this [wikipedia page on PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) and related links. You may remember eigenvalues and eigenvectors from linear algebra (I didn't).

The resulting principal components are linear combinations of original features, weighted by their importance.

The basic SKL implementation follows. Note: always standardize your data first, since PCA is sensitive to feature scales! 

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale data before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create and fit PCA
pca = PCA(n_components=5)  # or n_components=0.95 to retain 95% variance
X_pca = pca.fit_transform(X_scaled)

# Access explained variance
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
print(sum(explained_variance))

Five components that explain 65.6% of the variance in the feature space.

But what do they represent? Harder to answer - feature loadings give us the contributions of each original feature on a principal component.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import pandas as pd
import seaborn as sns


# 3. Feature loadings (coefficients)
n_components = 5  # Examine the first five components
loadings = pd.DataFrame(
    data=pca.components_[:n_components, :].T,
    columns=[f'PC{i+1}' for i in range(n_components)],
    index=feature_names
)

# Create a heatmap of feature loadings
plt.figure(figsize=(12, 8))
sns.heatmap(loadings, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Loadings (Coefficients) for First Five Principal Components')
plt.tight_layout()
plt.show()



From this, if we understand the data we may be able to intuit what each component is "capturing" about the original data.

#### Key Characteristics

1. Unsupervised Transformation: Creates new features (principal components) based solely on input feature relationships, without considering the target variable
2. Orthogonality: Produces completely uncorrelated components, eliminating multicollinearity issues
3. Variance Maximization: Orders components by explained variance, allowing dimensionality reduction with minimal information loss
4. Linear Combinations: Creates components as weighted linear combinations of original features

The key limitation is that PCA components lose direct interpretability since they're combinations of original features. The transformation is also sensitive to feature scaling, so standardization is essential before applying PCA. Additionally, PCA assumes linear relationships and is less effective when nonlinear patterns dominate the data.

## Packaged Ensemble Methods

### Random Forests (Bagging)

Implementation is straightforward, as you have come to expect.

Critical parameters for `RandomForestClassifier` include:

- `n_estimators`: Number of trees (diminishing returns, convergence properties)
- `max_features`: Number of features considered for each split
  - Classification default: `sqrt(n_features)`
  - Regression default: `n_features`
- `max_depth`: How deep to grow trees (depth vs. generalization)
- `min_samples_split` / `min_samples_leaf`: Controlling tree complexity
- `bootstrap`: whether to use bootstrap sampling
- `oob_score`: Whether to use out-of-bag samples for validation

In [None]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, 
                          n_informative=15, n_redundant=5, 
                          random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                   random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt',
                           max_depth=None, min_samples_split=2,
                           bootstrap=True, oob_score=True,
                           random_state=42)
rf.fit(X_train, y_train)

# Evaluate
print(f"Training accuracy: {rf.score(X_train, y_train):.4f}")
print(f"Test accuracy: {rf.score(X_test, y_test):.4f}")
print(f"OOB accuracy: {rf.oob_score_:.4f}")

### Gradient Boosting (Boosting)

Key parameters:

- `learning_rate`: Controls contribution of each tree
- `n_estimators`: Number of sequential trees
- `subsample`: Fraction of samples for stochastic gradient boosting (< 1.0)
- `max_depth`: Depth of individual trees (usually shallow, 3-5 levels)
- `min_samples_split` / `min_samples_leaf`: Controls tree complexity
- `max_features`: Number of features for each split (similar to Random Forest)

In practice, the full value of each error prediction is not used. Some portion of it, called the learning rate.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt

# Train Gradient Boosting model
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
                               max_depth=3, subsample=0.8,
                               random_state=42)
gb.fit(X_train, y_train)

# Evaluate
print(f"Training accuracy: {gb.score(X_train, y_train):.4f}")
print(f"Test accuracy: {gb.score(X_test, y_test):.4f}")

# Staged predictions (analyzing learning curve)
train_scores = []
test_scores = []

# For each iteration/stage of the boosting process:
for y_pred_train in gb.staged_predict(X_train):
    train_scores.append(accuracy_score(y_train, y_pred_train))
    
for y_pred_test in gb.staged_predict(X_test):
    test_scores.append(accuracy_score(y_test, y_pred_test))

# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(train_scores) + 1), train_scores, label='Training')
plt.plot(range(1, len(test_scores) + 1), test_scores, label='Testing')
plt.xlabel('Number of Boosting Iterations')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Gradient Boosting Learning Curve')
plt.tight_layout()
plt.show()

#### State of the Art in Gradient Boosting

While the SKL implementation is adequate, for the state of the art in Gradient Boosting, use either `XGBoost` ([eXtreme Gradient Boosting](https://xgboost.ai/), by Tianqi Chen of DLMC) or `LightGBM` ([LightGBM docs](https://lightgbm.readthedocs.io/en/stable/), by Microsoft). Both build on baseline GBM implementations like SKL's with different performance and efficiency optimizations. In short, XGBoost, which has built-in regularization and handles missing values automatically, is best when accuracy is critical and training time isn't a bottleneck. LightGBM is more performant and handles categorical features better. It is preferred for large data sets or other situations where training time is a major concern.

It is reasonable to claim that, in many cases, the best single model available for tabular data is XGBoost. It is a strong default choice with excellent performance across a wide range of scenarios.

The cell below shows a basic implementation of XGB with the SKL interface using the same data that we just trained the RF and GBM on.

As we'll see, the key to success with XGBoost is proper parameter tuning, which often requires experimentation based on your specific dataset. Key parameters include:

- `n_estimators`: Number of trees (boosting rounds).
- `max_depth`: Maximum depth of each tree.
- `learning_rate`: Step size shrinkage used to prevent overfitting.
- `subsample`: Fraction of samples used for fitting individual trees.
- `colsample_bytree`: Fraction of features used for building each tree.
- `gamma`: Minimum loss reduction required for a split.
- `reg_alpha`: L1 regularization term on weights.
- `reg_lambda`: L2 regularization term on weights.


In [None]:
from xgboost import XGBClassifier

# Create and train model
model = XGBClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary:logistic',
    random_state=42
)
model.fit(X_train, y_train)

# Make predictions
y_pred_train = model.predict(X_train)  # Predict on training data
y_pred_test = model.predict(X_test)    # Predict on test data

# Evaluate
print(f"Training accuracy: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_pred_test):.4f}")

In this case, XGB appears to have made me a liar. Test accuracy is lower than wasy previously measured with either RF or GBM.

Of course, for a more representative result, this should be compared with `GridSearchCV` or `RandomizedSearchCV` to optimize the hyperparameters and evaluate the best model on the test data. Below we'll explore a large hyperparameter space ($4 \times 5 \times 4 \times 3 \times 3 \times 4 \times 3 \times 4 \times 4 = 92,160$ parameter combinations) for the optimal solution. Using an exhaustive grid search with 5-fold cross-validation would fit **460,800 models!** By using random search we fit only 50.

In [None]:
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
import time

# 1. Define a wide parameter grid
param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 0.1, 0.2, 0.5],
    'min_child_weight': [1, 3, 5],
    'reg_alpha': [0, 0.1, 0.5, 1],
    'reg_lambda': [0.1, 0.5, 1, 5]
}

# 2. First evaluate baseline model with cross-validation
base_model = XGBClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='binary:logistic',
    random_state=42
)

base_cv_scores = cross_val_score(base_model, X_train, y_train, cv=5, scoring='accuracy')
print(f"Baseline CV Accuracy: {base_cv_scores.mean():.4f} ± {base_cv_scores.std():.4f}")

# 3. RandomizedSearchCV (more efficient for large parameter spaces)
start_time = time.time()

random_search = RandomizedSearchCV(
    estimator=XGBClassifier(objective='binary:logistic', random_state=42),
    param_distributions=param_grid,
    n_iter=50,  # Number of parameter settings sampled
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,  # Use all available cores
    verbose=1
)

random_search.fit(X_train, y_train)

random_search_time = time.time() - start_time
print(f"RandomizedSearchCV completed in {random_search_time:.2f} seconds")
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV accuracy: {random_search.best_score_:.4f}")

# Test the best model from random search
y_pred = random_search.predict(X_test)
print(f"Test accuracy with best random search model: {accuracy_score(y_test, y_pred):.4f}")
print(f"Test accuracy with best grid search model: {accuracy_score(y_test, y_pred):.4f}")

# Compare all approaches
print("\nComparison Summary:")
print(f"Baseline model test accuracy: {accuracy_score(y_test, base_model.fit(X_train, y_train).predict(X_test)):.4f}")
print(f"RandomizedSearchCV best test accuracy: {accuracy_score(y_test, random_search.predict(X_test)):.4f}")

This improves our test accuracy to 0.90, a significant improvement over the baseline - enough to surpass the default performance (an unfair comparison) of both RF and GBM above.

## Nested Cross-Validation for Ensemble Evaluation

When working with ensemble methods that require hyperparameter tuning, standard cross-validation can lead to optimistically biased performance estimates. Nested cross-validation provides a more honest assessment by separating model selection from model evaluation.

### The Problem with Simple CV for Ensembles

With simple CV and hyperparameter tuning, we:
1. Split data into K folds
2. Try different hyperparameters and select the best based on CV performance
3. Report this best performance as our estimate

This can lead to overfitting to the validation set because the hyperparameters were specifically chosen to maximize performance on that data.

### Nested CV Solution

Nested cross-validation uses two loops:
- An outer loop that splits data into training and test sets multiple times
- An inner loop that performs hyperparameter tuning using only the training portion from each outer split

This creates a clear separation between:
1. Model selection (finding optimal hyperparameters in the inner loop)
2. Model evaluation (assessing performance on truly unseen data in the outer loop)

For ensemble methods, this is particularly important because they have many hyperparameters that can be tuned to maximize performance. Without nested CV, we risk selecting hyperparameter combinations that happen to work well on our validation data but don't generalize to new data.

Here's a practical implementation:

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Outer cross-validation loop
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Model and parameter grid
model = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'max_features': ['sqrt', 'log2']
}

# Nested CV scores
nested_scores = []

# For each outer training/testing split
# the split() method returns a list of indicies that are used to extract train and test values for the features and targets
for train_idx, test_idx in outer_cv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # Inner CV for hyperparameter tuning
    grid_search = GridSearchCV(
        estimator=model, param_grid=param_grid,
        cv=inner_cv, scoring='accuracy', n_jobs=-1
    )
    grid_search.fit(X_train, y_train)
    
    # Evaluate best model from inner CV on outer test fold
    best_model = grid_search.best_estimator_
    score = best_model.score(X_test, y_test)
    nested_scores.append(score)
    
    print(f"Outer fold: Best params {grid_search.best_params_}, Test score: {score:.4f}")

# Overall performance estimate
print(f"Nested CV accuracy: {np.mean(nested_scores):.4f} ± {np.std(nested_scores):.4f}")

This output shows that each outer loop uses a fold to test the performance of the best model found by the grid search, which uses its own CV to score each hyperparameter set. The Nested CV accuracy is based on the runs recorded. It is used to assess the overall model stability, *not* as a way to choose the best model. As we can see, the models selected by the first and last outer fold have identical parameter sets. The same is true for the second and fourth.

This process is typically used as a precursor to final model training. The key insight is that nested CV doesn't directly give you the "best" hyperparameters - it gives you an unbiased estimate of performance and shows you how stable your hyperparameter selection process is across different data subsets.

Typically, the next step is to run a final grid search on all data to find the best parameters for your final model, as shown below:

In [None]:
# After completing nested CV evaluation (as shown in previous example)
print(f"Nested CV accuracy: {np.mean(nested_scores):.4f} ± {np.std(nested_scores):.4f}")

# Now train the final model using all data
print("\n--- Training Final Deployment Model ---")

# Create the final grid search on all data
final_grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,  # Same inner CV strategy 
    scoring='accuracy',
    n_jobs=-1
)

# Fit on ALL data (not just a training subset)
final_grid_search.fit(X, y)

# Show best parameters found using all available data
print(f"Final model best parameters: {final_grid_search.best_params_}")
print(f"Final model CV score: {final_grid_search.best_score_:.4f}")

# Create the final model with the best parameters
final_model = RandomForestClassifier(
    **final_grid_search.best_params_,
    random_state=42
)

# Train on all data
final_model.fit(X, y)

print("Final model trained and ready for deployment")

The final model is fit using all available data because we're not using it to predict performance or guide model development. We are willing to accept its output because we validated its performance using the prior steps.

When deployed, `final_model` will be used to generate predictions for new observations.

Key Benefits for Ensemble Evaluation:

1. Unbiased performance estimation: Each performance score comes from data that was not used for hyperparameter selection
2. Reliable model comparison: Properly compare complex ensembles against simpler models
3. Reduced risk of overfitting: Avoid overestimating the performance of heavily tuned ensembles

## Custom Ensembles

This is an open-ended topic, as it covers an arbitrary combination of models. This notebook can only scratch the surface with an introduction and some starter code.

The key question is how to combine the results of multiple models. Two common approaches are voting and stacking.

### Voting Methods

As the name suggests, voting uses self-described methods to generate predictions for the ensemble. For classification problems using `VotingClassifier`, this can be based on simple majority (hard voting) or weighted average of predicted probabilities (soft voting). Likewise, `VotingRegressor` can use simple or weighted averages of the predicted values.

A sample implementation is provided below using a combination of Logistic Regression, Random Forest, and Support Vector Machine (covered in 7130) classifiers.

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Create three base models
log_clf = LogisticRegression(random_state=42)
rf_clf = RandomForestClassifier(random_state=42)
svc_clf = SVC(probability=True, random_state=42)

# Create and train voting classifier
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rf_clf), ('svc', svc_clf)],
    voting='soft'
)
voting_clf.fit(X_train, y_train)

# Compare individual models with ensemble
for clf, label in zip([log_clf, rf_clf, svc_clf, voting_clf],
                      ['Logistic Regression', 'Random Forest', 'SVM', 'Voting']):
    clf.fit(X_train, y_train)
    print(f"{label} test accuracy: {clf.score(X_test, y_test):.4f}")

As you can see, the voting test accuracy is higher than most individual models, but not all. Remember that this is train-test validation of an often-misleading performance metric.

For any ensemble it is best to include a diverse but complementary set of models - ideally each has different strengths / weaknesses to provide the best mix of coverage. The rationale for this should be self-evident. One way to evaluate that is by analyzing the correlation between model predictions, as seen below. Other methods exist, including agreement / disagreement analysis and Cohen's Kappa Matrix (a measure of agreement beyond chance).

In [None]:
# Get predictions from each base model
logistic_preds = log_clf.predict(X_test)
rf_preds = rf_clf.predict(X_test)
svc_preds = svc_clf.predict(X_test)

# Create a DataFrame of predictions
pred_df = pd.DataFrame({
    'Logistic': logistic_preds,
    'RandomForest': rf_preds,
    'SVM': svc_preds
})

# 1. Correlation matrix of predictions
corr_matrix = pred_df.corr()
print("Prediction correlation matrix:")
print(corr_matrix)

# Visualize correlation matrix
plt.figure(figsize=(5, 4))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation between Model Predictions')
plt.tight_layout()
plt.show()

### Stacking Methods

Stacking uses the outputs of the ensemble as inputs for a meta-model that generates the final prediction. Unsurprisingly, it is implemented with SKL's `StackingClassifier` and `StackingRegressor`. Note that stacking requires CV to prevent overfitting in the meta-model. The process involves training each base model on $K-1$ folds before generating predictions for the held-out data (remaining fold). That way the predictions used as inputs for the meta-model are based on unseen data, and represent generalized performance of the base models.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('svc', SVC(probability=True, random_state=42))
]

# Meta-model
meta_model = LogisticRegression(random_state=42)

# Stacking classifier with CV
stacking_clf = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,  # 5-fold cross-validation
    stack_method='predict_proba'
)
stacking_clf.fit(X_train, y_train)
print(f"Stacking test accuracy: {stacking_clf.score(X_test, y_test):.4f}")

# Compare with base models
for name, clf in base_models:
    clf.fit(X_train, y_train)
    print(f"{name} test accuracy: {clf.score(X_test, y_test):.4f}")

Unlike with the voting result, the stacked ensemble outperforms all base models. This performance improvement is due to a combination of both the intrinsic cross-validation and the model architecture. While voting uses fixed weights or simple majority, stacking's meta-model learns optimal weights for combining base model predictions. This allows it to give more influence to models that perform better in specific regions of the feature space. It does so by learning patterns that the simple vote can't capture.

### SKL Pipelines for Ensembles

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Define base models for stacking
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('svc', SVC(probability=True, random_state=42))
]

# Create stacking ensemble
stacking_ensemble = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(random_state=42),
    cv=5,
    stack_method='predict_proba'
)

# Create pipeline with preprocessing and stacking ensemble
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('stacking', stacking_ensemble)
])

# Train and evaluate
pipeline.fit(X_train, y_train)
print(f"Pipeline test accuracy: {pipeline.score(X_test, y_test):.4f}")

# Grid search for hyperparameter tuning
param_grid = {
    'pca__n_components': [0.85, 0.9, 0.95],
    'stacking__rf__n_estimators': [50, 100],
    'stacking__gb__learning_rate': [0.05, 0.1],
    'stacking__final_estimator__C': [0.1, 1.0, 10.0]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
print(f"Test accuracy with best params: {grid_search.score(X_test, y_test):.4f}")

# Compare with individual models through pipeline for fair comparison
base_pipelines = {}
for name, clf in base_models:
    base_pipelines[name] = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=0.95)),
        ('classifier', clf)
    ])
    base_pipelines[name].fit(X_train, y_train)
    print(f"{name} pipeline test accuracy: {base_pipelines[name].score(X_test, y_test):.4f}")

Here the best ensemble achieves an accuracy of 0.9400, which is slightly lower than that for the base pipeline (0.9450). If it were easy everyone would be doing it!

### Accuracy is Not Enough

Remember to thoroughly interrogate your results. The following comprehensive code block demonstrates a number of methods that you can use to do so. Most we have discussed, a few we have not (primarily McNemar's test, which evaluates the significant difference between models). 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.model_selection import learning_curve

# Create a collection of models to compare
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Voting Ensemble': voting_clf,  # From your previous example
    'Stacking Ensemble': stacking_clf  # From your previous example
}

# Dictionary to store results
results = {}

print("Performance Comparison Against Single Models\n")
print("=" * 70)

# Calculate metrics for each model
for name, model in models.items():
    if name not in ['Voting Ensemble', 'Stacking Ensemble']:  # These are already fitted
        model.fit(X_train, y_train)
    
    # Get predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    results[name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'confusion_matrix': confusion_matrix(y_test, y_pred)
    }
    
    # Calculate ROC AUC if the model supports predict_proba
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        results[name]['roc_auc'] = roc_auc_score(y_test, y_prob)
    else:
        results[name]['roc_auc'] = None
    
    # Print basic results
    print(f"\nModel: {name}")
    print(f"Accuracy:  {results[name]['accuracy']:.4f}")
    print(f"Precision: {results[name]['precision']:.4f}")
    print(f"Recall:    {results[name]['recall']:.4f}")
    print(f"F1 Score:  {results[name]['f1']:.4f}")
    if results[name]['roc_auc'] is not None:
        print(f"ROC AUC:   {results[name]['roc_auc']:.4f}")
    print("-" * 40)

# Create DataFrame for easy comparison
metrics_df = pd.DataFrame({
    model_name: {
        'Accuracy': results[model_name]['accuracy'],
        'Precision': results[model_name]['precision'],
        'Recall': results[model_name]['recall'],
        'F1 Score': results[model_name]['f1'],
        'ROC AUC': results[model_name]['roc_auc']
    }
    for model_name in results.keys()
}).T  # Transpose for better display

print("\nMetrics Summary:")
print(metrics_df)

# Visualization 1: Bar chart comparison of metrics
plt.figure(figsize=(12, 8))
metrics_df[['Accuracy', 'Precision', 'Recall', 'F1 Score']].plot(kind='bar', figsize=(12, 6))
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.xlabel('Model')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Visualization 2: ROC Curves
plt.figure(figsize=(10, 8))
for name, model in models.items():
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_prob)
        plt.plot(fpr, tpr, label=f'{name} (AUC = {results[name]["roc_auc"]:.3f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

# Statistical Significance Testing with McNemar's test
print("\nStatistical Significance Testing (McNemar's Test):")
print("H0: Two models have the same error rate")
print("p-value < 0.05 indicates statistically significant difference")
print("-" * 70)

model_names = list(models.keys())
p_values = np.zeros((len(model_names), len(model_names)))

for i, model1 in enumerate(model_names):
    for j, model2 in enumerate(model_names):
        if i != j:
            # Get predictions
            y_pred1 = models[model1].predict(X_test)
            y_pred2 = models[model2].predict(X_test)
            
            # Create contingency table with counts of:
            # [0,0]: both models correct
            # [0,1]: model1 correct, model2 incorrect
            # [1,0]: model1 incorrect, model2 correct
            # [1,1]: both models incorrect
            table = [
                [(y_pred1 == y_test) & (y_pred2 == y_test), (y_pred1 == y_test) & (y_pred2 != y_test)],
                [(y_pred1 != y_test) & (y_pred2 == y_test), (y_pred1 != y_test) & (y_pred2 != y_test)]
            ]
            table = np.array([[np.sum(cell) for cell in row] for row in table])
            
            # Calculate McNemar's test
            if table[0, 1] + table[1, 0] > 0:  # Check if there are any discordant predictions
                # Use the correct function name for newer scipy versions
                try:
                    # Try newer scipy version method
                    p_values[i, j] = stats.binomtest(
                        k=table[0, 1],
                        n=table[0, 1] + table[1, 0],
                        p=0.5
                    ).pvalue
                except AttributeError:
                    # Fall back to older version if available
                    p_values[i, j] = stats.binom_test(
                        x=table[0, 1],
                        n=table[0, 1] + table[1, 0],
                        p=0.5
                    )
            else:
                p_values[i, j] = 1.0  # No difference if no discordant predictions

# Create p-value matrix
p_value_df = pd.DataFrame(p_values, index=model_names, columns=model_names)
print(p_value_df.round(4))

# Visualization of p-values (highlight significant differences)
plt.figure(figsize=(10, 8))
mask = np.zeros_like(p_values, dtype=bool)
np.fill_diagonal(mask, True)  # Mask the diagonal

sns.heatmap(
    p_value_df, 
    annot=True, 
    fmt='.4f', 
    cmap='coolwarm_r', 
    mask=mask,
    vmin=0, 
    vmax=0.1,
    linewidths=0.5
)
plt.title("McNemar's Test p-values (< 0.05 indicates significant difference)")
plt.tight_layout()
plt.show()

# Learning Curves - How performance varies with training data size
def plot_learning_curve(estimator, title, X, y, cv=5, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure(figsize=(10, 6))
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring='accuracy'
    )
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    plt.legend(loc="best")
    
    return plt

# Generate learning curves for each model (except ensemble models which are already fitted)
for name, model in models.items():
    if name not in ['Voting Ensemble', 'Stacking Ensemble']:  # Skip already fitted models
        plot_learning_curve(
            model, f'Learning Curve - {name}', X, y, cv=5, n_jobs=-1
        )
        plt.tight_layout()
        plt.show()

From this you should be able to conclude that ensemble methods (particularly stacking) provide the best overall performance, though the difference between Random Forest and the ensembles is not statistically significant. Gradient Boosting shows strong potential but would likely benefit from more data. The Decision Tree serves as a good baseline but is clearly inadequate for this task.

### Cost-Performance Tradeoffs

In practice model selection involves balancing multiple factors:

- Performance (accuracy, precision, recall)
- Computational costs (training time, prediction time)
- Model complexity and interpretability
- Resource constraints in deployment settings

The following, final code chunk demonstrates these trades.

Note: some of the timing data here is estimated, so only the general relationships are illustrated.

In [None]:
import time
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings

warnings.filterwarnings('ignore')

print("\n\n### Model Complexity vs. Performance Trade-offs ###")
print("=" * 70)

# Dictionary to track complexity and timing
model_complexity = {
    'Decision Tree': {'n_params': 1, 'complexity': 'Low'},  
    'Random Forest': {'n_params': 100, 'complexity': 'Medium'},  # 100 trees
    'Gradient Boosting': {'n_params': 100, 'complexity': 'Medium-High'},  # 100 trees, sequential
    'Voting Ensemble': {'n_params': 3, 'complexity': 'High'},  # 3 base models
    'Stacking Ensemble': {'n_params': 4, 'complexity': 'Very High'}  # 3 base models + meta-model
}

# Collect timing information
timing_results = {}

# Test models for timing (training and prediction)
for name, model in models.items():
    if name not in ['Voting Ensemble', 'Stacking Ensemble']:  # Skip already trained models
        # Time training
        start_time = time.time()
        model.fit(X_train, y_train)
        train_time = time.time() - start_time
        
        # Time prediction
        start_time = time.time()
        model.predict(X_test)
        predict_time = time.time() - start_time
        
        # Store results
        timing_results[name] = {
            'training_time': train_time,
            'prediction_time': predict_time,
            'accuracy': results[name]['accuracy']
        }

# For pre-trained ensemble models, just time prediction
for name in ['Voting Ensemble', 'Stacking Ensemble']:
    if name in models:
        # Time prediction only
        start_time = time.time()
        models[name].predict(X_test)
        predict_time = time.time() - start_time
        
        # Estimate training time based on complexity (relative to base models)
        if name == 'Voting Ensemble':
            # Roughly sum of base models
            train_time = sum([timing_results[m]['training_time'] 
                             for m in ['Random Forest', 'Gradient Boosting', 'Decision Tree']]) * 1.1
        else:  # Stacking
            # Sum of base models + overhead for meta-model training with CV
            train_time = sum([timing_results[m]['training_time'] 
                             for m in ['Random Forest', 'Gradient Boosting', 'Decision Tree']]) * 1.5
        
        # Store results
        timing_results[name] = {
            'training_time': train_time,
            'prediction_time': predict_time,
            'accuracy': results[name]['accuracy']
        }

# Convert to DataFrame for easier handling
timing_df = pd.DataFrame(timing_results).T
timing_df['model'] = timing_df.index
timing_df['complexity'] = timing_df.index.map(lambda x: model_complexity[x]['complexity'])
timing_df['n_params'] = timing_df.index.map(lambda x: model_complexity[x]['n_params'])

# Display results
print("\nModel Performance and Computational Cost:")
print(timing_df[['complexity', 'training_time', 'prediction_time', 'accuracy']])

# Visualization of trade-offs
plt.figure(figsize=(8, 5))

# Create bubble chart - x: training time, y: accuracy, size: model complexity
plt.scatter(timing_df['training_time'], 
           timing_df['accuracy'], 
           s=timing_df['n_params']*5,  # Size based on number of parameters
           alpha=0.7)

# Add labels for each point
for i, model in enumerate(timing_df.index):
    plt.annotate(model, 
                (timing_df['training_time'][i], timing_df['accuracy'][i]),
                xytext=(5, 5), textcoords='offset points')

plt.title('Model Performance vs. Training Time')
plt.xlabel('Training Time (seconds)')
plt.ylabel('Accuracy')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Note: Bubble size represents relative model complexity.\n")

# Create bar chart for prediction time
plt.figure(figsize=(6, 4))
plt.barh(timing_df.index, timing_df['prediction_time'], color='skyblue')
plt.xlabel('Prediction Time (seconds)')
plt.title('Model Prediction Time Comparison')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()


Key Trade-off Insights:

1. Complexity vs. Performance: While more complex models generally perform better, the
   improvement from Random Forest to Stacking Ensemble may not justify the increased
   complexity and computational cost for all applications.
2. Training vs. Prediction Time: Ensemble methods take significantly longer to train but
   their prediction time may be acceptable for many applications.
3. Diminishing Returns: There appears to be a point of diminishing returns where additional
   complexity yields only marginal performance gains.
4. Recommended Approach: For this dataset, Random Forest offers the best balance of
   performance and complexity, while Stacking Ensemble should be considered when
   maximum accuracy is critical and training time is not a constraint.

## Conclusion: The Art and Science of Machine Learning

As we conclude our exploration of tree-based methods and ensemble learning, it's worth reflecting on the broader journey we've undertaken throughout this course. We began with fundamental statistical concepts and linear models, progressed through increasingly complex algorithms, and have now arrived at state-of-the-art ensemble techniques that represent some of the most powerful tools in the modern machine learning toolkit.

The ensemble methods we've explored—bagging, boosting, voting, and stacking—demonstrate an important principle in machine learning: by combining multiple perspectives, we often achieve greater insight than any single model can provide alone. This mirrors the collaborative nature of scientific progress itself.

However, as our final analysis of cost-performance tradeoffs reveals, machine learning is not merely a quest for the highest accuracy. The most sophisticated model is not always the most appropriate solution. The true art of applied machine learning lies in balancing theoretical performance with practical constraints—computational resources, interpretability needs, and deployment requirements.

As you move forward in your careers, remember that the techniques covered in this course are not just algorithms to be implemented, but frameworks for thinking about problems. The process of feature selection, model development, rigorous validation, and thoughtful evaluation represents a systematic approach to knowledge discovery that extends far beyond prediction tasks.

The field continues to evolve rapidly, but the fundamental principles we've covered—understanding your data, avoiding overfitting, properly validating models, and critically evaluating performance—will remain essential regardless of which new algorithms emerge. 

I encourage you to maintain both technical rigor and creative curiosity as you apply these tools to solve meaningful problems in your respective fields. The most impressive machine learning systems are not those with the most complex architectures, but those that most effectively address real-world needs.

Thank you for your engagement and hard work throughout this course. I look forward to seeing how you will use these methods to extract knowledge from data and drive innovation in your future endeavors.