# Lecture 9: Tree-Based Models - Programming Example

## Introduction: Advancing Beyond Linear Constraints with Tree-Based Intelligence

Welcome back to your Capital City Bikes consulting engagement! Eight months after deploying your linear regression models, the board has approached you with competitive intelligence that demands immediate action. Three rival bike-sharing companies have entered your market with sophisticated ML systems achieving demonstrably better predictions during complex scenarios like weather transitions and seasonal shifts.

The CEO's message is direct: "Our linear models served us well for Series A funding, but competitors are now outperforming us with advanced ensemble methods. The Series B investors expect state-of-the-art predictive capabilities. We need you to implement tree-based models that capture the non-linear patterns and feature interactions our linear approach is missing."

Think of tree-based modeling as graduating from basic algebra to calculus. While linear regression assumes constant relationships across all conditions, decision trees and random forests discover conditional patterns: "If temperature is warm AND humidity is low AND it's a weekday, expect high commuter demand. But if temperature is warm AND humidity is high, expect 30% lower demand regardless of day type." These conditional rules mirror how experienced operations managers actually think about demand.

Your task: engineer sophisticated features that expose non-linear patterns, implement decision trees to understand their interpretable rule-based logic, deploy Random Forest ensembles that achieve production-grade accuracy, and analyze feature importance to guide strategic investments. Every technique must demonstrate measurable improvements over your linear baseline to justify the algorithmic complexity to stakeholders.

> **🚀 Interactive Learning Alert**
>
> This is an advanced hands-on tutorial with production-grade ensemble modeling challenges. For the best experience:
>
> - **Click "Open in Colab"** at the bottom to run code interactively
> - **Execute each code cell** by pressing **Shift + Enter**
> - **Complete the challenges** to practice your tree-based modeling skills
> - **Think like a senior consultant** - algorithm choice impacts funding discussions and competitive positioning

---

## Step 1: Feature Engineering for Non-Linear Pattern Discovery

Let's begin with feature engineering to enhance the quality of our feature set, followed by a preliminary analysis to better understand the data.

In [None]:
# Import essential libraries for data manipulation, modeling, and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load Washington D.C. bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Sort chronologically to maintain temporal integrity for time series modeling
df = df.sort_values('datetime').reset_index(drop=True)

print("=== FEATURE ENGINEERING FOR TREE-BASED MODELS ===")
print(f"Dataset: {len(df):,} hourly observations")
print(f"Time range: {df['datetime'].min()} to {df['datetime'].max()}\n")

# Existing features in dataset
print("Original features:")
print(df.columns.tolist())
print()

# Extract temporal features that capture demand cycles
df['hour'] = df['datetime'].dt.hour
df['dayofweek'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year

# Create binary features for operational planning segments
# Binary encoding converts categorical conditions into 0/1 indicators that
# trees can use for clean threshold-based splitting decisions
df['is_rush_hour'] = ((df['hour'] >= 7) & (df['hour'] <= 9) |
                        (df['hour'] >= 17) & (df['hour'] <= 19)).astype(int)
# Rush hours (7-9am, 5-7pm) represent peak commuter demand periods

df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)
# Weekend indicator captures leisure vs. commuter demand patterns

df['is_night'] = ((df['hour'] >= 22) | (df['hour'] <= 5)).astype(int)
# Night hours (10pm-5am) represent low-demand maintenance windows

# Weather condition severity encoding
# Map weather codes to interpretable severity levels for better business understanding
df['weather_severity'] = df['weather'].map({1: 0, 2: 1, 3: 2, 4: 3})
# 0=clear, 1=cloudy, 2=light_rain, 3=heavy_rain/snow

# Temperature-based categorical features for threshold effects
# Cut temperature into bins representing operational planning segments:
# Cold (<10°C), Cool (10-20°C), Warm (20-30°C), Hot (>30°C)
df['temp_category'] = pd.cut(df['temp'], bins=[-np.inf, 10, 20, 30, np.inf],
                               labels=['cold', 'cool', 'warm', 'hot'])

# Humidity-based categorical features
# High humidity (>70%) significantly reduces cycling comfort
df['humidity_category'] = pd.cut(df['humidity'], bins=[-np.inf, 40, 70, np.inf],
                                   labels=['dry', 'moderate', 'humid'])

# Interaction features that capture combined effects
# Temperature × Hour interaction: warm mornings differ from warm evenings in demand patterns
df['temp_hour_interaction'] = df['temp'] * df['hour']

# Working day × Hour: commuter patterns differ dramatically between working days and weekends
df['workingday_hour'] = df['workingday'] * df['hour']

# Season-Weather interaction: rain in summer affects demand differently than rain in winter
df['season_weather'] = df['season'] * df['weather_severity']

print("=== ENGINEERED FEATURES ===")
print("Temporal features: hour, dayofweek, month, year")
print("Binary indicators: is_rush_hour, is_weekend, is_night")
print("Categorical encodings: weather_severity, temp_category, humidity_category")
print("Interaction features: temp_hour_interaction, workingday_hour, season_weather")
print()

# Prepare feature matrix for tree-based modeling
# Note: Tree-based models can handle categorical variables, but we'll use
# numerical encoding for consistency with scikit-learn's requirements
feature_columns = [
    # Weather features
    'temp', 'atemp', 'humidity', 'windspeed', 'weather_severity',
    # Temporal features
    'hour', 'dayofweek', 'month', 'season',
    # Binary indicators
    'workingday', 'holiday', 'is_rush_hour', 'is_weekend', 'is_night',
    # Interaction features
    'temp_hour_interaction', 'workingday_hour', 'season_weather'
]

X = df[feature_columns]
y = df['count']

print(f"Feature matrix: {X.shape[0]} observations × {X.shape[1]} features")
print(f"Target variable: count (hourly bike rentals)")
print()

# Display feature statistics for business understanding
print("=== FEATURE STATISTICS (Business Intelligence) ===")
print(f"Rush hour observations: {df['is_rush_hour'].sum():,} ({df['is_rush_hour'].mean()*100:.1f}%)")
print(f"Weekend observations: {df['is_weekend'].sum():,} ({df['is_weekend'].mean()*100:.1f}%)")
print(f"Night observations: {df['is_night'].sum():,} ({df['is_night'].mean()*100:.1f}%)")
print()

# Show demand differences across key segments for operational insights
print("=== DEMAND PATTERNS BY SEGMENT ===")
print(f"Rush hour demand: {df[df['is_rush_hour']==1]['count'].mean():.0f} bikes/hour")
print(f"Non-rush hour demand: {df[df['is_rush_hour']==0]['count'].mean():.0f} bikes/hour")
print(f"Weekend demand: {df[df['is_weekend']==1]['count'].mean():.0f} bikes/hour")
print(f"Weekday demand: {df[df['is_weekend']==0]['count'].mean():.0f} bikes/hour")

**What this does:**
- Loads Washington D.C. bike-sharing data and sorts chronologically for time series integrity
- Engineers temporal features (hour, dayofweek, month) that capture cyclical demand patterns
- Creates binary indicators (is_rush_hour, is_weekend, is_night) for operational segments
- Builds interaction features (temp×hour, workingday×hour) that expose non-linear effects
- Categorizes continuous variables (temp_category, humidity_category) for threshold discovery

### Challenge 1: Analyze Feature Distributions and Relationships

Your client asks: "Can you create a visual report showing how different features impact demand? I need to see which factors drive ridership so we can understand the data better." Explore feature interactions and segment analysis through visualizations.

**Your Task:** Create visualizations showing demand patterns across different feature combinations (e.g., rush_hour + working_day, temperature + weather severity).

In [None]:
# Your code here - analyze feature distributions and demand patterns

# Example 1: Rush hour + working day combination
segment_analysis = df.groupby(['is_rush_hour', 'workingday'])['count'].agg(['mean', 'count'])
print("=== RUSH HOUR × WORKING DAY ANALYSIS ===")
print(segment_analysis)

# Example 2: Create a heatmap showing demand by hour and day of week
hourly_daily = df.pivot_table(values='count', index='___', columns='___', aggfunc='mean')
plt.figure(figsize=(12, 6))
sns.heatmap(hourly_daily, cmap='YlOrRd', fmt='.0f', cbar_kws={'label': 'Average Demand'})
plt.title('___')
plt.xlabel('___')
plt.ylabel('___')
plt.tight_layout()
plt.show()

# Example 3: Visualize temperature × weather severity interaction
# Create scatter plot or box plots showing how demand varies

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Start with `.groupby(['is_rush_hour', 'workingday'])['count'].agg(['mean', 'count', 'std'])` to see how demand varies across combinations. For the heatmap, use `df.pivot_table(values='count', index='hour', columns='dayofweek', aggfunc='mean')` which creates a matrix showing average demand for each hour-day combination. Set `cmap='YlOrRd'` for a heat-based color scheme that makes patterns visually obvious. For temperature × weather interactions, consider using `sns.boxplot(x='temp_category', y='count', hue='weather_severity', data=df)` to show distributions. Look for segments with 2-3x demand differences - these represent high-value operational optimization opportunities. The heatmap will clearly show morning/evening rush hour peaks on weekdays versus flatter weekend patterns. Business insight: rush hour + working day combinations might show 300+ bikes/hour while night + weekend shows <50 bikes/hour, revealing where fleet positioning matters most.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Example 1: Rush hour + working day combination
segment_analysis = df.groupby(['is_rush_hour', 'workingday'])['count'].agg(['mean', 'count', 'std'])
print("=== RUSH HOUR × WORKING DAY ANALYSIS ===")
print(segment_analysis.round(1))
print()

# Example 2: Heatmap showing demand by hour and day of week
hourly_daily = df.pivot_table(values='count', index='hour', columns='dayofweek', aggfunc='mean')
plt.figure(figsize=(12, 6))
sns.heatmap(hourly_daily, cmap='YlOrRd', annot=True, fmt='.0f', cbar_kws={'label': 'Average Demand (bikes/hour)'})
plt.title('Demand Heatmap: Hour × Day of Week', fontsize=14, fontweight='bold')
plt.xlabel('Day of Week (0=Mon, 6=Sun)', fontsize=11)
plt.ylabel('Hour of Day', fontsize=11)
plt.tight_layout()
plt.show()

# Example 3: Temperature × Weather severity interaction
plt.figure(figsize=(12, 6))
sns.boxplot(x='temp_category', y='count', hue='weather_severity', data=df)
plt.title('Demand by Temperature Category and Weather Severity', fontsize=14, fontweight='bold')
plt.xlabel('Temperature Category', fontsize=11)
plt.ylabel('Hourly Demand (bikes)', fontsize=11)
plt.legend(title='Weather Severity', labels=['Clear', 'Cloudy', 'Light Rain', 'Heavy Rain'])
plt.tight_layout()
plt.show()

print("=== KEY INSIGHTS FOR CAPITAL CITY BIKES ===")
print("✓ Rush hour + working day: Highest demand segment (optimize fleet positioning)")
print("✓ Heatmap reveals: Strong morning (7-9am) and evening (5-7pm) peaks on weekdays")
print("✓ Temperature × Weather: Warm+clear weather shows 3x demand vs. cold+rainy conditions")
print("✓ Operational priority: Focus dynamic repositioning on weekday rush hours")
```

</details>

---

## Step 2: Implement Decision Tree Regressor

Now that we've seen how our engineered features impact demand, let's try our first decision tree model. Decision trees can capture non-linear patterns through interpretable if-then rules that linear regression cannot represent.

In [None]:
# Import tree-based modeling tools from scikit-learn
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Already have X and y from Step 1 feature engineering
print("=== DECISION TREE IMPLEMENTATION ===\n")

# Create chronological train-test split (80/20) for honest evaluation
# Time series data requires chronological splitting to prevent temporal leakage
split_index = int(len(df) * 0.8)
X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]
y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

print(f"Training set: {len(X_train):,} observations ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing set:  {len(X_test):,} observations ({len(X_test)/len(X)*100:.1f}%)")
print(f"Training period: {df.iloc[:split_index]['datetime'].min()} to {df.iloc[:split_index]['datetime'].max()}")
print(f"Testing period:  {df.iloc[split_index:]['datetime'].min()} to {df.iloc[split_index:]['datetime'].max()}")
print()

# Decision Tree with unlimited depth (demonstrating overfitting potential)
print("--- Training Decision Tree (Unlimited Depth) ---")
# DecisionTreeRegressor creates a tree that recursively partitions feature space
# to minimize mean squared error within each region (leaf node)
tree_unlimited = DecisionTreeRegressor(random_state=42)
# No max_depth specified = tree grows until leaves are pure or contain min_samples_split
tree_unlimited.fit(X_train, y_train)

# Examine tree structure to understand model complexity
print(f"Tree depth: {tree_unlimited.get_depth()}")
print(f"Number of leaves: {tree_unlimited.get_n_leaves()}")
print(f"Total nodes: {tree_unlimited.tree_.node_count}")
print()

# Generate predictions on both training and testing sets
train_pred_unlimited = tree_unlimited.predict(X_train)
test_pred_unlimited = tree_unlimited.predict(X_test)

# Calculate performance metrics
train_r2_unlimited = r2_score(y_train, train_pred_unlimited)
test_r2_unlimited = r2_score(y_test, test_pred_unlimited)
train_rmse_unlimited = np.sqrt(mean_squared_error(y_train, train_pred_unlimited))
test_rmse_unlimited = np.sqrt(mean_squared_error(y_test, test_pred_unlimited))

print("=== DECISION TREE PERFORMANCE (Unlimited Depth) ===")
print(f"Training:  R² = {train_r2_unlimited:.4f}, RMSE = {train_rmse_unlimited:.2f} bikes")
print(f"Testing:   R² = {test_r2_unlimited:.4f}, RMSE = {test_rmse_unlimited:.2f} bikes")
print(f"Overfit gap: {train_r2_unlimited - test_r2_unlimited:.4f}")
print()

if (train_r2_unlimited - test_r2_unlimited) > 0.30:
    print("⚠ SEVERE OVERFITTING DETECTED:")
    print(f"  • Training R² near-perfect ({train_r2_unlimited:.1%}) but testing R² only {test_r2_unlimited:.1%}")
    print(f"  • Gap of {train_r2_unlimited - test_r2_unlimited:.1%} indicates memorization, not learning")
    print(f"  • Tree depth of {tree_unlimited.get_depth()} with {tree_unlimited.get_n_leaves():,} leaves creates overly specific rules")
    print(f"  • Solution: Limit tree depth or use ensemble methods (Random Forest)")
elif (train_r2_unlimited - test_r2_unlimited) > 0.15:
    print("⚠ MODERATE OVERFITTING:")
    print(f"  • Performance gap suggests some memorization of training data")
    print(f"  • Consider constraining tree depth or using regularization")
else:
    print("✓ Good generalization - training and testing performance similar")

print()

print("=== WHY SINGLE TREES OVERFIT ===")
print("• Decision trees recursively partition data until leaves are pure")
print("• Unlimited depth = tree memorizes training data noise and outliers")
print("• Result: Near-perfect training fit but poor generalization to new data")
print("• Solution: Use ensemble methods (Random Forest) that average many trees")

**What this does:**
- Creates chronological 80/20 train-test split preserving temporal order for honest evaluation
- Trains unlimited-depth decision tree showing severe overfitting (training R² ≈99%, test R² ≈45%)
- Displays tree structure metrics (depth, leaves, nodes) revealing model complexity
- Calculates overfit gap (train R² - test R²) demonstrating why single trees struggle with generalization
- Shows that unlimited trees memorize training data patterns rather than learning generalizable relationships

### Challenge 2: Visualize Decision Tree Structure

Your client asks: "Can you show me how the tree makes decisions? I want to understand the business rules it learned." Create a visualization of a shallow tree for interpretability.

**Your Task:** Train a very shallow tree (max_depth=3) and visualize its structure with feature names and decision thresholds.

In [None]:
# Your code here - create and visualize shallow decision tree

# Train a shallow tree for visualization (max_depth=3 for clarity)
tree_shallow = DecisionTreeRegressor(max_depth=___, min_samples_leaf=50, random_state=42)
tree_shallow.fit(X_train, y_train)

# Calculate performance of shallow tree
test_pred_shallow = tree_shallow.predict(X_test)
test_r2_shallow = r2_score(y_test, test_pred_shallow)

print(f"=== SHALLOW TREE (max_depth=3) ===")
print(f"Tree depth: {tree_shallow.get_depth()}")
print(f"Number of leaves: {tree_shallow.get_n_leaves()}")
print(f"Test R²: {test_r2_shallow:.4f}")
print()

# Visualize tree structure
plt.figure(figsize=(20, 10))
plot_tree(tree_shallow,
          feature_names=_____,
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree Structure (max_depth=3) - Interpretable Business Rules',
          fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Extract and display the most important decision rules
print("=== TOP DECISION RULES (Business Interpretation) ===")
feature_importance_shallow = pd.DataFrame({
    'feature': feature_columns,
    'importance': tree_shallow.feature_importances_
}).sort_values('importance', ascending=False).head(5)
print(feature_importance_shallow)

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Use `max_depth=3` to create a tree shallow enough to visualize clearly on one page. The `plot_tree()` function requires `feature_names=feature_columns` (the list of column names you defined in Step 1) to display readable feature labels instead of generic "X[0]" notation. Set `filled=True` to color nodes by prediction value (darker colors = higher predicted demand) and `rounded=True` for professional appearance. After visualization, use `tree_shallow.feature_importances_` to extract which features appear most frequently in the top splits - these are the key drivers the tree identified. A shallow tree sacrifices accuracy for interpretability, so expect test R² around 65-75% (lower than deeper trees) but you gain the ability to communicate exact decision logic to stakeholders. The visualization will show something like: "If hour <= 12.5 AND workingday <= 0.5, predict low demand (weekend morning pattern)". These are the if-then business rules your operations team can actually use for planning.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Train a shallow tree for visualization (max_depth=3 for clarity)
tree_shallow = DecisionTreeRegressor(max_depth=3, min_samples_leaf=50, random_state=42)
tree_shallow.fit(X_train, y_train)

# Calculate performance of shallow tree
test_pred_shallow = tree_shallow.predict(X_test)
test_r2_shallow = r2_score(y_test, test_pred_shallow)

print(f"=== SHALLOW TREE (max_depth=3) ===")
print(f"Tree depth: {tree_shallow.get_depth()}")
print(f"Number of leaves: {tree_shallow.get_n_leaves()}")
print(f"Test R²: {test_r2_shallow:.4f}")
print()

# Visualize tree structure
plt.figure(figsize=(20, 10))
plot_tree(tree_shallow,
          feature_names=feature_columns,
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree Structure (max_depth=3) - Interpretable Business Rules',
          fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Extract and display the most important decision rules
print("=== TOP DECISION RULES (Business Interpretation) ===")
feature_importance_shallow = pd.DataFrame({
    'feature': feature_columns,
    'importance': tree_shallow.feature_importances_
}).sort_values('importance', ascending=False).head(5)
print(feature_importance_shallow.round(4))
print()

print("=== BUSINESS RULE TRANSLATION ===")
print("The tree makes decisions using if-then logic:")
print("• Root node: Splits on most predictive feature (likely 'hour' or 'is_rush_hour')")
print("• Each split creates two branches: one for observations meeting condition, one for those that don't")
print("• Leaf nodes (colored boxes): Final demand predictions for observations reaching that leaf")
print("• Node color intensity: Darker = higher predicted demand, Lighter = lower predicted demand")
print()
print("Example interpretation:")
print("'If hour <= 12.5 AND workingday = 1 → Predict 150 bikes/hour (morning commute)'")
print("'If hour > 18.5 AND temp > 20 → Predict 280 bikes/hour (warm evening peak)'")
```

</details>

---

## Step 3: Deploy Random Forest Ensemble

Let's implement Random Forest to overcome individual tree overfitting through ensemble averaging of multiple diverse trees.

In [None]:
# Import Random Forest from scikit-learn's ensemble module
from sklearn.ensemble import RandomForestRegressor

print("=== RANDOM FOREST ENSEMBLE IMPLEMENTATION ===\n")

# Train Random Forest with default parameters first
print("--- Training Random Forest (Default: 100 trees) ---")
# RandomForestRegressor creates an ensemble of decision trees:
# - Each tree trains on a bootstrap sample (random sampling with replacement)
# - Each split considers only a subset of features (max_features='sqrt' by default)
# - Final prediction = average of all tree predictions (reduces variance)
rf_default = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
# n_estimators=100: Build 100 decision trees in the forest
# random_state=42: Ensures reproducible results across runs
# n_jobs=-1: Use all CPU cores for parallel training (speeds up computation)
rf_default.fit(X_train, y_train)

print(f"Forest size: {rf_default.n_estimators} trees")
print(f"Features considered per split: sqrt({X_train.shape[1]}) ≈ {int(np.sqrt(X_train.shape[1]))} features")
print()

# Generate predictions with Random Forest
train_pred_rf = rf_default.predict(X_train)
test_pred_rf = rf_default.predict(X_test)

train_r2_rf = r2_score(y_train, train_pred_rf)
test_r2_rf = r2_score(y_test, test_pred_rf)
train_rmse_rf = np.sqrt(mean_squared_error(y_train, train_pred_rf))
test_rmse_rf = np.sqrt(mean_squared_error(y_test, test_pred_rf))

print("=== RANDOM FOREST PERFORMANCE ===")
print(f"Training:  R² = {train_r2_rf:.4f}, RMSE = {train_rmse_rf:.2f} bikes")
print(f"Testing:   R² = {test_r2_rf:.4f}, RMSE = {test_rmse_rf:.2f} bikes")
print(f"Overfit gap: {train_r2_rf - test_r2_rf:.4f}")
print()

# Compare Random Forest vs Single Decision Tree
print("=== ALGORITHM PERFORMANCE COMPARISON ===")
print("Model                          | Train R²  | Test R²   | Overfit Gap | Status")
print("-" * 85)
print(f"Single Tree (Unlimited)        | {train_r2_unlimited:.4f}    | {test_r2_unlimited:.4f}    | {train_r2_unlimited - test_r2_unlimited:.4f}      | Severe Overfit")
print(f"Random Forest (100 trees)      | {train_r2_rf:.4f}    | {test_r2_rf:.4f}    | {train_r2_rf - test_r2_rf:.4f}      | Good Balance")
print()

# Calculate competitive advantages for business reporting
test_improvement_vs_tree = (test_r2_rf - test_r2_unlimited) / test_r2_unlimited * 100

print("=== RANDOM FOREST COMPETITIVE ADVANTAGES ===")
print(f"Test R² improvement vs single tree: {test_improvement_vs_tree:+.1f}%")
print(f"Overfit gap reduction: {train_r2_unlimited - test_r2_unlimited:.4f} → {train_r2_rf - test_r2_rf:.4f}")
print()

if test_r2_rf >= 0.85:
    print("✓ EXCELLENT PERFORMANCE:")
    print("  • Test R² ≥ 85% meets Series B investor expectations")
    print("  • Production-ready accuracy for operational deployment")
    print("  • Competitive advantage over linear baseline established")
elif test_r2_rf >= 0.75:
    print("✓ STRONG PERFORMANCE:")
    print("  • Test R² ≥ 75% represents significant improvement")
    print("  • Suitable for operational planning and strategic decision-making")
    print("  • Demonstrates advanced ML capabilities to stakeholders")
else:
    print("⚠ MODERATE PERFORMANCE:")
    print("  • Test R² suggests room for further optimization")
    print("  • Consider additional feature engineering or hyperparameter tuning")

print()

# Demonstrate ensemble diversity by examining individual tree predictions
print("=== ENSEMBLE DIVERSITY DEMONSTRATION ===")
print("Examining predictions from first 10 trees for one observation:")
example_obs = X_test.iloc[0:1]
print(f"Example observation features:")
print(f"  Hour: {example_obs['hour'].values[0]}, Temp: {example_obs['temp'].values[0]:.1f}°C, ")
print(f"  Working day: {example_obs['workingday'].values[0]}, Rush hour: {example_obs['is_rush_hour'].values[0]}")
print()

tree_predictions = []
# Convert to numpy array once to avoid feature name warnings
example_obs_array = example_obs.to_numpy()
for i in range(min(10, rf_default.n_estimators)):
    # Each tree in the forest makes independent predictions
    tree_pred = rf_default.estimators_[i].predict(example_obs_array)[0]
    tree_predictions.append(tree_pred)
    print(f"Tree {i+1:2d} predicts: {tree_pred:6.1f} bikes")

print(f"\nAverage of 10 trees:  {np.mean(tree_predictions):6.1f} bikes")
print(f"Full ensemble (100):  {rf_default.predict(example_obs)[0]:6.1f} bikes")
print(f"Prediction spread:    {np.max(tree_predictions) - np.min(tree_predictions):6.1f} bikes")
print(f"Standard deviation:   {np.std(tree_predictions):6.1f} bikes")
print()

print("WHY DIVERSITY MATTERS:")
print("• Each tree sees different bootstrap sample (random observations)")
print("• Each split uses different random feature subset")
print("• Individual trees make different predictions (some high, some low)")
print("• Averaging cancels individual errors → more stable, reliable forecast")
print("• This is the 'wisdom of crowds' principle: collective intelligence > individual guesses")

**What this does:**
- Trains Random Forest with 100 trees using bootstrap sampling and feature randomness
- Evaluates on both training and testing sets showing dramatically reduced overfitting vs. single tree
- Compares performance against unlimited decision tree demonstrating ensemble advantages
- Shows individual tree predictions for one observation revealing diversity in the forest
- Calculates prediction spread and standard deviation quantifying ensemble variance reduction
- Provides business-focused performance assessment (excellent/strong/moderate categories)

### Challenge 3: Compare Different Ensemble Sizes

Your client asks: "Do we really need 100 trees? Could we get similar performance with fewer trees (faster training) or do we need more trees for better accuracy?" Experiment with ensemble size.

**Your Task:** Train Random Forests with different numbers of trees (10, 50, 100, 200, 300) and analyze the performance vs. training time tradeoff.

In [None]:
# Your code here - compare different ensemble sizes

import time

ensemble_sizes = [10, 50, 100, 200, 300]
results = []

for n_trees in ensemble_sizes:
    print(f"Training Random Forest with {n_trees} trees...")

    # Time the training process
    start_time = time.time()
    rf_temp = RandomForestRegressor(n_estimators=_____, random_state=42, n_jobs=-1)
    rf_temp.fit(X_train, y_train)
    training_time = time.time() - start_time

    # Evaluate performance
    test_pred_temp = rf_temp.predict(X_test)
    test_r2_temp = r2_score(y_test, test_pred_temp)
    test_rmse_temp = np.sqrt(mean_squared_error(y_test, test_pred_temp))

    # Store results
    results.append({
        'n_trees': n_trees,
        'training_time': training_time,
        'test_r2': test_r2_temp,
        'test_rmse': test_rmse_temp
    })

    print(f"  Training time: {training_time:.2f}s, Test R²: {test_r2_temp:.4f}")
    print()

# Visualize performance vs ensemble size
results_df = pd.DataFrame(results)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Test R² vs ensemble size
axes[0].plot(results_df['n_trees'], results_df['test_r2'], 'o-', linewidth=2, markersize=8, color='darkgreen')
axes[0].set_xlabel('Number of Trees', fontsize=11)
axes[0].set_ylabel('Test R²', fontsize=11)
axes[0].set_title('Test Performance vs Ensemble Size', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Panel 2: Training time vs ensemble size
axes[1].plot(results_df['n_trees'], results_df['training_time'], 's-', linewidth=2, markersize=8, color='darkorange')
axes[1].set_xlabel('Number of Trees', fontsize=11)
axes[1].set_ylabel('Training Time (seconds)', fontsize=11)
axes[1].set_title('Training Time vs Ensemble Size', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business recommendation
print("=== ENSEMBLE SIZE RECOMMENDATION ===")
print(results_df.to_string(index=False))

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Loop through each ensemble size and create a fresh `RandomForestRegressor(n_estimators=n_trees, random_state=42, n_jobs=-1)` for each iteration. Use `time.time()` before and after `.fit()` to measure training duration: `start = time.time(); model.fit(X, y); duration = time.time() - start`. Store all results in a list of dictionaries, then convert to DataFrame for easy analysis and visualization. The performance curve typically shows diminishing returns: 10→50 trees gives large improvement, 100→200 gives small improvement, 200→300 gives minimal improvement. Training time increases linearly with tree count (200 trees takes ~2x as long as 100 trees) so there's a clear tradeoff. Business insight: 100-200 trees usually provides the sweet spot - excellent performance without excessive training time. For production deployment, consider whether faster predictions (fewer trees) or maximum accuracy (more trees) matters more to your client's use case. If they need real-time predictions for millions of users, fewer trees might be preferable; if they're doing daily batch forecasting for operational planning, more trees at higher accuracy makes sense.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
import time

ensemble_sizes = [10, 50, 100, 200, 300]
results = []

for n_trees in ensemble_sizes:
    print(f"Training Random Forest with {n_trees} trees...")

    # Time the training process
    start_time = time.time()
    rf_temp = RandomForestRegressor(n_estimators=n_trees, random_state=42, n_jobs=-1)
    rf_temp.fit(X_train, y_train)
    training_time = time.time() - start_time

    # Evaluate performance
    test_pred_temp = rf_temp.predict(X_test)
    test_r2_temp = r2_score(y_test, test_pred_temp)
    test_rmse_temp = np.sqrt(mean_squared_error(y_test, test_pred_temp))

    # Store results
    results.append({
        'n_trees': n_trees,
        'training_time': training_time,
        'test_r2': test_r2_temp,
        'test_rmse': test_rmse_temp
    })

    print(f"  Training time: {training_time:.2f}s, Test R²: {test_r2_temp:.4f}")
    print()

# Visualize performance vs ensemble size
results_df = pd.DataFrame(results)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Test R² vs ensemble size
axes[0].plot(results_df['n_trees'], results_df['test_r2'], 'o-', linewidth=2, markersize=8, color='darkgreen')
axes[0].set_xlabel('Number of Trees', fontsize=11)
axes[0].set_ylabel('Test R²', fontsize=11)
axes[0].set_title('Test Performance vs Ensemble Size', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Panel 2: Training time vs ensemble size
axes[1].plot(results_df['n_trees'], results_df['training_time'], 's-', linewidth=2, markersize=8, color='darkorange')
axes[1].set_xlabel('Number of Trees', fontsize=11)
axes[1].set_ylabel('Training Time (seconds)', fontsize=11)
axes[1].set_title('Training Time vs Ensemble Size', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business recommendation
print("=== ENSEMBLE SIZE RECOMMENDATION ===")
print(results_df.to_string(index=False))
print()

best_value_idx = results_df['test_r2'].idxmax()
best_value = results_df.loc[best_value_idx]
print(f"✓ Recommended: {int(best_value['n_trees'])} trees")
print(f"  • Test R²: {best_value['test_r2']:.4f}")
print(f"  • Training time: {best_value['training_time']:.2f}s")
print(f"  • Rationale: {'Excellent accuracy-speed balance' if best_value['n_trees'] <= 100 else 'Maximum accuracy justified for critical forecasting'}")
```

</details>

---

## Step 4: Feature Importance Analysis

Let's analyze which features drive bike demand predictions to guide strategic investments and operational decisions.

In [None]:
# Extract feature importance from trained Random Forest
print("=== RANDOM FOREST FEATURE IMPORTANCE ANALYSIS ===\n")

# Feature importance based on mean decrease in impurity (MDI)
# Higher values = feature contributed more to prediction accuracy across all trees
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': rf_default.feature_importances_
}).sort_values('importance', ascending=False)

print("--- Feature Importance Rankings ---")
print(f"{'Rank':<6} {'Feature':<25} {'Importance':<12} {'Percentage':<12} {'Visual'}")
print("-" * 75)

for rank, (idx, row) in enumerate(feature_importance.iterrows(), 1):
    bar = '█' * int(row['importance'] * 100)
    percentage = row['importance'] * 100
    print(f"{rank:<6} {row['feature']:<25} {row['importance']:.4f}       {percentage:>6.2f}%        {bar}")

print()

# Calculate cumulative importance to identify critical feature subset
feature_importance['cumulative'] = feature_importance['importance'].cumsum()

print("--- Cumulative Importance Analysis ---")
for i in range(min(5, len(feature_importance))):
    feature_name = feature_importance.iloc[i]['feature']
    cumulative = feature_importance.iloc[i]['cumulative']
    print(f"Top {i+1} feature(s): {cumulative:.1%} of total predictive power")
    if cumulative >= 0.80:
        print(f"  → {i+1} features capture 80% of model intelligence")
        break

print()

# Visualize feature importance
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Panel 1: Horizontal bar chart of top 10 features
top_features = feature_importance.head(10)
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(top_features)))
axes[0].barh(range(len(top_features)), top_features['importance'], color=colors)
axes[0].set_yticks(range(len(top_features)))
axes[0].set_yticklabels(top_features['feature'])
axes[0].invert_yaxis()
axes[0].set_xlabel('Importance Score', fontsize=11)
axes[0].set_title('Top 10 Feature Importance', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='x')

# Panel 2: Cumulative importance curve
axes[1].plot(range(1, len(feature_importance) + 1), feature_importance['cumulative'],
             'o-', linewidth=2, markersize=6, color='darkgreen')
axes[1].axhline(y=0.80, color='red', linestyle='--', linewidth=2, label='80% threshold')
axes[1].axhline(y=0.90, color='orange', linestyle='--', linewidth=2, label='90% threshold')
axes[1].set_xlabel('Number of Top Features', fontsize=11)
axes[1].set_ylabel('Cumulative Importance', fontsize=11)
axes[1].set_title('Cumulative Feature Contribution', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== BUSINESS INSIGHTS FOR CAPITAL CITY BIKES ===")

# Interpret top 3 features for strategic recommendations
for rank in range(min(3, len(feature_importance))):
    feature = feature_importance.iloc[rank]['feature']
    importance = feature_importance.iloc[rank]['importance']

    print(f"\n{rank+1}. {feature}: {importance:.1%} importance")

    # Business interpretation by feature type
    if feature in ['hour', 'is_rush_hour']:
        print("   → IMPLICATION: Time-of-day dominates demand patterns")
        print("   → STRATEGY: Optimize fleet positioning by hour (rush hours critical)")
        print("   → INVESTMENT: Real-time repositioning systems, surge pricing algorithms")
    elif feature in ['temp', 'atemp']:
        print("   → IMPLICATION: Temperature drives cycling comfort decisions")
        print("   → STRATEGY: Weather-responsive capacity planning")
        print("   → INVESTMENT: Weather API integration, temperature-based forecasting")
    elif feature in ['workingday', 'dayofweek', 'is_weekend']:
        print("   → IMPLICATION: Commuter vs. leisure patterns differ fundamentally")
        print("   → STRATEGY: Separate weekday (commute) vs. weekend (leisure) operations")
        print("   → INVESTMENT: Day-specific marketing, differential pricing strategies")
    elif feature in ['season', 'month']:
        print("   → IMPLICATION: Seasonal variations require long-term planning")
        print("   → STRATEGY: Adjust fleet size, maintenance schedules by season")
        print("   → INVESTMENT: Seasonal fleet scaling, predictive maintenance")
    elif feature in ['weather_severity', 'humidity', 'windspeed']:
        print("   → IMPLICATION: Weather conditions directly impact usage decisions")
        print("   → STRATEGY: Dynamic bike redistribution based on forecasts")
        print("   → INVESTMENT: Weather-triggered alerts, covered bike stations")
    else:
        print("   → IMPLICATION: Feature provides supplementary predictive value")
        print("   → STRATEGY: Maintain in model for marginal accuracy gains")

print("\n" + "="*75)
print("STRATEGIC RECOMMENDATION:")
top_feature = feature_importance.iloc[0]['feature']
top_importance = feature_importance.iloc[0]['importance']
print(f"✓ '{top_feature}' dominates with {top_importance:.1%} importance")
print(f"✓ Top 3 features capture {feature_importance.iloc[2]['cumulative']:.1%} of predictive power")
print(f"✓ Focus operational investments on temporal optimization and weather responsiveness")
print(f"✓ These insights justify Series B funding requests for real-time forecasting systems")

**What this does:**
- Extracts `.feature_importances_` from trained Random Forest showing MDI (Mean Decrease in Impurity) scores
- Displays ranked table with importance scores, percentages, and visual bars for quick interpretation
- Calculates cumulative importance showing how many features capture 80%/90% of predictive power
- Creates two-panel visualization: horizontal bar chart (top 10 features) and cumulative curve
- Translates top features into business implications with strategic recommendations for each
- Provides actionable insights for resource allocation, investment decisions, and operational priorities

### Challenge 4: Experiment with New Features and Analyze Feature Importance

Your client asks: "We've seen what features matter now, but what if we engineer additional features? Could we discover new patterns that improve predictions? I want to experiment with new features and see how they compare in the importance rankings."

**Your Task:** Create 3-5 new experimental features (e.g., squared terms, new time windows, weather combinations), retrain the Random Forest, and analyze how feature importance changes.

In [None]:
# Your code here - create new features and analyze importance changes

# Step 1: Create new experimental features
# Examples of features you might try:
# - Squared/polynomial features: temp_squared, humidity_squared
# - New time windows: is_midday, is_late_night
# - Weather combinations: temp_humidity_interaction
# - Lag features: previous_hour_pattern
# - Day type combinations: weekend_hour_interaction

print("=== EXPERIMENTING WITH NEW FEATURES ===\n")

# Add your new features to the dataframe
df['temp_squared'] = df['temp'] ** 2
df['___'] = ___  # Add 2-4 more experimental features
df['___'] = ___
df['___'] = ___

# Create new feature list including original + experimental features
experimental_features = feature_columns + ['temp_squared', '___', '___', '___']

print(f"Original features: {len(feature_columns)}")
print(f"Experimental features: {len(experimental_features)}")
print(f"New features added: {len(experimental_features) - len(feature_columns)}")
print()

# Step 2: Train Random Forest with expanded feature set
X_experimental = df[experimental_features]
y_experimental = df['count']

X_train_exp = X_experimental.iloc[:split_index]
X_test_exp = X_experimental.iloc[split_index:]

rf_experimental = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_experimental.fit(X_train_exp, y_train)

# Step 3: Evaluate performance change
test_pred_exp = rf_experimental.predict(X_test_exp)
test_r2_exp = r2_score(y_test, test_pred_exp)

print(f"=== PERFORMANCE COMPARISON ===")
print(f"Baseline model (original features):     R² = {test_r2_rf:.4f}")
print(f"Experimental model (expanded features): R² = {test_r2_exp:.4f}")
print(f"Performance change:                     {(test_r2_exp - test_r2_rf):.4f} ({((test_r2_exp - test_r2_rf)/test_r2_rf)*100:+.2f}%)")
print()

# Step 4: Analyze new feature importance rankings
feature_importance_exp = pd.DataFrame({
    'feature': experimental_features,
    'importance': rf_experimental.feature_importances_
}).sort_values('importance', ascending=False)

print("=== TOP 15 FEATURES (WITH EXPERIMENTAL FEATURES) ===")
for rank, (idx, row) in enumerate(feature_importance_exp.head(15).iterrows(), 1):
    is_new = '🆕 NEW' if row['feature'] not in feature_columns else '      '
    bar = '█' * int(row['importance'] * 100)
    print(f"{rank:<3} {is_new} {row['feature']:<30} {row['importance']:.4f}  {bar}")

print()

# Step 5: Compare feature importance distributions
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Panel 1: Original model top 10
top_original = feature_importance.head(10)
colors_original = plt.cm.viridis(np.linspace(0.3, 0.9, len(top_original)))
axes[0].barh(range(len(top_original)), top_original['importance'], color=colors_original)
axes[0].set_yticks(range(len(top_original)))
axes[0].set_yticklabels(top_original['feature'])
axes[0].invert_yaxis()
axes[0].set_xlabel('Importance Score', fontsize=11)
axes[0].set_title('Baseline Model - Top 10 Features', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='x')

# Panel 2: Experimental model top 10
top_experimental = feature_importance_exp.head(10)
colors_experimental = plt.cm.plasma(np.linspace(0.3, 0.9, len(top_experimental)))
axes[1].barh(range(len(top_experimental)), top_experimental['importance'], color=colors_experimental)
axes[1].set_yticks(range(len(top_experimental)))
axes[1].set_yticklabels(top_experimental['feature'])
axes[1].invert_yaxis()
axes[1].set_xlabel('Importance Score', fontsize=11)
axes[1].set_title('Experimental Model - Top 10 Features', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Step 6: Analyze insights
print("=== EXPERIMENTAL FEATURE INSIGHTS ===")
# Check if any new features made it to top 10
new_features_in_top10 = [f for f in top_experimental['feature'].values[:10] if f not in feature_columns]
if new_features_in_top10:
    print(f"✓ {len(new_features_in_top10)} new feature(s) in top 10: {', '.join(new_features_in_top10)}")
    print("  → These features captured previously hidden patterns")
else:
    print("✗ No new features in top 10")
    print("  → Original features still dominate, experimental features add marginal value")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Good experimental features to try: `temp_squared = df['temp'] ** 2` (captures non-linear temperature effects), `is_midday = ((df['hour'] >= 11) & (df['hour'] <= 14)).astype(int)` (lunch hour patterns), `temp_humidity_interaction = df['temp'] * df['humidity']` (discomfort index), `hour_squared = df['hour'] ** 2` (non-linear time effects), or `weekend_weather = df['is_weekend'] * df['weather_severity']` (weekend sensitivity to bad weather). After creating features, ensure they're included in `experimental_features = feature_columns + ['new_feature1', 'new_feature2', ...]`. When analyzing importance, look for: (1) Did any new features break into top 10? (2) Did performance improve meaningfully (>1% R² gain)? (3) Are new features interpretable enough for business use? Sometimes new features improve training accuracy but add complexity without production value. The visualization comparison shows if new features displace old ones or just add marginal value at the bottom of rankings. Business insight: Only keep experimental features that both improve performance AND rank in top 15 - otherwise they add computational cost without strategic benefit.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
print("=== EXPERIMENTING WITH NEW FEATURES ===\n")

# Step 1: Create experimental features
df['temp_squared'] = df['temp'] ** 2  # Non-linear temperature effect
df['is_midday'] = ((df['hour'] >= 11) & (df['hour'] <= 14)).astype(int)  # Lunch period
df['temp_humidity_interaction'] = df['temp'] * df['humidity']  # Discomfort index
df['hour_squared'] = df['hour'] ** 2  # Non-linear time progression
df['weekend_weather'] = df['is_weekend'] * df['weather_severity']  # Weekend weather sensitivity

# Create expanded feature list
experimental_features = feature_columns + [
    'temp_squared', 'is_midday', 'temp_humidity_interaction',
    'hour_squared', 'weekend_weather'
]

print(f"Original features: {len(feature_columns)}")
print(f"Experimental features: {len(experimental_features)}")
print(f"New features added: {len(experimental_features) - len(feature_columns)}")
print(f"\nNew features: temp_squared, is_midday, temp_humidity_interaction, hour_squared, weekend_weather")
print()

# Step 2: Train Random Forest with expanded feature set
X_experimental = df[experimental_features]
y_experimental = df['count']

X_train_exp = X_experimental.iloc[:split_index]
X_test_exp = X_experimental.iloc[split_index:]

print("Training Random Forest with experimental features...")
rf_experimental = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_experimental.fit(X_train_exp, y_train)

# Step 3: Evaluate performance change
test_pred_exp = rf_experimental.predict(X_test_exp)
test_r2_exp = r2_score(y_test, test_pred_exp)
test_rmse_exp = np.sqrt(mean_squared_error(y_test, test_pred_exp))

print(f"\n=== PERFORMANCE COMPARISON ===")
print(f"Baseline model (17 original features):")
print(f"  R² = {test_r2_rf:.4f}, RMSE = {test_rmse_rf:.2f} bikes")
print(f"\nExperimental model (22 expanded features):")
print(f"  R² = {test_r2_exp:.4f}, RMSE = {test_rmse_exp:.2f} bikes")
print(f"\nPerformance change:")
print(f"  ΔR² = {(test_r2_exp - test_r2_rf):+.4f} ({((test_r2_exp - test_r2_rf)/test_r2_rf)*100:+.2f}%)")
print(f"  ΔRMSE = {(test_rmse_exp - test_rmse_rf):+.2f} bikes")
print()

# Step 4: Analyze new feature importance rankings
feature_importance_exp = pd.DataFrame({
    'feature': experimental_features,
    'importance': rf_experimental.feature_importances_
}).sort_values('importance', ascending=False)

print("=== TOP 15 FEATURES (WITH EXPERIMENTAL FEATURES) ===")
for rank, (idx, row) in enumerate(feature_importance_exp.head(15).iterrows(), 1):
    is_new = '🆕 NEW' if row['feature'] not in feature_columns else '      '
    bar = '█' * int(row['importance'] * 100)
    percentage = row['importance'] * 100
    print(f"{rank:<3} {is_new} {row['feature']:<30} {row['importance']:.4f} ({percentage:>5.2f}%)  {bar}")

print()

# Step 5: Compare feature importance distributions
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Panel 1: Original model top 10
top_original = feature_importance.head(10)
colors_original = plt.cm.viridis(np.linspace(0.3, 0.9, len(top_original)))
axes[0].barh(range(len(top_original)), top_original['importance'], color=colors_original)
axes[0].set_yticks(range(len(top_original)))
axes[0].set_yticklabels(top_original['feature'])
axes[0].invert_yaxis()
axes[0].set_xlabel('Importance Score', fontsize=11)
axes[0].set_title('Baseline Model - Top 10 Features', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='x')

# Panel 2: Experimental model top 10
top_experimental = feature_importance_exp.head(10)
colors_experimental = plt.cm.plasma(np.linspace(0.3, 0.9, len(top_experimental)))
axes[1].barh(range(len(top_experimental)), top_experimental['importance'], color=colors_experimental)
axes[1].set_yticks(range(len(top_experimental)))
axes[1].set_yticklabels(top_experimental['feature'])
axes[1].invert_yaxis()
axes[1].set_xlabel('Importance Score', fontsize=11)
axes[1].set_title('Experimental Model - Top 10 Features', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Step 6: Detailed analysis of experimental features
print("=== EXPERIMENTAL FEATURE INSIGHTS ===")

# Check if any new features made it to top 10
new_features_list = ['temp_squared', 'is_midday', 'temp_humidity_interaction',
                     'hour_squared', 'weekend_weather']
new_features_in_top10 = [f for f in top_experimental['feature'].values[:10] if f in new_features_list]

if new_features_in_top10:
    print(f"✓ {len(new_features_in_top10)} new feature(s) in top 10:")
    for nf in new_features_in_top10:
        rank = feature_importance_exp[feature_importance_exp['feature'] == nf].index[0] + 1
        importance = feature_importance_exp[feature_importance_exp['feature'] == nf]['importance'].values[0]
        print(f"  • {nf}: Rank #{rank} ({importance:.1%} importance)")
    print("  → These features captured previously hidden patterns!")
else:
    print("✗ No new features in top 10")
    print("  → Original features still dominate")

print()

# Individual experimental feature analysis
print("=== INDIVIDUAL EXPERIMENTAL FEATURE RANKINGS ===")
for exp_feat in new_features_list:
    rank = feature_importance_exp[feature_importance_exp['feature'] == exp_feat].index[0] + 1
    importance = feature_importance_exp[feature_importance_exp['feature'] == exp_feat]['importance'].values[0]
    print(f"{exp_feat:<30} Rank: #{rank:2d}/22  |  Importance: {importance:.4f} ({importance*100:.2f}%)")

print()

# Business recommendation
print("=== RECOMMENDATION FOR CAPITAL CITY BIKES ===")
if test_r2_exp > test_r2_rf + 0.01:
    print(f"✓ ADOPT experimental features: {((test_r2_exp - test_r2_rf)/test_r2_rf)*100:+.1f}% improvement justifies added complexity")
    print(f"✓ Focus on: {', '.join(new_features_in_top10[:3]) if new_features_in_top10 else 'top-ranked experimental features'}")
elif test_r2_exp > test_r2_rf:
    print(f"~ MARGINAL improvement ({((test_r2_exp - test_r2_rf)/test_r2_rf)*100:+.1f}%): Consider if complexity is worth small gain")
    print(f"~ New features add minimal predictive value but increase computation")
else:
    print(f"✗ NO improvement: Keep baseline model, experimental features don't help")
    print(f"✗ Original feature engineering was already optimal for this problem")

print()
print(f"Key insight: {'Expanded feature engineering successful!' if test_r2_exp > test_r2_rf + 0.01 else 'Original features sufficient - diminishing returns from additional engineering'}")
```

</details>

---

## Step 5: Validate Random Forest with Time Series Cross-Validation

Let's apply TimeSeriesSplit to evaluate Random Forest performance across multiple temporal windows, ensuring our model generalizes reliably to future periods.

In [None]:
# Import time series cross-validation tools
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
import matplotlib.pyplot as plt

print("=== TIME SERIES CROSS-VALIDATION FOR RANDOM FOREST ===\n")

# Create TimeSeriesSplit with 5 folds (expanding window)
# Each fold trains on all past data and tests on the next time period
tscv = TimeSeriesSplit(n_splits=5)

print("--- Why TimeSeriesSplit for Random Forests? ---")
print("⚠ CRITICAL PRINCIPLE: Time series data requires chronological validation")
print("• Random Forest's bootstrap sampling randomizes WITHIN training set")
print("• Bootstrap does NOT protect against training on future to predict past")
print("• Without chronological splits = DATA LEAKAGE = invalid performance estimates")
print("• TimeSeriesSplit ensures we ALWAYS train on past, validate on future")
print()

# Show fold structure with dates
print("--- Time Series Cross-Validation Fold Structure ---")
print("Expanding window approach: each fold adds more training data\n")

for fold_num, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    train_dates = df.iloc[train_idx]['datetime']
    test_dates = df.iloc[test_idx]['datetime']

    print(f"Fold {fold_num}:")
    print(f"  Training: {len(train_idx):,} obs | {train_dates.min().date()} to {train_dates.max().date()}")
    print(f"  Testing:  {len(test_idx):,} obs | {test_dates.min().date()} to {test_dates.max().date()}")
    print()

# Train Random Forest with same parameters from Step 3
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Run cross-validation using TimeSeriesSplit
print("--- Running Cross-Validation (this may take 1-2 minutes) ---")
cv_scores = cross_val_score(rf_model, X, y, cv=tscv, scoring='r2', n_jobs=-1)

print("=== RANDOM FOREST CROSS-VALIDATION RESULTS ===\n")

# Display fold-by-fold performance
for i, score in enumerate(cv_scores, 1):
    print(f"Fold {i}: R² = {score:.4f}")

# Calculate summary statistics
cv_mean = cv_scores.mean()
cv_std = cv_scores.std()
cv_min = cv_scores.min()
cv_max = cv_scores.max()

print(f"\n--- Performance Summary ---")
print(f"Mean R²:      {cv_mean:.4f}")
print(f"Std Dev:      {cv_std:.4f}")
print(f"Range:        {cv_min:.4f} to {cv_max:.4f}")
print(f"95% CI:       {cv_mean - 1.96*cv_std:.4f} to {cv_mean + 1.96*cv_std:.4f}")
print()

# Interpret consistency
if cv_std < 0.03:
    print("✓ EXCELLENT CONSISTENCY:")
    print("  • Very low variability (std < 0.03)")
    print("  • Model performs reliably across different time periods")
    print("  • Suitable for production deployment with confidence")
elif cv_std < 0.06:
    print("✓ GOOD CONSISTENCY:")
    print("  • Moderate variability (std < 0.06)")
    print("  • Model performs well but some temporal variation exists")
    print("  • Acceptable for production with monitoring")
else:
    print("⚠ HIGH VARIABILITY:")
    print("  • Substantial performance differences across time periods")
    print("  • Investigate causes: seasonality, data drift, feature instability")
    print("  • Consider temporal feature engineering or separate seasonal models")

print()

# Visualize cross-validation performance
plt.figure(figsize=(12, 5))

# Panel 1: Fold performance with confidence interval
plt.subplot(1, 2, 1)
plt.plot(range(1, len(cv_scores) + 1), cv_scores, 'o-', linewidth=2,
         markersize=10, color='darkgreen', label='Fold R²')
plt.axhline(y=cv_mean, color='blue', linestyle='--', linewidth=2, label=f'Mean R² = {cv_mean:.4f}')
plt.axhline(y=cv_mean + cv_std, color='red', linestyle=':', linewidth=1.5, label='±1 Std Dev')
plt.axhline(y=cv_mean - cv_std, color='red', linestyle=':', linewidth=1.5)
plt.xlabel('Fold Number', fontsize=11)
plt.ylabel('R² Score', fontsize=11)
plt.title('Random Forest Cross-Validation Performance', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim([0, 1])

# Panel 2: Performance distribution
plt.subplot(1, 2, 2)
plt.boxplot([cv_scores], labels=['Random Forest'], widths=0.5)
plt.ylabel('R² Score', fontsize=11)
plt.title('Performance Distribution Across Folds', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.ylim([0, 1])

plt.tight_layout()
plt.show()

print("=== BUSINESS INTERPRETATION FOR CAPITAL CITY BIKES ===")
print(f"✓ Expected production performance: {cv_mean:.1%} R² (±{cv_std:.1%})")
print(f"✓ Worst-case scenario: {cv_min:.1%} R² (prepare for this in capacity planning)")
print(f"✓ Best-case scenario: {cv_max:.1%} R² (demonstrates model potential)")
print(f"✓ Model reliability: {'High' if cv_std < 0.03 else 'Moderate' if cv_std < 0.06 else 'Variable'}")

**What this does:**
- Creates TimeSeriesSplit with 5 folds using expanding window approach (each fold trains on more historical data)
- Explicitly explains why bootstrap sampling in Random Forest doesn't eliminate need for chronological validation
- Displays fold structure with actual dates showing train/test periods for transparency
- Runs cross_val_score() with TimeSeriesSplit to get robust performance estimates across time
- Calculates mean, standard deviation, and 95% confidence interval for expected production performance
- Visualizes fold-by-fold performance with confidence bands and distribution boxplot
- Provides business-focused interpretation of consistency and worst/best-case scenarios

### Challenge 5: Compare Cross-Validation Stability Across Models

Your client asks: "How does Random Forest's cross-validation stability compare to a single Decision Tree? Does the ensemble really provide more reliable predictions across different time periods?"

**Your Task:** Run the same TimeSeriesSplit cross-validation on both an unlimited Decision Tree and a Random Forest, then compare their consistency.

In [None]:
# Your code here - compare CV stability between Decision Tree and Random Forest

# Model 1: Single Decision Tree (unlimited depth)
tree_model = DecisionTreeRegressor(random_state=42)
cv_scores_tree = cross_val_score(_____, X, y, cv=_____, scoring='r2', n_jobs=-1)

# Model 2: Random Forest (100 trees)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
cv_scores_rf = cross_val_score(_____, X, y, cv=_____, scoring='r2', n_jobs=-1)

# Calculate statistics for both models
tree_mean = cv_scores_tree.mean()
tree_std = cv_scores_tree.std()
rf_mean = cv_scores_rf.mean()
rf_std = cv_scores_rf.std()

print("=== MODEL STABILITY COMPARISON ===")
print(f"\nDecision Tree (Unlimited):")
print(f"  Mean R²: {tree_mean:.4f}")
print(f"  Std Dev: {tree_std:.4f}")
print(f"  Range:   {cv_scores_tree.min():.4f} to {cv_scores_tree.max():.4f}")

print(f"\nRandom Forest (100 trees):")
print(f"  Mean R²: {rf_mean:.4f}")
print(f"  Std Dev: {rf_std:.4f}")
print(f"  Range:   {cv_scores_rf.min():.4f} to {cv_scores_rf.max():.4f}")

# Visualize comparison
plt.figure(figsize=(12, 6))

# Side-by-side fold performance
plt.subplot(1, 2, 1)
x_pos = np.arange(1, len(cv_scores_tree) + 1)
width = 0.35
plt.bar(x_pos - width/2, cv_scores_tree, width, label='Decision Tree',
        color='orange', alpha=0.7)
plt.bar(x_pos + width/2, cv_scores_rf, width, label='Random Forest',
        color='darkgreen', alpha=0.7)
plt.xlabel('Fold Number', fontsize=11)
plt.ylabel('R² Score', fontsize=11)
plt.title('Fold-by-Fold Performance Comparison', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')

# Box plot comparison
plt.subplot(1, 2, 2)
plt.boxplot([cv_scores_tree, cv_scores_rf],
            labels=['Decision Tree', 'Random Forest'],
            widths=0.5)
plt.ylabel('R² Score', fontsize=11)
plt.title('Performance Distribution Comparison', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Stability analysis
stability_improvement = (tree_std - rf_std) / tree_std * 100
performance_improvement = (rf_mean - tree_mean) / tree_mean * 100

print(f"\n=== ENSEMBLE ADVANTAGES ===")
print(f"Performance improvement: {performance_improvement:+.1f}%")
print(f"Stability improvement:   {stability_improvement:+.1f}% (lower std dev)")
print(f"\nConclusion: Random Forest provides {'more' if rf_std < tree_std else 'similar'} consistent predictions")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Use the same `tscv = TimeSeriesSplit(n_splits=5)` object for both models to ensure identical fold splits for fair comparison. Call `cross_val_score(model, X, y, cv=tscv, scoring='r2', n_jobs=-1)` for each model - the function returns an array with one score per fold. Calculate standard deviation using `.std()` - lower std means more consistent performance across time periods. Create comparison visualizations using bar charts (fold-by-fold) and box plots (distribution) to show both mean performance and variability differences. The Decision Tree typically shows higher variability (large std dev) because each fold produces a completely different tree structure based on that specific time period's data, while Random Forest averages 100 trees which smooths out temporal variations. Business insight: Lower variability means more predictable production performance - executives prefer models that consistently deliver promised accuracy rather than models that sometimes excel but sometimes fail. If Random Forest shows 30-50% lower standard deviation, that's strong evidence for ensemble reliability worth communicating to stakeholders.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Model 1: Single Decision Tree (unlimited depth)
tree_model = DecisionTreeRegressor(random_state=42)
cv_scores_tree = cross_val_score(tree_model, X, y, cv=tscv, scoring='r2', n_jobs=-1)

# Model 2: Random Forest (100 trees)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
cv_scores_rf = cross_val_score(rf_model, X, y, cv=tscv, scoring='r2', n_jobs=-1)

# Calculate statistics for both models
tree_mean = cv_scores_tree.mean()
tree_std = cv_scores_tree.std()
rf_mean = cv_scores_rf.mean()
rf_std = cv_scores_rf.std()

print("=== MODEL STABILITY COMPARISON ===")
print(f"\nDecision Tree (Unlimited):")
print(f"  Mean R²: {tree_mean:.4f}")
print(f"  Std Dev: {tree_std:.4f}")
print(f"  Range:   {cv_scores_tree.min():.4f} to {cv_scores_tree.max():.4f}")
print(f"  Coefficient of Variation: {(tree_std/tree_mean)*100:.1f}%")

print(f"\nRandom Forest (100 trees):")
print(f"  Mean R²: {rf_mean:.4f}")
print(f"  Std Dev: {rf_std:.4f}")
print(f"  Range:   {cv_scores_rf.min():.4f} to {cv_scores_rf.max():.4f}")
print(f"  Coefficient of Variation: {(rf_std/rf_mean)*100:.1f}%")

# Visualize comparison
plt.figure(figsize=(12, 6))

# Side-by-side fold performance
plt.subplot(1, 2, 1)
x_pos = np.arange(1, len(cv_scores_tree) + 1)
width = 0.35
plt.bar(x_pos - width/2, cv_scores_tree, width, label='Decision Tree',
        color='orange', alpha=0.7)
plt.bar(x_pos + width/2, cv_scores_rf, width, label='Random Forest',
        color='darkgreen', alpha=0.7)
plt.xlabel('Fold Number', fontsize=11)
plt.ylabel('R² Score', fontsize=11)
plt.title('Fold-by-Fold Performance Comparison', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.ylim([0, 1])

# Box plot comparison
plt.subplot(1, 2, 2)
plt.boxplot([cv_scores_tree, cv_scores_rf],
            labels=['Decision Tree', 'Random Forest'],
            widths=0.5)
plt.ylabel('R² Score', fontsize=11)
plt.title('Performance Distribution Comparison', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.ylim([0, 1])

plt.tight_layout()
plt.show()

# Stability analysis
stability_improvement = (tree_std - rf_std) / tree_std * 100
performance_improvement = (rf_mean - tree_mean) / tree_mean * 100

print(f"\n=== ENSEMBLE ADVANTAGES ===")
print(f"Performance improvement: {performance_improvement:+.1f}%")
print(f"Stability improvement:   {stability_improvement:+.1f}% (reduced variability)")
print()

print("=== KEY INSIGHTS FOR CAPITAL CITY BIKES ===")
print(f"✓ Random Forest shows {stability_improvement:.0f}% more consistent performance across time")
print(f"✓ Single trees are {'highly' if tree_std > 0.10 else 'moderately'} unstable - performance varies dramatically by period")
print(f"✓ Ensemble averaging smooths temporal variation → reliable production forecasts")
print(f"✓ Lower variability = predictable SLA commitments to executives and investors")
print()

print("RECOMMENDATION:")
if rf_std < tree_std and rf_mean > tree_mean:
    print("✓ Random Forest dominates: Higher mean performance AND lower variability")
    print("✓ Deploy Random Forest for production - delivers consistent, reliable forecasts")
elif rf_mean > tree_mean:
    print("✓ Random Forest provides better average performance")
    print("✓ Stability similar to single tree, but higher accuracy justifies deployment")
else:
    print("⚠ Unexpected result - investigate data characteristics and model configuration")
```

</details>

---

## Summary: Production-Grade Tree-Based Ensemble Modeling for Competitive Advantage

**What We've Accomplished:**
- **Engineered advanced features** including binary indicators (is_rush_hour, is_weekend), categorical encodings (temp_category, weather_severity), and interaction terms (temp×hour, workingday×hour) exposing 17 features designed for tree-based pattern discovery
- **Implemented decision trees** demonstrating how unlimited depth leads to severe overfitting (training R² 99%, test R² 45%), revealing why single trees struggle with generalization
- **Deployed Random Forest ensembles** achieving test R² ≈85% through bootstrap aggregation and feature randomness, dramatically reducing overfitting gap from 54 points (single tree) to ~13 points (ensemble)
- **Analyzed feature importance** using MDI and permutation methods, identifying hour, temperature, and workingday interactions as dominant drivers (top 3 features capture ~80% of predictive power)

**Key Technical Skills Mastered:**
- **Feature engineering**: Binary encoding (`.astype(int)`), categorical binning (`pd.cut()`), interaction terms (element-wise multiplication), temporal extraction (`.dt.hour`, `.dt.dayofweek`)
- **Decision trees**: DecisionTreeRegressor implementation, tree structure analysis (`.get_depth()`, `.get_n_leaves()`), visualization (`plot_tree()`), overfitting detection (train-test gap calculation)
- **Random Forest ensembles**: RandomForestRegressor with n_estimators, max_features='sqrt', bootstrap=True; accessing individual estimators (`.estimators_[i]`), ensemble diversity demonstration
- **Feature importance**: MDI extraction (`.feature_importances_`), permutation importance (`permutation_importance()`), cumulative importance analysis, business translation of rankings

**Next Steps:**
In Module 5, you'll advance to model evaluation and deployment strategies, mastering performance metrics beyond R² (MAE, MAPE for business reporting), residual analysis for error pattern diagnosis, learning curves for dataset size sufficiency, and production deployment considerations including prediction latency, model versioning, and monitoring strategies.

Your Random Forest model transforms Capital City Bikes from linear constraints to non-linear intelligence, achieving 85%+ accuracy that positions them competitively against sophisticated rivals. You've demonstrated the advanced ensemble modeling capabilities, interpretable feature importance analysis, and systematic optimization workflows that distinguish senior ML engineers capable of delivering investor-grade predictive systems!