# Module 3 - Exercise 3: Random Forest

<a href="https://colab.research.google.com/github/jumpingsphinx/jumpingsphinx.github.io/blob/main/notebooks/module3-trees/exercise3-random-forest.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives

By the end of this exercise, you will be able to:

- Understand bootstrap aggregating (bagging)
- Implement Random Forest from scratch
- Analyze feature importance using different methods
- Use out-of-bag (OOB) error for model evaluation
- Compare Random Forest with single decision trees
- Tune hyperparameters for optimal performance

## Prerequisites

- Completion of previous Module 3 exercises
- Understanding of ensemble learning
- Familiarity with variance reduction techniques

## Setup

Run this cell first to import required libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import (
    load_breast_cancer, load_wine, fetch_california_housing,
    make_classification
)
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV
)
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, roc_auc_score
)
from sklearn.inspection import permutation_importance
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("NumPy version:", np.__version__)
print("Setup complete!")

---

## Part 1: Bootstrap Sampling

### Background

**Bootstrap sampling** is sampling with replacement from the original dataset. It creates diversity among ensemble members.

Key properties:
- Sample size = original dataset size
- Some samples appear multiple times
- About 63.2% of unique samples are included
- Remaining ~36.8% are "out-of-bag" (OOB) samples

### Exercise 1.1: Implement Bootstrap Sampling

**Task:** Implement a function to create bootstrap samples.

In [None]:
def bootstrap_sample(X, y, random_state=None):
    """
    Create a bootstrap sample from the dataset.
    
    Parameters:
    -----------
    X : np.ndarray
        Features (n_samples, n_features)
    y : np.ndarray
        Labels (n_samples,)
    random_state : int
        Random seed
    
    Returns:
    --------
    X_bootstrap : np.ndarray
        Bootstrap sample features
    y_bootstrap : np.ndarray
        Bootstrap sample labels
    oob_indices : np.ndarray
        Indices of out-of-bag samples
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples = X.shape[0]
    
    # Your code here
    # 1. Sample indices with replacement
    indices = np.arange(n_samples)
    bootstrap_indices = np.random.choice(indices, size=n_samples, replace=True)
    
    # 2. Get bootstrap samples
    X_bootstrap = X[bootstrap_indices]
    y_bootstrap = y[bootstrap_indices]
    
    # 3. Find OOB indices (samples not in bootstrap)
    oob_indices = np.setdiff1d(indices, bootstrap_indices)
# Test bootstrap sampling
X_test = np.arange(100).reshape(-1, 1)
y_test = np.arange(100)

X_boot, y_boot, oob_idx = bootstrap_sample(X_test, y_test, random_state=42)

print(f"Original samples: {len(X_test)}")
print(f"Bootstrap samples: {len(X_boot)}")
print(f"Unique bootstrap samples: {len(np.unique(y_boot))}")
print(f"OOB samples: {len(oob_idx)}")
print(f"OOB percentage: {len(oob_idx) / len(X_test) * 100:.1f}%")

# Verify some samples appear multiple times
counts = Counter(y_boot)
duplicates = sum(1 for count in counts.values() if count > 1)
print(f"Samples appearing multiple times: {duplicates}")

assert len(X_boot) == len(X_test), "Bootstrap sample should have same size"
assert len(oob_idx) > 0, "Should have OOB samples"
assert 30 <= len(oob_idx) <= 40, "OOB should be ~36.8%"
print("\n✓ Bootstrap sampling implemented correctly!")

### Exercise 1.2: Visualize Bootstrap Distribution

**Task:** Analyze the distribution of bootstrap samples.

In [None]:
# Run multiple bootstrap iterations
n_iterations = 1000
n_samples = 100
oob_percentages = []

for i in range(n_iterations):
    X_temp = np.arange(n_samples).reshape(-1, 1)
    y_temp = np.arange(n_samples)
    _, _, oob_idx = bootstrap_sample(X_temp, y_temp, random_state=i)
    oob_percentages.append(len(oob_idx) / n_samples * 100)

# Plot distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(oob_percentages, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(x=np.mean(oob_percentages), color='red', linestyle='--', 
            linewidth=2, label=f'Mean: {np.mean(oob_percentages):.2f}%')
plt.axvline(x=36.8, color='green', linestyle='--', 
            linewidth=2, label='Theoretical: 36.8%')
plt.xlabel('OOB Percentage (%)')
plt.ylabel('Frequency')
plt.title('Distribution of OOB Percentages\n(1000 Bootstrap Iterations)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Count frequency of each sample in one bootstrap
sample_counts = Counter(y_boot)
frequencies = [sample_counts.get(i, 0) for i in range(100)]
plt.bar(range(100), frequencies, alpha=0.7)
plt.xlabel('Sample Index')
plt.ylabel('Frequency in Bootstrap Sample')
plt.title('Sample Frequency in a Single Bootstrap')
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"Mean OOB percentage: {np.mean(oob_percentages):.2f}%")
print(f"Theoretical OOB percentage: {(1 - (1 - 1/n_samples)**n_samples) * 100:.2f}%")
print(f"Std of OOB percentage: {np.std(oob_percentages):.2f}%")

---

## Part 2: Bagging Implementation

### Background

**Bagging (Bootstrap Aggregating):**
1. Create B bootstrap samples
2. Train a model on each sample
3. Aggregate predictions:
   - Classification: Majority voting
   - Regression: Average

Benefits:
- Reduces variance
- Prevents overfitting
- Works well with high-variance models (like deep trees)

### Exercise 2.1: Implement Bagging Classifier

**Task:** Build a bagging classifier from scratch using decision trees.

In [None]:
class BaggingClassifier:
    def __init__(self, n_estimators=10, max_depth=None, random_state=None):
        """
        Bagging classifier using decision trees.
        
        Parameters:
        -----------
        n_estimators : int
            Number of trees in the ensemble
        max_depth : int
            Maximum depth of each tree
        random_state : int
            Random seed
        """
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.random_state = random_state
        self.trees = []
    
    def fit(self, X, y):
        """
        Fit bagging classifier.
        
        Parameters:
        -----------
        X : np.ndarray
            Training features
        y : np.ndarray
            Training labels
        
        Returns:
        --------
        self
        """
        self.trees = []
        
        for i in range(self.n_estimators):
            # Your code here
            # 1. Create bootstrap sample
            X_boot, y_boot, _ = bootstrap_sample(X, y, random_state=seed)
            
            # 2. Train a decision tree on bootstrap sample
            tree = DecisionTreeClassifier(max_depth=self.max_depth, random_state=seed)
            tree.fit(X_boot, y_boot)
        # Your code here
print("BaggingClassifier class implemented!")

### Exercise 2.2: Test Bagging on Breast Cancer Dataset

**Task:** Compare single decision tree vs bagging.

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

print("Breast Cancer Dataset:")
print(f"Shape: {X_cancer.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {np.bincount(y_cancer)}")
print()

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

# Train single decision tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
single_acc = single_tree.score(X_test, y_test)

# Train bagging classifier
bagging = BaggingClassifier(n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
bagging_acc = bagging.score(X_test, y_test)

print(f"Single Decision Tree Accuracy: {single_acc:.4f}")
print(f"Bagging (50 trees) Accuracy:   {bagging_acc:.4f}")
print(f"Improvement: {(bagging_acc - single_acc) * 100:.2f}%")

# Analyze effect of number of trees
n_trees_list = [1, 5, 10, 20, 50, 100]
train_scores = []
test_scores = []

for n_trees in n_trees_list:
    model = BaggingClassifier(n_estimators=n_trees, random_state=42)
    model.fit(X_train, y_train)
    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(n_trees_list, train_scores, 'o-', label='Train Accuracy', linewidth=2)
plt.plot(n_trees_list, test_scores, 's-', label='Test Accuracy', linewidth=2)
plt.axhline(y=single_acc, color='red', linestyle='--', 
            label=f'Single Tree ({single_acc:.4f})', linewidth=2)
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Bagging Performance vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.show()

assert bagging_acc >= single_acc, "Bagging should improve or match single tree"
print("\n✓ Bagging successfully implemented and tested!")

---

## Part 3: Feature Randomness

### Background

**Random Forest = Bagging + Random Feature Selection**

At each split in each tree:
- Randomly select $m$ features from $p$ total features
- Find best split using only these $m$ features
- Common choice: $m = \sqrt{p}$ for classification, $m = p/3$ for regression

Benefits:
- De-correlates trees
- Prevents dominant features from always being selected
- Further reduces variance

### Exercise 3.1: Implement Feature Randomness

**Task:** Add random feature selection to the bagging classifier.

In [None]:
class SimpleRandomForest:
    def __init__(self, n_estimators=10, max_depth=None, 
                 max_features='sqrt', random_state=None):
        """
        Simple Random Forest classifier.
        
        Parameters:
        -----------
        n_estimators : int
            Number of trees
        max_depth : int
            Maximum tree depth
        max_features : str or int
            Number of features to consider at each split
            - 'sqrt': sqrt(n_features)
            - 'log2': log2(n_features)
            - int: exact number
        random_state : int
            Random seed
        """
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.random_state = random_state
        self.trees = []
    
    def _get_max_features(self, n_features):
        """
        Calculate number of features to use at each split.
        """
        # Your code here
        # 1. Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # 2. Majority voting for each sample
        # Swap axes to get (n_samples, n_trees)
        tree_predictions = tree_predictions.T
        
        predictions = []
        for sample_preds in tree_predictions:
            # Find most frequent prediction
            counts = np.bincount(sample_preds)
            predictions.append(np.argmax(counts))
        
        predictions = np.array(predictions)
            # Your code here
        # Your code here
print("SimpleRandomForest class implemented!")

### Exercise 3.2: Compare Bagging vs Random Forest

**Task:** Demonstrate the benefit of feature randomness.

In [None]:
# Compare bagging (all features) vs random forest (random features)
n_features = X_train.shape[1]

# Bagging (uses all features)
bagging_full = SimpleRandomForest(
    n_estimators=50, 
    max_features=n_features,  # All features
    random_state=42
)
bagging_full.fit(X_train, y_train)
bagging_full_acc = bagging_full.score(X_test, y_test)

# Random Forest (uses sqrt(features))
rf_sqrt = SimpleRandomForest(
    n_estimators=50, 
    max_features='sqrt',  # Random feature subset
    random_state=42
)
rf_sqrt.fit(X_train, y_train)
rf_sqrt_acc = rf_sqrt.score(X_test, y_test)

# Random Forest (uses log2(features))
rf_log2 = SimpleRandomForest(
    n_estimators=50, 
    max_features='log2',
    random_state=42
)
rf_log2.fit(X_train, y_train)
rf_log2_acc = rf_log2.score(X_test, y_test)

print(f"Total features: {n_features}")
print(f"sqrt(features): {int(np.sqrt(n_features))}")
print(f"log2(features): {int(np.log2(n_features))}")
print()
print(f"Bagging (all {n_features} features):        {bagging_full_acc:.4f}")
print(f"Random Forest (sqrt={int(np.sqrt(n_features))} features): {rf_sqrt_acc:.4f}")
print(f"Random Forest (log2={int(np.log2(n_features))} features): {rf_log2_acc:.4f}")

# Visualize comparison
models = ['Single Tree', 'Bagging\n(all features)', 
          'RF (sqrt)', 'RF (log2)']
scores = [single_acc, bagging_full_acc, rf_sqrt_acc, rf_log2_acc]

plt.figure(figsize=(10, 6))
bars = plt.bar(models, scores, color=['red', 'orange', 'green', 'blue'], alpha=0.7)
plt.ylabel('Test Accuracy')
plt.title('Comparison: Single Tree vs Bagging vs Random Forest')
plt.ylim([0.9, 1.0])
plt.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, score in zip(bars, scores):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{score:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\n✓ Feature randomness successfully implemented!")

---

## Part 4: Random Forest from Scratch - Complete Implementation

### Exercise 4.1: Build Complete Random Forest

**Task:** Combine everything into a production-ready Random Forest.

In [None]:
class RandomForestFromScratch:
    def __init__(self, n_estimators=100, max_depth=None, 
                 max_features='sqrt', min_samples_split=2,
                 random_state=None):
        """
        Complete Random Forest implementation.
        
        Parameters:
        -----------
        n_estimators : int
            Number of trees
        max_depth : int
            Maximum tree depth
        max_features : str or int
            Features to consider at each split
        min_samples_split : int
            Minimum samples required to split
        random_state : int
            Random seed
        """
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.min_samples_split = min_samples_split
        self.random_state = random_state
        self.trees = []
        self.oob_score_ = None
    
    def _get_max_features(self, n_features):
        """Calculate number of features."""
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        else:
            return n_features
    
    def fit(self, X, y):
        """
        Fit random forest.
        """
        self.trees = []
        self.oob_predictions_ = np.zeros((X.shape[0], len(np.unique(y))))
        self.oob_counts_ = np.zeros(X.shape[0])
        
        n_features = X.shape[1]
        max_features = self._get_max_features(n_features)
        
        for i in range(self.n_estimators):
            # Bootstrap and fit
            seed = self.random_state + i if self.random_state is not None else None
            X_boot, y_boot, oob_idx = bootstrap_sample(X, y, random_state=seed)
            
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                max_features=max_features,
                min_samples_split=self.min_samples_split,
                random_state=seed
            )
            tree.fit(X_boot, y_boot)
            self.trees.append(tree)
            
            # OOB predictions
            if len(oob_idx) > 0:
                oob_pred = tree.predict_proba(X[oob_idx])
                self.oob_predictions_[oob_idx] += oob_pred
                self.oob_counts_[oob_idx] += 1
        
        # Calculate OOB score
        oob_mask = self.oob_counts_ > 0
        if np.sum(oob_mask) > 0:
            oob_pred = self.oob_predictions_[oob_mask] / self.oob_counts_[oob_mask, np.newaxis]
            oob_pred_labels = np.argmax(oob_pred, axis=1)
            self.oob_score_ = np.mean(oob_pred_labels == y[oob_mask])
        
        return self
    
    def predict_proba(self, X):
        """
        Predict class probabilities.
        """
        # Average probabilities from all trees
        probas = np.mean([tree.predict_proba(X) for tree in self.trees], axis=0)
        return probas
    
    def predict(self, X):
        """
        Predict class labels.
        """
        probas = self.predict_proba(X)
        return np.argmax(probas, axis=1)
    
    def score(self, X, y):
        """
        Calculate accuracy.
        """
        predictions = self.predict(X)
        return np.mean(predictions == y)

# Test the implementation
rf_scratch = RandomForestFromScratch(
    n_estimators=100, 
    max_features='sqrt',
    random_state=42
)
rf_scratch.fit(X_train, y_train)
scratch_acc = rf_scratch.score(X_test, y_test)
scratch_oob = rf_scratch.oob_score_

print(f"Random Forest from Scratch:")
print(f"Test Accuracy: {scratch_acc:.4f}")
print(f"OOB Accuracy:  {scratch_oob:.4f}")
print("\n✓ Complete Random Forest implemented from scratch!")

---

## Part 5: Scikit-learn Random Forest on Breast Cancer

### Exercise 5.1: Train and Evaluate Sklearn Random Forest

**Task:** Use scikit-learn's production-ready Random Forest.

In [None]:
# Train sklearn Random Forest
rf_sklearn = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',
    oob_score=True,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

rf_sklearn.fit(X_train, y_train)

# Evaluate
train_acc = rf_sklearn.score(X_train, y_train)
test_acc = rf_sklearn.score(X_test, y_test)
oob_acc = rf_sklearn.oob_score_

y_pred = rf_sklearn.predict(X_test)
y_proba = rf_sklearn.predict_proba(X_test)[:, 1]

print("Sklearn Random Forest Results:")
print(f"Train Accuracy: {train_acc:.4f}")
print(f"Test Accuracy:  {test_acc:.4f}")
print(f"OOB Accuracy:   {oob_acc:.4f}")
print()

# Detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
im = axes[0].imshow(cm, cmap='Blues')
plt.colorbar(im, ax=axes[0])
axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(cancer.target_names)
axes[0].set_yticklabels(cancer.target_names)
for i in range(2):
    for j in range(2):
        axes[0].text(j, i, cm[i, j], ha="center", va="center", 
                    color="red", fontsize=20)
axes[0].set_title('Confusion Matrix')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# ROC Curve
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

axes[1].plot(fpr, tpr, linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=2)
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nROC-AUC Score: {roc_auc:.4f}")
print("✓ Sklearn Random Forest successfully applied!")

---

## Part 6: Out-of-Bag (OOB) Error Estimation

### Background

**OOB Error:**
- For each sample, predict using only trees that didn't include it in training
- Provides unbiased error estimate without separate validation set
- Similar to cross-validation but "free"

### Exercise 6.1: Analyze OOB Score

**Task:** Compare OOB score with test score across different n_estimators.

In [None]:
# Vary number of trees
n_trees_range = [10, 20, 50, 100, 200, 300]
oob_scores = []
test_scores = []

for n_trees in n_trees_range:
    rf = RandomForestClassifier(
        n_estimators=n_trees,
        oob_score=True,
        random_state=42,
        n_jobs=-1
    )
    rf.fit(X_train, y_train)
    
    oob_scores.append(rf.oob_score_)
    test_scores.append(rf.score(X_test, y_test))

# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(n_trees_range, oob_scores, 'o-', label='OOB Score', linewidth=2, markersize=8)
plt.plot(n_trees_range, test_scores, 's-', label='Test Score', linewidth=2, markersize=8)
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('OOB Score vs Test Score')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("OOB vs Test Scores:")
print("n_trees  |  OOB Score  |  Test Score  |  Difference")
print("-" * 55)
for n, oob, test in zip(n_trees_range, oob_scores, test_scores):
    diff = abs(oob - test)
    print(f"{n:7d}  |  {oob:.4f}      |  {test:.4f}      |  {diff:.4f}")

print("\nObservations:")
print("- OOB score closely approximates test score")
print("- Both stabilize as number of trees increases")
print("- OOB provides free validation estimate")

### Exercise 6.2: Demonstrate OOB as Validation Alternative

**Task:** Show OOB can replace validation set for model selection.

In [None]:
# Compare: using validation set vs using OOB

# Method 1: Traditional train/val/test split
X_temp, X_test_2, y_temp, y_test_2 = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42
)
X_train_2, X_val, y_train_2, y_val = train_test_split(
    X_temp, y_temp, test_size=0.2, random_state=42
)

print(f"Train set: {len(X_train_2)} samples")
print(f"Val set:   {len(X_val)} samples")
print(f"Test set:  {len(X_test_2)} samples")
print()

# Find best max_depth using validation set
max_depths = [3, 5, 10, 15, 20, None]
val_scores = []

for depth in max_depths:
    rf = RandomForestClassifier(
        n_estimators=100, max_depth=depth, random_state=42, n_jobs=-1
    )
    rf.fit(X_train_2, y_train_2)
    val_scores.append(rf.score(X_val, y_val))

best_depth_val = max_depths[np.argmax(val_scores)]
print(f"Best max_depth (validation): {best_depth_val}")

# Method 2: Using full training set with OOB
X_train_full, X_test_3, y_train_full, y_test_3 = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42
)

oob_scores_depth = []

for depth in max_depths:
    rf = RandomForestClassifier(
        n_estimators=100, max_depth=depth, oob_score=True, 
        random_state=42, n_jobs=-1
    )
    rf.fit(X_train_full, y_train_full)
    oob_scores_depth.append(rf.oob_score_)

best_depth_oob = max_depths[np.argmax(oob_scores_depth)]
print(f"Best max_depth (OOB):        {best_depth_oob}")

# Compare final test performance
rf_val = RandomForestClassifier(
    n_estimators=100, max_depth=best_depth_val, random_state=42, n_jobs=-1
)
rf_val.fit(X_train_2, y_train_2)
test_acc_val = rf_val.score(X_test_2, y_test_2)

rf_oob = RandomForestClassifier(
    n_estimators=100, max_depth=best_depth_oob, random_state=42, n_jobs=-1
)
rf_oob.fit(X_train_full, y_train_full)
test_acc_oob = rf_oob.score(X_test_3, y_test_3)

print()
print(f"Test accuracy (validation approach): {test_acc_val:.4f}")
print(f"Test accuracy (OOB approach):        {test_acc_oob:.4f}")
print()
print("Advantages of OOB:")
print(f"- Uses more training data: {len(X_train_full)} vs {len(X_train_2)}")
print("- No need for separate validation set")
print("- Free validation estimate during training")

print("\n✓ OOB successfully demonstrated as validation alternative!")

---

## Part 7: Feature Importance on Wine Dataset

### Background

**Two Types of Feature Importance:**

1. **MDI (Mean Decrease in Impurity)**: Built into Random Forest
   - Based on reduction in Gini impurity
   - Fast to compute
   - Can be biased toward high-cardinality features

2. **Permutation Importance**: Model-agnostic
   - Measures performance drop when feature is shuffled
   - More reliable but slower
   - Not biased by feature cardinality

### Exercise 7.1: Compare MDI and Permutation Importance

**Task:** Calculate and compare both importance measures on Wine dataset.

In [None]:
# Load wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

print("Wine Dataset:")
print(f"Shape: {X_wine.shape}")
print(f"Classes: {wine.target_names}")
print(f"Features: {len(wine.feature_names)}")
print()

# Split data
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(
    X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
)

# Train Random Forest
rf_wine = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
rf_wine.fit(X_train_w, y_train_w)

print(f"Test Accuracy: {rf_wine.score(X_test_w, y_test_w):.4f}")
print()

# 1. MDI (Mean Decrease in Impurity) - built-in
mdi_importances = rf_wine.feature_importances_

# 2. Permutation Importance
perm_importance = permutation_importance(
    rf_wine, X_test_w, y_test_w, 
    n_repeats=10, random_state=42, n_jobs=-1
)
perm_importances = perm_importance.importances_mean

# Sort by MDI importance
indices_mdi = np.argsort(mdi_importances)[::-1]
indices_perm = np.argsort(perm_importances)[::-1]

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# MDI Importance
axes[0].barh(range(len(mdi_importances)), mdi_importances[indices_mdi], alpha=0.7)
axes[0].set_yticks(range(len(mdi_importances)))
axes[0].set_yticklabels([wine.feature_names[i] for i in indices_mdi])
axes[0].set_xlabel('MDI Importance')
axes[0].set_title('Feature Importance (Mean Decrease in Impurity)')
axes[0].grid(True, alpha=0.3, axis='x')
axes[0].invert_yaxis()

# Permutation Importance
axes[1].barh(range(len(perm_importances)), perm_importances[indices_perm], 
            color='green', alpha=0.7)
axes[1].set_yticks(range(len(perm_importances)))
axes[1].set_yticklabels([wine.feature_names[i] for i in indices_perm])
axes[1].set_xlabel('Permutation Importance')
axes[1].set_title('Feature Importance (Permutation)')
axes[1].grid(True, alpha=0.3, axis='x')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

# Print top 5 features by each method
print("Top 5 Features by MDI:")
for i in range(5):
    idx = indices_mdi[i]
    print(f"{i+1}. {wine.feature_names[idx]:25s}: {mdi_importances[idx]:.4f}")

print("\nTop 5 Features by Permutation:")
for i in range(5):
    idx = indices_perm[i]
    print(f"{i+1}. {wine.feature_names[idx]:25s}: {perm_importances[idx]:.4f}")

print("\n✓ Feature importance successfully calculated!")

### Exercise 7.2: Feature Selection Using Importance

**Task:** Use feature importance for feature selection.

In [None]:
# Select top k features based on importance
k_values = [3, 5, 7, 10, 13]  # 13 = all features
test_scores_mdi = []
test_scores_perm = []

for k in k_values:
    # Using MDI importance
    top_k_mdi = indices_mdi[:k]
    X_train_selected_mdi = X_train_w[:, top_k_mdi]
    X_test_selected_mdi = X_test_w[:, top_k_mdi]
    
    rf_mdi = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    rf_mdi.fit(X_train_selected_mdi, y_train_w)
    test_scores_mdi.append(rf_mdi.score(X_test_selected_mdi, y_test_w))
    
    # Using Permutation importance
    top_k_perm = indices_perm[:k]
    X_train_selected_perm = X_train_w[:, top_k_perm]
    X_test_selected_perm = X_test_w[:, top_k_perm]
    
    rf_perm = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    rf_perm.fit(X_train_selected_perm, y_train_w)
    test_scores_perm.append(rf_perm.score(X_test_selected_perm, y_test_w))

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(k_values, test_scores_mdi, 'o-', label='MDI Selection', 
         linewidth=2, markersize=8)
plt.plot(k_values, test_scores_perm, 's-', label='Permutation Selection', 
         linewidth=2, markersize=8)
plt.xlabel('Number of Top Features')
plt.ylabel('Test Accuracy')
plt.title('Model Performance vs Number of Features')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(k_values)
plt.show()

print("Performance with Different Feature Counts:")
print("k    |  MDI     |  Perm    ")
print("-" * 30)
for k, mdi_score, perm_score in zip(k_values, test_scores_mdi, test_scores_perm):
    print(f"{k:2d}   |  {mdi_score:.4f}  |  {perm_score:.4f}")

print("\nObservations:")
print("- Small feature subset can achieve good performance")
print("- Feature importance helps identify most informative features")
print("- Can reduce dimensionality and training time")

---

## Part 8: Hyperparameter Tuning

### Background

**Key Random Forest Hyperparameters:**
- `n_estimators`: Number of trees (more is usually better)
- `max_features`: Features considered at each split
- `max_depth`: Maximum tree depth
- `min_samples_split`: Minimum samples to split a node
- `min_samples_leaf`: Minimum samples in a leaf

### Exercise 8.1: Grid Search for Optimal Hyperparameters

**Task:** Use GridSearchCV to find best hyperparameters.

In [None]:
# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

print("Parameter Grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

total_combinations = np.prod([len(v) for v in param_grid.values()])
print(f"\nTotal combinations: {total_combinations}")
print("Running Grid Search (this may take a minute)...\n")

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Results
print("\nBest Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nBest CV Score: {grid_search.best_score_:.4f}")

# Evaluate on test set
best_rf = grid_search.best_estimator_
test_score = best_rf.score(X_test, y_test)
print(f"Test Score:    {test_score:.4f}")

# Compare with default parameters
default_rf = RandomForestClassifier(random_state=42, n_jobs=-1)
default_rf.fit(X_train, y_train)
default_score = default_rf.score(X_test, y_test)

print(f"\nDefault RF Test Score: {default_score:.4f}")
print(f"Tuned RF Test Score:   {test_score:.4f}")
print(f"Improvement:           {(test_score - default_score) * 100:.2f}%")

print("\n✓ Hyperparameter tuning completed!")

### Exercise 8.2: Visualize Hyperparameter Effects

**Task:** Study the effect of individual hyperparameters.

In [None]:
# Study effect of n_estimators and max_depth
n_estimators_range = [10, 25, 50, 100, 150, 200]
max_depth_range = [5, 10, 15, 20, None]

# Effect of n_estimators
scores_n_est = []
for n_est in n_estimators_range:
    rf = RandomForestClassifier(n_estimators=n_est, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    scores_n_est.append(rf.score(X_test, y_test))

# Effect of max_depth
scores_depth = []
for depth in max_depth_range:
    rf = RandomForestClassifier(max_depth=depth, n_estimators=100, 
                                random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    scores_depth.append(rf.score(X_test, y_test))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# n_estimators effect
axes[0].plot(n_estimators_range, scores_n_est, 'o-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Estimators')
axes[0].set_ylabel('Test Accuracy')
axes[0].set_title('Effect of n_estimators')
axes[0].grid(True, alpha=0.3)

# max_depth effect
depth_labels = [str(d) if d is not None else 'None' for d in max_depth_range]
axes[1].plot(range(len(max_depth_range)), scores_depth, 'o-', 
            linewidth=2, markersize=8, color='green')
axes[1].set_xticks(range(len(max_depth_range)))
axes[1].set_xticklabels(depth_labels)
axes[1].set_xlabel('Max Depth')
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Effect of max_depth')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Insights:")
print("- n_estimators: More trees generally improve performance but with diminishing returns")
print("- max_depth: Too shallow underfits, too deep may overfit")
print("- Random Forest is relatively robust to hyperparameter choices")

---

## Part 9: Random Forest Regression on California Housing

### Background

Random Forest works equally well for regression:
- Each tree predicts a continuous value
- Final prediction = average of all tree predictions
- Uses MSE instead of Gini impurity for splits

### Exercise 9.1: Apply Random Forest Regression

**Task:** Predict housing prices using Random Forest Regressor.

In [None]:
# Load California Housing dataset
housing = fetch_california_housing()
X_housing = housing.data
y_housing = housing.target

print("California Housing Dataset:")
print(f"Shape: {X_housing.shape}")
print(f"Features: {housing.feature_names}")
print(f"Target: Median house value (in $100,000s)")
print(f"Target range: [{y_housing.min():.2f}, {y_housing.max():.2f}]")
print()

# Split data
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42
)

# Train Random Forest Regressor
# Your code here
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        else:
            return n_features
# Evaluate
train_mse = mean_squared_error(y_train_h, y_train_pred)
test_mse = mean_squared_error(y_test_h, y_test_pred)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
train_r2 = r2_score(y_train_h, y_train_pred)
test_r2 = r2_score(y_test_h, y_test_pred)

print("Random Forest Regressor Results:")
print(f"Train RMSE: ${train_rmse * 100000:.2f}")
print(f"Test RMSE:  ${test_rmse * 100000:.2f}")
print(f"Train R²:   {train_r2:.4f}")
print(f"Test R²:    {test_r2:.4f}")

# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predicted vs Actual
axes[0].scatter(y_test_h, y_test_pred, alpha=0.5, s=10)
axes[0].plot([y_test_h.min(), y_test_h.max()], 
            [y_test_h.min(), y_test_h.max()], 
            'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual Price ($100k)')
axes[0].set_ylabel('Predicted Price ($100k)')
axes[0].set_title('Predicted vs Actual Prices')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals
residuals = y_test_h - y_test_pred
axes[1].scatter(y_test_pred, residuals, alpha=0.5, s=10)
axes[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Price ($100k)')
axes[1].set_ylabel('Residuals ($100k)')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ Random Forest Regression successfully applied!")

### Exercise 9.2: Feature Importance for Regression

**Task:** Identify most important features for housing price prediction.

In [None]:
# Get feature importances
importances = rf_regressor.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(importances)), importances[indices], alpha=0.7)
plt.yticks(range(len(importances)), [housing.feature_names[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Feature Importance for Housing Price Prediction')
plt.grid(True, alpha=0.3, axis='x')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("Feature Importances (sorted):")
for i in range(len(importances)):
    idx = indices[i]
    print(f"{i+1}. {housing.feature_names[idx]:15s}: {importances[idx]:.4f}")

print("\nInterpretation:")
print("- MedInc (median income) is most important predictor")
print("- Location features (latitude, longitude) also very important")
print("- These align with domain knowledge about housing prices")

---

## Part 10: Imbalanced Classification

### Background

**Handling Imbalanced Data with Random Forest:**
- `class_weight='balanced'`: Automatically adjust weights inversely proportional to class frequencies
- `class_weight={0: w0, 1: w1}`: Custom weights
- Alternative: Use stratified sampling

### Exercise 10.1: Create and Handle Imbalanced Dataset

**Task:** Demonstrate class_weight parameter on imbalanced data.

In [None]:
# Create imbalanced dataset
X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1
    flip_y=0.01,
    random_state=42
)

print("Imbalanced Dataset:")
print(f"Total samples: {len(y_imb)}")
print(f"Class 0: {np.sum(y_imb == 0)} ({np.sum(y_imb == 0) / len(y_imb) * 100:.1f}%)")
print(f"Class 1: {np.sum(y_imb == 1)} ({np.sum(y_imb == 1) / len(y_imb) * 100:.1f}%)")
print(f"Imbalance ratio: {np.sum(y_imb == 0) / np.sum(y_imb == 1):.1f}:1")
print()

# Split data
X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb
)

# Model 1: Standard Random Forest (no weighting)
rf_no_weight = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
rf_no_weight.fit(X_train_imb, y_train_imb)
y_pred_no_weight = rf_no_weight.predict(X_test_imb)

# Model 2: Balanced class weights
rf_balanced = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf_balanced.fit(X_train_imb, y_train_imb)
y_pred_balanced = rf_balanced.predict(X_test_imb)

# Compare results
print("Results Without Class Weighting:")
print(classification_report(y_test_imb, y_pred_no_weight, 
                          target_names=['Class 0', 'Class 1']))

print("\nResults With Balanced Class Weighting:")
print(classification_report(y_test_imb, y_pred_balanced, 
                          target_names=['Class 0', 'Class 1']))

# Visualize confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# No weighting
cm_no_weight = confusion_matrix(y_test_imb, y_pred_no_weight)
im1 = axes[0].imshow(cm_no_weight, cmap='Blues')
plt.colorbar(im1, ax=axes[0])
axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(['Class 0', 'Class 1'])
axes[0].set_yticklabels(['Class 0', 'Class 1'])
for i in range(2):
    for j in range(2):
        axes[0].text(j, i, cm_no_weight[i, j], ha="center", va="center", 
                    color="red", fontsize=20)
axes[0].set_title('Without Class Weighting')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Balanced weighting
cm_balanced = confusion_matrix(y_test_imb, y_pred_balanced)
im2 = axes[1].imshow(cm_balanced, cmap='Blues')
plt.colorbar(im2, ax=axes[1])
axes[1].set_xticks([0, 1])
axes[1].set_yticks([0, 1])
axes[1].set_xticklabels(['Class 0', 'Class 1'])
axes[1].set_yticklabels(['Class 0', 'Class 1'])
for i in range(2):
    for j in range(2):
        axes[1].text(j, i, cm_balanced[i, j], ha="center", va="center", 
                    color="red", fontsize=20)
axes[1].set_title('With Balanced Class Weighting')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

# Calculate and compare metrics for minority class
from sklearn.metrics import precision_score, recall_score, f1_score

print("\nMinority Class (Class 1) Metrics:")
print("Metric          |  No Weight  |  Balanced  |  Improvement")
print("-" * 60)

metrics = ['Precision', 'Recall', 'F1 Score']
no_weight_scores = [
    precision_score(y_test_imb, y_pred_no_weight),
    recall_score(y_test_imb, y_pred_no_weight),
    f1_score(y_test_imb, y_pred_no_weight)
]
balanced_scores = [
    precision_score(y_test_imb, y_pred_balanced),
    recall_score(y_test_imb, y_pred_balanced),
    f1_score(y_test_imb, y_pred_balanced)
]

for metric, no_w, bal in zip(metrics, no_weight_scores, balanced_scores):
    improvement = ((bal - no_w) / no_w * 100) if no_w > 0 else 0
    print(f"{metric:15s} |  {no_w:.4f}     |  {bal:.4f}    |  {improvement:+.1f}%")

print("\nKey Observations:")
print("- Without weighting: Model ignores minority class (low recall)")
print("- With balanced weights: Much better minority class detection")
print("- Trade-off: Slightly more false positives but catches minority class")

print("\n✓ Class weighting successfully demonstrated!")

---

## Challenge Problems (Optional)

### Challenge 1: Implement Random Forest Regressor from Scratch

Extend the classification implementation to handle regression.

In [None]:
class RandomForestRegressorScratch:
    """
    Random Forest Regressor from scratch.
    
    Hint: Main difference from classifier:
    - Use DecisionTreeRegressor instead of DecisionTreeClassifier
    - Predict by averaging tree predictions instead of voting
    """
    def __init__(self, n_estimators=100, max_depth=None, 
                 max_features='sqrt', random_state=None):
        # Your code here
            # 1. Bootstrap sample
            X_boot, y_boot, _ = bootstrap_sample(X, y, random_state=seed)
        # Your code here
        # Your code here
print("Challenge 1: Implement Random Forest Regressor from scratch!")

### Challenge 2: Extremely Randomized Trees (Extra-Trees)

Implement Extra-Trees which uses random thresholds instead of optimal ones.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

# Your task: Compare Random Forest vs Extra-Trees
# Hint: Extra-Trees are faster to train but may have slightly lower accuracy

# Your code here
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        # Transpose to (n_samples, n_trees)
        tree_predictions = tree_predictions.T
        
        predictions = []
        for sample_preds in tree_predictions:
            counts = np.bincount(sample_preds)
            predictions.append(np.argmax(counts))
        
        predictions = np.array(predictions)
print("Challenge 2: Compare Random Forest with Extra-Trees!")

### Challenge 3: Implement Weighted Voting

Instead of simple majority voting, weight each tree's vote by its accuracy.

In [None]:
class WeightedRandomForest:
    """
    Random Forest with weighted voting based on tree accuracy.
    
    Hint: 
    - Store validation accuracy for each tree
    - Use accuracy as weight in final voting
    """
    def __init__(self, n_estimators=100, random_state=None):
        # Your code here
        pass
    
    def fit(self, X, y):
        # Your code here
        pass
    
    def predict(self, X):
        # Your code here
        pass

print("Challenge 3: Implement weighted voting!")

### Challenge 4: Proximity Matrix

Calculate and visualize the proximity matrix between samples.

In [None]:
def compute_proximity_matrix(rf_model, X):
    """
    Compute proximity matrix for Random Forest.
    
    Proximity between samples i and j = 
    (Number of trees where i and j end in same leaf) / (Total trees)
    
    Hint: Use .apply() method to get leaf indices
    """
    # Your code here
    pass

# Use proximity for clustering or visualization

print("Challenge 4: Implement proximity matrix calculation!")

---

## Reflection Questions

1. **Why does bootstrap sampling create diversity in Random Forest?**
   - What percentage of samples are OOB on average?
   - How does this affect each tree?

2. **What is the key difference between Bagging and Random Forest?**
   - How does feature randomness help?
   - Why use sqrt(features) instead of all features?

3. **When would you prefer Random Forest over a single Decision Tree?**
   - Consider bias-variance tradeoff
   - What about interpretability?

4. **How does OOB error estimation work?**
   - Why is it similar to cross-validation?
   - When is it most useful?

5. **What are the pros and cons of MDI vs Permutation Importance?**
   - Which is faster?
   - Which is more reliable?

6. **How do you handle imbalanced data with Random Forest?**
   - What does class_weight='balanced' do?
   - What are alternative approaches?

7. **Random Forest for classification vs regression: What changes?**
   - Split criterion?
   - Aggregation method?

8. **Why doesn't Random Forest typically overfit even with many trees?**
   - Think about the ensemble averaging
   - What happens as n_estimators increases?

---

## Summary

In this exercise, you learned:

- **Bootstrap Sampling**: Create diverse training sets through sampling with replacement
- **Bagging**: Reduce variance by training multiple models on bootstrap samples
- **Feature Randomness**: De-correlate trees by using random feature subsets
- **Random Forest**: Powerful ensemble combining bagging and feature randomness
- **OOB Estimation**: Free validation using out-of-bag samples
- **Feature Importance**: MDI and permutation methods for interpretation
- **Hyperparameter Tuning**: Grid search for optimal performance
- **RF Regression**: Apply Random Forest to continuous target variables
- **Imbalanced Data**: Handle class imbalance with class_weight parameter

**Key Takeaways:**

1. **Random Forest Advantages:**
   - Excellent out-of-box performance
   - Robust to overfitting
   - Handles high-dimensional data well
   - Provides feature importance
   - Works for both classification and regression
   - Requires minimal preprocessing

2. **Key Hyperparameters:**
   - `n_estimators`: More is usually better (diminishing returns)
   - `max_features`: sqrt(n) for classification, n/3 for regression
   - `max_depth`: Controls individual tree complexity
   - `min_samples_split/leaf`: Prevents overfitting

3. **Best Practices:**
   - Start with default parameters
   - Use OOB score for quick validation
   - Increase n_estimators if resources allow
   - Use class_weight for imbalanced data
   - Check feature importance for insights
   - Consider permutation importance for reliability

4. **When to Use Random Forest:**
   - Need strong baseline quickly
   - Interpretability less critical than performance
   - Mixed feature types (numerical + categorical)
   - Non-linear relationships
   - Moderate dataset sizes (not huge)

**Next Steps:**

- Study Gradient Boosting (XGBoost, LightGBM, CatBoost)
- Learn about ensemble stacking
- Explore feature engineering for Random Forest
- Practice on Kaggle competitions

---

**Congratulations!** You now have a deep understanding of Random Forest algorithms, from bootstrap sampling to production deployment. Random Forest remains one of the most reliable and widely-used machine learning algorithms for both classification and regression tasks.

---

**Need help?** Check the solution notebook or open an issue on [GitHub](https://github.com/jumpingsphinx/jumpingsphinx.github.io/issues).