# 1. Introduction
In this notebook, we will implement a Random Forest classifier and regressor from scratch using the `DecisionTree` and `DecisionTreeRegressor` classes. Random Forest is an ensemble learning technique that combines multiple decision trees to improve prediction accuracy and robustness. It consists of multiple decision trees which were fitted on different subsets of the training data and features. Each tree in the forest provides a vote for classification or a prediction for regression, and the final output is determined by aggregating these votes or predictions. First of all, we will import the functionalities and classes from the Notebook on Decision Trees which are all saved in the `Decision_Trees.py` file:

In [14]:
from Decision_Trees import *

### 1.1. Key Concepts of the Random Forest
1. **Ensemble Learning**: Random Forest leverages the principle of ensemble learning, which involves combining multiple models to produce a stronger overall model. Instead of relying on a single decision tree, Random Forest constructs a collection of trees, each trained on different data samples and subsets of features. This approach helps in reducing overfitting and enhancing generalization.


2. **Bootstrapping and Bagging**: Bootstrapping refers to the process of creating multiple subsets of the training data by sampling with replacement. Each decision tree in the Random Forest is trained on a different bootstrap sample, which ensures that the trees are diverse. Bagging (Bootstrap Aggregating) is used to combine the predictions of these trees to make the final decision, reducing the variance of the model and improving its stability.


3. **Feature Subsampling**: During the training of each decision tree, a random subset of features is selected for splitting nodes. This technique, known as feature subsampling, introduces additional diversity among the trees and prevents them from becoming too similar. As a result, the ensemble of trees becomes more robust and less prone to overfitting.


4. **Voting and Averaging**:
   - **Classification**: For classification tasks, each tree in the forest votes for a class label. The final class prediction is determined by majority voting, where the class with the most votes is selected as the output.
   - **Regression**: For regression tasks, each tree provides a continuous prediction. The final output is computed as the average of all individual tree predictions, which helps in smoothing out the effects of noisy data.


### 1.2. Advantages of Random Forests


- **Reduced Overfitting**: By averaging multiple decision trees, Random Forest reduces the risk of overfitting that is common with single decision trees.


- **Improved Accuracy**: Combining predictions from multiple trees often results in higher accuracy and better generalization to unseen data.


- **Feature Importance**: Random Forest can evaluate the importance of different features, providing insights into which features contribute most to the predictions.


# 2. Implementation
Let's start by implementing the Random Forest class for both classification and regression. This class will build on the `DecisionTree` and `DecisionTreeRegressor` classes and will include methods for fitting the forest to the data and making predictions.

In [15]:
import numpy as np

class RandomForest:
    def __init__(self, n_trees=100, max_depth=10, min_samples_split=2, max_features=None, regression=False):
        """
        Initialize the Random Forest model.
        
        Parameters:
        - n_trees (int): Number of decision trees in the forest.
        - max_depth (int): Maximum depth of each decision tree.
        - min_samples_split (int): Minimum number of samples required to split an internal node.
        - max_features (int or None): Number of features to consider when looking for the best split.
        - regression (bool): If True, use decision trees for regression; otherwise, use for classification.
        """
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.regression = regression
        self.trees = []

    def fit(self, X, y):
        """
        Fit the Random Forest model to the training data.
        
        Parameters:
        - X (array-like, shape (n_samples, n_features)): Training feature matrix.
        - y (array-like, shape (n_samples,)): Target values.
        """
        self.trees = []
        for _ in range(self.n_trees):
            # Initialize a new decision tree
            if self.regression:
                tree = DecisionTreeRegressor(
                    min_samples_split=self.min_samples_split,
                    max_depth=self.max_depth,
                    max_features=self.max_features
                )
            else:
                tree = DecisionTree(
                    min_samples_split=self.min_samples_split,
                    max_depth=self.max_depth,
                    max_features=self.max_features
                )
            
            # Create a bootstrap sample from the training data
            bootstrap_indices = np.random.randint(0, X.shape[0], X.shape[0])
            X_bootstrap, y_bootstrap = X[bootstrap_indices], y[bootstrap_indices]
            
            # Fit the decision tree on the bootstrap sample
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)

    def predict(self, X):
        """
        Predict the target values for the given input data.
        
        Parameters:
        - X (array-like, shape (n_samples, n_features)): Feature matrix for prediction.
        
        Returns:
        - array, shape (n_samples,): Predicted target values.
        """
        if self.regression:
            # For regression, aggregate predictions by averaging
            predictions = np.zeros((self.n_trees, X.shape[0]))
            for i, tree in enumerate(self.trees):
                predictions[i] = tree.predict(X)
            return np.mean(predictions, axis=0)
        else:
            # For classification, aggregate predictions by majority voting
            predictions = np.zeros((self.n_trees, X.shape[0]), dtype=int)
            for i, tree in enumerate(self.trees):
                predictions[i] = tree.predict(X)
            # Majority voting: select the most common class label for each sample
            return np.array([np.bincount(pred).argmax() for pred in predictions.T])

### 2.1. Example Classification
We will create a synthetic dataset to test how well our Random Forest Classifier performs.

In [17]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a synthetic classification dataset
X, y = make_classification(n_samples=500, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Random Forest Classifier
rf_classifier = RandomForest(n_trees=10, max_depth=5, min_samples_split=4, max_features=10, regression=False)
rf_classifier.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf_classifier.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f'Classification Accuracy: {accuracy:.2f}')

Classification Accuracy: 0.92


### 2.2. Example Regression
We will use the California Housing dataset again in order to test how well our Random Forest Regressor predicts housing prices.

In [20]:
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
california_data = fetch_california_housing()
X, y = california_data.data, california_data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Random Forest Regressor
rf_regressor = RandomForest(n_trees=10, max_depth=8, min_samples_split=4, max_features=5, regression=True)
rf_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate using Mean Squared Error (MSE)
mse = np.mean((y_pred - y_test) ** 2)
print(f'Regression Mean Squared Error (MSE) on California Housing: {mse:.2f}')

Regression Mean Squared Error (MSE) on California Housing: 0.35
