### Random forest regressor

The theory for building a random forest model for regression problems. This is largely the same as for the random forest classifier, which I assume you are familiar with.

As explained in the classifier notebook, the random forest is an ensemble method combining many (modified) decision trees which are each trained on bagged data. Given the (modified) decision tree regressor, the implementation of the random forest regressor is the same as for the classifier, making this a somewhat trivial exercise. We therefore copy most code from the random forest classifier notebook. The only difference is that the final prediction made by the random forest is the average of the predictions of each decision tree. (And we use the root mean square error in the scoring function like for the decision tree regressor.)

In [1]:
import numpy as np
import pandas as pd
from decision_tree_regressor_modified import DecisionTreeRegressorModified

In [2]:
def create_bootstrap_samples(X, y, n_estimators, random_state=42):
    """
    Create the bootstrap samples of the given data.
    
    Parameters:
    X (DataFrame): The original dataset.
    y (Series): The target values corresponding to the dataset.
    n_estimators (int): The number of bootstrap samples to create
                        (equal to the number of trees in the forest).
    random_state (int): Seed for reproducibility. Defaults to 42.
    
    Returns:
    list: The bootstrap samples of the data.
    """
    if random_state is not None:
        np.random.seed(random_state)
    sample_size = X.shape[0]
    bootstrap_samples = []

    for _ in range(n_estimators):
        # Generate a random sample with replacement
        sample_indices = np.random.choice(sample_size, size=sample_size, replace=True)
        X_sample, y_sample = X.iloc[sample_indices], y.iloc[sample_indices]
        bootstrap_samples.append( (X_sample, y_sample) )
    
    return bootstrap_samples

def build_forest(bootstrap_samples, n_estimators, max_depth=None, min_samples_split=2, min_samples_leaf=1, 
                 verbose=False, threshold=0.0, num_features=None, random_state=42):
    """
    Build the random forest from a decision tree for each bootstrap sample.
    
    Parameters:
    bootstrap_samples (list): The list of bootstrap samples.
    n_estimators (int): The number of trees in the forest.
    
    Returns:
    list: The random forest (list of decision trees).
    """
    forest = []
    
    for i in range(n_estimators):
        X_sample, y_sample = bootstrap_samples[i]
        
        # Train a decision tree on the bootstrap sample
        tree = DecisionTreeRegressorModified(max_depth=max_depth,
                                             min_samples_split=min_samples_split,
                                             min_samples_leaf=min_samples_leaf,
                                             verbose=verbose,
                                             threshold=threshold,
                                             num_features=num_features,
                                             random_state=random_state)
        tree.fit(X_sample, y_sample)
        forest.append(tree)
    
    return forest

def predict(X, forest):
    """
    Make predictions using the random forest.
    
    Parameters:
    X (DataFrame): The data to make predictions on.
    forest (list): The random forest (list of decision trees).

    Returns:
    array-like: The predicted labels for the input data.
    """
    predictions = np.array([tree.predict(X) for tree in forest])
    # Use the average
    forest_predictions = np.mean(predictions, axis=0)
    return forest_predictions

def score(y_true, y_pred):
    """
    Calculate the root mean square error of predictions.
    
    Parameters:
    y_true (array-like): True values.
    y_pred (array-like): Predicted values.
    
    Returns:
    float: Root mean square error.
    """
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

In [3]:
test_data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6],
    'feature2': [0, 1, 0, 1, 0, 1],
    'target': [1.1, 1.2, 0.9, 0.1, -0.1, 0.0]
    })
X = test_data[['feature1', 'feature2']]
y = test_data['target']
n_estimators = 2
bootstrap_samples_list = create_bootstrap_samples(X, y, n_estimators)

forest = build_forest(bootstrap_samples_list, n_estimators, num_features=2)
predictions = predict(X, forest)
print(predictions)

rmse = score(y, predictions)
print(f"Root Mean Square Error: {rmse:.4f}")

[ 1.05  1.05  0.9   0.1  -0.1   0.1 ]
Root Mean Square Error: 0.0764


Now for the more rigorous testing.

In [None]:
iris_data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv', header=None)
X = iris_data.drop(columns=[0])
y = iris_data[0]

# Map the labels to integers
label_mapping = {label: idx for idx, label in enumerate(X[4].unique())}
X[4] = X[4].map(label_mapping)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Do the Forest fit
bootstrap_samples_list = create_bootstrap_samples(X_train, y_train, n_estimators=5)
forest = build_forest(bootstrap_samples_list, n_estimators=5, max_depth=5)
y_pred = predict(X_test, forest)
rmse = score(y_test, y_pred)
print(f"Root Mean Square Error on iris dataset: {rmse:.4f}")

Root Mean Square Error on Iris dataset: 0.3568


In [None]:
for n_estimators in [1, 2, 3, 4, 5]:
    bootstrap_samples_list = create_bootstrap_samples(X_train, y_train, n_estimators=n_estimators)
    forest = build_forest(bootstrap_samples_list, n_estimators=n_estimators, max_depth=5)
    y_pred = predict(X_test, forest)
    accuracy = score(y_test, y_pred)
    print(f"Root Mean Square Error with {n_estimators} estimators: {accuracy}")
# Interestingly, the error is smallest with 2 estimators

Root Mean Square Error with 1 estimators: 0.4384877452022938
Root Mean Square Error with 2 estimators: 0.33544082648778967
Root Mean Square Error with 3 estimators: 0.3466004329622881
Root Mean Square Error with 4 estimators: 0.3433600806074362
Root Mean Square Error with 5 estimators: 0.3568256541469695


In [None]:
# Compare with sklearn's RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
sklearn_forest = RandomForestRegressor(n_estimators=5, max_depth=5, random_state=42)
sklearn_forest.fit(X_train, y_train)
sklearn_predictions = sklearn_forest.predict(X_test)
sklearn_rmse = np.sqrt(mean_squared_error(y_test, sklearn_predictions))
print(f"Sklearn RandomForestRegressor rmse: {sklearn_rmse:.4f}")
# Sklearn is slightly better, and also a bit faster

Sklearn RandomForestRegressor rmse: 0.3418


In [29]:
# Use the diabetes dataset as a regression example
from sklearn.datasets import load_diabetes
diabetes_data = load_diabetes()
X = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)
y = pd.Series(diabetes_data.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
bootstrap_samples_list = create_bootstrap_samples(X_train, y_train, n_estimators=10)
forest = build_forest(bootstrap_samples_list, n_estimators=10, max_depth=5)
y_pred = predict(X_test, forest)
rmse = score(y_test, y_pred)
print(f"Root Mean Square Error on Diabetes dataset: {rmse:.4f}")

Root Mean Square Error on Diabetes dataset: 56.0641


In [None]:
sklearn_forest = RandomForestRegressor(n_estimators=10, max_depth=5, random_state=42)
sklearn_forest.fit(X_train, y_train)
sklearn_predictions = sklearn_forest.predict(X_test)
sklearn_rmse = np.sqrt(mean_squared_error(y_test, sklearn_predictions))
print(f"Sklearn RandomForestRegressor rmse: {sklearn_rmse:.4f}")
# In this case sklearn is slightly worse, but much faster given the larger number of trees

Sklearn RandomForestRegressor rmse: 56.6336


Just like for the decision tree test, the predictions we make on the diabetes data are quite bad, but so are those from sklearn, so a random forest does not work well on this data either. Here we also see that it takes our forest about ten seconds to run while sklearn is still instantaneous, so our implementation is much slower.

Still, in the test cases it is again clear that our implementation generates good results, comparable to sklearn. It is however fairly slow, just like the classifier, so that this might also not work well for larger datasets.