# LAB 07 - Random Forest for Regression

In this lab we will be extending the previous lab about Decision trees and build a Regression model using Random Forest.

For simplicity, we will be using the same dataset as the previous lab (you can find it in ECLASS).

**IMPORTANT:** For this lab, if you haven't finished your code from last week's lab on Decision trees, you will have the option to use the sklearn implementation for a regression tree. However, this doesn't mean that you should skip the previous lab. This is just so that you don't get behind with the content and you don't spend all your time today working on the previous lab. 

In [18]:
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

As mentioned before, use the Boston Housing data and prepare your train/val/test split as usual.

## Exercise 1 -- Bootstrap

Also known as [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating), this technique consists of making several samples with replacement of the original data, using each of the samples to train an estimator, and then aggregating the predictions using the average (this is also a type of model ensemble).

In [19]:
def bootstrap(X, num_bags=10):
    """
    Given a dataset and a number of bags,
    sample the dataset with replacement.
    
    This function does not return a copy
    of the datapoints, but a list of indices
    with compatible dimensionality
    
    Parameters
    ----------
    X : ndarray
        A dataset
    num_bags : int, default 10
        The number of bags to create
    
    Returns
    -------
    list of ndarray
        The list contains `num_bags` integer one-dimensional ndarrays.
        Each of these contains the indices corresponding to the 
        sampled datapoints in `X`
    
    Notes
    -----
    * The number of datapoints in each bach will
      match the number of datapoints in the given
      dataset.
    * The
    """
    rng = np.random.default_rng(0) # you can change the seed, or use 0 to replicate my results
    # Your code here
    
    
    num_samples = len(X)
    bags = []
    for _ in range(num_bags):
        indices = rng.choice(num_samples, size=num_samples, replace=True)
        bags.append(indices)
        
    return bags

In [20]:
rng = np.random.default_rng(0)
X_small = rng.random(size=(100,2))
bags = bootstrap(X_small)
bags[0]

array([85, 63, 51, 26, 30,  4,  7,  1, 17, 81, 64, 91, 50, 60, 97, 72, 63,
       54, 55, 93, 27, 81, 67,  0, 39, 85, 55,  3, 76, 72, 84, 17,  8, 86,
        2, 54,  8, 29, 48, 42, 40,  2,  0, 12,  0, 67, 52, 64, 25, 61, 76,
       38, 46, 99, 80, 98, 37, 68, 95, 65, 84, 68, 70, 38, 87, 13, 57, 72,
       84, 52, 37, 31, 42, 48, 71, 88,  7, 93, 53, 35, 67, 57, 25, 32, 71,
       59, 50, 33, 76, 39, 32, 89, 26, 22, 71, 62,  4,  8, 37, 83])

## Exercise 2 -- Aggregation

The second part of bagging.

In [21]:
def aggregate_regression(preds):
    """
    Aggregate predictions by several estimators
    
    Parameters
    ----------
    preds : list of ndarray
        Predictions from multiple estimators.
        All ndarrays in this list should have the same
        dimensionality.
        
    Return
    ------
    ndarray
        The mean of the predictions
    """
    # Your code here
    return np.mean(preds, axis=0)

## Exercise 3 -- Random Forest for regression

Using the functions you implemented above, it is now time to put all of them together to train several decision trees and then ensemble them to output a single prediction. For the random forest, however, we need to select a subset of features at each split on the decision tree. 

For this part, you can use the sklearn implementation of Random forest for regression as your estimator for each set of features and bags. See below an example of how to do this, and always remember to check the necessary documentation when using an external function.

Some parameters you will have to set are: 
* num_features: number of features per estimator
* min_samples: min number of samples per leaf node
* max_depth: maximum depth of the decision tree (each estimator)
* num_estimators: number of decision trees you will create using each bag and random set of features

In [23]:
## your code goes here:
class RandomForest:
    def __init__(self, num_features=4, min_samples=20, max_depth=6, num_estimators=10):
        self.num_features = num_features
        self.min_samples = min_samples
        self.max_depth = max_depth
        self.num_estimators = num_estimators
        self.estimators = []

    def fit(self, X_train, y_train):
        for _ in range(self.num_estimators):
            bag_indices = bootstrap(X_train)
        for indice in bag_indices:
            bag_X = X_train[indice]
            bag_y = y_train[indice]
            rf_estimator = self._build_rf_estimator()
            rf_estimator.fit(bag_X, bag_y)
            self.estimators.append(rf_estimator)

    def predict(self, X_test):
        predictions = []
        for estimator in self.estimators:
            pred = estimator.predict(X_test)
            predictions.append(pred)
        return aggregate_regression(predictions)

    def _build_rf_estimator(self):
        return DecisionTreeRegressor(
            max_depth=self.max_depth,
            min_samples_leaf=self.min_samples,
            max_features=self.num_features
        )

def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

# Carregando o conjunto de dados
data = pd.read_csv("data/BostonHousing.txt")

# Definindo as features e o target
X = data.drop('medv', axis=1).values
y = data['medv'].values

# Dividindo os dados em conjuntos de treino, validação e teste
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, train_size=0.80, random_state=13)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, test_size=0.5, random_state=13)

rf = RandomForest(num_features=4, min_samples=20, max_depth=6, num_estimators=10)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)

rmse_rf = np.sqrt(rmse(y_val, y_pred))

print(f'Random Forest Validation RMSE: {rmse_rf:.2f}')

Random Forest Validation RMSE: 1.78


In [28]:
from sklearn.ensemble import RandomForestRegressor
rfsk = RandomForestRegressor(n_estimators=10, max_depth=6, max_features = 4, min_samples_leaf= 20, random_state=0)

# Treinar o modelo
rfsk.fit(X_train, y_train)

# Fazer previsões no conjunto de teste
y_pred = rfsk.predict(X_val)

# Calcular o erro quadrático médio (MSE)
mse = rmse(y_val, y_pred)
print(f'Random Forest Validation RMSE: {mse:.2f}')

Random Forest Validation RMSE: 3.87
