# Training and Evaluation for Random Forest Model
This model trains differently than the neural network based models which is why there is a seperate notebook. Training is computationally very expensive which is why we were only able to train on a fraction of the training data used for the other models (6%). Training still takes roughly 1 hour.

In [46]:
import torch
import numpy as np
from data_handling import complete_data_preparation
from benchmarks import RandomForest


## Data Preparation

In [49]:
# Loading all the data
batches_train, batches_val, batches_test = complete_data_preparation(sequence_length=3, batch_size=64)

In [6]:
# Only selecting fraction of the data due to computational cost
batches_train = batches_train[:1000]
batches_val = batches_val[:1000]
batches_test = batches_test[:1000]

In [35]:
# Reshape targets of each batch to be have shape S x 1 (S = Number of Samples) and aggregating them to one vector
reshaped_batches_target = [batch[:,-1,-1].view(batch.shape[0],1) for batch in batches_train]
stacked_targets = torch.cat(reshaped_batches_target, dim=0).numpy().ravel()
print(f'Shape of Target Vector: {stacked_targets.shape}')

Shape of Target Vector: (64000,)


In [36]:
# Reshape features of each batch to be have shape S x 1 (S = Number of Samples) and aggregating them to one matrix
reshaped_batches = [batch[:,:,:-1].reshape(batch.shape[0], -1) for batch in batches_train]
stacked_data = torch.cat(reshaped_batches, dim=0).numpy()
print(f'Shape of Feature Matrix: {stacked_data.shape}')

Shape of Feature Matrix: (64000, 195)


## Training

In [33]:
# Initializing the Model
model = RandomForest(n_estimators=100, max_depth=30)

In [None]:
# Trainign the model
sse_train_unadjusted = model.train(stacked_data, stacked_targets)

For the other models we calculated the mean squared error for the batch and summed over all the batches. For this model we no longer have batches so we adjusted the loss by dividing by 64 to account for this. Additionally, as we are only using 1000 of the 16670 batches for the data set so we adjust by multiplying with $\frac{16670}{1000}$.

In [52]:
sse_train_adjusted = sse_train_unadjusted / 64 * (16670/1000)
print(f'Estimated Train Loss: {sse_train_adjusted}')

Estimated Test Loss: 7.0597633312831505


In [40]:
model_saving_path = "modelDumps/" + "RandomForest" + ".pt"
import os
if not os.path.exists('modelDumps'):
    os.makedirs('modelDumps')
torch.save(model, model_saving_path)

## Evaluation

In [44]:
# Preparing the test data

# Reshape targets of each batch to be have shape S x 1 (S = Number of Samples) and aggregating them to one vector
reshaped_batches_target = [batch[:,-1,-1].view(batch.shape[0],1) for batch in batches_test]
stacked_targets_test = torch.cat(reshaped_batches_target, dim=0).numpy().ravel()
print(f'Shape of Target Vector: {stacked_targets.shape}')

# Reshape features of each batch to be have shape S x 1 (S = Number of Samples) and aggregating them to one matrix
reshaped_batches = [batch[:,:,:-1].reshape(batch.shape[0], -1) for batch in batches_train]
stacked_data_test = torch.cat(reshaped_batches, dim=0).numpy()
print(f'Shape of Feature Matrix: {stacked_data.shape}')


Shape of Target Vector: (64000,)
Shape of Feature Matrix: (64000, 195)


In [55]:
# Getting the predicitons of the model
predictions = model(stacked_data_test)
sse_test_unadjusted = np.sum((predictions - stacked_targets_test)**2)
adjusted_test_loss = sse_test_unadjusted / 64 * (3880/1000)

As before, we have to adjust for the fact that we summed summed the square loss and not the averaque square loss of the batches so we adjusted the loss by dividing by 64 to account for this. Additionally, as we are only using 1000 of the 3880 batches for the data set so we adjust by multiplying with $\frac{3880}{1000}$.

In [56]:
print(f'Estimated Test Loss: {adjusted_test_loss}')

Estimated Test Loss: 5.415680542938186


In [59]:
# Parameters
# Number of trees
print("Number of trees: ", len(model.model.estimators_))

# Depth of each tree
tree_depths = [tree.tree_.max_depth for tree in model.model.estimators_]
print("Tree depths: ", tree_depths)

# Number of leaf nodes in each tree
leaf_counts = [tree.tree_.n_leaves for tree in model.model.estimators_]
print("Number of leaf nodes per tree: ", leaf_counts)

print(f"Total number of leaf nodes: {np.sum(leaf_counts)}")

Number of trees:  100
Tree depths:  [30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]
Number of leaf nodes per tree:  [8430, 8008, 7450, 7836, 6907, 6237, 6630, 7679, 7921, 6977, 6908, 6294, 4996, 5234, 6063, 7568, 6409, 5964, 8133, 8107, 5926, 6724, 5491, 8287, 6578, 6558, 8393, 7998, 7279, 7080, 6282, 8439, 6311, 8844, 6967, 8507, 6542, 8889, 7198, 7922, 6194, 5516, 6834, 5435, 5841, 6666, 6244, 5830, 6589, 6643, 6316, 6355, 5489, 7468, 7800, 7436, 7424, 9542, 6708, 6282, 6560, 6752, 7102, 7447, 7334, 7820, 10174, 5951, 6985, 5838, 6218, 5605, 9692, 6661, 6984, 8646, 6589, 5813, 6832, 7951, 8165, 7320, 7521, 8140, 5249, 7486, 6514, 7000, 5