# Generalised Forest Tuning - Bayesian Optimisation

This is the code used to run experiments on Bayesian optimisation in the paper "Generalising Random Forest Parameter Optimisation to Include Stability and Cost" by CHB Liu, BP Chamberlain, DA Little, A Cardoso (2017).

Please ensure you are using the anaconda environment `gft_env`. This is usually indicated by successfully importing the libraries below. If the library import resulted in any error, please try and run the `./setup_environment.sh` script again.

## Library imports

In [None]:
import pandas as pd # You need pandas 0.19+
import numpy as np

from data_loader import *
from evaluator import *
from pybo import solve_bayesopt
from functools import partial

## Bayesian Optimisation Wrapper Function

The following function takes the training and validation features and labels, the loss function weights ($\alpha$, $\beta$, and $\gamma$), and returns the best forest parameter combination, the Bayesian optimisation model, intermediary results, and the metrics achieved with the best forest parameter combination.

In [None]:
def bayesopt_RF_performace(
    X_train, X_val, y_train, y_val,
    weight_alpha=1, weight_beta=1, weight_gamma=1):

    objective = \
        partial(get_RF_generalised_performance_score,
                features_train_complete=X_train,
                features_val=X_val,
                labels_train_complete=y_train,
                labels_val=y_val,
                weight_alpha=weight_alpha,
                weight_beta=weight_beta,
                weight_gamma=weight_gamma,
                verbose=False)

    xbest, model, info = solve_bayesopt(
        objective,
        bounds=[[5, 200], [1, 20], [0.1, 1]],
        niter=20, 
        verbose=True)

    best_performance = \
         train_and_get_RF_performance(
             np.array(xbest), 
             X_train, 
             X_val,
             y_train, 
             y_val, 
             nrun=10)
        
    return xbest, model, info, best_performance

## Bayesian optimisation for Orange small dataset

In [None]:
orange_small_features, _, _, orange_small_upselling_labels = \
    get_and_process_orange_small_data()

# Do train validation split:
# We select the first <prop> data as train data, 
# the rest will be the hold out validation set
orange_small_features_train, orange_small_features_val, \
orange_small_upselling_labels_train, orange_small_upselling_labels_val = \
    split_train_val_data(orange_small_features,
                         orange_small_upselling_labels,
                         prop=0.5)

In [None]:
# To observe the forest parameters' sensitivity on weight parameters,
# change the values of <weight_alpha>, <weight_beta> and <weight_gamma>
best_param, _, _, best_performance = \
    bayesopt_RF_performace(
        orange_small_features_train, orange_small_features_val,
        orange_small_upselling_labels_train, 
        orange_small_upselling_labels_val,
        weight_alpha=1, weight_beta=1, weight_gamma=0.01
    )

# Print results
print("")
print("------ BAYESIAN OPTIMISATION RESULT ------")
print("The best parameter combination (with the highest "
      "posterior mean) found by Bayesian optimisation is:")
print("No. trees: " + str(int(best_param[0])) + 
      ", Max tree depth: " + str(int(best_param[1])) +
      ", Training prop.: " + str(best_param[2]) + "\n")

print("The metrics achieved under this parameter combination "
      "is as follow:")
print(best_performance)

## Bayesian optimisation for Criteo dataset

Warning: Long running process

In [None]:
# Default to train and tune parameters with 10% of criteo data to
# prevent memory issues
# If you have >64GB of memory and would like to use the full
# dataset, set the <sample> argument to False
criteo_features, criteo_labels = \
    get_and_process_criteo_data(sample=True)

# Do train validation split:
# We select the first <prop> data as train data, 
# the rest will be the hold out validation set
criteo_features_train, criteo_features_val, \
criteo_labels_train, criteo_labels_val = \
    split_train_val_data(criteo_features,
                         criteo_labels,
                         prop=0.5)

In [None]:
# To observe the forest parameters' sensitivity on weight parameters,
# change the values of <weight_alpha>, <weight_beta> and <weight_gamma>
best_param, _, _, best_performance = \
    bayesopt_RF_performace(
        criteo_features_train, criteo_features_val,
        criteo_labels_train, criteo_labels_val,
        weight_alpha=1, weight_beta=1, weight_gamma=0.001
    )

# Print results
print("")
print("------ BAYESIAN OPTIMISATION RESULT ------")
print("The best parameter combination (with the highest "
      "posterior mean) found by Bayesian optimisation is:")
print("No. trees: " + str(int(best_param[0])) + 
      ", Max tree depth: " + str(int(best_param[1])) +
      ", Training prop.: " + str(best_param[2]) + "\n")

print("The metrics achieved under this parameter combination "
      "is as follow:")
print(best_performance)