### Experiments
 1. Create 20 random splits of the training-test dataset.
 2. For each split:
      1.   Create a validation (val) set taking 20% of the training set.
      2.   Get best hyperparameter by training on train set and testing on val set.
      3.   Train the model on the entire training set with the best pair of hyperparameters.
      4.   Get the performance on the test set.
 7. Report the averaged performance on all 20 splits.

In [7]:
# Load external files
ext_files <- c("web_loader.r", "subRF_method.r", "grid_search.r")
for (file in ext_files) source(file)

# Check packages availability
packages <- list("caret","devtools","randomForestCI", "surfin", "grf", "Metrics", "tidyverse")
    # library(rpart) # for kyphosis data
    # library(MASS)
installed <- installed.packages()

# Check if installed and install
for (package in packages){
    if(package %in% rownames(installed) == FALSE){
        install.packages( package )
    }
}

# Load Libraries
for (package in packages) suppressMessages(library(package, character.only = TRUE))

# Custom Function
pointLL <- function(mean, sigma2, datapoint) {
    # Gaussian Log-Likelihood function for single-point
    -.5*log(2*pi*sigma2) - .5*( datapoint - mean)**2/sigma2 
}

# Print dataset selection
print("Available datasets") 
names(web_loader)

[1] "Available datasets"


## Testing script
#### Train & Validate model then test and produce confidence interval metrics and model metrics
source: [How uncertain are your Random Forest predictions?](http://shftan.github.io/surfin/demo.html)

Typical random forest implementations use bootstrapped trees. The U-statistics based variance estimate is based on subsamples which allows the Central Limit Theorem to be applied. The number of observations subsampled should be on the order of $\sqrt{n}$ where n is the number of observations in the data set. Other parameters of interest are the number of trees, and B (the number of common observations between trees), and L (the number of trees sharing a observation). ntree, B and L are connected: if we use ntree=5000 trees and B=25 common observations between trees, L=5000/25 = 200 trees will share an observation, then the next 200 trees with share another observation, and so forth. So two of these three parameters need to be specified, and the third will automatically follow. Mentch & Hooker found in their experiments that a range of 10 to 50 for B works well empirically. 

In [6]:
# User Input
folds <- 20
validation_grid_size <- 100
alpha <- .05


# Iterate over all datasets
for (dataset_name in names(web_loader)){
    command_string <- paste0("./training.r ",
                             dataset_name," ",
                             as.character(folds)," ",
                             as.character(validation_grid_size)," ",
                             as.character(alpha)
                        )
    
    # Better to run on a screen
#     system(command=command_string, 
#     intern=TRUE, ignore.stdout=FALSE, ignore.stderr=FALSE
#     )    

}

dataset  housing  nrows:  506  ncols  14 

dataset  concrete  nrows:  1030  ncols  9 

dataset  wine  nrows:  1599  ncols  12 

dataset  energy  nrows:  768  ncols  9 

dataset  yacht  nrows:  308  ncols  7 

dataset  gas_sensor  nrows:  3843160  ncols  19 

