# VS-k Small Hipster HPO
This is our HPO notebook.  Testing takes < 10 minutes to run w/ 4 qty of r4.rxlarge EC2 instance & Kubernetes

### Parameters
 - input bucket location - Lynn's S3 bucket (us-west-2 / Oregon)
 - input files `hipster...` and fc `label`
 - number of Spark executors 50 w/ 2 GM RAM each
 - number of RF trees is being tested (line 22 in Grid section), w/ batch of 28
 - default `mtry`
 - FIX - return calculation of OOB as `-ro` paramater is configured and runs (look in driver logs)

To use this, do the following:
- In the **Grid Search** section line 23 - change the value of # of trees to be created
- Need to fix `spark-submit...` command to be able to run OOB (fix `-ro` parameter on line 53)  

Output should look like this:
 - process finished at 86.14959335327148 seconds using 10 trees
 - process finished at 927.3492593765259 seconds using 100 trees
 - process finished at 117.82438707351685 seconds using 200 trees
 - process finished at 157.99670958518982 seconds using 500 trees
 - process finished at 230.5197627544403 seconds using 1000 trees
 - The ntree value that took the least amount of time was 10 at 86.14959335327148 seconds


# Imports

In [1]:
import shlex
from subprocess import Popen, PIPE
import time 

# Deal with prechecks

In [2]:
%%bash

set -e

MASTER=https://kubernetes.default.svc:443
INPUT_BUCKET=variant-spark

function fatal_error () {
	echo "ERROR: $1" 1>&2
	exit 1
}

if [ -z ${MASTER+x} ];
    then
        echo "You must set the MASTER environment variable to a kubernetes API endpoint";
        echo "Example: https://ABC.sk1.us-west-2.eks.amazonaws.com:443"
        exit 1
fi

if [ -z ${INPUT_BUCKET+x} ];
    then
        echo "You must set the INPUT_BUCKET environment variable to a bucket containing input data";
        echo "Example: variant-spark-k-storage"
        exit 1
fi

[[ $(type -P "spark-submit") ]] || fatal_error  "\`spark-submit\` cannot be found. Please make sure it's on your PATH."


# Run Grid search

In [25]:
'''
We have a grid of values (trees or batches)
We iterate through the grid of values and find which takes the least amount of time
We have that configuration as a variable 
Once iterated we output that value

'''

# Process file system params
MASTER="https://kubernetes.default.svc:443"
INPUT_BUCKET="variant-spark"
input_file = "s3a://{}/datasets/hipsterIndex/hipster.vcf".format(INPUT_BUCKET)
input_features = "s3a://{}/datasets/hipsterIndex/hipster_labels.txt".format(INPUT_BUCKET)
feature_column = "label"

# Process run params
n_iterations = 10
upper_bound = 10000
lower_bound = 100

# builds a list of evenly spaces values for the tree_val variable
n_trees = [10,100,200,500,1000]

least_time_val = 100
param_dict = {}

for tree_val in n_trees:
    
    time_start = time.time()
    
    MASTER="https://kubernetes.default.svc:443"
    INPUT_BUCKET="variant-spark"
    
    input_string = "spark-submit \
        --class au.csiro.variantspark.cli.VariantSparkApp \
        --driver-class-path ./conf \
        --master k8s://{0} \
        --deploy-mode cluster \
        --name VS-Hipster-HPO \
        --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
        --conf spark.executor.instances=50 \
        --conf spark.executor.memory=2g \
        --conf spark.kubernetes.container.image=jamesrcounts/variantspark:002 \
        --jars http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar,http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar,http://central.maven.org/maven2/joda-time/joda-time/2.9.9/joda-time-2.9.9.jar \
        local:///opt/spark/jars/variant-spark_2.11-0.2.0-SNAPSHOT-all.jar importance \
            -if {2} \
            -ff {3} \
            -fc {4} \
            -v \
            -rn {1} \
            -rbs 28 \
            -ro".format(MASTER, tree_val, input_file, input_features, feature_column)
  
    args = shlex.split(input_string)
    
    process = Popen(args, stdout=PIPE, stderr=PIPE)
    while (process.wait() == None):
        line = process.stdout.readline()
        print(line)
    time_stop = time.time()
    total_time = time_stop - time_start
    
    print("process finished at {} seconds using {} trees".format(total_time, tree_val))
    
    if (total_time < least_time_val):
        least_time_val = total_time
        
    param_dict[total_time] = tree_val
    
print("The ntree value that took the least amount of time was {} at {} seconds".format(param_dict[least_time_val], least_time_val))

process finished at 90.4185721874237 seconds using 10 trees
process finished at 116.50048351287842 seconds using 100 trees
process finished at 133.7640700340271 seconds using 200 trees
process finished at 210.01211428642273 seconds using 500 trees
process finished at 1139.2977166175842 seconds using 1000 trees
The ntree value that took the least amount of time was 10 at 90.4185721874237 seconds
