## Hyperparameter Database

#### Submitted to RISE:<span class="girk">(Approved)</span>

   Hyperparameters are parameters that are specified prior to running machine learning algorithms that have a large effect on the predictive power of statistical models. Knowledge of the relative importance of a hyperparameter to an algorithm and its range of values is crucial to hyperparameter tuning and creating effective models. To either experts or non-experts, determining hyperparameters that optimize model performance can be a tedious and difficult task. Therefore, we develop a hyperparameter database that allows users to visualize and understand how to choose hyperparameters that maximize the predictive power of their models.
   The database is created by running millions of hyperparameter values, over thousands of public datasets and calculating the individual conditional expectation of every hyperparameter to the quality of a model. We analyze the effect of hyperparameters on algorithms such as Distributed Random Forest (DRF), Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), and several more. Consequently, the database attempts to provide a one-stop platform for data scientists to identify hyperparameters that have the most effect on their models in order to speed up the process of developing effective predictive models. Moreover, the database will also use these public datasets to build models that can predict hyperparameters without search and for visualizing and teaching concepts such as statistical power and bias/variance tradeoff. The raw data will also be publically available for the research community.

## What are the hyperparamters?

Hyperparameters are parameters that are specified prior to running machine learning algorithms that have a large effecton the predictive power of statistical models. Hyperparameters are specified for tuning purpose, for examples:
    * learningrate - Learning Rate
    * n_layers     - Number of layers
    * n_neurons    - Number of neurons
    * Hidden Layers - Number of layers and size of each layers

##### Hyperparameters are important because they directly control the behaviour of the training algorithm and have a significant impact on the performance of the model that is being trained.

In [1]:
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import pandas as pd
import logging
import csv
import optparse
import time
import json
from distutils.util import strtobool
import psutil

import warnings
warnings.filterwarnings('ignore')

In [36]:
port_no=random.randint(5555,55555)
h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no)

Checking whether there is an H2O instance running at http://localhost:20426..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)56-b12)
  Starting server from C:\Users\prabh\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\prabh\AppData\Local\Temp\tmptuxsytfj
  JVM stdout: C:\Users\prabh\AppData\Local\Temp\tmptuxsytfj\h2o_prabh_started_from_python.out
  JVM stderr: C:\Users\prabh\AppData\Local\Temp\tmptuxsytfj\h2o_prabh_started_from_python.err
  Server is running at http://127.0.0.1:20426
Connecting to H2O server at http://127.0.0.1:20426... successful.


0,1
H2O cluster uptime:,04 secs
H2O cluster version:,3.12.0.1
H2O cluster version age:,"1 year, 8 months and 16 days !!!"
H2O cluster name:,H2O_from_python_prabh_et3bww
H2O cluster total nodes:,1
H2O cluster free memory:,5.750 Gb
H2O cluster total cores:,12
H2O cluster allowed cores:,12
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:20426


In [37]:
#importing data to the server
df = h2o.import_file(path="./Dataset/loan.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


#### We try to predict if it is a bad loan, by taking Loan dataset as an example

In [38]:
#Checking the heads
df.head()

loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
5000,36 months,10.65,10,RENT,24000,credit_card,AZ,27.65,0,83.7,9,0,26,verified
2500,60 months,15.27,0,RENT,30000,car,GA,1.0,0,9.4,4,1,12,verified
2400,36 months,15.96,10,RENT,12252,small_business,IL,8.72,0,98.5,10,0,10,not verified
10000,36 months,13.49,10,RENT,49200,other,CA,20.0,0,21.0,37,0,15,verified
5000,36 months,7.9,3,RENT,36000,wedding,AZ,11.2,0,28.3,12,0,7,verified
3000,36 months,18.64,9,RENT,48000,car,CA,5.35,0,87.5,4,0,4,verified
5600,60 months,21.28,4,OWN,40000,small_business,CA,5.55,0,32.6,13,1,7,verified
5375,60 months,12.69,0,RENT,15000,other,TX,18.08,0,36.5,3,1,7,verified
6500,60 months,14.65,5,OWN,72000,debt_consolidation,AZ,16.12,0,20.6,23,0,13,not verified
12000,36 months,12.69,10,OWN,75000,debt_consolidation,CA,10.78,0,67.1,34,0,22,verified




In [39]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='bad_loan' 
data_file='loan.csv'
run_time=333
run_id='SOME_ID_20180617_221529' # Just some arbitrary ID
server_path='./Dataset/'
classification=True
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="automl_test"  # project_name = project

#### All that we need is the `target`, and our AI software does the rest.

In [40]:
# assign target and inputs for logistic regression
y = target
X = [name for name in df.columns if name != y]
print(y)
print(X)

bad_loan
['loan_amnt', 'term', 'int_rate', 'emp_length', 'home_ownership', 'annual_inc', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'revol_util', 'total_acc', 'longest_credit_length', 'verification_status']


In [42]:
# impute missing values
_ = df[reals].impute(method='mean')
_ = df[ints].impute(method='median')

if scale:
    df[reals] = df[reals].scale()
    df[ints] = df[ints].scale()

In [43]:
# set target to factor for classification by default or if user specifies classification
if classification:
    df[y] = df[y].asfactor()

In [44]:
df[y].levels()

[['0', '1']]

In [46]:
# Use local data file or download from some type of bucket
import os

data_path=os.path.join(server_path,data_file)
data_path

'./Dataset/loan.csv'

In [47]:
if classification:
    class_percentage = y_balance=df[y].mean()[0]/(df[y].max()-df[y].min())
    if class_percentage < balance_threshold:
        balance_y=True
        

print(run_time)
type(run_time)

333


int

In [49]:
# automl
# runs for run_time seconds then builds a stacked ensemble

aml = H2OAutoML(max_runtime_secs=run_time,project_name = project) # init automl, run for 300 seconds
aml.train(x=X,  
           y=y,
           training_frame=df)

AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


#### We run thousands of hyperparamter combinations and select the best out of it.

In [50]:
# view leaderboard
lb = aml.leaderboard
lb

model_id,auc,logloss
StackedEnsemble_0_AutoML_20190223_190631,0.700333,0.438033
GLM_grid_0_AutoML_20190223_190631_model_1,0.694005,0.439154
GLM_grid_0_AutoML_20190223_190631_model_0,0.694005,0.439154
GBM_grid_0_AutoML_20190223_190631_model_0,0.691778,0.440359
XRT_0_AutoML_20190223_190631,0.685696,0.442635
DRF_0_AutoML_20190223_190631,0.681921,0.445599
GBM_grid_0_AutoML_20190223_190631_model_1,0.662744,0.470498




In [63]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[3])

In [66]:
mod_best.params

{'model_id': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'ModelKeyV3',
    'schema_type': 'Key<Model>'},
   'name': 'GBM_grid_0_AutoML_20190223_190631_model_0',
   'type': 'Key<Model>',
   'URL': '/3/Models/GBM_grid_0_AutoML_20190223_190631_model_0'}},
 'training_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_training_py_17_sid_a299',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_training_py_17_sid_a299'}},
 'validation_frame': {'default': None,
  'actual': {'__meta': {'schema_version': 3,
    'schema_name': 'FrameKeyV3',
    'schema_type': 'Key<Frame>'},
   'name': 'automl_validation_py_17_sid_a299',
   'type': 'Key<Frame>',
   'URL': '/3/Frames/automl_validation_py_17_sid_a299'}},
 'nfolds': {'default': 0, 'actual': 5},
 'keep_cross_validation_predictions': {'default': False, 'actual': True},
 'keep_cross_validation_fold_assignment

<br>
<br>
'ntrees': {'default': 50, <span class="girk">'actual'</span>: 33},

'max_depth': {'default': 5, <span class="girk">'actual'</span>: 4},

'learn_rate': {'default': 0.1, <span class="girk">'actual'</span>: 0.8}

<br>
<br>

#### We try to check the plot between hyperparameter against its values to know the best value range.
----------------------
<img src="Images/n_estimator.PNG" width="700"/>

#### We do the same for the different Hyperparameters we have
<br>

---------------------
<img src="Images/max_depth.PNG" width="700"/>

#### Not just that, we even see the importance of Hyperparameters through the plots.

<img src="Images/hy_importance.jpg" width="660"/>

### We will develop novel hyperparameter interpretability metrics, inspired by model inerpretability metrics, such as :

<font size="6">
    
    * Global surrogate models
    * Word embeddings
    * Individual conditional 
    expectation(ICE) plots
    * K-local interpretable model-agnostic 
    explanations(K-LIME)
    * Leave-one-covariance (LOCO)
    * Local feature importance
<font>

<font size="6">
    
    * Partial dependency plots
    * Random forest feature importance
    * Standardized coefficient importance
    * Visualization of neural network layers
    * Generalized low rank estimators
    * Feature extraction and ranking
    * accumulated local effects (ALE) 
    * Shapley values.
    
<font>

## Currently

#### The hyperparameter database analyzes the effect of hyperparameters on the following algorithms:

<center><font size='5'>

* Distributed Random Forest (DRF)
* Generalized Linear Model (GLM)
* Gradient Boosting Machine (GBM)
* Naïve Bayes Classifier
* Stacked Ensembles 
* XGBoost and 
* Deep Learning Models (Neural Networks).
<font>

#### Data dump for hyperparamter researchers and Kaggle competition

<center><img src="Images/data_dump.png" width="700"/></center>

In [1]:
from IPython.display import IFrame

## Database - UML Diagram



In [2]:
IFrame(src='./HP_Database_UML_Diagram.html', width=900, height=700)

### Web site for the Hyperparamter Database

##### Technologies Stack:

   Front-End & Back-End: Django Framework

   Database: MySQL

   Hosting Platform: GCP

•	Why Django Framework?

•	Why GCP?

##### GCP services used: 
   Compute Engine
