<a href="https://colab.research.google.com/github/harnalashok/h2o/blob/master/biological_h2o.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
Last amended: 1st Jan, 2021
My folder: C:\Users\Administrator\OneDrive\Documents\biological_response
Data Source: https://www.kaggle.com/c/bioresponse/overview

Objectives:
        i)  Experiments in neural network and Deeplearning
        ii) Understanding wt-initialization strategy
       iii) Learning to work in h2o
        iv)  h2o on Google colab
         v) Drug Designing


DO NOT EXECUTE THIS CODE IN SPYDER--IT MAY FAIL

Ref:
Machine Learning with python and H2O
   https://www.h2o.ai/wp-content/uploads/2018/01/Python-BOOKLET.pdf
H2o deeplearning (latest) booklet
   http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf

"""

In [None]:
# -3.0 Install java run-time
! apt-get install default-jre
!java -version

In [None]:
# https://medium.com/@naeemasvat.na/how-to-use-h2o-in-google-colab-b69ba539ab1a
# -2.0 Install h2o
! pip install h2o

In [None]:
# -1.0 Mount your google drive 
#      so that you can access data files 
#      on your Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# 1.0 Call libraries
%reset -f
import pandas as pd
import h2o
import os
# 1.1
from h2o.estimators.deeplearning import H2ODeepLearningEstimator


In [None]:
# 1.2 Display output of multiple commands from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# 2. Start h2o
h2o.init(max_mem_size = "2G")

In [None]:
# 3. Change working folder and read bio_response data
# os.chdir("C:\\Users\\Administrator\\OneDrive\\Documents\\biological_response")
# os.chdir("D:\\data\\OneDrive\\Documents\\biological_response")


In [None]:
# 3.1 Read data file (colab code)
bio =h2o.import_file("/content/drive/MyDrive/MiscFiles/bio_response.csv")

In [None]:
# 3.2 Explore
type(bio)           #  h2o.frame.H2OFrame

# 3.3
bio.shape
bio.head(3)     # bio.head().as_data_frame()
bio.tail(3)     # bio.tail().as_data_frame()


In [None]:
# 3.4 Transform target to factor column
bio['Activity'] = bio['Activity'].asfactor()

# 3.4 How many factor levels this columns has
bio['Activity'].levels()


In [None]:
# 3.5 Which are predictors and which one is target column
col = bio.columns
x = col[1:]
y = "Activity"
x[:5]

In [None]:
# 4.0 Split the dataset into train/test

train,test = bio.split_frame(ratios= [0.7])
train.shape
test.shape

# Weight Initialization strategy
Ref: [StackOveflow answer this](https://stats.stackexchange.com/a/47604/78454) and [this.](https://stats.stackexchange.com/questions/204114/deep-neural-network-weight-initialization)<br>
Ref: Write formulas using latex notations. [See this reference](http://www.malinc.se/math/latex/basiccodeen.php)<br>
Let us assume you are using sigmoid neuron ie logistic neurons.<br>

The logistic function is close to flat for large positive or negative inputs. The derivative at an input of 2
is about 1/10, but at 10 the derivative is about 1/22000. This means that if the input of a logistic neuron is 10 then, for a given training signal, the neuron will learn about 2200 times slower that if the input was 2.<br>

If you want the neuron to learn quickly, you either need to produce a huge training signal (such as with a cross-entropy loss function) or you want the derivative to be large. To make the derivative large, you set the initial weights so that you often get inputs in the range [−4,4].<br>

The initial weights you give might or might not work. It depends on how the inputs are normalized. If the inputs are normalized to have mean 0 and standard deviation 1, then a random sum of d terms with weights uniform on
$ [1/\sqrt{d}, -1/\sqrt{d} ]$  will have mean 0 and variance 1/3, independent of d. The probability that you get a sum outside of [−4,4] is small. That means as you increase d, you are not causing the neurons to start out saturated so that they don't learn.<br>

With inputs which are not normalized, those weights may not be effective at avoiding saturation.<br><br>
Some Maths:<br>
Assume there are d inputs and all are normalized. So input-signal mean = 0 and input signal var = 1. If weight from input to next neuron is 'w' and it is uniformly distributed as: $ [1/\sqrt{d}, -1/\sqrt{d} ]$, then <br><br>
E($ \sum_ {i=1}^{d} w $ ) =  d * E($ \sum_ {}^{} w $ ) = d * 0 = 0. And variance of sum of inputs at a neuron is:  <br>
Var($ \sum_ {i=1}^{d} w $ ) = var(w1) + var(w2)+ var(w3) + ...d-terms <br><br>
For a uniform distribution of [a,b], varaince is: 
$ (b-a)^2 /12 $ <br>
Therefore, for uniform distribution $ [1/\sqrt{d}, -1/\sqrt{d} ]$ , variance is 1/(3d).<br>
As all w's are identically distributed, therefore, <br>
Var(w1) + var(w2) +...d-terms is = 1/3.<br><br>
*Glorot initialization for sigmoid activation*<br>
$fan_{avg} = (fan_{in} + fan_{out})/2 $ <br>
Normal distribution with mean 0 and variance: $ 1/fan_{avg} $<br>
OR, a uniform distribution between -r and +r with r= 
$ 1/\sqrt({3/fan_{avg}}) $ <br>

In [None]:
# 4.1 Instantiate a simple deeplearning model
#     We vary Initialization wts to see their
#     effect on validation error
#     Ref: Weights initialization depend on the activation function being used. 

df = []
initial_wt_dist = ["Normal", "Uniform Adaptive", "Uniform" ]

# 4.2
for i in range(3):
    # 4.3 Instantiate the model
    dl =H2ODeepLearningEstimator(
                                   distribution="bernoulli",
                                   activation = "Tanh",
                                   hidden = [64,32,16],
                                   epochs = 100,           # Even though epochs are 100,
                                                           # iterations stop very early. 
                                                           # Progress bar after some time to 
                                                           # suddenly jumps to 100
                                   score_each_iteration = True,
                                   initial_weight_distribution = initial_wt_dist[i]
                                  )
    # 4.4 Begin training
    dl.train(
              x= x,            # Predictor columns
              y= y,            # Target
              training_frame=train,  # training data
              validation_frame = test
             )

    # 4.5 Append dl object to list
    df.append(dl)       


In [None]:
# 4.6 Get scoring history for each of the three models
df_normal = df[0].scoring_history()
df_ua = df[1].scoring_history()
df_un = df[2].scoring_history()
# 4.7
df_normal.columns
#4.8
df_normal.head(4)
df_normal.tail(4)
df_ua[['validation_classification_error','training_classification_error']].head(3)
df_un[['validation_classification_error','training_classification_error']].head(3)
df_normal[['validation_classification_error','training_classification_error']].head(3)

In [None]:
# 5.0 Plot validation errors for difft
#     initialization schemes:

import matplotlib.pyplot as plt
# 5.1
fig = plt.figure()
ax = fig.add_subplot(111)
# 5.2
_=ax.plot(df_ua[['iterations']],df_ua[['validation_classification_error']],label = "Uniform Adaption", color = "red")
_=ax.plot(df_un[['iterations']],df_un[['validation_classification_error']],label = "Uniform", color = "black")
_=ax.plot(df_normal[['iterations']],df_normal[['validation_classification_error']], label = "Normal",   color = "blue")
_=ax.legend()
_= ax.set_title("Effect of wt-initialization strategy")
ax.grid()

In [None]:
# 6.0 
fig = plt.figure()
ax = fig.add_subplot(111)

_=ax.plot(df_ua[['training_classification_error']]    , label = "Uniform Adaption", color = "red")
_=ax.plot(df_un[['training_classification_error']]    , label = "Uniform",          color = "black")
_=ax.plot(df_normal[['training_classification_error']], label = "Normal",           color = "blue")
_=ax.legend()
ax.grid()

In [None]:
# 7.0
fig = plt.figure()
ax = fig.add_subplot(111)

_=ax.plot(df_ua[['iterations']], df_ua[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(df_ua[['iterations']], df_ua[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_=ax.set_title("Learning Curve--Overfitting is obvious")
ax.grid(which='major', linestyle='-', linewidth='0.5', color='red')
# Turn on the minor TICKS, which are required for the minor GRID
ax.minorticks_on()
ax.grid(which='minor', linestyle=':', linewidth='0.5', color='black')

In [None]:
# 8.0 
dl =H2ODeepLearningEstimator(
                               distribution="bernoulli",
                               activation = "rectifierwithdropout",   # CHANGED
                               hidden = [32,32,32],
                               epochs = 25,                      # CHANGED
                                                                # Iterations stop very early. You can this from 
                                                                # progress bar that suddenly jumps to 100% after a time
                               hidden_dropout_ratios = [0.4,0.5,0.5] ,  # ADDED
                               #input_dropout_ratio = 0.2,
                               l1= 1e-5,   
                               l2= 1e-5,
                               score_each_iteration = True,
                               initial_weight_distribution = "Uniform Adaptive",
                               variable_importances = True
                              )

dl.train(
          x= x,            # Predictor columns
          y= y,            # Target
          training_frame=train,  # training data
          validation_frame = test
         )

In [None]:
# 9.0
dl.predict(test)
dl.logloss()
dl.auc()

In [None]:
# 9.1
dx1 = dl.scoring_history()

In [None]:
# 9.2
fig = plt.figure()
ax = fig.add_subplot(111)

_=ax.plot(dx1[['iterations']],dx1[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(dx1[['iterations']],dx1[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_= ax.set_title("Learning Curve--No overfitting")
_= ax.set_ylim([0.10,0.40])
ax.minorticks_on()
ax.grid(which = "major", color = "red")
ax.grid(which = "minor", linestyle = "--")

In [None]:
# 9.3
fig = plt.figure()
ax = fig.add_subplot(111)

_=ax.plot(dx[['iterations']],dx[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(dx[['iterations']],dx[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_= ax.set_title("Learning Curve--No overfitting")
_= ax.set_ylim([0.10,0.40])
ax.minorticks_on()
ax.grid(which = "major", color = "red")
ax.grid(which = "minor", linestyle = "--")

In [None]:
# 10.0
# https://stackoverflow.com/q/45442608/3282777
# Feature importance is in decreasing order
#  Variable Importance considers the weights connecting
#  the input features to the first two hidden layers.
#   The higher the connecting weights, more impt the feature is

dl.varimp()

In [None]:
import numpy as np
# https://stackoverflow.com/q/45442608/3282777
f_impt = pd.DataFrame.from_records(dl.varimp(), columns = ["feature", "relative_importance", "scaled_importance", "percentage"])
f_impt
f_impt['scaled_importance']/np.sum(f_impt['scaled_importance'])

In [None]:
help(dl.varimp())

In [None]:
##################### Examining Regularization ##########################################################

In [None]:
# 11.0
train,test = bio.split_frame(ratios= [0.7])
train.shape
test.shape

In [None]:
# 11.1 No regularization
dl =H2ODeepLearningEstimator(
                               distribution="bernoulli",
                               activation = "rectifier",   # CHANGED
                               hidden = [100,64,32],
                               epochs = 500,                      # CHANGED
                                                                # Iterations stop very early. You can this from 
                                                                # progress bar that suddenly jumps to 100% after a time
                               #hidden_dropout_ratios = [0.5,0.5,0.5] ,  # ADDED
                               #l1= 1e-5,   
                               #l2= 1e-5,
                               score_each_iteration = True,
                               initial_weight_distribution = "Uniform Adaptive"
                               #variable_importances = True
                              )

dl.train(
          x= x,            # Predictor columns
          y= y,            # Target
          training_frame=train,  # training data
          validation_frame = test
         )



reg_no = dl.scoring_history()

In [None]:
#11.2 Dropouts only
dl =H2ODeepLearningEstimator(
                               distribution="bernoulli",
                               activation = "rectifierwithdropout",   # CHANGED
                               hidden = [100,64,32],
                               epochs = 500,                      # CHANGED
                                                                # Iterations stop very early. You can this from 
                                                                # progress bar that suddenly jumps to 100% after a time
                               hidden_dropout_ratios = [0.5,0.5,0.5] ,  # ADDED
                               #l1= 1e-5,   
                               #l2= 1e-5,
                               score_each_iteration = True,
                               initial_weight_distribution = "Uniform Adaptive"
                               #variable_importances = True
                              )

dl.train(
          x= x,            # Predictor columns
          y= y,            # Target
          training_frame=train,  # training data
          validation_frame = test
         )



reg_drop = dl.scoring_history()

In [None]:
# 11.3  l1/l2
dl =H2ODeepLearningEstimator(
                               distribution="bernoulli",
                               activation = "rectifier",   # CHANGED
                               hidden = [100,64,32],
                               epochs = 500,                      # CHANGED
                                                                # Iterations stop very early. You can this from 
                                                                # progress bar that suddenly jumps to 100% after a time
                               #hidden_dropout_ratios = [0.5,0.5,0.5] ,  # ADDED
                               l1= 1e-5,   
                               l2= 1e-5,
                               score_each_iteration = True,
                               initial_weight_distribution = "Uniform Adaptive"
                               #variable_importances = True
                              )

dl.train(
          x= x,            # Predictor columns
          y= y,            # Target
          training_frame=train,  # training data
          validation_frame = test
         )



reg_l1 = dl.scoring_history()

In [None]:
# 11.4 
# dropouts + l1 + l2
dl =H2ODeepLearningEstimator(
                               distribution="bernoulli",
                               activation = "rectifierwithdropout",   # CHANGED
                               hidden = [100,64,32],
                               epochs = 500,                      # CHANGED
                                                                # Iterations stop very early. You can this from 
                                                                # progress bar that suddenly jumps to 100% after a time
                               hidden_dropout_ratios = [0.5,0.5,0.5] ,  # ADDED
                               l1= 1e-5,   
                               l2= 1e-5,
                               score_each_iteration = True,
                               initial_weight_distribution = "Uniform Adaptive"
                               #variable_importances = True
                              )

dl.train(
          x= x,            # Predictor columns
          y= y,            # Target
          training_frame=train,  # training data
          validation_frame = test
         )



reg_all = dl.scoring_history()

In [None]:
# 11.5 dropouts + l1 + l2 + input + stronger l1/l2
dl =H2ODeepLearningEstimator(
                               distribution="bernoulli",
                               activation = "rectifierwithdropout",   # CHANGED
                               hidden = [100,64,32],
                               epochs = 500,                      # CHANGED
                                                                # Iterations stop very early. You can this from 
                                                                # progress bar that suddenly jumps to 100% after a time
                               hidden_dropout_ratios = [0.5,0.5,0.5] ,  # ADDED
                               l1= 1e-4,   
                               l2= 1e-4,
                               input_dropout_ratio = 0.2,     # Added
                               mini_batch_size = 10,          # Added
                               stopping_rounds= 20,           # Added 
                               score_each_iteration = True,
                               initial_weight_distribution = "Uniform Adaptive"
                               #variable_importances = True
                              )

dl.train(
          x= x,            # Predictor columns
          y= y,            # Target
          training_frame=train,  # training data
          validation_frame = test
         )



reg_all_in = dl.scoring_history()

In [None]:
# 12.0 Draw all the four now
fig = plt.figure(figsize=(20,5))

ax = fig.add_subplot(151)

sc = reg_no

_=ax.plot(sc[['iterations']],sc[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(sc[['iterations']],sc[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_= ax.set_title("No regularization")
_= ax.set_ylim([0.10,0.40])
ax.minorticks_on()
ax.grid(which = "major", color = "red")
ax.grid(which = "minor", linestyle = "--")

#######################33

sc = reg_drop

ax = fig.add_subplot(152)

_=ax.plot(sc[['iterations']],sc[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(sc[['iterations']],sc[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_= ax.set_title("Only dropouts")
_= ax.set_ylim([0.10,0.40])
ax.minorticks_on()
ax.grid(which = "major", color = "red")
ax.grid(which = "minor", linestyle = "--")

###############3


ax = fig.add_subplot(153)
sc = reg_l1
_=ax.plot(sc[['iterations']],sc[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(sc[['iterations']],sc[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_= ax.set_title("Only l1/l2")
_= ax.set_ylim([0.10,0.40])
ax.minorticks_on()
ax.grid(which = "major", color = "red")
ax.grid(which = "minor", linestyle = "--")

#######################

ax = fig.add_subplot(154)
sc = reg_all
_=ax.plot(sc[['iterations']],sc[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(sc[['iterations']],sc[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_= ax.set_title("Droputs + l1 + l2")
_= ax.set_ylim([0.10,0.40])
ax.minorticks_on()
ax.grid(which = "major", color = "red")
ax.grid(which = "minor", linestyle = "--")

###########################

ax = fig.add_subplot(155)
sc = reg_all_in
_=ax.plot(sc[['iterations']],sc[['training_classification_error']]    , label = "Training error",    color = "red")
_=ax.plot(sc[['iterations']],sc[['validation_classification_error']]  , label = "Validation error",color = "black")
_=ax.legend()
_= ax.set_title("Droputs + l1 + l2+ in")
_= ax.set_ylim([0.10,0.40])
ax.minorticks_on()
ax.grid(which = "major", color = "red")
ax.grid(which = "minor", linestyle = "--")

In [None]:
#################