# Crossvaldation in H2O
## Jose M Albornoz
### December 2018

This notebook demonstrates how crossvalidation is performed in H2O

In [1]:
import h2o
import pandas as pd

RANDOM_SEED = 801

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from c:\users\albornoj\appdata\local\programs\python\python37\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\AlbornoJ\AppData\Local\Temp\tmp12rsv_4v
  JVM stdout: C:\Users\AlbornoJ\AppData\Local\Temp\tmp12rsv_4v\h2o_AlbornoJ_started_from_python.out
  JVM stderr: C:\Users\AlbornoJ\AppData\Local\Temp\tmp12rsv_4v\h2o_AlbornoJ_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,05 secs
H2O cluster timezone:,Europe/London
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,20 days
H2O cluster name:,H2O_from_python_AlbornoJ_v6adm4
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# 1.- Read artificial dataset

In [3]:
df = pd.read_csv("artificial_data.csv")

In [4]:
df.head()

Unnamed: 0,id,bloodType,age,healthy_eating,active_lifestyle,salary
0,0,A,33.741012,5.0,7.0,34099.525572
1,1,O,46.110061,4.0,6.0,43770.999909
2,2,B,37.318695,4.0,3.0,37496.334083
3,3,O,19.069138,9.0,3.0,28278.922197
4,4,O,47.405131,5.0,5.0,45005.237724


In [None]:
hf = h2o.H2OFrame(df)

In [None]:
hf.shape

# 2.- Train-validation-test split

In [None]:
train, test = hf.split_frame(ratios=[0.898], destination_frames = ['train', 'test'], seed=RANDOM_SEED)

In [None]:
train.shape

In [None]:
test.shape

# 3.- Model build

In [None]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [None]:
y = "salary"
ignoreFields = [y, 'id']
x = [i for i in train.names if i not in ignoreFields]

In [None]:
mGBM1 = H2OGradientBoostingEstimator(model_id='cv_9folds', nfolds=9)

In [None]:
mGBM1.train(x, y, train)

In [None]:
mGBM1

# 4.- Model performance

In [None]:
mGBM1.mae(train=True)

In [None]:
mGBM1.mae(xval=True)

In [None]:
perf = mGBM1.model_performance(test)
perf.mae()

Both crossvalidation and test set produce a higher MAE than in the training set - a sign of overfitting. 

# 5.- An overfitting model

In [None]:
mGBM2 = H2OGradientBoostingEstimator(model_id='overfit_cv_9folds', nfolds=9, ntrees=100, max_depth=10)

In [None]:
mGBM2.train(x, y, train)

In [None]:
print('Train: %d ----> %d' % (mGBM1.mae(train=True), mGBM2.mae(train=True)))
print('Validation: %d ----> %d' % (mGBM1.mae(xval=True), mGBM2.mae(xval=True)))
print('Test: %d ----> %d' % (perf.mae(), mGBM2.model_performance(test).mae()))

As before, increasing the complexity of the model has only made overfitting worse