# Overfitting a Gradient Boosting Model in H2O
## Jose M Albornoz
### December 2018

This notebook demonstrates an overfitting basic Gradient Boosting model in H2O

In [1]:
import h2o
import pandas as pd

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from c:\users\albornoj\appdata\local\programs\python\python37\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\AlbornoJ\AppData\Local\Temp\tmpbj85m9pn
  JVM stdout: C:\Users\AlbornoJ\AppData\Local\Temp\tmpbj85m9pn\h2o_AlbornoJ_started_from_python.out
  JVM stderr: C:\Users\AlbornoJ\AppData\Local\Temp\tmpbj85m9pn\h2o_AlbornoJ_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,05 secs
H2O cluster timezone:,Europe/London
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,20 days
H2O cluster name:,H2O_from_python_AlbornoJ_zecrls
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# 1.- Read articicial dataset

In [3]:
df = pd.read_csv("artificial_data.csv")

In [4]:
df.head()

Unnamed: 0,id,bloodType,age,healthy_eating,active_lifestyle,salary
0,0,A,33.741012,5.0,7.0,34099.525572
1,1,O,46.110061,4.0,6.0,43770.999909
2,2,B,37.318695,4.0,3.0,37496.334083
3,3,O,19.069138,9.0,3.0,28278.922197
4,4,O,47.405131,5.0,5.0,45005.237724


In [5]:
hf = h2o.H2OFrame(df)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
hf.shape

(1000, 6)

# 2.- Train-validation-test split

We will use a 80-10-10 train-validation-test split

In [7]:
train, validation, test = hf.split_frame([0.8, 0.1])

In [8]:
train.shape

(789, 6)

In [9]:
validation.shape

(101, 6)

In [10]:
test.shape

(110, 6)

# 3.- Model build

In [11]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [12]:
y = "salary"
ignoreFields = [y, 'id']
x = [i for i in train.names if i not in ignoreFields]

In [13]:
mGBM1 = H2OGradientBoostingEstimator(model_id='defaults')

In [14]:
mGBM1.train(x, y, train, validation_frame=validation)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [15]:
mGBM1

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  defaults


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 1272749.3281765298
RMSE: 1128.1619246263056
MAE: 955.0245960076046
RMSLE: 0.03129479823926259
Mean Residual Deviance: 1272749.3281765298

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 2120155.8751807795
RMSE: 1456.0755046290626
MAE: 1219.2001869142664
RMSLE: 0.04011191104233286
Mean Residual Deviance: 2120155.8751807795
Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
,2018-12-12 10:03:30,0.085 sec,0.0,10437.5240351,8955.0990287,108941907.9836267,9898.3298675,8508.3946533,97976934.1656044
,2018-12-12 10:03:30,0.328 sec,1.0,9418.8607136,8073.4340690,88714937.1418343,8946.9209296,7681.7212389,80047394.1210322
,2018-12-12 10:03:30,0.374 sec,2.0,8503.6361645,7278.8184856,72311828.0182599,8085.9175740,6938.9273712,65382063.0135877
,2018-12-12 10:03:30,0.398 sec,3.0,7681.6624803,6564.1255248,59007938.4608682,7312.1255647,6267.2976739,53467180.2741729
,2018-12-12 10:03:30,0.436 sec,4.0,6944.0984743,5922.4088344,48220503.6211245,6615.3033688,5663.9587554,43762238.6616235
---,---,---,---,---,---,---,---,---,---
,2018-12-12 10:03:31,1.136 sec,46.0,1138.2649042,964.2208739,1295646.9920268,1463.2778700,1225.5912051,2141182.1249574
,2018-12-12 10:03:31,1.148 sec,47.0,1135.5981430,961.8593973,1289583.1422795,1459.6923417,1222.3018364,2130701.7325190
,2018-12-12 10:03:31,1.173 sec,48.0,1132.6861682,959.3925435,1282977.9556592,1459.5809500,1222.1310437,2130376.5495094



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
age,440635424768.0000000,1.0,0.9855177
healthy_eating,4124273920.0000000,0.0093598,0.0092243
active_lifestyle,1978833920.0000000,0.0044909,0.0044258
bloodType,372084640.0000000,0.0008444,0.0008322




# 4.- Model performance

In [16]:
mGBM1.mae(train=True)

955.0245960076046

In [17]:
mGBM1.mae(valid=True)

1219.2001869142664

In [18]:
perf = mGBM1.model_performance(test)

In [19]:
perf.mae()

1374.9214349220588

It can be seen that the mean arithmetic error (MAE) is much higher in the validation and test sets than in the training set - a sign of overfitting. 

# 5.- An overfitting model

In [20]:
mGBM2 = H2OGradientBoostingEstimator(model_id='overfit', ntrees=100, max_depth=10)

In [21]:
mGBM2.train(x, y, train, validation_frame=validation)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [22]:
print('Train: %d ----> %d' % (mGBM1.mae(train=True), mGBM2.mae(train=True)))
print('Validation: %d ----> %d' % (mGBM1.mae(valid=True), mGBM2.mae(valid=True)))
print('Test: %d ----> %d' % (perf.mae(), mGBM2.model_performance(test).mae()))

Train: 955 ----> 658
Validation: 1219 ----> 1280
Test: 1374 ----> 1321


Increasing the complexity of the model has only made overfitting worse