# Exploratory Analysis with GLMs in H2O
## Jose M Albornoz
### December 2018

This notebook demonstrates how a GLM model built with H2O can be used for exploratory anaysis

In [1]:
import h2o
import pandas as pd

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from c:\users\albornoj\appdata\local\programs\python\python37\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\AlbornoJ\AppData\Local\Temp\tmpko8u9zri
  JVM stdout: C:\Users\AlbornoJ\AppData\Local\Temp\tmpko8u9zri\h2o_AlbornoJ_started_from_python.out
  JVM stderr: C:\Users\AlbornoJ\AppData\Local\Temp\tmpko8u9zri\h2o_AlbornoJ_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,05 secs
H2O cluster timezone:,Europe/London
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,20 days
H2O cluster name:,H2O_from_python_AlbornoJ_sp300x
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# 1.- Read smoking dataset

In [3]:
smoking = h2o.import_file('smoking.csv')

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
smoking = smoking.drop('idx', axis=1)

In [5]:
smoking['proportion_deaths'] = smoking['dead']*1000./smoking['pop']

In [6]:
smoking.head()

age,smoke,pop,dead,proportion_deaths
40-44,no,656,18,27.439
45-59,no,359,22,61.2813
50-54,no,249,19,76.3052
55-59,no,632,55,87.0253
60-64,no,1067,117,109.653
65-69,no,897,170,189.521
70-74,no,668,179,267.964
75-79,no,361,120,332.41
80+,no,274,120,437.956
40-44,cigarPipeOnly,145,2,13.7931




In [7]:
smoking.summary()

Unnamed: 0,age,smoke,pop,dead,proportion_deaths
type,enum,enum,int,int,real
mins,,,98.0,2.0,13.793103448275861
mean,,,1558.9444444444443,253.61111111111114,204.74020216107442
maxs,,,6052.0,1001.0,557.5221238938053
sigma,,,1562.232174887577,262.5974951221821,161.25943876495234
zeros,,,0,0,0
missing,0,0,0,0,0
0,40-44,no,656.0,18.0,27.4390243902439
1,45-59,no,359.0,22.0,61.28133704735376
2,50-54,no,249.0,19.0,76.30522088353413


# 2.- Features and target columns

In [8]:
x = ['age', 'smoke']

In [9]:
y = 'proportion_deaths'

# 3.- Model build: age and smoking

In [10]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [11]:
model =  H2OGeneralizedLinearEstimator(family='poisson', model_id='smokin_p')

In [12]:
model.train(x, y, smoking)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [13]:
model.model_performance()


ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 2316.4596014435115
RMSE: 48.12961252122763
MAE: 41.99060354583532
RMSLE: 0.5117121251988099
R^2: 0.9083760725920239
Mean Residual Deviance: 16.42413160563875
Null degrees of freedom: 35
Residual degrees of freedom: 24
Null deviance: 4440.504087789642
Residual deviance: 591.268737802995
AIC: 866.00934169979




## 3.1 - Model coefficients

In [14]:
model.coef()

{'Intercept': 5.1081179755386925,
 'age.40-44': -0.7437710604359609,
 'age.45-59': -0.5184194612646564,
 'age.50-54': -0.38145083908401034,
 'age.55-59': -0.10869163446220398,
 'age.60-64': 0.0,
 'age.65-69': 0.12047755386645775,
 'age.70-74': 0.41754868888224195,
 'age.75-79': 0.7134460662140327,
 'age.80+': 0.9746132421022864,
 'smoke.cigarPipeOnly': -0.054622959488238365,
 'smoke.cigarretteOnly': 0.154081554968311,
 'smoke.cigarrettePlus': 0.0,
 'smoke.no': -0.05683520013173293}

# 4.- Model build: smoking only

In [15]:
model.train('smoke', y, smoking)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [16]:
model.model_performance()


ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 24365.981442552653
RMSE: 156.0960647888109
MAE: 132.7771335966116
RMSLE: 1.006390105569462
R^2: 0.03624180904111585
Mean Residual Deviance: 118.95718508688621
Null degrees of freedom: 35
Residual degrees of freedom: 31
Null deviance: 4440.504087789642
Residual deviance: 4282.4586631279035
AIC: 4543.199267024699




## 4.1 - Model coefficients

In [17]:
model.coef()

{'Intercept': 5.314006445846301,
 'smoke.cigarPipeOnly': -0.10903405584308601,
 'smoke.cigarretteOnly': 0.18855640401138096,
 'smoke.cigarrettePlus': 0.0320220371575779,
 'smoke.no': -0.11145324707073101}