# Simple H2O.ai tutorial using the airlines dataset

In [1]:
import h2o
h2o.init(ip="localhost", port=54323)

Checking whether there is an H2O instance running at http://localhost:54323. connected.


0,1
H2O cluster uptime:,7 days 18 hours 22 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.8
H2O cluster version age:,27 days
H2O cluster name:,H2O_from_python_hbi16859_w1o7wl
H2O cluster total nodes:,1
H2O cluster free memory:,3.245 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [2]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [3]:
#import the airlines dataset
airlines=h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
airlines.shape

(43978, 31)

In [8]:
airlines[0:5,:]

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,IsArrDelayed,IsDepDelayed
1987,10,14,3,741,730,912,849,PS,1451,,91,79,,23,11,SAN,SFO,447,,,0,,0,,,,,,YES,YES
1987,10,15,4,729,730,903,849,PS,1451,,94,79,,14,-1,SAN,SFO,447,,,0,,0,,,,,,YES,NO
1987,10,17,6,741,730,918,849,PS,1451,,97,79,,29,11,SAN,SFO,447,,,0,,0,,,,,,YES,YES
1987,10,18,7,729,730,847,849,PS,1451,,78,79,,-2,-1,SAN,SFO,447,,,0,,0,,,,,,NO,NO
1987,10,19,1,749,730,922,849,PS,1451,,93,79,,33,19,SAN,SFO,447,,,0,,0,,,,,,YES,YES




Convert columns to factors

In [9]:
airlines['Year']=airlines['Year'].asfactor()
airlines['Month']=airlines['Month'].asfactor()
airlines['DayOfWeek']=airlines['DayOfWeek'].asfactor()
airlines['Cancelled']=airlines['Cancelled'].asfactor()
airlines['FlightNum']=airlines['FlightNum'].asfactor()

Set the predictor names and the response column name

In [10]:
predictors=["Origin","Dest","Year","UniqueCarrier","DayOfWeek","Month","Distance","FlightNum"]
response="IsDepDelayed"

split the data into training and validation sets

In [11]:
train,valid=airlines.split_frame(ratios=[.8],seed=1234)

In [12]:
train.shape

(35251, 31)

Specify the number of bins that will be included in the histogram and then split

In [14]:
bin_num=[8,16,32,64,128,256,512,1024,2048,4096]
label=[str(i) for i in bin_num]
label

['8', '16', '32', '64', '128', '256', '512', '1024', '2048', '4096']

train the models

In [16]:
for key,num in enumerate(bin_num):
    print("{},{}".format(key,num))

0,8
1,16
2,32
3,64
4,128
5,256
6,512
7,1024
8,2048
9,4096


In [17]:
for key,num in enumerate(bin_num):
    airlines_gbm=H2OGradientBoostingEstimator(nbins_cats=num,seed=1234)
    airlines_gbm.train(x=predictors,y=response,training_frame=train,validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


print the AUC scores for the training data and validation data

In [18]:
print(label[key],"training score",airlines_gbm.auc(train=True))
print(label[key],"validation score",airlines_gbm.auc(valid=True))

4096 training score 0.8592800837223833
4096 validation score 0.7309097636276833


In [22]:
train.shape

(35251, 31)