# Predicting Airline Delays in Python

The following is a demonstration of predicting potential flight delays using a publicly available airlines dataset. For this example, the dataset used is a small sample of what is more than two decades worth of flight data in order to ensure the download and import process would not take more than a minute or two.

## The Data

The data comes originally from [RITA](http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp) where it is described in detail. To use the entire 26 years worth of flight information to more accurately predict delays and cancellation please download one of the following and change the path to the data in the notebook: 

  * [2 Thousand Rows - 4.3MB](https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv)
  * [5.8 Million Rows - 580MB](https://s3.amazonaws.com/h2o-airlines-unpacked/airlines_all.05p.csv)
  * [152 Million Rows (Years: 1987-2013) - 14.5GB](https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.1987.2013.csv)

## Business Benefits

There are obvious benefits to predicting potential delays and logistic issues for a business. It helps the user make contingency plans and corrections to avoid undesirable outcomes. Recommendation engines can forewarn flyers of possible delays and rank flight options accordingly, other businesses might pay more for a flight to ensure certain shipments arrive on time, and airline carriers can use the information to better their flight plans. The goal is to have the machine take in all the possible factors that will affect a flight and return the probability of a flight being delayed.

### Load the H2O module and start an local H2O cluster

Connection to an H2O cloud is established through the `h2o.init` function from the `h2o` module. To connect to a pre-existing H2O cluster make sure to edit the H2O location with argument `myIP` and `myPort`.


In [26]:
import h2o
import os
import tabulate
import operator 
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [27]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,1 hour 45 mins
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.42.0.2
H2O_cluster_version_age:,5 days
H2O_cluster_name:,H2O_from_python_Ravichandran_y31k4h
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,13.53 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


### Import Data into H2O

We will use the `h2o.importFile` function to do a parallel read of the data into the H2O distributed key-value store. During import of the data, features Year, Month, DayOfWeek, and FlightNum were set to be parsed as enumerator or categorical rather than numeric columns. Once the data is in H2O, get an overview of the airlines dataset quickly by using `describe`.


In [28]:
airlines_hex = h2o.import_file(path = os.path.realpath("../data/allyears2k.csv"),
                               destination_frame = "airlines.hex")
airlines_hex.describe()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,IsArrDelayed,IsDepDelayed
type,int,int,int,int,int,int,int,int,enum,int,enum,int,int,int,int,int,enum,enum,int,int,int,int,enum,int,int,int,int,int,int,enum,enum
mins,1987.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,,1.0,,16.0,17.0,14.0,-63.0,-16.0,,,11.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,,
mean,1997.5,1.4090909090909092,14.601073263904679,3.820614852880986,1345.8466613820756,1313.2228614307153,1504.6341303788884,1485.289167310928,,818.8429896766565,,124.81452913540424,125.0215626066189,114.31611109078271,9.317111936984317,10.007390655600112,,,730.1821905650503,5.381368059530624,14.168634184732058,0.024694165264450407,,0.0024785119832643593,4.047800291055638,0.2893764692712415,4.855031904175528,0.017015560282100075,7.620060450016793,,
maxs,2008.0,10.0,31.0,7.0,2400.0,2359.0,2400.0,2359.0,,3949.0,,475.0,437.0,402.0,475.0,473.0,,,3365.0,128.0,254.0,1.0,,1.0,369.0,201.0,323.0,14.0,373.0,,
sigma,6.344360901710614,1.8747113713439636,9.17579042586145,1.9050131191328963,465.3408991242338,476.25113999259963,484.347487903516,492.750434122701,,777.4043691636348,,73.97444166059017,73.40159463000927,69.63632951506104,29.840221962414844,26.438809042916446,,,578.4380082304243,4.201979939864827,9.905085747204327,0.15519314135784237,,0.049723487218862286,16.2057299044842,4.4167798987341245,18.61977622147568,0.40394018210151167,23.487565874106217,,
zeros,0,0,0,0,0,569,0,569,,0,,0,0,0,1514,6393,,,0,623,557,42892,,43869,7344,8840,7388,8914,7140,,
missing,0,0,0,0,1086,0,1195,0,0,0,32,1195,13,16649,1195,1086,0,0,35,16026,16024,0,9774,0,35045,35045,35045,35045,35045,0,0
0,1987.0,10.0,14.0,3.0,741.0,730.0,912.0,849.0,PS,1451.0,,91.0,79.0,,23.0,11.0,SAN,SFO,447.0,,,0.0,,0.0,,,,,,YES,YES
1,1987.0,10.0,15.0,4.0,729.0,730.0,903.0,849.0,PS,1451.0,,94.0,79.0,,14.0,-1.0,SAN,SFO,447.0,,,0.0,,0.0,,,,,,YES,NO
2,1987.0,10.0,17.0,6.0,741.0,730.0,918.0,849.0,PS,1451.0,,97.0,79.0,,29.0,11.0,SAN,SFO,447.0,,,0.0,,0.0,,,,,,YES,YES


### Building a GLM Model

Run a logistic regression model using function `h2o.glm` and selecting “binomial” for parameter `Family`. Add some regularization by setting alpha to 0.5 and lambda to 1e-05.

In [29]:
# Set predictor and response variables
myY = "IsDepDelayed"
myX = ["Dest", "Origin", "DayofMonth", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance"]

# GLM - Predict Delays
glm_model = H2OGeneralizedLinearEstimator(
    family = "binomial",standardize = True, solver = "IRLSM",
    link = "logit", alpha = 0.5, model_id = "glm_model_from_python" )
glm_model.train(x               = myX,
               y               = myY,
               training_frame  = airlines_hex)


glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,binomial,logit,"Elastic Net (alpha = 0.5, lambda = 1.507E-4 )",281,175,7,airlines.hex

Unnamed: 0,NO,YES,Error,Rate
NO,6207.0,14680.0,0.7028,(14680.0/20887.0)
YES,2429.0,20662.0,0.1052,(2429.0/23091.0)
Total,8636.0,35342.0,0.389,(17109.0/43978.0)

metric,threshold,value,idx
max f1,0.3833979,0.7072031,283.0
max f2,0.114321,0.8469524,393.0
max f0point5,0.5472912,0.6667238,183.0
max accuracy,0.5010473,0.6494611,211.0
max precision,0.9597631,1.0,0.0
max recall,0.0862649,1.0,398.0
max specificity,0.9597631,1.0,0.0
max absolute_mcc,0.5472912,0.2966244,183.0
max min_per_class_accuracy,0.5279034,0.6470486,194.0
max mean_per_class_accuracy,0.5472912,0.6481534,183.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.010005,0.8655052,1.8309666,1.8309666,0.9613636,0.9104087,0.9613636,0.9104087,0.0183188,0.0183188,83.096661,83.096661,0.0175049
2,0.0200555,0.8310881,1.6287794,1.7296438,0.8552036,0.8451658,0.9081633,0.8777133,0.01637,0.0346888,62.8779386,72.9643761,0.0308108
3,0.030015,0.8076817,1.5740814,1.6780253,0.826484,0.8189507,0.8810606,0.8582148,0.0156771,0.0503659,57.4081424,67.8025349,0.0428493
4,0.0400655,0.7947413,1.5296738,1.6408112,0.8031674,0.8006733,0.861521,0.8437804,0.015374,0.0657399,52.9673762,64.081116,0.054058
5,0.0500023,0.7790183,1.5558922,1.6239355,0.8169336,0.7870019,0.8526603,0.832497,0.0154606,0.0812005,55.5892233,62.3935502,0.0656884
6,0.1000045,0.7333656,1.487958,1.5559467,0.7812642,0.7562962,0.8169623,0.7943966,0.0744013,0.1556017,48.7957969,55.5946736,0.117061
7,0.1500068,0.6981523,1.435992,1.5159618,0.7539791,0.7148077,0.7959679,0.7678669,0.0718029,0.2274046,43.5992033,51.5961835,0.1629626
8,0.2000091,0.6705609,1.3156043,1.4658724,0.6907685,0.6840601,0.769668,0.7469152,0.0657832,0.2931878,31.5604281,46.5872447,0.1961897
9,0.3000136,0.6204829,1.2826925,1.4048125,0.6734879,0.6445239,0.737608,0.7127848,0.1282751,0.4214629,28.2692522,40.4812472,0.2557139
10,0.3999955,0.5742274,1.1591039,1.3433958,0.6085968,0.5977075,0.7053607,0.6840204,0.1158893,0.5373522,15.9103927,34.3395811,0.2892074

Unnamed: 0,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error
,2023-07-30 12:04:20,0.000 sec,0,30427.9757692,0.6918908,,,,,,,
,2023-07-30 12:04:20,0.031 sec,1,27733.1096342,0.635579,,,,,,,
,2023-07-30 12:04:20,0.040 sec,2,27675.8625078,0.6348479,,,,,,,
,2023-07-30 12:04:20,0.046 sec,3,27673.2467699,0.6348358,,,,,,,
,2023-07-30 12:04:20,0.061 sec,4,27638.5288288,0.6343359,,,,,,,
,2023-07-30 12:04:20,0.071 sec,5,27639.3502918,0.6343385,,,,,,,
,2023-07-30 12:04:20,0.084 sec,6,27637.1643359,0.6343207,,,,,,,
,2023-07-30 12:04:20,0.100 sec,7,27636.6605038,0.6343169,0.4682555,0.6284201,0.1207389,0.7019232,0.7155652,1.8309666,0.3890354

variable,relative_importance,scaled_importance,percentage
Origin.MDW,1.7330899,1.0,0.0282491
Origin.AUS,1.3122480,0.7571725,0.0213895
Origin.HNL,1.0303770,0.5945318,0.0167950
UniqueCarrier.WN,0.9742207,0.5621293,0.0158797
Origin.ERI,0.9612517,0.5546462,0.0156683
Origin.ATL,0.8978418,0.5180584,0.0146347
Origin.TLH,0.8958194,0.5168915,0.0146017
Origin.MYR,0.8713567,0.5027764,0.0142030
Origin.BUR,0.8667153,0.5000983,0.0141273
Origin.LIH,0.8663010,0.4998592,0.0141206


In [30]:
print("AUC of the training set : " + str(glm_model.auc()))
# Variable importances from each algorithm
# Calculate magnitude of normalized GLM coefficients
glm_varimp = glm_model.coef_norm()
for k,v in glm_varimp.items():
    glm_varimp[k] = abs(glm_varimp[k])
    
# Sort in descending order by magnitude
glm_sorted = sorted(glm_varimp.items(), key = operator.itemgetter(1), reverse = True)
table = tabulate.tabulate(glm_sorted, headers = ["Predictor", "Normalized Coefficient"], tablefmt = "orgtbl")
print("Variable Importances:\n\n" + table)

AUC of the training set : 0.7019232382289031
Variable Importances:

| Predictor        |   Normalized Coefficient |
|------------------+--------------------------|
| Origin.MDW       |               1.73309    |
| Origin.AUS       |               1.31225    |
| Origin.HNL       |               1.03038    |
| UniqueCarrier.WN |               0.974221   |
| Origin.ERI       |               0.961252   |
| Origin.ATL       |               0.897842   |
| Origin.TLH       |               0.895819   |
| Origin.MYR       |               0.871357   |
| Origin.BUR       |               0.866715   |
| Origin.LIH       |               0.866301   |
| Origin.PSP       |               0.845205   |
| UniqueCarrier.TW |               0.802473   |
| Origin.BNA       |               0.763528   |
| Dest.LYH         |               0.755978   |
| Origin.HPN       |               0.743651   |
| Origin.ABQ       |               0.739158   |
| Origin.PBI       |               0.718862   |
| Origin.STL       |

### Building a Deep Learning Model

Build a binary classfication model using function `h2o.deeplearning` and selecting “bernoulli” for parameter `Distribution`. Run 100 passes over the data by setting parameter `epoch` to 100.

In [31]:
# Deep Learning - Predict Delays
deeplearning_model = H2ODeepLearningEstimator(
    distribution = "bernoulli", model_id = "deeplearning_model_from_python",
    epochs = 100, hidden = [200,200],  
    seed = 6765686131094811000, variable_importances = True)
deeplearning_model.train(x               = myX,
                         y               = myY,
                         training_frame  = airlines_hex)

deeplearning Model Build progress: |█████████████████████████████████████████████| (done) 100%


Unnamed: 0,layer,units,type,dropout,l1,l2,mean_rate,rate_rms,momentum,mean_weight,weight_rms,mean_bias,bias_rms
,1,284,Input,0.0,,,,,,,,,
,2,200,Rectifier,0.0,0.0,0.0,0.0750737,0.2449265,0.0,-0.0044054,0.1859598,-0.9848622,0.4898831
,3,200,Rectifier,0.0,0.0,0.0,0.2748816,0.2558143,0.0,-0.0539884,0.267534,-0.722886,1.1734319
,4,2,Softmax,,0.0,0.0,0.0124277,0.0113767,0.0,0.0099541,0.5297868,0.1415117,0.3121001

Unnamed: 0,NO,YES,Error,Rate
NO,3560.0,1141.0,0.2427,(1141.0/4701.0)
YES,738.0,4552.0,0.1395,(738.0/5290.0)
Total,4298.0,5693.0,0.1881,(1879.0/9991.0)

metric,threshold,value,idx
max f1,0.4046646,0.8289174,233.0
max f2,0.1666235,0.8940719,323.0
max f0point5,0.6210219,0.8401102,156.0
max accuracy,0.4499335,0.8132319,217.0
max precision,0.9998963,1.0,0.0
max recall,0.0041663,1.0,396.0
max specificity,0.9998963,1.0,0.0
max absolute_mcc,0.4499335,0.6248724,217.0
max min_per_class_accuracy,0.4772917,0.8115312,208.0
max mean_per_class_accuracy,0.4648391,0.8123562,212.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.010009,0.9999993,1.8886578,1.8886578,1.0,0.9999999,1.0,0.9999999,0.0189036,0.0189036,88.8657845,88.8657845,0.0189036
2,0.020018,0.9999489,1.8886578,1.8886578,1.0,0.9999855,1.0,0.9999927,0.0189036,0.0378072,88.8657845,88.8657845,0.0378072
3,0.030027,0.9997672,1.8886578,1.8886578,1.0,0.9998817,1.0,0.9999557,0.0189036,0.0567108,88.8657845,88.8657845,0.0567108
4,0.0401361,0.9991708,1.8886578,1.8886578,1.0,0.9994901,1.0,0.9998385,0.0190926,0.0758034,88.8657845,88.8657845,0.0758034
5,0.050045,0.998088,1.8886578,1.8886578,1.0,0.998652,1.0,0.9996035,0.0187146,0.094518,88.8657845,88.8657845,0.094518
6,0.1000901,0.9849112,1.8848805,1.8867692,0.998,0.992787,0.999,0.9961953,0.0943289,0.1888469,88.4880529,88.6769187,0.1886342
7,0.150035,0.9496541,1.816745,1.8634589,0.9619238,0.9695726,0.9866578,0.9873329,0.0907372,0.2795841,81.6745021,86.3458941,0.2753297
8,0.2000801,0.8953598,1.7451198,1.8338594,0.924,0.9229824,0.9709855,0.9712372,0.0873346,0.3669187,74.5119849,83.3859368,0.3545809
9,0.3000701,0.7818583,1.6145283,1.7607734,0.8548549,0.8404205,0.9322882,0.9276462,0.1614367,0.5283554,61.4528328,76.0773408,0.4851731
10,0.4000601,0.6509172,1.4595034,1.6854747,0.7727728,0.7200721,0.8924193,0.8757657,0.1459357,0.6742911,45.950336,68.5474739,0.5828212

Unnamed: 0,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error
,2023-07-30 12:04:20,0.000 sec,,0.0,0,0.0,,,,,,,
,2023-07-30 12:04:22,2.440 sec,44416 obs/sec,2.2684069,1,99760.0,0.469432,0.629929,0.1154604,0.7091771,0.7322682,1.8319981,0.3731358
,2023-07-30 12:04:28,8.093 sec,77650 obs/sec,13.6450271,6,600081.0,0.4453766,0.580311,0.2037914,0.7650196,0.7829732,1.8319981,0.3233911
,2023-07-30 12:04:33,13.437 sec,84939 obs/sec,25.0116422,11,1099962.0,0.4454378,0.5835513,0.2035726,0.7824397,0.8014845,1.8697713,0.3016715
,2023-07-30 12:04:38,18.669 sec,88683 obs/sec,36.3784392,16,1599851.0,0.4237609,0.5324108,0.2792016,0.8085517,0.8280524,1.8886578,0.2906616
,2023-07-30 12:04:44,23.785 sec,91248 obs/sec,47.7531266,21,2100087.0,0.412482,0.5069729,0.3170609,0.8256994,0.8461476,1.8886578,0.268842
,2023-07-30 12:04:49,28.984 sec,92533 obs/sec,59.1186957,26,2599922.0,0.3973176,0.4753209,0.3663527,0.8503522,0.8672209,1.8886578,0.2458212
,2023-07-30 12:04:54,34.047 sec,93827 obs/sec,70.4995907,31,3100431.0,0.3848477,0.447293,0.4055027,0.8687243,0.8854971,1.8886578,0.2307076
,2023-07-30 12:04:59,39.084 sec,94847 obs/sec,81.868366,36,3600407.0,0.3758412,0.4273738,0.4330031,0.8809501,0.8967838,1.8886578,0.2170954
,2023-07-30 12:05:05,45.054 sec,95916 obs/sec,95.5131429,42,4200477.0,0.3624713,0.3997701,0.4726253,0.8957555,0.9096769,1.8886578,0.1947753

variable,relative_importance,scaled_importance,percentage
DayofMonth,1.0,1.0,0.0098923
UniqueCarrier.PS,0.9896858,0.9896858,0.0097903
Year,0.9787593,0.9787593,0.0096822
DayOfWeek,0.9082621,0.9082621,0.0089848
UniqueCarrier.US,0.7447158,0.7447158,0.0073670
Origin.SFO,0.6747527,0.6747527,0.0066749
Dest.PHX,0.6635277,0.6635277,0.0065638
Dest.PHL,0.6591558,0.6591558,0.0065206
UniqueCarrier.UA,0.6531988,0.6531988,0.0064617
UniqueCarrier.PI,0.6325973,0.6325973,0.0062579


In [32]:
print("AUC of the training set : " + str(deeplearning_model.auc()))
deeplearning_model.varimp(table)

AUC of the training set : 0.8988111969097996


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,DayofMonth,1.000000,1.000000,0.009892
1,UniqueCarrier.PS,0.989686,0.989686,0.009790
2,Year,0.978759,0.978759,0.009682
3,DayOfWeek,0.908262,0.908262,0.008985
4,UniqueCarrier.US,0.744716,0.744716,0.007367
...,...,...,...,...
279,Dest.LBB,0.181902,0.181902,0.001799
280,Dest.AMA,0.178747,0.178747,0.001768
281,Dest.missing(NA),0.000000,0.000000,0.000000
282,Origin.missing(NA),0.000000,0.000000,0.000000


### Shut down the cluster

Shut down the cluster now that we are done using it.

In [33]:
h2o.shutdown(prompt=False)

H2O session _sid_a81b closed.


  h2o.shutdown(prompt=False)
