# Rulefit demo - Airlines Dataset

## H2O Rulefit algorithm

Rulefit algorithm combines tree ensembles and linear models to take advantage of both methods: a tree ensemble accuracy and a linear model interpretability. The general algorithm fits a tree ensebmle to the data, builds a rule ensemble by traversing each tree, evaluates the rules on the data to build a rule feature set and fits a sparse linear model (LASSO) to the rule feature set joined with the original feature set.

For more information, refer to: http://statweb.stanford.edu/~jhf/ftp/RuleFit.pdf by Jerome H. Friedman and Bogden E. Popescu.

## Demo example

We will train a rulefit model to predict the rules defining whether a flight delays at take-off or not:


In [1]:
import h2o
from h2o.estimators import H2ORuleFitEstimator, H2ORandomForestEstimator

# init h2o cluster
h2o.init(url="http://192.168.59.134:50000", strict_version_check=False)

versionFromGradle='3.31.0',projectVersion='3.31.0.99999',branch='zuzana_rulefit_weights_and_response',lastCommitHash='a4583d2882a2f233e5038a543e1cffa3253e78f7',gitDescribe='jenkins-master-5213-46-ga4583d2882-dirty',compiledOn='2020-10-08 11:41:51',compiledBy='zuzanaolajcova'
Checking whether there is an H2O instance running at http://192.168.59.134:50000 . connected.


0,1
H2O_cluster_uptime:,24 secs
H2O_cluster_timezone:,Europe/Prague
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.99999
H2O_cluster_version_age:,3 days
H2O_cluster_name:,rulefit_local
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,55.51 Gb
H2O_cluster_total_cores:,40
H2O_cluster_allowed_cores:,40


In [2]:
# Import the airlines dataset into H2O:
airlines = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/bigdata/server/airlines_all.csv")

airlines["Year"] = airlines["Year"].asfactor()
airlines["Month"] = airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()

predictors = ["Origin", "Dest", "Year", "UniqueCarrier","DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"

Parse progress: |█████████████████████████████████████████████████████████| 100%


Using the `algorithm` parameter, a user can set whether algorithm will use DRF or GBM to fit a tree enseble. 

Using the `min_rule_length` and `max_rule_length` parameters, a user can set interval of tree enseble depths to be fitted. The bigger this interval is, the more tree ensembles will be fitted (1 per each depth) and the bigger the rule feature set will be.

Using the `max_num_rules` parameter, the maximum number of rules to return can be set.

Using the `model_type` parameter, the type of base learners in the enseble can be set.

Using the `rule_generation_ntrees` parameter, the number of trees for tree enseble can be set.

In [3]:
rfit = H2ORuleFitEstimator(algorithm="drf",
                           min_rule_length=1,
                           max_rule_length=4,
                           max_num_rules=30,
                           model_type="rules_and_linear",
                           seed=1234,
                           rule_generation_ntrees=50)
rfit.train(training_frame=airlines, x=predictors, y=response)

rulefit Model Build progress: |███████████████████████████████████████████| 100%


The output for the Rulefit model includes:
    - model parameters
    - rule importences in tabular form
    - training and validation metrics of the underlying linear model

In [4]:
from IPython.display import display, HTML
rule_importance = rfit.rule_importance()
# Make a pretty HTML table printout of the results
(table, nr, is_pandas) = rule_importance._as_show_table()
display(HTML(table.to_html()))

Unnamed: 0,Unnamed: 1,variable,coefficient,rule
0,,M1T11N7,0.949084,"(UniqueCarrier in {DL, F9, PI, PS, UA, US}) & (Year in {1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000} or Year is NA)"
1,,M2T14N17,-0.641124,"(UniqueCarrier in {DL, F9, PI, PS, UA, US}) & (UniqueCarrier in {DL, F9, PS, UA, US}) & (Year in {1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000} or Year is NA)"
2,,M1T10N10,-0.207439,"(UniqueCarrier in {9E, AA, AQ, AS, B6, CO, DH, EA, EV, FL, HA, HP, ML (1), MQ, NW, OH, OO, PA (1), TW, TZ, WN, XE, YV} or UniqueCarrier is NA) & (Origin in {ABE, ABI, ABQ, ABY, ACT, ACY, AEX, AGS, ALB, ALO, AMA, ANC, ANI, APF, ATW, AUS, AVL, AVP, AZO, BDL, BFF, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BOI, BPT, BQK, BQN, BRO, BTM, BTR, BTV, BUF, BUR, BZN, CAE, CAK, CCR, CDC, CDV, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, DAB, DAY, DBQ, DCA, DET, DHN, DLH, DRO, DSM, EAU, EFD, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FNT, FSD, FSM, FWA, GCN, GEG, GFK, GGG, GJT, GLH, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GTF, GUC, GUM, HDN, HKY, HLN, HNL, HPN, HRL, HSV, HTS, HVN, IAD, ICT, IDA, ILE, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, KOA, KTN, LAN, LAW, LBB, LCH, LEX, LFT, LGB, LIH, LIT, LNK, LNY, LRD, LSE, LWB, LWS, LYH, MAF, MAZ, MBS, MCI, MCN, MCO, MDT, MEM, MFE, MFR, MGM, MHT, MIB, MKC, MKE, MKG, MKK, MLB, MLI, MLU, MOB, MOT, MQT, MRY, MSN, MSO, MSY, MTH, MTJ, MYR, OGG, OKC, OMA, ONT, ORF, ORH, OXR, PBI, PDX, PFN, PHF, PHL, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSP, PUB, PVD, PWM, RAP, RDD, RDM, RDR, RFD, RHI, RIC, RKS, RNO, ROA, ROC, ROP, ROR, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SCK, SDF, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SPN, SPS, SRQ, STT, STX, SUN, SUX, SWF, SYR, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, UCA, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YUM})"
3,,M0T9N4,0.188001,"(UniqueCarrier in {9E, AA, AQ, AS, B6, CO, DH, EA, EV, FL, HA, HP, ML (1), MQ, NW, OH, OO, PA (1), TW, TZ, WN, XE, YV} or UniqueCarrier is NA)"
4,,M2T40N15,0.152916,"(Origin in {ACK, ADK, ADQ, AKN, ASE, ATL, BET, BFF, BFI, BOS, BRW, BWI, CAE, CKB, CLT, CVG, CYS, DAY, DEN, DFW, DLG, DTW, DUT, ERI, EWR, EYW, FLL, FMN, HVN, JFK, LAS, LAX, LGA, MIA, MKC, MOD, MTH, OGD, OME, ORD, OTH, OTZ, PHL, PHX, PIR, PIT, PSG, PVU, ROA, SEA, SFO, SLC, SOP, SYR, TEX, UCA, YAP} or Origin is NA) & (Year in {1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000, 2001, 2007} or Year is NA) & (UniqueCarrier in {AA, DL, EV, F9, MQ, PI, UA, US, XE} or UniqueCarrier is NA)"
5,,M1T36N7,0.121472,"(Year in {1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000, 2001, 2006, 2007, 2008} or Year is NA) & (UniqueCarrier in {DL, EV, PI, PS, UA, US})"
6,,M2T21N15,0.118678,"(Dest in {ABE, ABQ, ACK, ACV, ACY, ADK, AGS, AKN, ALB, AMA, ANC, APF, ASE, ATL, AUS, AVL, AVP, BDL, BFI, BGM, BGR, BHM, BMI, BOI, BOS, BQK, BQN, BTR, BTV, BUF, BWI, BZN, CAE, CAK, CCR, CHA, CHS, CID, CLT, CMH, CMI, COS, CRP, CRW, CSG, CWA, CYS, DAB, DAY, DBQ, DCA, DEN, DHN, DSM, DUT, EGE, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAY, FCA, FLL, FLO, FNT, FWA, GEG, GNV, GRK, GRR, GSO, GSP, GST, GUC, HDN, HHH, HPN, HRL, HSV, HTS, HVN, ICT, ILG, ILM, IND, ISO, ISP, ITH, JAC, JAN, JAX, JFK, LAR, LAS, LAX, LEX, LGA, LIT, LMT, LNK, LYH, MAZ, MCI, MCN, MCO, MDT, MEI, MFE, MFR, MGM, MHT, MIA, MKC, MKE, MLB, MLU, MOB, MQT, MSN, MSY, MYR, OAJ, OAK, OGD, OKC, OMA, ONT, ORD, ORF, ORH, OTH, OTZ, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIR, PIT, PNS, PSC, PSE, PSG, PVD, PWM, RCA, RDU, RFD, RIC, RNO, ROA, ROC, ROW, RSW, SAN, SAT, SAV, SBN, SCK, SDF, SEA, SFO, SHV, SIT, SJU, SKA, SLC, SMF, SOP, SPI, SRQ, STT, SWF, SYR, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TYS, UCA, VIS, VLD, VPS, WRG, YAP} or Dest is NA) & (UniqueCarrier in {AA, DL, PI, PS, TW, UA, US} or UniqueCarrier is NA) & (Year in {1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000, 2007} or Year is NA)"
7,,M1T18N7,0.116154,"(Year in {1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000, 2001, 2006, 2007, 2008} or Year is NA) & (Origin in {ACK, ADK, ATL, BFI, BOS, BWI, CEC, CKB, CLT, CRW, CVG, DEN, DFW, DUT, EWR, HVN, JFK, LAX, MKK, MOD, MTH, ORD, OTH, PHL, PIR, PIT, SFO, SLC, SOP, TEX, UCA})"
8,,M1T2N10,-0.110389,"(UniqueCarrier in {9E, AA, AQ, AS, B6, CO, DH, EA, EV, FL, HA, HP, ML (1), MQ, NW, OH, OO, PA (1), TW, TZ, WN, XE, YV} or UniqueCarrier is NA) & (Origin in {ABE, ABI, ABQ, ABY, ACT, ACY, AEX, AGS, ALB, ALO, AMA, ANC, ANI, APF, ATW, AUS, AVL, AVP, AZO, BDL, BFF, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BPT, BQK, BQN, BRO, BTM, BTR, BTV, BUF, BUR, BZN, CAE, CAK, CCR, CDC, CDV, CHA, CHO, CHS, CIC, CID, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAY, DBQ, DCA, DET, DHN, DLH, DRO, DSM, EAU, EFD, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FNT, FSD, FSM, FWA, GCN, GEG, GFK, GGG, GJT, GLH, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GTF, GTR, GUC, GUM, HDN, HKY, HLN, HNL, HPN, HRL, HSV, HTS, HVN, IAD, ICT, IDA, ILE, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, KOA, KSM, KTN, LAN, LAW, LBB, LCH, LEX, LFT, LGB, LIH, LIT, LNK, LNY, LRD, LSE, LWB, LWS, LYH, MAF, MAZ, MBS, MCI, MCN, MCO, MDT, MEM, MFE, MFR, MGM, MHT, MIB, MKE, MKG, MKK, MLB, MLI, MLU, MOB, MOT, MQT, MRY, MSN, MSO, MSY, MTJ, MYR, OGG, OKC, OMA, ONT, ORF, ORH, OXR, PBI, PDX, PFN, PHF, PHL, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSP, PUB, PVD, PWM, RAP, RDD, RDM, RDR, RFD, RHI, RIC, RKS, RNO, ROA, ROC, ROP, ROR, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SCK, SDF, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SPN, SPS, SRQ, STT, STX, SUN, SUX, SWF, SYR, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, UCA, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YUM})"
9,,M1T6N7,0.079796,"(Origin in {ACK, ADK, ADQ, AKN, ASE, ATL, BET, BFF, BFI, BOS, BRW, BWI, CAE, CKB, CLT, CVG, CYS, DAY, DEN, DFW, DLG, DTW, DUT, ERI, EWR, EYW, FLL, FMN, HVN, JFK, LAS, LAX, LGA, MIA, MKC, MOD, MTH, OGD, OME, ORD, OTH, OTZ, PHL, PHX, PIR, PIT, PSG, PVU, ROA, SEA, SFO, SLC, SOP, SYR, TEX, UCA, YAP} or Origin is NA) & (UniqueCarrier in {AA, DL, F9, PI, PS, UA, US} or UniqueCarrier is NA)"


Note: The rules are additive. That means that if a flight is described by multiple rules, their probability is added together from those rules.