# RuleFit Analysis

The goal of the notebook is to use machine learning to automatically create a set of rules to predict a target column.  The rules provide an explainable solution.

## Steps

1. Import and clean data
2. Train rule-fit model
3. Examine rules
4. Customizations

## Step 1. Import and Clean Data

In [1]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "12.0.2" 2019-07-16; Java(TM) SE Runtime Environment (build 12.0.2+10); Java HotSpot(TM) 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)
  Starting server from /Users/megankurka/env2/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmp9tuw0k38
  JVM stdout: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmp9tuw0k38/h2o_megankurka_started_from_python.out
  JVM stderr: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmp9tuw0k38/h2o_megankurka_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.2
H2O_cluster_version_age:,1 day
H2O_cluster_name:,H2O_from_python_megankurka_gr81uf
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,4 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


In [2]:
df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv",
                    col_types={'pclass': "enum", 'survived': "enum"})
df.head()

Parse progress: |█████████████████████████████████████████████████████████| 100%


pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,Allen Miss. Elisabeth Walton,female,29.0,0,0,24160.0,211.338,B5,S,2.0,,St Louis MO
1,1,Allison Master. Hudson Trevor,male,0.9167,1,2,113781.0,151.55,C22 C26,S,11.0,,Montreal PQ / Chesterville ON
1,0,Allison Miss. Helen Loraine,female,2.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,0,Allison Mr. Hudson Joshua Creighton,male,30.0,1,2,113781.0,151.55,C22 C26,S,,135.0,Montreal PQ / Chesterville ON
1,0,Allison Mrs. Hudson J C (Bessie Waldo Daniels),female,25.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,1,Anderson Mr. Harry,male,48.0,0,0,19952.0,26.55,E12,S,3.0,,New York NY
1,1,Andrews Miss. Kornelia Theodosia,female,63.0,1,0,13502.0,77.9583,D7,S,10.0,,Hudson NY
1,0,Andrews Mr. Thomas Jr,male,39.0,0,0,112050.0,0.0,A36,S,,,Belfast NI
1,1,Appleton Mrs. Edward Dale (Charlotte Lamson),female,53.0,2,0,11769.0,51.4792,C101,S,,,Bayside Queens NY
1,0,Artagaveytia Mr. Ramon,male,71.0,0,0,,49.5042,,C,,22.0,Montevideo Uruguay




## Step 2: Train Rule-Fit Model

We will train a rule-fit model to predict the survival.  The outcome of the rulefit model is rules defining whether or not someone will survive.  The rulefit model is done with the following steps:

1. Train a series of random forest models with different depths
2. Extract rules from the random forest models
3. Train a GLM model with Lasso regularization using the rules to predict the target. 
4. Extract the most important rules.

In [3]:
train, test = df.split_frame(ratios=[0.75], destination_frames=["train.hex", "test.hex"], seed=1234)

In [30]:
x =  ["age", "sibsp", "parch", "fare", "sex", "pclass"]

from h2o.estimators import H2ORuleFitEstimator
rfit = H2ORuleFitEstimator(seed=1234)
rfit.train(training_frame=train, x=x, y="survived")

rulefit Model Build progress: |███████████████████████████████████████████| 100%


In [31]:
rule_importance = rfit.rule_importance()
(table, nr, is_pandas) = rule_importance._as_show_table()

In [32]:
from IPython.display import display, HTML
import pandas as pd
pd.options.display.max_colwidth = 0
display(HTML(table.head().to_html()))

Unnamed: 0,Unnamed: 1,variable,coefficient,rule
0,,M0T18N12,0.766079,"(pclass in {1, 2}) & (age < 60.49951934814453 or age is NA) & (sex in {female})"
1,,M0T44N21,-0.441729,"(sex in {male} or sex is NA) & (pclass in {2, 3} or pclass is NA) & (age >= 9.569365501403809 or age is NA)"
2,,linear.sex.female,0.408274,
3,,linear.sex.male,-0.171397,
4,,M0T28N20,-0.139191,(sex in {male} or sex is NA) & (fare < 51.03279113769531 or fare is NA) & (age >= 6.469268798828125 or age is NA)


There are only 5 rules that are created which are recapped below: 

**Highest Likelihood of Survival**

1. Women
1. Women in class 1 or 2 who are less than 60 years old

**Lowest Likelihood of Survival**
1. Men
2. Men age 6+ who paid less than 51 dollars for their tickets
4. Men age 10+ in class 2 or 3

Note: The rules are additive.  That means that if a passenger is described by multiple rules, their probability is added together from those rules.

We will predict on our test data and see how well our rulefit model performs.

In [33]:
predictions = rfit.predict(test)
predictions = test["survived"].cbind(predictions)
predictions.head()

rulefit prediction progress: |████████████████████████████████████████████| 100%


survived,predict,p0,p1
0,0,0.629429,0.370571
0,0,0.675644,0.324356
1,0,0.678355,0.321645
1,1,0.293804,0.706196
1,1,0.316881,0.683119
0,0,0.64067,0.35933
1,0,0.643983,0.356017
1,1,0.320156,0.679844
1,0,0.638349,0.361651
1,0,0.678833,0.321167




In [34]:
positives = predictions[predictions["predict"] == "1"]
negatives = predictions[predictions["predict"] == "0"]
print("How many times we correctly predicted survived: {:.2%}".format(positives[positives["survived"] == positives["predict"]].nrow/positives.nrow))
print("How many times we correctly predicted not survived: {:.2%}".format(negatives[negatives["survived"] == negatives["predict"]].nrow/negatives.nrow))


How many times we correctly predicted survived: 76.00%
How many times we correctly predicted not survived: 79.82%


In [35]:
print("Accuracy with RuleFit Model: {:.2%}".format(predictions[predictions["survived"] == predictions["predict"]].nrow/predictions.nrow))
print("Accuracy with Constant Model: {:.2%}".format(predictions[predictions["survived"] == "0"].nrow/predictions.nrow))


Accuracy with RuleFit Model: 78.66%
Accuracy with Constant Model: 62.80%


## Step 4: Customizations

In this section, we train a new rulefit model to predict `parch`.  We customize the rulefit model so that the distribution used is `poisson` and that the tree model used is GBM.  We also restrict is so that it is limited to 10 rules and the rules cannot be longer than 2.

In [None]:
x =  ["age", "pclass", "sibsp", "fare", "sex", "survived"]
rulefit_parch_model = H2ORuleFitEstimator(algorithm="GBM", seed=1234,
                                          max_num_rules=10,
                                          max_rule_length=2,
                                          distribution="poisson"
                                         )
rulefit_parch_model.train(training_frame=train, x=x, y="parch")

In [None]:
rule_importance = rulefit_parch_model.rule_importance()
(table, nr, is_pandas) = rule_importance._as_show_table()

In [None]:
from IPython.display import display, HTML
import pandas as pd
pd.options.display.max_colwidth = 0
display(HTML(table.head().to_html()))

In [44]:
h2o.cluster().shutdown()

H2O session _sid_8ad8 closed.
