# RuleFit Analysis

The goal of the notebook is to use machine learning to automatically create a set of rules to predict a target column.  The rules provide an explainable solution.

## Steps

1. Import and clean data
2. Train rule-fit model
3. Examine rules

## Step 1. Import and Clean Data

In [18]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "12.0.2" 2019-07-16; Java(TM) SE Runtime Environment (build 12.0.2+10); Java HotSpot(TM) 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)
  Starting server from /Users/megankurka/env2/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpwb0w4s7y
  JVM stdout: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpwb0w4s7y/h2o_megankurka_started_from_python.out
  JVM stderr: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpwb0w4s7y/h2o_megankurka_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.1.1
H2O cluster version age:,18 days
H2O cluster name:,H2O_from_python_megankurka_e7bqqs
H2O cluster total nodes:,1
H2O cluster free memory:,4 Gb
H2O cluster total cores:,16
H2O cluster allowed cores:,16


In [19]:
df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv",
                    col_types = {'pclass': "enum", 'survived': "enum"})
df.head()

Parse progress: |█████████████████████████████████████████████████████████| 100%


pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,Allen Miss. Elisabeth Walton,female,29.0,0,0,24160.0,211.338,B5,S,2.0,,St Louis MO
1,1,Allison Master. Hudson Trevor,male,0.9167,1,2,113781.0,151.55,C22 C26,S,11.0,,Montreal PQ / Chesterville ON
1,0,Allison Miss. Helen Loraine,female,2.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,0,Allison Mr. Hudson Joshua Creighton,male,30.0,1,2,113781.0,151.55,C22 C26,S,,135.0,Montreal PQ / Chesterville ON
1,0,Allison Mrs. Hudson J C (Bessie Waldo Daniels),female,25.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,1,Anderson Mr. Harry,male,48.0,0,0,19952.0,26.55,E12,S,3.0,,New York NY
1,1,Andrews Miss. Kornelia Theodosia,female,63.0,1,0,13502.0,77.9583,D7,S,10.0,,Hudson NY
1,0,Andrews Mr. Thomas Jr,male,39.0,0,0,112050.0,0.0,A36,S,,,Belfast NI
1,1,Appleton Mrs. Edward Dale (Charlotte Lamson),female,53.0,2,0,11769.0,51.4792,C101,S,,,Bayside Queens NY
1,0,Artagaveytia Mr. Ramon,male,71.0,0,0,,49.5042,,C,,22.0,Montevideo Uruguay




## Step 2: Train Rule-Fit Model

We will train a rule-fit model to predict the survival.  The outcome of the rulefit model is rules defining whether or not someone will survive.  The rulefit model is done with the following steps:

1. Train a series of random forest models with different depths
2. Extract rules from the random forest models
3. Train a GLM model with Lasso regularization using the rules to predict the target. 
4. Extract the most important rules.

In [20]:
train, test = df.split_frame(ratios = [0.75], destination_frames=["train.hex", "test.hex"], seed = 1234)

In [None]:
from rulefit import H2ORuleFit

x =  ["age", "sibsp", "parch", "fare", "sex", "pclass"]
rulefit_model = H2ORuleFit(seed = 1234)
rulefit_model.train(training_frame = train, x = x, y = "survived")

drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |

In [None]:
print("Intercept: " + str(round(rulefit_model.intercept.get("Intercept"), 10)))
print("\n\n")

rules = rulefit_model.rule_importance
for i in range(len(rules)):
    print("Coefficient:" + str(round(rules.iloc[i]["coefficient"], 15)) 
          + "\nRule: " + rules.iloc[i]["rule"] + "\n\n")

In [None]:
rulefit_model.varimp_plot()

In [None]:
rulefit_model.coverage_table(df)

There are only 5 rules that are created which are recapped below: 

**Highest Likelihood of Survival**

1. Women in class 1 or 2
2. Women with 2 siblings + spouses or less

**Lowest Likelihood of Survival**
1. Men age 9+
2. Men age 10+ in class 2 or 3
3. Men 6+ in class 2 or 3 with fare < $52

Note: The rules are additive.  That means that if a passenger is described by multiple rules, their probability is added together from those rules.

We will predict on our test data and see how well our rulefit model performs.

In [None]:
predictions = rulefit_model.predict(test)
predictions = test["survived"].cbind(predictions)
predictions.head()

In [None]:
positives = predictions[predictions["predict"] == "1"]
negatives = predictions[predictions["predict"] == "0"]
print("How many times we correctly predicted survived: {:.2%}".format(positives[positives["survived"] == positives["predict"]].nrow/positives.nrow))
print("How many times we correctly predicted not survived: {:.2%}".format(negatives[negatives["survived"] == negatives["predict"]].nrow/negatives.nrow))


In [None]:
print("Accuracy with RuleFit Model: {:.2%}".format(predictions[predictions["survived"] == predictions["predict"]].nrow/predictions.nrow))
print("Accuracy with Constant Model: {:.2%}".format(predictions[predictions["survived"] == "0"].nrow/predictions.nrow))


In [None]:
h2o.cluster().shutdown()