# RuleFit Analysis

The goal of the notebook is to use machine learning to automatically create a set of rules to predict a target column.  The rules provide an explainable solution.

## Steps

1. Import and clean data
2. Train rule-fit model
3. Examine rules

## Step 1. Import and Clean Data

In [1]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "12.0.2" 2019-07-16; Java(TM) SE Runtime Environment (build 12.0.2+10); Java HotSpot(TM) 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)
  Starting server from /Users/megankurka/env/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpl11k4l_i
  JVM stdout: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpl11k4l_i/h2o_megankurka_started_from_python.out
  JVM stderr: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpl11k4l_i/h2o_megankurka_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.3
H2O cluster version age:,12 days
H2O cluster name:,H2O_from_python_megankurka_zt4i9s
H2O cluster total nodes:,1
H2O cluster free memory:,4 Gb
H2O cluster total cores:,16
H2O cluster allowed cores:,16


In [2]:
df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
df.head()

Parse progress: |█████████████████████████████████████████████████████████| 100%


pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,Allen Miss. Elisabeth Walton,female,29.0,0,0,24160.0,211.338,B5,S,2.0,,St Louis MO
1,1,Allison Master. Hudson Trevor,male,0.9167,1,2,113781.0,151.55,C22 C26,S,11.0,,Montreal PQ / Chesterville ON
1,0,Allison Miss. Helen Loraine,female,2.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,0,Allison Mr. Hudson Joshua Creighton,male,30.0,1,2,113781.0,151.55,C22 C26,S,,135.0,Montreal PQ / Chesterville ON
1,0,Allison Mrs. Hudson J C (Bessie Waldo Daniels),female,25.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,1,Anderson Mr. Harry,male,48.0,0,0,19952.0,26.55,E12,S,3.0,,New York NY
1,1,Andrews Miss. Kornelia Theodosia,female,63.0,1,0,13502.0,77.9583,D7,S,10.0,,Hudson NY
1,0,Andrews Mr. Thomas Jr,male,39.0,0,0,112050.0,0.0,A36,S,,,Belfast NI
1,1,Appleton Mrs. Edward Dale (Charlotte Lamson),female,53.0,2,0,11769.0,51.4792,C101,S,,,Bayside Queens NY
1,0,Artagaveytia Mr. Ramon,male,71.0,0,0,,49.5042,,C,,22.0,Montevideo Uruguay




## Step 2: Train Rule-Fit Model

We will train a rule-fit model to predict the survival.  The outcome of the rulefit model is rules defining whether or not someone will survive.  The rulefit model is done with the following steps:

1. Train a series of random forest models with different depths
2. Extract rules from the random forest models
3. Train a GLM model with Lasso regularization using the rules to predict the target. 
4. Extract the most important rules.

In [61]:
def h2o_rulefit(training_frame, x, y, seed = None, min_depth = 2, max_depth = 10):
    
    # Get paths from random forest models
    paths_frame = training_frame[y]
    depths = range(min_depth, max_depth + 1)
    rf_models = []
    for model_idx in range(len(depths)):
        
        # Train random forest models
        from h2o.estimators.random_forest import H2ORandomForestEstimator
        rf_model = H2ORandomForestEstimator(seed = seed, model_id = "rf.hex", max_depth = depths[model_idx])
        rf_model.train(y = y, x = x, training_frame = training_frame)
        rf_models = rf_models + [rf_model]
    
        paths = rf_model.predict_leaf_node_assignment(training_frame)
        paths.col_names = ["rf_" + str(model_idx) +"."+ x for x in paths.col_names]
        paths_frame = paths_frame.cbind(paths)
    
    # Extract important paths
    from h2o.estimators import H2OGeneralizedLinearEstimator
    glm = H2OGeneralizedLinearEstimator(model_id = "glm.hex", nfolds = 5, seed = 1234, 
                                        alpha = 1, lambda_search = True)
    glm.train(y = y, training_frame=paths_frame)
    
    import pandas as pd
    rule_importance = pd.DataFrame.from_dict(glm.coef(), orient = "index").reset_index()
    rule_importance.columns = ["variable", "coefficient"]
    rule_importance = rule_importance[(rule_importance.coefficient.abs() > 0) & (rule_importance.variable != "Intercept")]
    
    # Convert paths to rules
    rules = []
    for i in rule_importance.variable:
        model_num, tree_num, path = i.replace("rf_", "").replace("T", "").split(".")
        tree = H2OTree(rf_models[int(model_num)], int(tree_num)-1)
        rules = rules + [tree_traverser(tree.root_node, path)]

    # Add rules and order by absolute coefficient
    rule_importance["rule"] = rules
    rule_importance["abs_coefficient"] = rule_importance["coefficient"].abs()
    rule_importance = rule_importance.sort_values(by = "abs_coefficient", ascending = False)
    rule_importance = rule_importance.drop("abs_coefficient", axis = 1)
    
    return rule_importance


In [62]:
import numpy as np
from h2o.tree import H2OTree
def tree_traverser(node, split_path):
    rule = []
    splits = [char for char in split_path]
    for i in splits:
        if i == "R":
            if np.isnan(node.threshold):
                rule = rule + [{'split_feature': node.split_feature, 
                                'value': node.right_levels, 
                                'operator': 'in'}]
            else:
                rule = rule + [{'split_feature': node.split_feature, 
                                'value': node.threshold, 
                                'operator': '>='}]

            node = node.right_child
        if i == "L":
            if np.isnan(node.threshold):
                rule = rule + [{'split_feature': node.split_feature, 
                                'value': node.left_levels, 
                                'operator': 'in'}]

            else:
                rule = rule + [{'split_feature': node.split_feature, 
                                'value': node.threshold, 
                                'operator': '<'}]

            node = node.left_child
    consolidated_rules = consolidate_rules(rule)
    consolidated_rules = " AND ".join(consolidated_rules.values())
    return consolidated_rules

def consolidate_rules(rules):
    rules = [x for x in rules if x.get("value")]
    features = set([x.get('split_feature') for x in rules])
    consolidated_rules = {}
    for i in features:
        feature_rules = [x for x in rules if x.get('split_feature') == i]
        if feature_rules[0].get('operator') == 'in':
            cleaned_rules = i + " is in " + ", ".join(sum([x.get('value') for x in feature_rules], []))
        else:
            cleaned_rules = []
            operators = set([x.get('operator') for x in feature_rules])
            for op in operators:
                vals = [x.get('value') for x in feature_rules if x.get('operator') == op]
                if '>' in op:
                    constraint = max(vals)
                else:
                    constraint = min(vals)
                cleaned_rules = " and ".join([op + " " + str(round(constraint, 3))])
                cleaned_rules = i + " " + cleaned_rules
        consolidated_rules[i] = cleaned_rules
    
    return consolidated_rules

In [63]:
train, valid = df.split_frame(seed = 1234)

In [64]:
x =  ["age", "sibsp", "parch", "fare", "sex", "pclass", "survived"]
rules = h2o_rulefit(train, x, "survived", seed = 1234)

drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
glm Model Build progress: |███████████████████████████████████████████████| 100%


## Step 3. Examine Rules

Now that we've built the rule-fit model, we can examine the rules which have the greatest impact on the survival probability.

In [60]:
for i in range(len(rules)):
    print("Coefficient:" + str(round(rules.iloc[i]["coefficient"], 5)) + "\nRule: " + rules.iloc[i]["rule"] + "\n\n")

Coefficient:-0.25749
Rule: parch < 0.5 AND sex is in female AND pclass >= 2.5 AND fare < 12.008 AND age < 30.508


Coefficient:0.25358
Rule: fare < 8.187 AND pclass >= 2.5 AND age < 20.562


Coefficient:0.19626
Rule: parch < 0.5 AND sex is in male AND pclass < 1.5 AND fare < 32.673 AND age < 54.414


Coefficient:-0.17832
Rule: fare < 11.58 AND pclass >= 2.5 AND age < 54.25 AND sex is in female


Coefficient:0.16401
Rule: parch < 0.5 AND sex is in male AND pclass < 1.5 AND fare < 33.882 AND age < 53.328


Coefficient:-0.1591
Rule: fare < 21.492 AND pclass >= 2.5 AND age < 30.137 AND parch >= 0.5


Coefficient:-0.15851
Rule: parch >= 0.5 AND pclass >= 2.5 AND sibsp < 1.5 AND fare < 14.923 AND age < 37.369


Coefficient:0.14999
Rule: fare < 22.957 AND pclass >= 2.5 AND age < 31.418 AND parch < 1.5


Coefficient:0.14702
Rule: parch >= 0.5 AND sex is in male AND sibsp < 1.5 AND fare < 50.783 AND age < 12.227


Coefficient:0.14519
Rule: parch >= 0.5 AND sex is in male AND pclass < 2.5 AND fa


Coefficient:-0.04288
Rule: parch < 0.5 AND sex is in female AND pclass < 2.5 AND sibsp < 1.5 AND age >= 37.056


Coefficient:0.04221
Rule: fare < 13.656 AND pclass < 2.5 AND sibsp < 0.5 AND age < 50.693


Coefficient:0.04142
Rule: sex is in female AND pclass < 2.5 AND sibsp < 0.5 AND fare < 26.037 AND age < 52.38


Coefficient:0.04128
Rule: parch < 0.5 AND sex is in male AND pclass >= 2.5 AND fare < 7.89 AND age < 34.429


Coefficient:-0.04112
Rule: pclass >= 1.5 AND age < 60.278


Coefficient:0.04005
Rule: fare >= 21.514 AND pclass < 2.5 AND age < 36.273


Coefficient:-0.03977
Rule: pclass >= 2.5 AND sibsp >= 0.5 AND parch >= 0.5


Coefficient:0.03913
Rule: fare < 7.842 AND pclass >= 2.5 AND age < 29.469 AND sex is in male


Coefficient:0.03848
Rule: sex is in male AND pclass < 1.5 AND sibsp < 0.5 AND fare < 71.446 AND age < 28.115


Coefficient:0.03783
Rule: fare >= 21.514 AND pclass < 2.5 AND sex is in female


Coefficient:-0.03721
Rule: fare < 51.715 AND pclass >= 2.5


Coefficien


Coefficient:-0.00818
Rule: fare < 26.017 AND parch < 1.5 AND sex is in female


Coefficient:0.00767
Rule: fare < 8.656 AND sibsp < 1.5 AND age < 21.03 AND sex is in male


Coefficient:-0.00748
Rule: pclass < 2.5 AND age >= 13.039


Coefficient:-0.00744
Rule: pclass < 2.5 AND sibsp < 1.5 AND parch < 2.0 AND sex is in female


Coefficient:0.00735
Rule: parch < 1.5 AND sex is in male AND sibsp < 1.5 AND fare < 22.014 AND age < 49.083


Coefficient:-0.00732
Rule: sex is in female AND sibsp < 2.5 AND pclass >= 2.5 AND fare < 8.337 AND age >= 16.951


Coefficient:-0.0073
Rule: fare < 50.783 AND pclass >= 2.5 AND age >= 5.567


Coefficient:-0.00682
Rule: fare >= 8.005 AND pclass >= 2.5 AND age >= 39.537


Coefficient:-0.00672
Rule: fare >= 16.01 AND pclass >= 1.5 AND sibsp < 1.5 AND age < 60.278


Coefficient:-0.00655
Rule: sex is in female AND sibsp < 2.5 AND pclass >= 2.5 AND fare < 14.878 AND age >= 16.951


Coefficient:-0.00652
Rule: age < 19.448 AND sex is in male


Coefficient:-0.00644


Coefficient:-0.00025
Rule: parch < 1.5 AND sex is in male AND pclass >= 2.5 AND fare >= 12.328 AND age < 7.495


Coefficient:0.00025
Rule: fare < 99.194 AND age < 53.254 AND parch < 0.5 AND sex is in male


Coefficient:-0.00024
Rule: fare < 22.561 AND age < 17.645 AND parch < 0.5 AND sex is in male


Coefficient:0.00023
Rule: pclass >= 1.5 AND age < 32.164 AND parch < 0.5 AND sex is in male


Coefficient:0.00022
Rule: parch < 3.0 AND sex is in female AND pclass >= 2.5 AND fare < 25.752 AND age >= 3.664


Coefficient:0.00019
Rule: parch < 1.5 AND sex is in female AND pclass < 2.5 AND fare >= 49.032 AND age < 61.238


Coefficient:-0.00018
Rule: fare < 76.696 AND parch < 0.5 AND sex is in male


Coefficient:-0.00017
Rule: fare < 30.075 AND pclass >= 2.5


Coefficient:-0.00016
Rule: fare < 30.075 AND pclass >= 2.5


Coefficient:0.00015
Rule: fare < 7.805 AND pclass >= 1.5 AND age < 32.362 AND parch < 0.5


Coefficient:-0.00014
Rule: sex is in male AND pclass < 1.5 AND sibsp < 0.5 AND fare

The first rule tells us that women in class 3 who were traveling with no parents or children and were under 30 were less likely to survive.

```
Coefficient:-0.25749
Rule: parch < 0.5 AND sex is in female AND pclass >= 2.5 AND fare < 12.008 AND age < 30.508
```

Another rule tells us that men in class 1 who were traveling with no parents or children were more likely to survive.

```
Coefficient:0.19626
Rule: parch < 0.5 AND sex is in male AND pclass < 1.5 AND fare < 32.673 AND age < 54.414
```

## Improvements and Next Steps

Some of the rules are very similar to each other:

* Rule: parch < 0.5 AND sex is in male AND pclass < 1.5 AND fare < 32.673 AND age < 54.414
* Rule: parch < 0.5 AND sex is in male AND pclass < 1.5 AND fare < 33.882 AND age < 53.328

It would be nice to modify the rule-fit code to consolidate these similar rules.