### Prologue

In [1]:
%run nursery_utils.py

%matplotlib inline

Utility code in the associated file performs the following steps:
defines function to print pretty confusion matrix: plot_confusion_matrix()
defines a function to get the class code by label: get_class_code()
defines a function to plot a tree inline: tree_to_code()
defines a funciton to extract all the structural arrays of a tree: get_tree_structure()
defines a function to extract a metrics dictionary from a random forest: explore_forest()
defines a function to map the path of an instance down a tree: tree_path()
defines a function to map the path of an instance down a tree ensemble: forest_path()
defines a function to find the majority predicted class from object returned by forest_path(): major_class_from_forest_paths()
defines a function to convert a tree into a function: tree_to_code()
defines a function to get list of all the paths for one instance out of a forest_paths object: get_paths()
defines a function to get basic labels back from one hot encoded label names: decode onehot


In [2]:
%run nursery_dataprep.py

Utility code in the associated file performs the following steps:
set random seed
import packages and modules
defines a custom summary function: rstr()
create the list of variable names: var_names
create the list of features (var_names less class): features
import the nursery.csv file
create the pandas dataframe and prints head: nursery
create the categorical var encoder dictionary: le_dict
create a function to get any code for a column name and label: get_code
create the dictionary of categorical values: categories
creates the list of one hot encoded variable names, onehot_features
create the list of class names: class_names
create the pandas dataframe with encoded vars: nursery_pre
create the pandas dataframe containing all features less class: X
create the pandas series containing the class 'decision': y
create the training and test sets: X_train, y_train, X_test, y_test
evaluate the training and test set priors and print them: train_priors, test_priors
create a One Hot Encoder and 

In [3]:
# create the usual RF instances
# get the parameters from previous parameter tuning events
with open('results.json', 'r') as infile:
    results = json.load(infile)

results = pd.DataFrame(results)
best_grid = results.loc[results['score'].idxmax()]
print("Best OOB: " "{:0.4f}".format(best_grid.score))

# get the correct packed arguments for RFClassifier
best_params = {k: int(v) for k, v in best_grid.items() if k not in ('score', 'elapsed_time')}
print("Grid:", best_params)
rf = RandomForestClassifier(random_state=seed, oob_score=True, **best_params)
rf.fit(X_train_enc, y_train)

# helper function for prediction
enc_model = make_pipeline(encoder, rf)

pred = enc_model.predict(X_test)
print(metrics.cohen_kappa_score(y_test, pred))

# this is the very simplified forest, for easy counting and validation
rf_simple = RandomForestClassifier(n_estimators = 10
                                   , max_depth = 3
                                   , min_samples_leaf = 250
                                   , random_state=seed)
rf_simple.fit(X_train_enc, y_train)

# helper function for prediction
enc_model_simple = make_pipeline(encoder, rf_simple)

pred = enc_model_simple.predict(X_test)
print(metrics.cohen_kappa_score(y_test, pred))
cm = metrics.confusion_matrix(y_test, pred)

Best OOB: 0.9881
Grid: {'n_estimators': 1000, 'min_samples_leaf': 1, 'max_depth': 16}
0.987190964529
0.745192629849


In [4]:
# Pull out single trees as examples
tree_idx1 = 0
tree_idx2 = 1

instance_idx1 = 0
instance_idx2 = 1

# from the deep and complex model
rft1 = rf.estimators_[tree_idx1]
rft2 = rf.estimators_[tree_idx2]
# from the simple model
rft_simple1 = rf_simple.estimators_[tree_idx1]
rft_simple2 = rf_simple.estimators_[tree_idx2]

### Scenario
In this workbook, I traverse the individual trees within a forest, extracting the path of an instance and converting to rules. We will want to combine the rules of all the trees into a "instance master" rule set for each instance. We may also want to discover the "class master" rule set that predicts each class.

In [5]:
itp_simple_nolabels = tree_path(tree = rft_simple1
                   , feature_names = onehot_features
                   , instances = X_test[0:5]
                   , feature_encoding = encoder)

itp_simple = tree_path(tree = rft_simple1
                   , feature_names = onehot_features
                   , instances = X_test[0:5]
                   , labels = y_test[0:5]
                   , feature_encoding = encoder)

In [6]:
# pd.DataFrame(itp1[0]['path'])

In [7]:
ifp_simple = forest_path(forest = rf_simple
                   , feature_names = onehot_features
                   , instances = X_test[0:5]
                   , labels = y_test[0:5]
                   , feature_encoding = encoder)

In [8]:
print(ifp_simple[0][0])
print(ifp_simple[1][0]) # different tree, same instance
print(ifp_simple[2][0]) # different tree, same instance

{'tree_correct': True, 'true_class': 0, 'pred_proba': [0.5057358243198952, 0.3087512291052114, 0.0, 0.15142576204523106, 0.034087184529662404], 'pred_class': 0, 'path': {'feature_name': ['parents_great_pret', 'children_1', 'health_priority'], 'feature_idx': [0, 12, 25], 'feature_value': [0.0, 0.0, 0.0], 'leq_threshold': [True, True, True], 'threshold': [0.5, 0.5, 0.5]}}
{'tree_correct': True, 'true_class': 0, 'pred_proba': [1.0, 0.0, 0.0, 0.0, 0.0], 'pred_class': 0, 'path': {'feature_name': ['has_nurs_very_crit', 'health_not_recom'], 'feature_idx': [7, 24], 'feature_value': [0.0, 1.0], 'leq_threshold': [True, False], 'threshold': [0.5, 0.5]}}
{'tree_correct': True, 'true_class': 0, 'pred_proba': [1.0, 0.0, 0.0, 0.0, 0.0], 'pred_class': 0, 'path': {'feature_name': ['parents_usual', 'health_not_recom'], 'feature_idx': [2, 24], 'feature_value': [1.0, 1.0], 'leq_threshold': [False, False], 'threshold': [0.5, 0.5]}}


In [9]:
# do a run for the fully grown forest
ifp = forest_path(forest = rf
                   , feature_names = onehot_features
                   , instances = X_test[0:5]
                   , labels = y_test[0:5]
                   , feature_encoding = encoder)

In [10]:
print(ifp[0][2])
print(ifp[1][2]) # different tree, same instance
print(ifp[2][2]) # different tree, same instance

{'tree_correct': True, 'true_class': 3, 'pred_proba': [0.0, 0.0, 0.0, 1.0, 0.0], 'pred_class': 3, 'path': {'feature_name': ['parents_great_pret', 'children_1', 'health_priority', 'has_nurs_proper', 'has_nurs_less_proper', 'parents_pretentious', 'children_2', 'has_nurs_improper', 'housing_convenient'], 'feature_idx': [0, 12, 25, 6, 5, 1, 13, 4, 16], 'feature_value': [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'leq_threshold': [True, True, False, True, True, True, True, True, True], 'threshold': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}}
{'tree_correct': True, 'true_class': 3, 'pred_proba': [0.0, 0.0, 0.0, 1.0, 0.0], 'pred_class': 3, 'path': {'feature_name': ['has_nurs_very_crit', 'health_not_recom', 'housing_convenient', 'form_complete', 'children_more'], 'feature_idx': [7, 24, 16, 8, 15], 'feature_value': [0.0, 0.0, 0.0, 1.0, 1.0], 'leq_threshold': [True, True, True, False, False], 'threshold': [0.5, 0.5, 0.5, 0.5, 0.5]}}
{'tree_correct': True, 'true_class': 3, 'pred_proba': [0

In [11]:
# class 0 is very easy to distinguish, instance 0 is class 0
print('instance 0, true class 0\n'
      , sum([ifp[i][0]['pred_class'] == 0 for i in range(best_params['n_estimators'])])
      , 'out of'
      , best_params['n_estimators']
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 1 is class 1
print('instance 1, true class 1\n'
      , sum([ifp[i][1]['pred_class'] == 1 for i in range(best_params['n_estimators'])])
      , 'out of'
      , best_params['n_estimators']
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 2 is class 3
print('instance 2, true class 3\n'
      , sum([ifp[i][2]['pred_class'] == 3 for i in range(best_params['n_estimators'])])
      , 'out of'
      , best_params['n_estimators']
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 3 is class 3
print('instance 2, true class 3\n'
      , sum([ifp[i][3]['pred_class'] == 3 for i in range(best_params['n_estimators'])])
      , 'out of'
      , best_params['n_estimators']
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 4 is class 1
print('instance 2, true class 1\n'
      , sum([ifp[i][4]['pred_class'] == 1 for i in range(best_params['n_estimators'])])
      , 'out of'
      , best_params['n_estimators']
      , 'estimators predicted correctly')

instance 0, true class 0
 993 out of 1000 estimators predicted correctly
instance 1, true class 1
 646 out of 1000 estimators predicted correctly
instance 2, true class 3
 973 out of 1000 estimators predicted correctly
instance 2, true class 3
 993 out of 1000 estimators predicted correctly
instance 2, true class 1
 949 out of 1000 estimators predicted correctly


In [12]:
# class 0 is very easy to distinguish, instance 0 is class 0
print('instance 0, true class 0\n'
      , sum([ifp_simple[i][0]['pred_class'] == 0 for i in range(10)])
      , 'out of'
      , 10
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 1 is class 1
print('instance 1, true class 1\n'
      , sum([ifp_simple[i][1]['pred_class'] == 1 for i in range(10)])
      , 'out of'
      , 10
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 2 is class 3
print('instance 2, true class 3\n'
      , sum([ifp_simple[i][2]['pred_class'] == 3 for i in range(10)])
      , 'out of'
      , 10
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 3 is class 3
print('instance 2, true class 3\n'
      , sum([ifp_simple[i][3]['pred_class'] == 3 for i in range(10)])
      , 'out of'
      , 10
      , 'estimators predicted correctly')
# class 1 and 3 get mixed up. instance 4 is class 1
print('instance 2, true class 1\n'
      , sum([ifp_simple[i][4]['pred_class'] == 1 for i in range(10)])
      , 'out of'
      , 10
      , 'estimators predicted correctly')

instance 0, true class 0
 10 out of 10 estimators predicted correctly
instance 1, true class 1
 3 out of 10 estimators predicted correctly
instance 2, true class 3
 5 out of 10 estimators predicted correctly
instance 2, true class 3
 8 out of 10 estimators predicted correctly
instance 2, true class 1
 5 out of 10 estimators predicted correctly


In [13]:
# just the correct trees (though we might not know this, and have to take the majority)
ifp_simple_correct_paths = [ifp_simple[i][1]['path'] for i in range(10) if ifp_simple[i][1]['pred_class'] == 1]

mc = major_class_from_forest_paths(ifp_simple, 1)
# just the trees that voted in the majority
ifp_simple_majclass_paths = [ifp_simple[i][1]['path'] for i in range(10) if ifp_simple[i][1]['pred_class'] == mc]

print(ifp_simple_correct_paths)
print(ifp_simple_majclass_paths)

pd.DataFrame(ifp_simple_correct_paths)

[{'feature_name': ['parents_great_pret', 'children_1', 'health_priority'], 'feature_idx': [0, 12, 25], 'feature_value': [0.0, 1.0, 1.0], 'leq_threshold': [True, False, False], 'threshold': [0.5, 0.5, 0.5]}, {'feature_name': ['has_nurs_very_crit', 'health_not_recom', 'parents_great_pret'], 'feature_idx': [7, 24, 0], 'feature_value': [0.0, 0.0, 0.0], 'leq_threshold': [True, True, True], 'threshold': [0.5, 0.5, 0.5]}, {'feature_name': ['children_1', 'has_nurs_proper', 'health_not_recom'], 'feature_idx': [12, 6, 24], 'feature_value': [0.0, 0.0, 0.0], 'leq_threshold': [True, True, True], 'threshold': [0.5, 0.5, 0.5]}]
[{'feature_name': ['parents_usual', 'health_recommended', 'health_not_recom'], 'feature_idx': [2, 26, 24], 'feature_value': [1.0, 0.0, 0.0], 'leq_threshold': [False, True, True], 'threshold': [0.5, 0.5, 0.5]}, {'feature_name': ['health_priority', 'has_nurs_proper', 'children_more'], 'feature_idx': [25, 6, 15], 'feature_value': [0.0, 0.0, 0.0], 'leq_threshold': [True, True, Tru

Unnamed: 0,feature_idx,feature_name,feature_value,leq_threshold,threshold
0,"[0, 12, 25]","[parents_great_pret, children_1, health_priority]","[0.0, 1.0, 1.0]","[True, False, False]","[0.5, 0.5, 0.5]"
1,"[7, 24, 0]","[has_nurs_very_crit, health_not_recom, parents...","[0.0, 0.0, 0.0]","[True, True, True]","[0.5, 0.5, 0.5]"
2,"[12, 6, 24]","[children_1, has_nurs_proper, health_not_recom]","[0.0, 0.0, 0.0]","[True, True, True]","[0.5, 0.5, 0.5]"


In [14]:
# ifp structure is a list of objects for each tree.
# each object contains a list of objects for each instance.
# object is a dictionary, one of the keys is 'tree_correct'
# so we would like to filter only for correct trees


In [15]:
ifp[0][2]

{'path': {'feature_idx': [0, 12, 25, 6, 5, 1, 13, 4, 16],
  'feature_name': ['parents_great_pret',
   'children_1',
   'health_priority',
   'has_nurs_proper',
   'has_nurs_less_proper',
   'parents_pretentious',
   'children_2',
   'has_nurs_improper',
   'housing_convenient'],
  'feature_value': [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
  'leq_threshold': [True, True, False, True, True, True, True, True, True],
  'threshold': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]},
 'pred_class': 3,
 'pred_proba': [0.0, 0.0, 0.0, 1.0, 0.0],
 'tree_correct': True,
 'true_class': 3}

In [16]:
paths = get_paths(ifp, 2, only_correct_trees=True)
print('There are', len(paths), 'correctly predicting trees.')

print(paths[:2])
freq_patts = apriori(transactions = paths, support = 0.2, max_itemset_size = 6)
print()
freq_patts

NameError: name 'fp' is not defined

In [None]:
from copy import deepcopy
decoded_paths = deepcopy(paths) # avoids referential update of the forest paths object

decoded_paths = decode_onehot_paths(paths=decoded_paths, labels=features, condense=False)

freq_patts_dec = apriori(transactions = decoded_paths, support = 0.2, max_itemset_size = 8)
print()
freq_patts_dec


In [None]:
decoded_paths = decode_onehot_paths(paths=paths, labels=features)

freq_patts_dec = apriori(transactions = decoded_paths, support = 0.2, max_itemset_size = 8)
print()
freq_patts_dec