# prologue

### set up notebook and load package

In [1]:
# for notebook plotting
%matplotlib inline 

# load what we need
import CHIRPS.datasets as ds
import CHIRPS.reproducible as rp

# demo datasets that ship with package. all from UCI unless stated otherwise
# ds.adult_data, ds.adult_samp_data, ds.adult_small_samp_data Large dataset ships with manageable sub samples
# ds.bankmark_data, ds.bankmark_samp_data
# ds.car_data
# ds.cardio_data this is the cardiotocography dataset
# ds.credit_data
# ds.german_data
# ds.lending_data, ds.lending_samp_data, ds.lending_small_samp_data, ds.lending_tiny_samp_data from Kaggle. see datasets_from_source file for links
# ds.nursery_data, ds.nursery_samp_data
# ds.rcdv_data, ds.rcdv_samp_data from US government see datasets_from_source file for links

### common config - can be ommitted if defaults are OK

In [2]:
project_dir = 'V:\\whiteboxing' # defaults to a directory "whiteboxing" in the working directory
random_state_splits = 123 # one off for splitting the data into test / train

# data management

### datasets

Several datasets are available as pre-prepared containers that hold the data and some meta-data that is used in the algorithm
Any dataset can be turned into a container by invoking the constructor found in the file structures.py

In [3]:
# datasets might be down-sampled to make them easier to work with
# the full sets are available too
# this is a list of constructors that will be used in the benchmarking
rp.datasets

[<function CHIRPS.datasets.adult_small_samp_data>,
 <function CHIRPS.datasets.bankmark_samp_data>,
 <function CHIRPS.datasets.car_data>,
 <function CHIRPS.datasets.cardio_data>,
 <function CHIRPS.datasets.credit_data>,
 <function CHIRPS.datasets.german_data>,
 <function CHIRPS.datasets.lending_tiny_samp_data>,
 <function CHIRPS.datasets.nursery_samp_data>]

In [4]:
# example of one dataset
# note: random_state propagates through other functions and is easily updated to allow alternative runs
ds.cardio_data(random_state=123, project_dir=project_dir)

<CHIRPS.structures.data_container at 0x1c56a7f5780>

### standardising train-test splitting
Some methods are not available in Python. We want to maintain the same dataset splits no matter which platform. So, the train test data is split with the one-time random seed and the splits are saved to csv in the project folders.

In [5]:
# writes external files
rp.export_data_splits(datasets=rp.datasets, project_dir=project_dir, random_state_splits=random_state_splits)

Exported train-test data for 8 datasets.


# Experimental Runs
Loop through datasets, actioning the functions in the package to execute a round of experiments and test evaluations.

## 0. Optional Memory and Computation Cost Management
CHIRPS is time economical but memory intensive to compute for lots of instances at once.


### Parallel processing
Scikit takes care of parallel for the RF construction.
We can parallelise the following:
1. the walk of instances down each tree to collect the paths. The paths for many instances are returned in a single array. This parallelises across trees.
2. building CHIRPS and the final explanation (rule). This is a search optimisation and we can parallelise each instance.

This is expecially effective when running batches. For single instances, set both to false to avoid spinning up the parallel infrastructure.

In [6]:
# control for async processes - each tree walk can be done in its own core
# and so can each explanation (e.g. rule conditions merge by hill-climbing)
# these will default to false if not passed explicitly to the explainer function
# on a multi-core machine there should be a good speed up for large batches
# when the batch_size advantage exceeds the overhead of setting up multi-processing
# timings will be printed to screen so you can see if it helps
forest_walk_async=True
chirps_explanation_async=True

### Preparing unseen data

Again note:
test set has never been "seen" by random forest during training
test set has been only used to assess model (random forest) accuracy - no additional tuning after this
test set has not be involved in generating the explainer

#### Batching
The memory space requirements for all the paths can be reduced by dividing the test set into batches. However this does take longer as there is an overhead to instantiate all the required objects, especially if coupled with parallel processing.
Best compromise could be a small number of larger batches.

In [7]:
# the number of instances can be controlled by
# batch_size - how many instances to explain at one time
# set these larger than the size of the test set and it will simply run the whole test set in one batch. Better do option 2
batch_size = 5
# how many instances to explain in total from a test/unseen set
n_instances = 10000

### 1. Data and Forest prep
Use the random state splits to do a one-off data split.
Fit the RF to training data, using the iterating random state.
Save the performance metrics on the test set for later review.

### 2. Prepare Unseen Data and Predictions
Important to note:
Test set never "seen" by RF during training.
test set not involved in generating the explainer.
Test set used to evaluate model (random forest) accuracy beyond OOBE scores - no additional tuning based on these results.
Test set used to evaluate explanation scores by leave-one-out method removing the specific instance we're explaining.

Important to note:
We will explain predictions directly from the trained RF. Explanation system makes no compromise on model accuracy.

### 3. CHIRPS algorithm
1. Extract Tree Prediction Paths
2. Freqent pattern mining of paths
3. Score and sort mined path segments
4. Merge path segments into one rule

#### CHIRPS 1
Fit a forest_walker object to the dataset and decision forest. This is a wrapper that will extract the paths of all the given instances. Its main method delivers the instance paths for the remaining steps of the algorithm as a new object: a batch_paths_container. It can also report interesting statistics (treating the forest as a set of random tree-structured variables).

#### CHIRPS 2-4
A batch_CHIRPS_container is fitted with the batch_paths_container returned by the forest walker, and with a sample of data. For CHIRPS, we prefer a large sample. The whole training set or other representative sample will do. This is a wrapper object will execute steps 2-4 on all each the instance-paths in the batch_paths_container.

Important to note:
true_divide warnings are OK! It just means that a continuous variable is unbounded on one side i.e. no greater/less than inequality is used in the specific CHIRPS explanation.

Important note: 
Here we are using the training set to create the explainers. We could use a different dataset as long as it is representative of the training set that built the decision forest. Most important that we don't use the dataset that we wish to explain, so never use the test set, for example.

### 4. Evaluating CHIRPS Explanations
Test set has been used to create an explainer *one instance at a time* and the rest of test set was not "seen" during this construction. To score each explainer, we use test set, leaving out the individual instance being explained. The data_split_container (tt) has a convenience funtion for doing this. All the results are saved to csv files in the project directory.

In [8]:
run_CHIRPS = False
run_Anchors = True

for random_state in [123]:
    for d_constructor in rp.datasets:
        print('Running experiment for ' + d_constructor.__name__ + ' with random state = ' + str(random_state))
        print()
        # 1. Data and Forest prep
        print('Split data into main train-test and build RF') 
        
        mydata = d_constructor(random_state=random_state, project_dir=project_dir)
        
        meta_data = mydata.get_meta()
        save_path = meta_data['get_save_path']()
        train_index, test_index = mydata.get_tt_split_idx(random_state=random_state_splits)
        tt = mydata.tt_split(train_index, test_index)
        
        rf = rp.forest_prep(X=tt.X_train_enc, y=tt.y_train,
                            save_path=save_path,
                            class_names=meta_data['class_names_label_order'],
                            random_state=meta_data['random_state'])

        print()
        
        if run_CHIRPS: # might assume this is always true! :-)
            rp.CHIRPS_benchmark(forest=rf, ds_container=tt, meta_data=meta_data,
                                batch_size=batch_size, n_instances=n_instances,
                                forest_walk_async=True,
                                chirps_explanation_async=True,
                                save_path=save_path,
                                random_state=random_state)
        
        
        if run_Anchors:
            # new copy of ds_container (need to reset the row counters)
            tt_anch = mydata.tt_split(train_index, test_index)
            # preprocessing - discretised continuous X matrix has been added and also needs an updated var_dict 
            # plus returning the fitted explainer that holds the data distribution
            tt_anch, anchors_explainer = rp.Anchors_preproc(ds_container=tt_anch,
                                                             meta_data=meta_data)
    
            # re-fitting the random forest to the discretised data and evaluating
            rf = rp.forest_prep(X=tt_anch.X_train_enc, y=tt_anch.y_train,
                            save_path=save_path,
                            class_names=meta_data['class_names_label_order'],
                            random_state=meta_data['random_state'],
                            identifier='Anchors')

            rp.Anchors_benchmark(forest=rf, ds_container=tt_anch, meta_data=meta_data,
                                anchors_explainer=anchors_explainer,
                                batch_size=batch_size, n_instances=n_instances,
                                save_path=save_path, random_state=meta_data['random_state'])
                


Running experiment for adult_small_samp_data with random state = 123

Split data into main train-test and build RF
using previous tuning parameters

using previous tuning parameters
Prepare Unseen Data and Predictions
Running Anchors on each instance and collecting results
Working on Anchors for instance 263
{'labels': [0, 1], 'counts': array([0, 0], dtype=int64), 'p_counts': array([0., 0.]), 's_counts': array([0., 0.])}
{'labels': [0, 1], 'counts': array([840, 344], dtype=int64), 'p_counts': array([0.70945946, 0.29054054]), 's_counts': array([0.70886076, 0.29029536])}
{'labels': [0, 1], 'counts': array([840, 344], dtype=int64), 'p_counts': array([0.70945946, 0.29054054]), 's_counts': array([0.70886076, 0.29029536])}
{'labels': [0, 1], 'counts': array([651, 272], dtype=int64), 'p_counts': array([0.70530878, 0.29469122]), 's_counts': array([0.70454545, 0.29437229])}
{'labels': [0, 1], 'counts': array([0, 0], dtype=int64), 'p_counts': array([0., 0.]), 's_counts': array([0., 0.])}
Running



Prepare Unseen Data and Predictions
Running Anchors on each instance and collecting results
Working on Anchors for instance 1633
{'labels': [0, 1], 'counts': array([0, 0], dtype=int64), 'p_counts': array([0., 0.]), 's_counts': array([0., 0.])}


KeyboardInterrupt: 

In [None]:
exp

In [None]:
import numpy as np


#                 # train
#                 priors = p_count_corrected(tt['y_train'], [i for i in range(len(mydata.class_names))])
#                 if any(fit_anchor_train):
#                     p_counts = p_count_corrected(enc_rf.predict(tt['X_train'][fit_anchor_train]), [i for i in range(len(mydata.class_names))])
#                 else:
#                     p_counts = p_count_corrected([None], [i for i in range(len(mydata.class_names))])
#                 counts = p_counts['counts']
#                 labels = p_counts['labels']
#                 post = p_counts['p_counts']
#                 p_corrected = np.array([p if p > 0.0 else 1.0 for p in post])
#                 cover = counts.sum() / priors['counts'].sum()
#                 recall = counts/priors['counts'] # recall
#                 r_corrected = np.array([r if r > 0.0 else 1.0 for r in recall]) # to avoid div by zeros
#                 observed = np.array((counts, priors['counts']))
#                 if counts.sum() > 0: # previous_counts.sum() == 0 is impossible
#                     chisq = chi2_contingency(observed=observed[:, np.where(observed.sum(axis=0) != 0)], correction=True)
#                 else:
#                     chisq = np.nan
#                 f1 = [2] * ((post * recall) / (p_corrected + r_corrected))
#                 not_covered_counts = counts + (np.sum(priors['counts']) - priors['counts']) - (np.sum(counts) - counts)
#                 accu = not_covered_counts/priors['counts'].sum()
#                 # to avoid div by zeros
#                 pri_corrected = np.array([pri if pri > 0.0 else 1.0 for pri in priors['p_counts']])
#                 pos_corrected = np.array([pos if pri > 0.0 else 0.0 for pri, pos in zip(priors['p_counts'], post)])
#                 if counts.sum() == 0:
#                     rec_corrected = np.array([0.0] * len(pos_corrected))
#                     cov_corrected = np.array([1.0] * len(pos_corrected))
#                 else:
#                     rec_corrected = counts / counts.sum()
#                     cov_corrected = np.array([counts.sum() / priors['counts'].sum()])

#                 lift = pos_corrected / ( ( cov_corrected ) * pri_corrected )

#                 # capture train
#                 mc = enc_rf.predict(tt['X_test'][i].reshape(1, -1))[0]
#                 mc_lab = mydata.class_names[enc_rf.predict(tt['X_test'][i].reshape(1, -1))[0]]
#                 tc = enc_rf.predict(tt['X_test'][i].reshape(1, -1))[0]
#                 tc_lab = mydata.class_names[enc_rf.predict(tt['X_test'][i].reshape(1, -1))[0]]
#                 vt = np.nan
#                 mvs = np.nan
#                 prior = priors['p_counts'][tc]
#                 prettify_rule = ' AND '.join(exp.names())
#                 rule_len = len(exp.names())
#                 tr_prec = post[tc]
#                 tr_recall = recall[tc]
#                 tr_f1 = f1[tc]
#                 tr_acc = accu[tc]
#                 tr_lift = lift[tc]
#                 tr_coverage = cover

#                 # test
#                 priors = p_count_corrected(tt['y_test'], [i for i in range(len(mydata.class_names))])
#                 if any(fit_anchor_test):
#                     p_counts = p_count_corrected(enc_rf.predict(tt['X_test'][fit_anchor_test]), [i for i in range(len(mydata.class_names))])
#                 else:
#                     p_counts = p_count_corrected([None], [i for i in range(len(mydata.class_names))])
#                 counts = p_counts['counts']
#                 labels = p_counts['labels']
#                 post = p_counts['p_counts']
#                 p_corrected = np.array([p if p > 0.0 else 1.0 for p in post])
#                 cover = counts.sum() / priors['counts'].sum()
#                 recall = counts/priors['counts'] # recall
#                 r_corrected = np.array([r if r > 0.0 else 1.0 for r in recall]) # to avoid div by zeros
#                 observed = np.array((counts, priors['counts']))
#                 if counts.sum() > 0: # previous_counts.sum() == 0 is impossible
#                     chisq = chi2_contingency(observed=observed[:, np.where(observed.sum(axis=0) != 0)], correction=True)
#                 else:
#                     chisq = np.nan
#                 f1 = [2] * ((post * recall) / (p_corrected + r_corrected))
#                 not_covered_counts = counts + (np.sum(priors['counts']) - priors['counts']) - (np.sum(counts) - counts)
#                 # accuracy = (TP + TN) / num_instances formula: https://books.google.co.uk/books?id=ubzZDQAAQBAJ&pg=PR75&lpg=PR75&dq=rule+precision+and+coverage&source=bl&ots=Aa4Gj7fh5g&sig=6OsF3y4Kyk9KlN08OPQfkZCuZOc&hl=en&sa=X&ved=0ahUKEwjM06aW2brZAhWCIsAKHY5sA4kQ6AEIUjAE#v=onepage&q=rule%20precision%20and%20coverage&f=false
#                 accu = not_covered_counts/priors['counts'].sum()
#                 pri_corrected = np.array([pri if pri > 0.0 else 1.0 for pri in priors['p_counts']]) # to avoid div by zeros
#                 pos_corrected = np.array([pos if pri > 0.0 else 0.0 for pri, pos in zip(priors['p_counts'], post)]) # to avoid div by zeros
#                 if counts.sum() == 0:
#                     rec_corrected = np.array([0.0] * len(pos_corrected))
#                     cov_corrected = np.array([1.0] * len(pos_corrected))
#                 else:
#                     rec_corrected = counts / counts.sum()
#                     cov_corrected = np.array([counts.sum() / priors['counts'].sum()])

#                 lift = pos_corrected / ( ( cov_corrected ) * pri_corrected )

#                 # capture test
#                 tt_prec = post[tc]
#                 tt_recall = recall[tc]
#                 tt_f1 = f1[tc]
#                 tt_acc = accu[tc]
#                 tt_lift = lift[tc]
#                 tt_coverage = cover

#                 output_anch[i] = [instance_id,
#                                     'anchors', # result_set
#                                     prettify_rule,
#                                     rule_len,
#                                     mc,
#                                     mc_lab,
#                                     tc,
#                                     tc_lab,
#                                     mvs,
#                                     prior,
#                                     tr_prec,
#                                     tr_recall,
#                                     tr_f1,
#                                     tr_acc,
#                                     tr_lift,
#                                     tr_coverage,
#                                     tt_prec,
#                                     tt_recall,
#                                     tt_f1,
#                                     tt_acc,
#                                     tt_lift,
#                                     tt_coverage,
#                                     acc,
#                                     coka]

#             output = np.concatenate((output, output_anch), axis=0)
#             anch_end_time = timeit.default_timer()
#             anch_elapsed_time = anch_end_time - anch_start_time

#         # save the tabular results to a file
#         output_df = DataFrame(output, columns=headers)
#         output_df.to_csv(mydata.make_save_path(mydata.pickle_dir.replace('pickles', 'results') + '_rnst_' + str(mydata.random_state) + "_addt_" + str(add_trees) + '_timetest.csv'))
#         # save the full rule_acc_lite objects
#         if save_rule_accs:
#             explainers_store = open(mydata.make_save_path('explainers' + '_rnst_' + str(mydata.random_state) + "_addt_" + str(add_trees) + '.pickle'), "wb")
#             pickle.dump(explainers, explainers_store)
#             explainers_store.close()

#         print('Completed experiment for ' + str(dataset) + ':')
#         print('random_state ' + str(mydata.random_state) + ' and ' +str(add_trees) + ' additional trees')
#         # pass the elapsed times up to the caller
#         return(wb_elapsed_time + wbres_elapsed_time, anch_elapsed_time, grid_idx)


#                 anch_start_time = timeit.default_timer()
    
    
    
    # 4. Evaluating CHIRPS Explanations
    print('Evaluating found explanations')

    results_start_time = timeit.default_timer()

    # iterate over all the test instances (based on the ids in the index)
    # scoring will leave out the specific instance by this id.
    rt.evaluate_CHIRPS_explainers(CHIRPS, tt, labels.index,
                                  forest=rf,
                                  print_to_screen=False, # set True when running single instances
                                  save_results_path=save_path,
                                  save_results_file='results' + '_rnst_' + str(random_state),
                                  save_CHIRPS=True)

    results_end_time = timeit.default_timer()
    results_elapsed_time = results_end_time - results_start_time
    print('CHIRPS batch results eval time elapsed:', "{:0.4f}".format(results_elapsed_time), 'seconds')
    # this completes the CHIRPS runs


    print()
    print()

In [None]:
import CHIRPS.routines as rt

n_instances, n_batches = rt.batch_instance_ceiling(ds_container=tt_anch, n_instances=n_instances, batch_size=batch_size)

# this gets the next batch out of the data_split_container according to the required number of instances
# all formats can be extracted, depending on the requirement
# unencoded, encoded (sparse matrix is the type returned by scikit), ordinary dense matrix also available
instances, instances_enc, instances_enc_matrix, labels = tt_anch.get_next(batch_size, which_split='test') # default

# OPTION 2 - just run with whole test set
# instances = tt.X_test; instances_enc = tt.X_test_enc; instances_enc_matrix = tt.X_test_enc_matrix; labels = tt.y_test

# Make all the model predictions from the decision forest
preds = rf.predict(X=instances_enc)

instance = tt_anch.X_test[0]
anchors_explainer.explain_instance(instance, rf.predict, threshold=0.95)


In [None]:
tt.X_train_enc[1].todense()

In [None]:
tt_anch.X_train_enc[1].todense()

In [None]:
from anchor import anchor_tabular as anchtab
from lime import lime_tabular as limtab
from copy import deepcopy
import numpy as np

mydata = rp.datasets[0](random_state=random_state, project_dir=project_dir)

In [None]:
rf, tt, meta_data = rp.data_forest_prep(mydata, project_dir=project_dir,
                              override_tuning=False,
                              random_state=random_state,
                              random_state_splits=random_state_splits)



In [None]:
# dictionary format for Anchors
categorical_names = {anch_meta_data['var_dict'][k]['order_col'] : anch_meta_data['var_dict'][k]['labels'] \
                     for k in anch_meta_data['var_dict'].keys() if not anch_meta_data['var_dict'][k]['class_col']}

In [None]:
explainer = anchtab.AnchorTabularExplainer(meta_data['class_names'], meta_data['features'], X_train_disc, categorical_names)
explainer.fit(X_train_disc, tt.y_train, X_test_disc, tt.y_test)

# update the tt object
anch_X_train_enc = explainer.encoder.transform(X_train_disc)

In [None]:
la

In [None]:
lah

In [None]:
tt.y_train.unique()

In [None]:
meta_data['get_label'](meta_data['class_col'], 0)