# prologue

### set up notebook and load package

In [1]:
# for notebook plotting
%matplotlib inline 

# load what we need
import time
import timeit
import CHIRPS.structures as strcts
import CHIRPS.datasets as ds
import CHIRPS.routines as rt
import CHIRPS.reproducible as rp

# demo datasets that ship with package. all from UCI unless stated otherwise
# ds.adult_data, ds.adult_samp_data, ds.adult_small_samp_data Large dataset ships with manageable sub samples
# ds.bankmark_data, ds.bankmark_samp_data
# ds.car_data
# ds.cardio_data this is the cardiotocography dataset
# ds.credit_data
# ds.german_data
# ds.lending_data, ds.lending_samp_data, ds.lending_small_samp_data, ds.lending_tiny_samp_data from Kaggle. see datasets_from_source file for links
# ds.nursery_data, ds.nursery_samp_data
# ds.rcdv_data, ds.rcdv_samp_data from US government see datasets_from_source file for links

### common config - can be ommitted if defaults are OK

In [2]:
project_dir = 'V:\\whiteboxing' # defaults to a directory "whiteboxing" in the working directory
random_state_splits = 123 # one off for splitting the data into test / train

# data management

### datasets

Several datasets are available as pre-prepared containers that hold the data and some meta-data that is used in the algorithm
Any dataset can be turned into a container by invoking the constructor found in the file structures.py

In [3]:
# datasets might be down-sampled to make them easier to work with
# the full sets are available too
# this is a list of constructors that will be used in the benchmarking
rp.datasets

[<function CHIRPS.datasets.adult_small_samp_data>,
 <function CHIRPS.datasets.bankmark_samp_data>,
 <function CHIRPS.datasets.car_data>,
 <function CHIRPS.datasets.cardio_data>,
 <function CHIRPS.datasets.credit_data>,
 <function CHIRPS.datasets.german_data>,
 <function CHIRPS.datasets.lending_tiny_samp_data>,
 <function CHIRPS.datasets.nursery_samp_data>]

In [4]:
# example of one dataset
# note: random_state propagates through other functions and is easily updated to allow alternative runs
ds.cardio_data(random_state=123, project_dir=project_dir)

<CHIRPS.structures.data_container at 0x1baac544978>

### standardising train-test splitting
Some methods are not available in Python. We want to maintain the same dataset splits no matter which platform. So, the train test data is split with the one-time random seed and the splits are saved to csv in the project folders.

In [5]:
# writes external files
rp.export_data_splits(datasets=rp.datasets, project_dir=project_dir, random_state_splits=random_state_splits)

Exported train-test data for 8 datasets.


# Experimental Runs
Loop through datasets, actioning the functions in the package to execute a round of experiments and test evaluations.

## 0. Optional Memory and Computation Cost Management
CHIRPS is time economical but memory intensive to compute for lots of instances at once.


### Parallel processing
Scikit takes care of parallel for the RF construction.
We can parallelise:
1. the walk of instances down each tree to collect the paths.
2. building CHIRPS and the final explanation (rule).

In [6]:
# control for async processes - each tree walk can be done in its own core
# and so can each explanation (e.g. rule conditions merge by hill-climbing)
# these will default to false if not passed explicitly to the explainer function
# on a multi-core machine there should be a good speed up for large batches
# when the batch_size advantage exceeds the overhead of setting up multi-processing
# timings will be printed to screen so you can see if it helps
forest_walk_async=True
chirps_explanation_async=True

### Preparing unseen data

Again note:
test set has never been "seen" by random forest during training
test set has been only used to assess model (random forest) accuracy - no additional tuning after this
test set has not be involved in generating the explainer

#### Batching
The memory space requirements for all the paths can be reduced by dividing the test set into batches. However this does take longer as there is an overhead to instantiate all the required objects, especially if coupled with parallel processing.
Best compromise could be a small number of larger batches.

In [7]:
# the number of instances can be controlled by
# batch_size - how many instances to explain at one time
# set these larger than the size of the test set and it will simply run the whole test set in one batch. Better do option 2
batch_size = 2
# how many instances to explain in total from a test/unseen set
n_instances = 10000

### 1. Data and Forest prep
Use the random state splits to do a one-off data split.
Fit the RF to training data, using the iterating random state.
Save the performance metrics on the test set for later review.

### 2. Prepare Unseen Data and Predictions
Important to note:
Test set never "seen" by RF during training.
test set not involved in generating the explainer.
Test set used to evaluate model (random forest) accuracy beyond OOBE scores - no additional tuning based on these results.
Test set used to evaluate explanation scores by leave-one-out method removing the specific instance we're explaining.

Important to note:
We will explain predictions directly from the trained RF. Explanation system makes no compromise on model accuracy.

### 3. CHIRPS algorithm
1. Extract Tree Prediction Paths
2. Freqent pattern mining of paths
3. Score and sort mined path segments
4. Merge path segments into one rule

#### CHIRPS 1
Fit a forest_walker object to the dataset and decision forest. This is a wrapper that will extract the paths of all the given instances. Its main method delivers the instance paths for the remaining steps of the algorithm as a new object: a batch_paths_container. It can also report interesting statistics (treating the forest as a set of random tree-structured variables).

#### CHIRPS 2-4
A batch_CHIRPS_container is fitted with the batch_paths_container returned by the forest walker, and with a sample of data. For CHIRPS, we prefer a large sample. The whole training set or other representative sample will do. This is a wrapper object will execute steps 2-4 on all each the instance-paths in the batch_paths_container.

Important to note:
true_divide warnings are OK! It just means that a continuous variable is unbounded on one side i.e. no greater/less than inequality is used in the specific CHIRPS explanation.

Important note: 
Here we are using the training set to create the explainers. We could use a different dataset as long as it is representative of the training set that built the decision forest. Most important that we don't use the dataset that we wish to explain, so never use the test set, for example.

### 4. Evaluating CHIRPS Explanations
Test set has been used to create an explainer *one instance at a time* and the rest of test set was not "seen" during this construction. To score each explainer, we use test set, leaving out the individual instance being explained. The data_split_container (tt) has a convenience funtion for doing this. All the results are saved to csv files in the project directory.

In [8]:
for random_state in [123, 124]:
    for d_constructor in rp.datasets:
        print('Running experiment for ' + d_constructor.__name__ + ' with random state = ' + str(random_state))
        print()
        # 1. Data and Forest prep
        print('Split data into train-test and build RF') 
        rf, tt, meta_data = rp.data_forest_prep(d_constructor, project_dir=project_dir,
                                      override_tuning=False,
                                      random_state=random_state,
                                      random_state_splits=random_state_splits)
        
        save_path = meta_data['get_save_path']()
        print()

        # 2. Prepare Unseen Data and Predictions
        print('Prepare Unseen Data and Predictions')
        # OPTION 1 - batching (to be implemented in the new code, right now it will do just one batch)
        # this will normalise the above parameters to the size of the dataset
        n_instances, n_batches = rt.batch_instance_ceiling(data_split=tt, n_instances=n_instances, batch_size=batch_size)

        # this gets the next batch out of the data_split_container according to the required number of instances
        # all formats can be extracted, depending on the requirement
        # unencoded, encoded (sparse matrix is the type returned by scikit), ordinary dense matrix also available
        instances, instances_enc, instances_enc_matrix, labels = tt.get_next(batch_size, which_split='test') # default

        # OPTION 2 - just run with whole test set
        # instances = tt.X_test; instances_enc = tt.X_test_enc; instances_enc_matrix = tt.X_test_enc_matrix; labels = tt.y_test

        # Make all the model predictions from the decision forest
        preds = rf.predict(X=instances_enc)
        print()
        
        # 3.1 - Extract Tree Prediction Paths

        print('Walking forest for ' + str(len(labels)) + ' instances... (please wait)')

        # wrapper object needs the decision forest itself and the dataset meta data (we have a convenience function for this)
        f_walker = strcts.forest_walker(forest = rf, meta_data=meta_data)

        # set the timer
        forest_walk_start_time = timeit.default_timer()

        # do the walk - returns a batch_paths_container (even for just one instance)
        # requires the X instances in a matrix (dense, ordinary numpy matrix) - this is available in the data_split_container
        bp_container = f_walker.forest_walk(instances = instances_enc_matrix
                                , labels = preds # we're explaining the prediction, not the true label!
                                , forest_walk_async = forest_walk_async)

        # stop the timer
        forest_walk_end_time = timeit.default_timer()
        forest_walk_elapsed_time = forest_walk_end_time - forest_walk_start_time

        print('Forest Walk with async = ' + str(forest_walk_async))
        print('Forest Walk time elapsed:', "{:0.4f}".format(forest_walk_elapsed_time), 'seconds')
        print()
        
        # 3.2-3.4 - Freqent pattern mining of paths, Score and sort mined path segments, Merge path segments into one rule

        # build CHIRPS and a rule for each instance represented in the batch paths container
        CHIRPS = strcts.batch_CHIRPS_explainer(bp_container,
                                        forest=rf,
                                        sample_instances=tt.X_train_enc, # any representative sample can be used
                                        sample_labels=tt.y_train,  # any representative sample can be used
                                        meta_data=meta_data)

        print('Running CHIRPS on a batch of ' + str(len(labels)) + ' instances... (please wait)')
        # start a timer
        ce_start_time = timeit.default_timer()

        CHIRPS.batch_run_CHIRPS(chirps_explanation_async=chirps_explanation_async) # all the defaults

        ce_end_time = timeit.default_timer()
        ce_elapsed_time = ce_end_time - ce_start_time
        print('CHIRPS time elapsed:', "{:0.4f}".format(ce_elapsed_time), 'seconds')
        print('CHIRPS with async = ' + str(chirps_explanation_async))
        print()
        
        # 4. Evaluating CHIRPS Explanations
        print('Evaluating found explanations')
        
        results_start_time = timeit.default_timer()

        # iterate over all the test instances (based on the ids in the index)
        # scoring will leave out the specific instance by this id.
        rt.evaluate_CHIRPS_explainers(CHIRPS, tt, labels.index,
                                      print_to_screen=False, # set True when running single instances
                                      save_results_path=save_path,
                                      save_results_file='results' + '_rnst_' + str(random_state),
                                      save_CHIRPS=True)

        results_end_time = timeit.default_timer()
        results_elapsed_time = results_end_time - results_start_time
        print('CHIRPS batch results eval time elapsed:', "{:0.4f}".format(results_elapsed_time), 'seconds')
        # this completes the CHIRPS runs

        print()
        print()

Running experiment for adult_small_samp_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.2306 seconds

Running CHIRPS on a batch of 2 instances... (please wait)


  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 0.2916 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0283 seconds


Running experiment for bankmark_samp_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.4378 seconds

Running CHIRPS on a batch of 2 instances... (please wait)


  np.histogram(lowers, lower_bins)[0]).round(5)
  np.histogram(uppers, upper_bins)[0]).round(5)


CHIRPS time elapsed: 0.3702 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0540 seconds


Running experiment for car_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.1875 seconds

Running CHIRPS on a batch of 2 instances... (please wait)
CHIRPS time elapsed: 0.5497 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0214 seconds


Running experiment for cardio_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.4320 seconds

Running CHIRPS on a batch of 2 instances... (please wait)


  np.histogram(lowers, lower_bins)[0]).round(5)
  np.histogram(uppers, upper_bins)[0]).round(5)


CHIRPS time elapsed: 0.6077 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0392 seconds


Running experiment for credit_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.3994 seconds

Running CHIRPS on a batch of 2 instances... (please wait)


  np.histogram(lowers, lower_bins)[0]).round(5)
  np.histogram(uppers, upper_bins)[0]).round(5)


CHIRPS time elapsed: 0.3121 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0338 seconds


Running experiment for german_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.2240 seconds

Running CHIRPS on a batch of 2 instances... (please wait)


  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 0.4213 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0494 seconds


Running experiment for lending_tiny_samp_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.7110 seconds

Running CHIRPS on a batch of 2 instances... (please wait)


  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 1.1822 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0740 seconds


Running experiment for nursery_samp_data with random state = 123

Split data into train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions

Walking forest for 2 instances... (please wait)
Forest Walk with async = False
Forest Walk time elapsed: 0.6595 seconds

Running CHIRPS on a batch of 2 instances... (please wait)
CHIRPS time elapsed: 0.6341 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.1081 seconds


Running experiment for adult_small_samp_data with random state = 124

Split data into train-test and build RF
New grid tuning... (please wait)
Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 500}
Training time: 1.064868337316554
Out of Bag Accuracy Score: 0.8455236980690463

Trying params: {'max_depth': 8, 'min_samples_

  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 0.4277 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0689 seconds


Running experiment for bankmark_samp_data with random state = 124

Split data into train-test and build RF
New grid tuning... (please wait)
Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 500}
Training time: 1.1535155219483926
Out of Bag Accuracy Score: 0.9148264984227129

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1000}
Training time: 2.244431250499531
Out of Bag Accuracy Score: 0.9141955835962146

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1500}
Training time: 3.4299912323031094
Out of Bag Accuracy Score: 0.9123028391167193

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 500}
Training time: 1.047445018608073
Out of Bag Accuracy Score: 0.9129337539432176

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 1000}
Train

  np.histogram(lowers, lower_bins)[0]).round(5)
  np.histogram(uppers, upper_bins)[0]).round(5)


CHIRPS time elapsed: 0.5501 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0801 seconds


Running experiment for car_data with random state = 124

Split data into train-test and build RF
New grid tuning... (please wait)
Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 500}
Training time: 0.7874389371075523
Out of Bag Accuracy Score: 0.9752066115702479

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1000}
Training time: 1.5659214825400198
Out of Bag Accuracy Score: 0.9743801652892562

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1500}
Training time: 2.371384376673248
Out of Bag Accuracy Score: 0.9743801652892562

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 500}
Training time: 0.7332540895386757
Out of Bag Accuracy Score: 0.9694214876033058

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 1000}
Training time:

  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 0.8277 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0749 seconds


Running experiment for credit_data with random state = 124

Split data into train-test and build RF
New grid tuning... (please wait)
Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 500}
Training time: 0.664097453041137
Out of Bag Accuracy Score: 0.8737060041407867

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1000}
Training time: 1.368747457277749
Out of Bag Accuracy Score: 0.8737060041407867

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1500}
Training time: 2.072724710542815
Out of Bag Accuracy Score: 0.8799171842650103

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 500}
Training time: 0.6291805057861382
Out of Bag Accuracy Score: 0.8757763975155279

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 1000}
Training time

  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 0.3632 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0667 seconds


Running experiment for german_data with random state = 124

Split data into train-test and build RF
New grid tuning... (please wait)
Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 500}
Training time: 0.8463240273955819
Out of Bag Accuracy Score: 0.7485714285714286

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1000}
Training time: 1.679281373361448
Out of Bag Accuracy Score: 0.7442857142857143

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1500}
Training time: 2.5716090676855288
Out of Bag Accuracy Score: 0.7457142857142857

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 500}
Training time: 0.769951618418304
Out of Bag Accuracy Score: 0.7285714285714285

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 1000}
Training tim

  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 0.7265 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0747 seconds


Running experiment for lending_tiny_samp_data with random state = 124

Split data into train-test and build RF
New grid tuning... (please wait)
Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 500}
Training time: 1.6099684927794158
Out of Bag Accuracy Score: 0.9158180583842498

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1000}
Training time: 3.3501282703839763
Out of Bag Accuracy Score: 0.9137813985064495

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1500}
Training time: 4.830363093224236
Out of Bag Accuracy Score: 0.9158180583842498

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 500}
Training time: 1.5401003447585708
Out of Bag Accuracy Score: 0.9008825526137135

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 1000}


  np.histogram(uppers, upper_bins)[0]).round(5)
  np.histogram(lowers, lower_bins)[0]).round(5)


CHIRPS time elapsed: 0.7559 seconds
CHIRPS with async = False

Evaluating found explanations
CHIRPS batch results eval time elapsed: 0.0865 seconds


Running experiment for nursery_samp_data with random state = 124

Split data into train-test and build RF
New grid tuning... (please wait)
Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 500}
Training time: 1.1386606156136736
Out of Bag Accuracy Score: 0.9366041896361632

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1000}
Training time: 2.2939615266969895
Out of Bag Accuracy Score: 0.9377067254685777

Trying params: {'max_depth': 8, 'min_samples_leaf': 1, 'n_estimators': 1500}
Training time: 3.430476790842704
Out of Bag Accuracy Score: 0.9371554575523704

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 500}
Training time: 1.0601698974137435
Out of Bag Accuracy Score: 0.9250275633958104

Trying params: {'max_depth': 8, 'min_samples_leaf': 5, 'n_estimators': 1000}
Train