# prologue

### set up notebook and load package

In [1]:
# for notebook plotting
%matplotlib inline 

# load what we need
import CHIRPS.datasets as ds
import CHIRPS.reproducible as rp

# demo datasets that ship with package. all from UCI unless stated otherwise
# ds.adult_data, ds.adult_samp_data, ds.adult_small_samp_data Large dataset ships with manageable sub samples
# ds.bankmark_data, ds.bankmark_samp_data
# ds.car_data
# ds.cardio_data this is the cardiotocography dataset
# ds.credit_data
# ds.german_data
# ds.lending_data, ds.lending_samp_data, ds.lending_small_samp_data, ds.lending_tiny_samp_data from Kaggle. see datasets_from_source file for links
# ds.nursery_data, ds.nursery_samp_data
# ds.rcdv_data, ds.rcdv_samp_data from US government see datasets_from_source file for links

### common config - can be ommitted if defaults are OK

In [2]:
project_dir = 'V:\\whiteboxing' # defaults to a directory "whiteboxing" in the working directory
random_state_splits = 123 # one off for splitting the data into test / train

# data management

### datasets

Several datasets are available as pre-prepared containers that hold the data and some meta-data that is used in the algorithm
Any dataset can be turned into a container by invoking the constructor found in the file structures.py

In [3]:
# datasets might be down-sampled to make them easier to work with
# the full sets are available too
# this is a list of constructors that will be used in the benchmarking
rp.datasets

[<function CHIRPS.datasets.adult_small_samp_data>,
 <function CHIRPS.datasets.bankmark_samp_data>,
 <function CHIRPS.datasets.car_data>,
 <function CHIRPS.datasets.cardio_data>,
 <function CHIRPS.datasets.credit_data>,
 <function CHIRPS.datasets.german_data>,
 <function CHIRPS.datasets.lending_tiny_samp_data>,
 <function CHIRPS.datasets.nursery_samp_data>]

In [4]:
# example of one dataset
# note: random_state propagates through other functions and is easily updated to allow alternative runs
ds.cardio_data(random_state=123, project_dir=project_dir)

<CHIRPS.structures.data_container at 0x19ab4d4cda0>

### standardising train-test splitting
Some methods are not available in Python. We want to maintain the same dataset splits no matter which platform. So, the train test data is split with the one-time random seed and the splits are saved to csv in the project folders.

In [5]:
# writes external files
rp.export_data_splits(datasets=rp.datasets, project_dir=project_dir, random_state_splits=random_state_splits)

Exported train-test data for 8 datasets.


# Experimental Runs
Loop through datasets, actioning the functions in the package to execute a round of experiments and test evaluations.

## 0. Optional Memory and Computation Cost Management
CHIRPS is time economical but memory intensive to compute for lots of instances at once.


### Parallel processing
Scikit takes care of parallel for the RF construction.
We can parallelise the following:
1. the walk of instances down each tree to collect the paths. The paths for many instances are returned in a single array. This parallelises across trees.
2. building CHIRPS and the final explanation (rule). This is a search optimisation and we can parallelise each instance.

This is expecially effective when running batches. For single instances, set both to false to avoid spinning up the parallel infrastructure.

In [6]:
# control for async processes - each tree walk can be done in its own core
# and so can each explanation (e.g. rule conditions merge by hill-climbing)
# these will default to false if not passed explicitly to the explainer function
# on a multi-core machine there should be a good speed up for large batches
# when the batch_size advantage exceeds the overhead of setting up multi-processing
# timings will be printed to screen so you can see if it helps
forest_walk_async=True
chirps_explanation_async=True

### Preparing unseen data

Again note:
test set has never been "seen" by random forest during training
test set has been only used to assess model (random forest) accuracy - no additional tuning after this
test set has not be involved in generating the explainer

#### Batching
The memory space requirements for all the paths can be reduced by dividing the test set into batches. However this does take longer as there is an overhead to instantiate all the required objects, especially if coupled with parallel processing.
Best compromise could be a small number of larger batches.

In [7]:
# the number of instances can be controlled by
# batch_size - how many instances to explain at one time
# set these larger than the size of the test set and it will simply run the whole test set in one batch. Better do option 2
batch_size = 5
# how many instances to explain in total from a test/unseen set
n_instances = 10000

### 1. Data and Forest prep
Use the random state splits to do a one-off data split.
Fit the RF to training data, using the iterating random state.
Save the performance metrics on the test set for later review.

### 2. Prepare Unseen Data and Predictions
Important to note:
Test set never "seen" by RF during training.
test set not involved in generating the explainer.
Test set used to evaluate model (random forest) accuracy beyond OOBE scores - no additional tuning based on these results.
Test set used to evaluate explanation scores by leave-one-out method removing the specific instance we're explaining.

Important to note:
We will explain predictions directly from the trained RF. Explanation system makes no compromise on model accuracy.

### 3. CHIRPS algorithm
1. Extract Tree Prediction Paths
2. Freqent pattern mining of paths
3. Score and sort mined path segments
4. Merge path segments into one rule

#### CHIRPS 1
Fit a forest_walker object to the dataset and decision forest. This is a wrapper that will extract the paths of all the given instances. Its main method delivers the instance paths for the remaining steps of the algorithm as a new object: a batch_paths_container. It can also report interesting statistics (treating the forest as a set of random tree-structured variables).

#### CHIRPS 2-4
A batch_CHIRPS_container is fitted with the batch_paths_container returned by the forest walker, and with a sample of data. For CHIRPS, we prefer a large sample. The whole training set or other representative sample will do. This is a wrapper object will execute steps 2-4 on all each the instance-paths in the batch_paths_container.

Important to note:
true_divide warnings are OK! It just means that a continuous variable is unbounded on one side i.e. no greater/less than inequality is used in the specific CHIRPS explanation.

Important note: 
Here we are using the training set to create the explainers. We could use a different dataset as long as it is representative of the training set that built the decision forest. Most important that we don't use the dataset that we wish to explain, so never use the test set, for example.

### 4. Evaluating CHIRPS Explanations
Test set has been used to create an explainer *one instance at a time* and the rest of test set was not "seen" during this construction. To score each explainer, we use test set, leaving out the individual instance being explained. The data_split_container (tt) has a convenience funtion for doing this. All the results are saved to csv files in the project directory.

In [8]:
run_CHIRPS = True
run_Anchors = True
run_defragTrees = True

for random_state in [123]:
    for d_constructor in [rp.datasets[3]]:
        print('Running experiment for ' + d_constructor.__name__ + ' with random state = ' + str(random_state))
        print()
        # 1. Data and Forest prep
        print('Split data into main train-test and build RF') 
        
        mydata = d_constructor(random_state=random_state, project_dir=project_dir)
        
        meta_data = mydata.get_meta()
        save_path = meta_data['get_save_path']()
        train_index, test_index = mydata.get_tt_split_idx(random_state=random_state_splits)
        tt = mydata.tt_split(train_index, test_index)
        
        # this will train and score the model
        rf = rp.forest_prep(ds_container=tt,
                            meta_data=meta_data,
                            save_path=save_path)

        print()
        
        if run_CHIRPS: # might assume this is always true! :-)
            rp.CHIRPS_benchmark(forest=rf, ds_container=tt, meta_data=meta_data,
                                batch_size=batch_size, n_instances=n_instances,
                                forest_walk_async=True,
                                chirps_explanation_async=False,
                                save_path=save_path,
                                dataset_name=d_constructor.__name__,
                                random_state=random_state)
        
        
        if run_Anchors:
            # new copy of ds_container (need to reset the row counters)
            tt_anch = mydata.tt_split(train_index, test_index)
            # preprocessing - discretised continuous X matrix has been added and also needs an updated var_dict 
            # plus returning the fitted explainer that holds the data distribution
            tt_anch, anchors_explainer = rp.Anchors_preproc(ds_container=tt_anch,
                                                             meta_data=meta_data)
    
            # re-fitting the random forest to the discretised data and evaluating
            rf = rp.forest_prep(ds_container=tt_anch,
                            meta_data=meta_data,
                            save_path=save_path,
                            identifier='Anchors')

            rp.Anchors_benchmark(forest=rf, ds_container=tt_anch, meta_data=meta_data,
                                anchors_explainer=anchors_explainer,
                                batch_size=batch_size, n_instances=n_instances,
                                save_path=save_path, dataset_name=d_constructor.__name__,
                                random_state=meta_data['random_state'])
        
        if run_defragTrees:
            # create a new copy of tt split, because each one keeps track of which instances it has given out.
            # re-using the top one means different instances are passed
            tt_dfrgtrs = mydata.tt_split(train_index, test_index)
            
            # some dfrgtrs specific parameters
            Kmax = 10
            restart = 1
            maxitr = 10
            dfrgtrs = rp.defragTrees_prep(ds_container=tt_dfrgtrs, meta_data=meta_data, forest=rf, 
                                            Kmax=Kmax, restart=restart, maxitr=maxitr,
                                            identifier='defragTrees', save_path=save_path)
            
                        
            rp.defragTrees_benchmark(forest=rf, ds_container=tt_dfrgtrs, meta_data=meta_data, dfrgtrs=dfrgtrs,
                                    batch_size=batch_size, n_instances=n_instances,
                                    save_path=save_path, dataset_name=d_constructor.__name__,
                                    random_state=random_state)
            

            

Running experiment for cardio_data with random state = 123

Split data into main train-test and build RF
using previous tuning parameters

Prepare Unseen Data and Predictions for CHIRPS benchmark
Walking forest for 5 instances... (please wait)
Forest Walk with async = True
Forest Walk time elapsed: 0.8215 seconds

Running CHIRPS on a batch of 5 instances... (please wait)


  np.histogram(lowers, lower_bins)[0]).round(5)
  np.histogram(uppers, upper_bins)[0]).round(5)


CHIRPS time elapsed: 0.4017 seconds
CHIRPS with async = False

Evaluating found explanations





CHIRPS batch results eval time elapsed: 0.1819 seconds
using previous tuning parameters
Prepare Unseen Data and Predictions for Anchors benchmark
Running Anchors on each instance and collecting results
Working on Anchors for instance 63
[8, 7]
[[3. 3.]
 [2. 3.]
 [0. 3.]
 ...
 [1. 3.]
 [3. 2.]
 [1. 3.]]
[[2. 0.]
 [3. 0.]
 [3. 1.]
 [1. 2.]
 [1. 1.]]
[[2. 0.]
 [3. 0.]
 [3. 1.]
 [1. 2.]
 [1. 1.]]
False
[1, 17]
[[2. 0.]
 [2. 1.]
 [0. 0.]
 ...
 [2. 1.]
 [0. 0.]
 [3. 2.]]
[[2. 1.]
 [0. 0.]
 [2. 1.]
 [2. 1.]]
[[2. 1.]
 [0. 0.]
 [2. 1.]
 [2. 1.]]
False
[19, 6, 18, 16, 1, 17, 11, 4]
[[3. 1. 0. ... 0. 3. 1.]
 [2. 0. 2. ... 1. 2. 2.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [2. 0. 2. ... 1. 2. 2.]
 [3. 0. 0. ... 0. 2. 2.]
 [1. 0. 2. ... 2. 1. 0.]]
[[3. 1. 0. 0. 0. 0. 3. 1.]
 [1. 0. 1. 1. 2. 1. 2. 0.]
 [3. 0. 1. 3. 2. 1. 1. 0.]]
[[3. 1. 0. 0. 0. 0. 3. 1.]
 [1. 0. 1. 1. 2. 1. 2. 0.]
 [3. 0. 1. 3. 2. 1. 1. 0.]]
Fa

IndexError: index 56 is out of bounds for axis 1 with size 21