# prologue

### set up notebook and load package

In [1]:
# for notebook plotting
%matplotlib inline 

# load what we need
import numpy as np
import CHIRPS.datasets as ds
import CHIRPS.reproducible as rp
from CHIRPS.routines import extend_path

# demo datasets that ship with package. all from UCI unless stated otherwise
# ds.adult_data, ds.adult_samp_data, ds.adult_small_samp_data Large dataset ships with manageable sub samples
# ds.bankmark_data, ds.bankmark_samp_data
# ds.car_data
# ds.cardio_data this is the cardiotocography dataset
# ds.credit_data
# ds.german_data
# ds.lending_data, ds.lending_samp_data, ds.lending_small_samp_data, ds.lending_tiny_samp_data from Kaggle. see datasets_from_source file for links
# ds.nursery_data, ds.nursery_samp_data
# ds.rcdv_data, ds.rcdv_samp_data from US government see datasets_from_source file for links

### common config - can be ommitted if defaults are OK

In [2]:
project_dir = 'V:\\whiteboxing' # defaults to a directory "whiteboxing" in the working directory
random_state_splits = 123 # one off for splitting the data into test / train

# data management

### datasets

Several datasets are available as pre-prepared containers that hold the data and some meta-data that is used in the algorithm
Any dataset can be turned into a container by invoking the constructor found in the file structures.py

In [3]:
# datasets might be down-sampled to make them easier to work with
# the full sets are available too
# this is a list of constructors that will be used in the benchmarking
rp.datasets

[<function CHIRPS.datasets.adult_small_samp(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.bankmark_samp(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.car(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.cardio(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.credit(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.german(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.lending_tiny_samp(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.nursery_samp(random_state=123, project_dir=None)>,
 <function CHIRPS.datasets.rcdv_samp(random_state=123, project_dir=None)>]

In [4]:
# example of one dataset
# note: random_state propagates through other functions and is easily updated to allow alternative runs
ds.cardio(random_state=123, project_dir=project_dir)

<CHIRPS.structures.data_container at 0x218c4392c88>

### standardising train-test splitting
Some methods are not available in Python. We want to maintain the same dataset splits no matter which platform. So, the train test data is split with the one-time random seed and the splits are saved to csv in the project folders.

In [5]:
# writes external files
rp.export_data_splits(datasets=rp.datasets, project_dir=project_dir, random_state_splits=random_state_splits)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Exported train-test data for 9 datasets.


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


# Experimental Runs
Loop through datasets, actioning the functions in the package to execute a round of experiments and test evaluations.

## 0. Optional Memory and Computation Cost Management
CHIRPS is time economical but memory intensive to compute for lots of instances at once.


### Parallel processing
Scikit takes care of parallel for the RF construction.
We can parallelise the following:
1. the walk of instances down each tree to collect the paths. The paths for many instances are returned in a single array. This parallelises across trees.
2. building CHIRPS and the final explanation (rule). This is a search optimisation and we can parallelise each instance.

This is expecially effective when running batches. For single instances, set both to false to avoid spinning up the parallel infrastructure.

In [6]:
# control for async processes - each tree walk can be done in its own core
# and so can each explanation (e.g. rule conditions merge by hill-climbing)
# these will default to false if not passed explicitly to the explainer function
# on a multi-core machine there should be a good speed up for large batches
# when the batch_size advantage exceeds the overhead of setting up multi-processing
# timings will be printed to screen so you can see if it helps
forest_walk_async=True
chirps_explanation_async=True

### Preparing unseen data

Again note:
test set has never been "seen" by random forest during training
test set has been only used to assess model (random forest) accuracy - no additional tuning after this
test set has not be involved in generating the explainer

#### Batching
The memory space requirements for all the paths can be reduced by dividing the test set into batches. However this does take longer as there is an overhead to instantiate all the required objects, especially if coupled with parallel processing.
Best compromise could be a small number of larger batches.

In [7]:
# the number of instances can be controlled by
# batch_size - how many instances to explain at one time
# set these larger than the size of the test set and it will simply run the whole test set in one batch. Better do option 2
batch_size = 10000
# how many instances to explain in total from a test/unseen set
n_instances = 10000

### 1. Data and Forest prep
Use the random state splits to do a one-off data split.
Fit the RF to training data, using the iterating random state.
Save the performance metrics on the test set for later review.

### 2. Prepare Unseen Data and Predictions
Important to note:
Test set never "seen" by RF during training.
test set not involved in generating the explainer.
Test set used to evaluate model (random forest) accuracy beyond OOBE scores - no additional tuning based on these results.
Test set used to evaluate explanation scores by leave-one-out method removing the specific instance we're explaining.

Important to note:
We will explain predictions directly from the trained RF. Explanation system makes no compromise on model accuracy.

### 3. CHIRPS algorithm
1. Extract tree prediction paths
2. Freqent pattern mining of paths
3. Score and sort mined path segments
4. Merge path segments into one rule

#### CHIRPS 1
Fit a forest_walker object to the dataset and decision forest. This is a wrapper that will extract the paths of all the given instances. Its main method delivers the instance paths for the remaining steps of the algorithm as a new object: a batch_paths_container. It can also report interesting statistics (treating the forest as a set of random tree-structured variables).

#### CHIRPS 2-4
A batch_CHIRPS_container is fitted with the batch_paths_container returned by the forest walker, and with a sample of data. For CHIRPS, we prefer a large sample. The whole training set or other representative sample will do. This is a wrapper object will execute steps 2-4 on all each the instance-paths in the batch_paths_container.

Important to note:
true_divide warnings are OK! It just means that a continuous variable is unbounded on one side i.e. no greater/less than inequality is used in the specific CHIRPS explanation.

Important note: 
Here we are using the training set to create the explainers. We could use a different dataset as long as it is representative of the training set that built the decision forest. Most important that we don't use the dataset that we wish to explain, so never use the test set, for example.

### 4. Evaluating CHIRPS Explanations
Test set has been used to create an explainer *one instance at a time* and the rest of test set was not "seen" during this construction. To score each explainer, we use test set, leaving out the individual instance being explained. The data_split_container (tt) has a convenience funtion for doing this. All the results are saved to csv files in the project directory.

In [10]:
run_CHIRPS = False
run_Anchors = False
run_defragTrees = True

CHIRPS_sensitivity = False

alpha_paths = np.tile([0.9, 0.5, 0.1], 24)
disc_path_bins = np.tile(np.repeat([4, 8], 3), 12)
score_func = np.tile(np.repeat([5, 3, 1], 6), 4)
weighting = np.tile(np.repeat(['chisq', 'nothing'], 9), 2)
support_paths = np.repeat([0.1, 0.05], 18)

kwargs_grid = {k : {'alpha_paths' : ap, 'disc_path_bins' : dpb, 'score_func' : sf, 'weighting' : w, 'support_paths' : sp} 
    for k, ap, dpb, sf, w, sp in zip(range(36), alpha_paths, disc_path_bins, score_func, weighting, support_paths)}

kwargs_default = {'support_paths' : 0.1, 'alpha_paths' : 0.5, 'disc_path_bins' : 4, 'score_func' : 1, 'weighting' : 'chisq' }

for random_state in range(123, 124):
    for d_constructor in rp.datasets:
        print('Running experiment for ' + d_constructor.__name__ + ' with random state = ' + str(random_state))
        print()
        # 1. Data and Forest prep
        print('Split data into main train-test and build RF') 
        mydata = d_constructor(random_state=random_state, project_dir=project_dir)
        
        meta_data = mydata.get_meta()
        save_path = meta_data['get_save_path']()
        train_index, test_index = mydata.get_tt_split_idx(random_state=random_state_splits)
        tt = mydata.tt_split(train_index, test_index)
        
        # this will train and score the model
        rf = rp.forest_prep(ds_container=tt,
                            meta_data=meta_data,
                            save_path=save_path)

        print()
        if run_CHIRPS:
            if CHIRPS_sensitivity:
                for kwg in kwargs_grid:
                    
                    rp.CHIRPS_benchmark(forest=rf, ds_container=tt, meta_data=meta_data,
                                        batch_size=batch_size, n_instances=n_instances,
                                        forest_walk_async=forest_walk_async,
                                        chirps_explanation_async=chirps_explanation_async,
                                        save_path=save_path,
                                        save_sensitivity_path=extend_path(stem=save_path, \
                                                              extensions=['sensitivity', \
                                                                'sp_' + str(kwargs_grid[kwg]['support_paths']) + \
                                                                '_ap_' + str(kwargs_grid[kwg]['alpha_paths']) + \
                                                                '_dpb_' + str(kwargs_grid[kwg]['disc_path_bins']) + \
                                                                '_sf_' + str(kwargs_grid[kwg]['score_func']) + \
                                                                '_w_' + str(kwargs_grid[kwg]['weighting']) + '_']),
                                        dataset_name=d_constructor.__name__,
                                        random_state=random_state, **kwargs_grid[kwg])
                    
                    # create a new ds_container as the last one is used up
                    tt = mydata.tt_split(train_index, test_index)
            else:
                rp.CHIRPS_benchmark(forest=rf, ds_container=tt, meta_data=meta_data,
                                    batch_size=batch_size, n_instances=n_instances,
                                    forest_walk_async=forest_walk_async,
                                    chirps_explanation_async=chirps_explanation_async,
                                    save_path=save_path,
                                    dataset_name=d_constructor.__name__,
                                    random_state=random_state, **kwargs_default)
                    
        if run_Anchors:
            # new copy of ds_container (need to reset the row counters)
            tt_anch = mydata.tt_split(train_index, test_index)
            # preprocessing - discretised continuous X matrix has been added and also needs an updated var_dict 
            # plus returning the fitted explainer that holds the data distribution
            tt_anch, anchors_explainer = rp.Anchors_preproc(ds_container=tt_anch,
                                                             meta_data=meta_data)
    
            # re-fitting the random forest to the discretised data and evaluating
            rf = rp.forest_prep(ds_container=tt_anch,
                            meta_data=meta_data,
                            save_path=save_path,
                            identifier='Anchors')

            rp.Anchors_benchmark(forest=rf, ds_container=tt_anch, meta_data=meta_data,
                                anchors_explainer=anchors_explainer,
                                batch_size=batch_size, n_instances=n_instances,
                                save_path=save_path, dataset_name=d_constructor.__name__,
                                random_state=meta_data['random_state'])
        
        if run_defragTrees:
            # create a new copy of tt split, because each one keeps track of which instances it has given out.
            # re-using the top one means different instances are passed
            tt_dfrgtrs = mydata.tt_split(train_index, test_index)
            
            # some dfrgtrs specific parameters
            Kmax = 10
            restart = 20
            maxitr = 100
            dfrgtrs, eval_start_time, defTrees_elapsed_time = rp.defragTrees_prep(ds_container=tt_dfrgtrs,
                                                                                meta_data=meta_data, forest=rf, 
                                                                                Kmax=Kmax, restart=restart, maxitr=maxitr,
                                                                                identifier='defragTrees', save_path=save_path)
            
                        
            rp.defragTrees_benchmark(forest=rf, ds_container=tt_dfrgtrs, meta_data=meta_data,
                                    dfrgtrs=dfrgtrs, eval_start_time=eval_start_time,
                                    defTrees_elapsed_time=defTrees_elapsed_time,
                                    batch_size=batch_size, n_instances=n_instances,
                                    save_path=save_path, dataset_name=d_constructor.__name__,
                                    random_state=random_state)


Running experiment for adult_small_samp with random state = 123

Split data into main train-test and build RF
using previous tuning parameters


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.



Running defragTrees

[Seed 123] TrainingError = 0.24, K = 1
[Seed 124] TrainingError = 0.24, K = 3
[Seed 125] TrainingError = 0.24, K = 2
[Seed 126] TrainingError = 0.24, K = 1
[Seed 127] TrainingError = 0.24, K = 2
[Seed 128] TrainingError = 0.24, K = 3
[Seed 129] TrainingError = 0.24, K = 1
[Seed 130] TrainingError = 0.24, K = 2
[Seed 131] TrainingError = 0.24, K = 2
[Seed 132] TrainingError = 0.24, K = 2
[Seed 133] TrainingError = 0.24, K = 2
[Seed 134] TrainingError = 0.24, K = 2
[Seed 135] TrainingError = 0.24, K = 1
[Seed 136] TrainingError = 0.24, K = 2
[Seed 137] TrainingError = 0.24, K = 2
[Seed 138] TrainingError = 0.24, K = 2
[Seed 139] TrainingError = 0.24, K = 1
[Seed 140] TrainingError = 0.24, K = 1
[Seed 141] TrainingError = 0.24, K = 1
[Seed 142] TrainingError = 0.24, K = 2
Optimal Model >> Seed 123, TrainingError = 0.24, K = 1
Fit defragTrees time elapsed: 1146.8500 seconds

defragTrees test accuracy
Accuracy = 0.769441
Coverage = 1.000000
Overlap = 0.000000
defragTre

  'precision', 'predicted', average, warn_for)


Working on defragTrees for instance 263
Working on defragTrees for instance 905
Working on defragTrees for instance 1205
Working on defragTrees for instance 1727
Working on defragTrees for instance 1866
Working on defragTrees for instance 2087
Working on defragTrees for instance 2357
Working on defragTrees for instance 1935
Working on defragTrees for instance 742
Working on defragTrees for instance 1597
Working on defragTrees for instance 208
Working on defragTrees for instance 1012
Working on defragTrees for instance 1369
Working on defragTrees for instance 21
Working on defragTrees for instance 1383
Working on defragTrees for instance 1191
Working on defragTrees for instance 1234
Working on defragTrees for instance 1620
Working on defragTrees for instance 429
Working on defragTrees for instance 1074
Working on defragTrees for instance 1633
Working on defragTrees for instance 1678
Working on defragTrees for instance 1617
Working on defragTrees for instance 2209
Working on defragTrees 

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.



Running defragTrees

[Seed 123] TrainingError = 0.10, K = 1
[Seed 124] TrainingError = 0.10, K = 2
[Seed 125] TrainingError = 0.10, K = 1
[Seed 126] TrainingError = 0.10, K = 1
[Seed 127] TrainingError = 0.10, K = 2
[Seed 128] TrainingError = 0.10, K = 2
[Seed 129] TrainingError = 0.10, K = 1
[Seed 130] TrainingError = 0.10, K = 2
[Seed 131] TrainingError = 0.10, K = 2
[Seed 132] TrainingError = 0.10, K = 2
[Seed 133] TrainingError = 0.10, K = 2
[Seed 134] TrainingError = 0.10, K = 1
[Seed 135] TrainingError = 0.10, K = 2
[Seed 136] TrainingError = 0.10, K = 1
[Seed 137] TrainingError = 0.10, K = 1
[Seed 138] TrainingError = 0.10, K = 2
[Seed 139] TrainingError = 0.10, K = 2
[Seed 140] TrainingError = 0.10, K = 2
[Seed 141] TrainingError = 0.10, K = 2
[Seed 142] TrainingError = 0.10, K = 1
Optimal Model >> Seed 123, TrainingError = 0.10, K = 1
Fit defragTrees time elapsed: 842.8470 seconds

defragTrees test accuracy
Accuracy = 0.880882
Coverage = 1.000000
Overlap = 0.000000
defragTree

  'precision', 'predicted', average, warn_for)


Working on defragTrees for instance 141
Working on defragTrees for instance 890
Working on defragTrees for instance 1676
Working on defragTrees for instance 1665
Working on defragTrees for instance 1158
Working on defragTrees for instance 1765
Working on defragTrees for instance 2104
Working on defragTrees for instance 1573
Working on defragTrees for instance 1950
Working on defragTrees for instance 1014
Working on defragTrees for instance 691
Working on defragTrees for instance 701
Working on defragTrees for instance 1097
Working on defragTrees for instance 1394
Working on defragTrees for instance 417
Working on defragTrees for instance 1738
Working on defragTrees for instance 809
Working on defragTrees for instance 2067
Working on defragTrees for instance 43
Working on defragTrees for instance 1167
Working on defragTrees for instance 846
Working on defragTrees for instance 75
Working on defragTrees for instance 591
Working on defragTrees for instance 1694
Working on defragTrees for i

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.



Running defragTrees

[Seed 123] TrainingError = 0.29, K = 2
[Seed 124] TrainingError = 0.29, K = 4
[Seed 125] TrainingError = 0.29, K = 3
[Seed 126] TrainingError = 0.29, K = 3
[Seed 127] TrainingError = 0.29, K = 3
[Seed 128] TrainingError = 0.29, K = 3
[Seed 129] TrainingError = 0.29, K = 3
[Seed 130] TrainingError = 0.29, K = 4
[Seed 131] TrainingError = 0.29, K = 2
[Seed 132] TrainingError = 0.29, K = 2
[Seed 133] TrainingError = 0.29, K = 4
[Seed 134] TrainingError = 0.29, K = 3
[Seed 135] TrainingError = 0.29, K = 2
[Seed 136] TrainingError = 0.29, K = 3
[Seed 137] TrainingError = 0.29, K = 2
[Seed 138] TrainingError = 0.28, K = 4
[Seed 139] TrainingError = 0.29, K = 2
[Seed 140] TrainingError = 0.29, K = 2
[Seed 141] TrainingError = 0.29, K = 2
[Seed 142] TrainingError = 0.28, K = 3
Optimal Model >> Seed 142, TrainingError = 0.28, K = 3
Fit defragTrees time elapsed: 183.3384 seconds

defragTrees test accuracy
Accuracy = 0.704633
Coverage = 1.000000
Overlap = 0.355212
defragTree

  'precision', 'predicted', average, warn_for)


Working on defragTrees for instance 63
Working on defragTrees for instance 1726
Working on defragTrees for instance 594
Working on defragTrees for instance 772
Working on defragTrees for instance 1155
Working on defragTrees for instance 1503
Working on defragTrees for instance 1391
Working on defragTrees for instance 1286
Working on defragTrees for instance 1848
Working on defragTrees for instance 433
Working on defragTrees for instance 1298
Working on defragTrees for instance 1702
Working on defragTrees for instance 381
Working on defragTrees for instance 565
Working on defragTrees for instance 1737
Working on defragTrees for instance 1026
Working on defragTrees for instance 680
Working on defragTrees for instance 1640
Working on defragTrees for instance 556
Working on defragTrees for instance 1216
Working on defragTrees for instance 704
Working on defragTrees for instance 550
Working on defragTrees for instance 838
Working on defragTrees for instance 1659
Working on defragTrees for i

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.



Running defragTrees

[Seed 123] TrainingError = 0.43, K = 1
[Seed 124] TrainingError = 0.43, K = 1
[Seed 125] TrainingError = 0.14, K = 2
[Seed 126] TrainingError = 0.43, K = 1
[Seed 127] TrainingError = 0.43, K = 1
[Seed 128] TrainingError = 0.43, K = 1
[Seed 129] TrainingError = 0.43, K = 1
[Seed 130] TrainingError = 0.43, K = 1
[Seed 131] TrainingError = 0.43, K = 1
[Seed 132] TrainingError = 0.43, K = 1
[Seed 133] TrainingError = 0.43, K = 1
[Seed 134] TrainingError = 0.43, K = 1
[Seed 135] TrainingError = 0.43, K = 1
[Seed 136] TrainingError = 0.43, K = 1
[Seed 137] TrainingError = 0.14, K = 2
[Seed 138] TrainingError = 0.43, K = 1
[Seed 139] TrainingError = 0.43, K = 1
[Seed 140] TrainingError = 0.43, K = 1
[Seed 141] TrainingError = 0.43, K = 1
[Seed 142] TrainingError = 0.43, K = 1
Optimal Model >> Seed 125, TrainingError = 0.14, K = 2
Fit defragTrees time elapsed: 141.5855 seconds

defragTrees test accuracy
Accuracy = 0.859903
Coverage = 0.990338
Overlap = 0.463768
defragTree

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.



Running defragTrees

[Seed 123] TrainingError = 0.30, K = 1
[Seed 124] TrainingError = 0.30, K = 2
[Seed 125] TrainingError = 0.30, K = 1
[Seed 126] TrainingError = 0.30, K = 1
[Seed 127] TrainingError = 0.30, K = 2
[Seed 128] TrainingError = 0.30, K = 1
[Seed 129] TrainingError = 0.30, K = 1
[Seed 130] TrainingError = 0.30, K = 1
[Seed 131] TrainingError = 0.30, K = 2
[Seed 132] TrainingError = 0.30, K = 2
[Seed 133] TrainingError = 0.30, K = 1
[Seed 134] TrainingError = 0.30, K = 1
[Seed 135] TrainingError = 0.30, K = 1
[Seed 136] TrainingError = 0.30, K = 2
[Seed 137] TrainingError = 0.30, K = 1
[Seed 138] TrainingError = 0.30, K = 1
[Seed 139] TrainingError = 0.30, K = 3
[Seed 140] TrainingError = 0.30, K = 2
[Seed 141] TrainingError = 0.30, K = 2
[Seed 142] TrainingError = 0.30, K = 2
Optimal Model >> Seed 123, TrainingError = 0.30, K = 1
Fit defragTrees time elapsed: 233.7488 seconds

defragTrees test accuracy
Accuracy = 0.706667
Coverage = 1.000000
Overlap = 0.000000
defragTree

  'precision', 'predicted', average, warn_for)


Working on defragTrees for instance 131
Working on defragTrees for instance 195
Working on defragTrees for instance 372
Working on defragTrees for instance 721
Working on defragTrees for instance 770
Working on defragTrees for instance 161
Working on defragTrees for instance 470
Working on defragTrees for instance 345
Working on defragTrees for instance 437
Working on defragTrees for instance 992
Working on defragTrees for instance 538
Working on defragTrees for instance 827
Working on defragTrees for instance 829
Working on defragTrees for instance 952
Working on defragTrees for instance 298
Working on defragTrees for instance 922
Working on defragTrees for instance 204
Working on defragTrees for instance 758
Working on defragTrees for instance 816
Working on defragTrees for instance 388
Working on defragTrees for instance 511
Working on defragTrees for instance 75
Working on defragTrees for instance 880
Working on defragTrees for instance 13
Working on defragTrees for instance 719
Wo

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.



Running defragTrees

[Seed 123] TrainingError = 0.21, K = 2
[Seed 124] TrainingError = 0.21, K = 2
[Seed 125] TrainingError = 0.21, K = 1
[Seed 126] TrainingError = 0.21, K = 1
[Seed 127] TrainingError = 0.21, K = 2
[Seed 128] TrainingError = 0.21, K = 3
[Seed 129] TrainingError = 0.21, K = 1
[Seed 130] TrainingError = 0.21, K = 1
[Seed 131] TrainingError = 0.21, K = 2
[Seed 132] TrainingError = 0.21, K = 2
[Seed 133] TrainingError = 0.21, K = 1
[Seed 134] TrainingError = 0.21, K = 2
[Seed 135] TrainingError = 0.21, K = 3
[Seed 136] TrainingError = 0.21, K = 2
[Seed 137] TrainingError = 0.21, K = 2
[Seed 138] TrainingError = 0.21, K = 2
[Seed 139] TrainingError = 0.21, K = 1
[Seed 140] TrainingError = 0.21, K = 2
[Seed 141] TrainingError = 0.21, K = 3
[Seed 142] TrainingError = 0.21, K = 2
Optimal Model >> Seed 123, TrainingError = 0.21, K = 2
Fit defragTrees time elapsed: 1331.9734 seconds

defragTrees test accuracy
Accuracy = 0.816456
Coverage = 0.957278
Overlap = 0.549051
defragTre

  'precision', 'predicted', average, warn_for)


Working on defragTrees for instance 2037
Working on defragTrees for instance 698
Working on defragTrees for instance 1516
Working on defragTrees for instance 1758
Working on defragTrees for instance 256
Working on defragTrees for instance 1712
Working on defragTrees for instance 711
Working on defragTrees for instance 1021
Working on defragTrees for instance 292
Working on defragTrees for instance 868
Working on defragTrees for instance 1835
Working on defragTrees for instance 1873
Working on defragTrees for instance 1476
Working on defragTrees for instance 235
Working on defragTrees for instance 2040
Working on defragTrees for instance 356
Working on defragTrees for instance 775
Working on defragTrees for instance 1924
Working on defragTrees for instance 1998
Working on defragTrees for instance 1511
Working on defragTrees for instance 1029
Working on defragTrees for instance 182
Working on defragTrees for instance 1171
Working on defragTrees for instance 28
Working on defragTrees for 

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.



Running defragTrees

[Seed 123] TrainingError = 0.66, K = 2
[Seed 124] TrainingError = 0.66, K = 1
[Seed 125] TrainingError = 0.64, K = 2
[Seed 126] TrainingError = 0.64, K = 2
[Seed 127] TrainingError = 0.66, K = 1
[Seed 128] TrainingError = 0.66, K = 2
[Seed 129] TrainingError = 0.63, K = 3
[Seed 130] TrainingError = 0.64, K = 2
[Seed 131] TrainingError = 0.65, K = 2
[Seed 132] TrainingError = 0.64, K = 2
[Seed 133] TrainingError = 0.65, K = 2
[Seed 134] TrainingError = 0.64, K = 2
[Seed 135] TrainingError = 0.66, K = 1
[Seed 136] TrainingError = 0.66, K = 1
[Seed 137] TrainingError = 0.65, K = 2
[Seed 138] TrainingError = 0.66, K = 1
[Seed 139] TrainingError = 0.64, K = 2
[Seed 140] TrainingError = 0.64, K = 2
[Seed 141] TrainingError = 0.66, K = 2
[Seed 142] TrainingError = 0.64, K = 2
Optimal Model >> Seed 129, TrainingError = 0.63, K = 3
Fit defragTrees time elapsed: 2073.2457 seconds

defragTrees test accuracy
Accuracy = 0.367609
Coverage = 1.000000
Overlap = 0.000000
defragTre

  'precision', 'predicted', average, warn_for)


Working on defragTrees for instance 33
Working on defragTrees for instance 2300
Working on defragTrees for instance 379
Working on defragTrees for instance 2347
Working on defragTrees for instance 403
Working on defragTrees for instance 440
Working on defragTrees for instance 1501
Working on defragTrees for instance 1951
Working on defragTrees for instance 2325
Working on defragTrees for instance 813
Working on defragTrees for instance 673
Working on defragTrees for instance 2423
Working on defragTrees for instance 1000
Working on defragTrees for instance 746
Working on defragTrees for instance 252
Working on defragTrees for instance 2330
Working on defragTrees for instance 1628
Working on defragTrees for instance 132
Working on defragTrees for instance 2110
Working on defragTrees for instance 2525
Working on defragTrees for instance 649
Working on defragTrees for instance 1185
Working on defragTrees for instance 2362
Working on defragTrees for instance 2358
Working on defragTrees for 

  'precision', 'predicted', average, warn_for)


Working on defragTrees for instance 1752
Working on defragTrees for instance 1202
Working on defragTrees for instance 1728
Working on defragTrees for instance 512
Working on defragTrees for instance 827
Working on defragTrees for instance 637
Working on defragTrees for instance 266
Working on defragTrees for instance 1629
Working on defragTrees for instance 717
Working on defragTrees for instance 522
Working on defragTrees for instance 328
Working on defragTrees for instance 1518
Working on defragTrees for instance 746
Working on defragTrees for instance 958
Working on defragTrees for instance 553
Working on defragTrees for instance 600
Working on defragTrees for instance 1587
Working on defragTrees for instance 655
Working on defragTrees for instance 1102
Working on defragTrees for instance 668
Working on defragTrees for instance 155
Working on defragTrees for instance 1412
Working on defragTrees for instance 1259
Working on defragTrees for instance 1501
Working on defragTrees for ins