# Implementation Documentation

Sections:
1. Configs
2. Data
3. Outputs
4. Scripts

## 1. Configs

### Example config

In [1]:
import json
dataset_name = 'auto'

with open('../Configs/'+dataset_name+'.json') as config_file:
        config = json.load(config_file)

In [2]:
print(json.dumps(config, indent = 4, sort_keys = False))

{
    "name": "adult",
    "type": "classification",
    "primary_paths": "adult_local_paths_1000.csv",
    "secondary_paths": "adult_local_paths_per_instance_1000.csv",
    "data_with_headers": "adult_headers.csv",
    "filtered_data_with_headers": "adult_filtered.csv",
    "perturbed_data": "adult_perturbed.csv",
    "local_bins": "adult_local_bin_labels_1000.csv",
    "perturbed_paths": "adult_perturbed_paths_1000.csv",
    "perturbed_local_bins": "adult_perturbed_bin_labels_1000.csv",
    "tree_depths": "adult_local_depths_1000.csv",
    "tree_widths": "adult_tree_widths.csv",
    "leaf_nodes": "adult_leaf_nodes.csv",
    "variable_importance": "adult_variable_importance_1000.csv",
    "num_features": 14,
    "target_col": 15,
    "path_regex": "([0-9]{1,2})([A-Z]+)([01])",
    "primary_weight": 0.25,
    "secondary_weight": 0.01,
    "frame_folder": "local_dt_info_adult",
    "size": 30162,
    "sample": 1000,
    "classes": [
        "<=50K",
        ">50K"
    ],
    "columns": 

### Config details

- name: name of the dataset
- type: type of task (accepted values: classification, regression)
- primary_paths: name of csv of primary paths
- secondary_paths: name of csv of secondary paths
- data_with_headers: name of csv of data with header
- filtered_data_with_headers: name of csv of filtered data (no NA) with header
- perturbed_data: name of csv of perturbed data
- local_bins: name of csv of local bins
- perturbed_paths: name of csv with original and perturbed paths
- perturbed_local_bins: name of csv with bins corresponding to the file perturbed_paths
- tree_depths: name of csv with tree depth data
- tree_widths: name of csv with maximum tree width data
- leaf_nodes: name of csv with leaf node data
- variable importance: name of csv containing variable importance for local trees
- num_features: number of features in the dataset
- target_col: target column
- path_regex: regex used to convert each node of the path from string to tuple form
- primary weight: weight of primary instance in it's own local tree
- secondary weight: weight of secondary instance in the local tree of another instance
- frame_folder: name of folder containing frames for local trees
- size: size of filtered dataset
- sample: number of rows of dataset used
- classes: class labels
- columns: names of dataset columns
- factors: name of csv with factor name to numerical index mapping for each factor (optional: only if dataset has categorical data)
- data_numeric: name of csv of data in numeric format (optional: only if dataset has categorical data)

## 2. Data

In [3]:
import pandas as pd

### a. Data with header

In [4]:
pd.read_csv("../Data/"+config['data_with_headers'])

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### b. Filtered data with header

? values removed

In [5]:
pd.read_csv("../Data/"+config['filtered_data_with_headers'])

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30157,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
30158,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
30159,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
30160,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### c. Perturbed data

Column 14 has the original target class labels and column 15 has the new predicted class labels after perturbation

In [6]:
pd.read_csv("../Data/"+config['perturbed_data'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,27.0,Private,0.0,11th,18.0,Never-married,Priv-house-serv,Own-child,Asian-Pac-Islander,Male,26846.0,260.0,36.0,Guatemala,<=50K,>50K
1,83.0,Federal-gov,102036.0,11th,21.0,Married-spouse-absent,Transport-moving,Husband,Asian-Pac-Islander,Male,17371.0,1063.0,48.0,Ecuador,<=50K,>50K
2,0.0,Self-emp-inc,0.0,12th,18.0,Divorced,Protective-serv,Other-relative,White,Male,136862.0,0.0,173.0,Guatemala,<=50K,>50K
3,143.0,Self-emp-not-inc,0.0,7th-8th,15.0,Separated,Machine-op-inspct,Other-relative,Amer-Indian-Eskimo,Female,14175.0,351.0,112.0,India,<=50K,>50K
4,48.0,Without-pay,202380.0,5th-6th,7.0,Married-civ-spouse,Protective-serv,Own-child,White,Female,8591.0,1404.0,69.0,Portugal,<=50K,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,66.0,State-gov,0.0,7th-8th,15.0,Divorced,Priv-house-serv,Unmarried,Black,Female,24498.0,0.0,17.0,El-Salvador,<=50K,>50K
996,3.0,Without-pay,826889.0,Bachelors,0.0,Divorced,Handlers-cleaners,Wife,Amer-Indian-Eskimo,Female,88760.0,0.0,94.0,Peru,<=50K,>50K
997,69.0,Self-emp-not-inc,325515.0,HS-grad,8.0,Divorced,Farming-fishing,Husband,White,Male,9542.0,0.0,80.0,Guatemala,<=50K,>50K
998,41.0,Local-gov,195258.0,10th,10.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K,<=50K


### d. Numeric Data (Optional)

Data obtained by applying the factor mapping to a dataset.

In [7]:
df = None
if 'data_numeric' in config:
    df = pd.read_csv("../Data/"+config['data_numeric'])
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,8,77516,10,13,5,2,2,5,2,2174,0,40,40,1
1,50,7,83311,10,13,3,5,1,5,2,0,0,13,40,1
2,38,5,215646,12,9,1,7,2,5,2,0,0,40,40,1
3,53,5,234721,2,7,3,7,1,3,2,0,0,40,40,1
4,28,5,338409,10,13,3,11,6,3,1,0,0,40,6,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30157,27,5,257302,8,12,3,14,6,5,1,0,0,38,40,1
30158,40,5,154374,12,9,3,8,1,5,2,0,0,40,40,2
30159,58,5,151910,12,9,7,2,5,5,1,0,0,40,40,1
30160,22,5,201490,12,9,5,2,4,5,2,0,0,20,40,1


## 3. Outputs

### a. Primary paths

In [8]:
pd.read_csv("../Outputs/"+config['primary_paths'])

Unnamed: 0,paths
0,"7A0,14B0,8A0"
1,"11C0,13D0"
2,"6E0,4F0"
3,"7G0,12H0,2I0,7G0"
4,"7J0,4K0,13L0"
...,...
995,"6M0,11BH0,2I0"
996,"11C0,4F0,7I0,13BK0"
997,"4F0,8E1,7M0"
998,"6Q1,7F1,1EL1,13AEL1"


### b. Secondary paths

In [9]:
pd.read_csv("../Outputs/"+config['secondary_paths'], index_col = 0)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X991,X992,X993,X994,X995,X996,X997,X998,X999,X1000
1,"11BC0,8T0,7A0","4L1,6C0,10A0","6C0,2U0,3BH0","8I1,1AX1,7U1,12AY0,2C1","7U0,14B0,8Q0","7U0,4N0,13AS0","11Y0,13Z1,7U0,4A0,6C0,13AB0","7Q0,1BD0,3FS1,2C0","7J1,1D1,4U0","11Y0,13Z1,7V0,4P0,6AC0,13AB0",...,"11BC0,1Z0","7J1,8Q0,12AN0","7A0,2A1","7U0,12AM0,2C0,7U0","11Y0,13Z1,7U0,4N0,6C0,13AB0","11Y0,13Z0","7U0,14B0,8I1,7U0,4N0,2C0",7K0,"6I0,11BX0","11BK0,7J0,8I0"
2,"11BM0,7A0,4E0,2U0","4L1,6C0,10A0","6C0,2U0,3BH0","7U0,2C0,1DI0","7C1,8T0,4P0","11BK0,7Q0,8I0","11Y0,13Z1,7U0,4A0,6C0,13AB0","13AO1,4N0","11ET0,8T0,1CU0","8I1,1AX0,4L0",...,"11BC0,1Z0","7V1,6I1,14B1,8C1","7K0,1EZ0,11DO0","7J1,6C0,1DS0","1BZ0,2A0","7U0,7U1,1CP1","7H0,12AM0,2C0,7H0","7L0,1EZ1,7L1,2C1","7V1,6AC1,14B1,8I1","11FJ0,2A0,7T0"
3,"1BZ0,2C0","11Y0,2H0,4E0,7K0,13W1,3FV0,11FW0,13FX0","13FI0,12GY0,2C0,14B0,4N0,9C0","7A0,4E0,14B0,3DV1,7A0","8A0,4L0","7C1,8C0,1DM0","7K1,1DH0","6I0,7V1,1EP1","7J1,1D1,4U0","11Y0,13Z1,7V0,4P0,6AC0,13AB0",...,"13FL0,11ET0,4E0,2U1,1FK1","11ET0,8C0,1CU1,3EV0","7A0,2A1","13CP1,3EM0,2T1","7K0,8I1,7K1,1DH0","11BC0,1Z0","4K1,6I1",7K0,"6AC1,4E1,1CR1,3CS0,2C0,7K0","11BK0,7J0,8I0"
4,"8C0,4N0,7U0","13FI0,12GY0,2H0,14B0,4N0,9C0","7J1,6AC1,3JQ1,4P1","13CP1,3EM0,2C1","8I1,7Q0,7Q1,1EW1,2C1,13CP0","13CP1,3EM0,2C1","7Q0,7Q0","6I0,12ABM0","11ET0,8T0,1CU0","8I1,1AX0,4L0",...,"4L1,8A1,1DU0,1FK1","7J1,8Q0,12AN0","7K0,1EZ0,11DO0","7J1,6AC1,1FG1","11Y0,13Z1,7U0,4N0,6C0,13AB0","8T0,4E1,13CP0","8A0,4H0","7L0,1EZ1,7L1,2C1","6I0,11BX0","11FJ0,2A0,7T0"
5,"11BC0,8T0,7A0","11IZ0,4L1,6AC1,3JK0","7U0,11FW0,2C0,12NW0,1CO0","7V0,4P0,9C0,7V1,1EP0,3EQ0","6C0,2C0,7Q0","7U0,4N0,13AS0","2C0,6C0,13AH0","8Q0,2C0","7A0,3IO1,3IP0,11IQ0,13IR0,7A0","4P0,6AC1,7C1,7C1",...,"8I1,7V1,1DY0","7V1,6I1,14B1,8C1","7J1,3GP0,8I1,3GR0,7J0,1AV1","7U0,12AM0,2C0,7U0","1BZ0,2A0","8I1,7A1","7U0,14B0,8I1,7U0,4N0,2C0","7R0,1BD0,3FS1,2C0","7V1,6AC1,14B1,8I1","7L0,3IO1,3IP0,11IQ0,13IR0,7L0"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,,,,,,,,,,,...,,,,,,,,,,
97,,,,,,,,,,,...,,,,,,,,,,
98,,,,,,,,,,,...,,,,,,,,,,
99,,,,,,,,,,,...,,,,,,,,,,


### c. Local bins

In [10]:
pd.read_csv("../Outputs/"+config['local_bins'])

Unnamed: 0.1,Unnamed: 0,values.global_bins.
0,A,(2)
1,AB,49
2,ABC,92578.5
3,ABD,30968.5
4,ABE,(21)
...,...,...
1648,XY,341681.5
1649,XZ,276705
1650,Y,4225
1651,YZ,96714


### d. Perturbed paths

In [11]:
pd.read_csv("../Outputs/"+config['perturbed_paths'])

Unnamed: 0,original,perturbed
0,"7A0,14B0,8A0","7P0,14AD1"
1,"11C0,13D0",11C1
2,"6E0,4F0","6E0,4Q1"
3,"7G0,12H0,2I0,7G0","7G0,12H0,2I0,7G0"
4,"7J0,4K0,13L0","7J0,4I0,13L1,13BN1"
...,...,...
995,"6M0,11BH0,2I0","6E0,11BH1"
996,"11C0,4F0,7I0,13BK0",11C1
997,"4F0,8E1,7M0","4F0,8E1,7I1,2I1"
998,"6Q1,7F1,1EL1,13AEL1","6Q1,7J1,1EL1,13AEL1"


### e. Perturbed local bins

Applicable to large datasets for which one script produces primary+secondary paths and another produces primary+perturbed paths. A separate set of local bins i.e. perturbed_local_bins is maintained for the latter.

In [12]:
pd.read_csv("../Outputs/"+config['perturbed_local_bins'])

Unnamed: 0.1,Unnamed: 0,values.global_bins.
0,A,(2)
1,AB,1824.5
2,ABC,60.5
3,ABD,40943
4,ABE,60
...,...,...
582,XY,271185.5
583,XZ,98.5
584,Y,(27)
585,YZ,159137


### f. Tree depths

In [13]:
pd.read_csv("../Outputs/"+config['tree_depths'])

Unnamed: 0,x
0,7
1,7
2,6
3,9
4,5
...,...
995,6
996,5
997,5
998,5


### g. Tree widths

Maximum tree width in each local tree and the depth at which the maximum width occurs.

In [14]:
pd.read_csv("../Outputs/"+config['tree_widths'])

Unnamed: 0,width,at-depth
0,4,2
1,8,5
2,4,2
3,8,3
4,8,3
...,...,...
995,6,3
996,4,3
997,2,1
998,6,3


### h. Leaf nodes

Number of leaf nodes in each local tree and the list of leaf nodes.

In [15]:
pd.read_csv("../Outputs/"+config['leaf_nodes'])

Unnamed: 0,num_leaves,leaves
0,13,"[5, 6, 8, 14, 19, 30, 37, 63, 72, 124, 125, 14..."
1,15,"[3, 4, 41, 43, 44, 47, 80, 84, 85, 90, 91, 92,..."
2,10,"[4, 5, 12, 15, 26, 28, 29, 55, 108, 109]"
3,16,"[9, 10, 11, 12, 15, 16, 26, 27, 28, 29, 35, 68..."
4,11,"[8, 10, 11, 12, 13, 15, 18, 28, 29, 38, 39]"
...,...,...
995,13,"[5, 8, 13, 14, 18, 19, 24, 31, 50, 60, 61, 102..."
996,8,"[3, 9, 10, 16, 22, 23, 34, 35]"
997,6,"[3, 4, 10, 23, 44, 45]"
998,10,"[4, 10, 12, 13, 14, 23, 30, 31, 44, 45]"


### i. Variable importance

Matrix of variable importance of each feature in a local tree. NaN if value is not available.

In [16]:
pd.read_csv("../Outputs/"+config['variable_importance'])

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,11.411598,3.416068,0.507198,9.673437,3.329528,3.906292,17.319525,4.109172,,,,,2.853881,4.452669
1,8.129641,7.071155,7.541180,8.206366,0.310662,1.795567,7.054799,1.006509,,,6.736842,,7.347508,0.481055
2,6.987798,8.229027,5.301051,14.677562,8.686763,11.032258,13.039755,0.507937,0.649351,0.507937,1.240966,,5.281920,1.240966
3,7.489889,6.433319,1.180088,10.792876,9.636746,6.875077,10.791003,9.423401,0.438336,4.247207,1.521372,2.955690,2.054706,
4,5.178472,1.923206,4.260837,6.287867,4.949216,1.350000,6.039838,2.700000,,,,,3.667484,1.306047
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,7.988674,6.071879,1.529524,6.754286,3.000000,11.931648,8.543184,,,2.727234,2.371030,,4.668138,
996,4.119048,3.435923,2.738024,4.431548,1.620536,2.976190,6.841805,0.270089,,,6.600833,,0.475904,
997,10.807314,11.631777,7.250000,16.694835,9.601538,,14.997314,3.235165,,1.653529,,2.743297,10.502454,1.371648
998,20.825110,10.753428,2.904762,23.128379,5.638824,27.647192,23.531889,2.186759,,8.624887,0.335601,,2.713124,4.373518


### j. Factors (Optional)

- Mapping of factors to numeric index.
- The number in row 0 indicates the column number of the feature in the dataset (numbered from 1).
- The numeric index used for each level in R is indicated by row number. <br>
- For example, the mapping for marital-status is: <br>
{1: Divorced, 2: Married-AF-spouse, ..., 7: Widowed}

In [17]:
df = None
if 'factors' in config:
    df = pd.read_csv("../Outputs/"+config['factors'])
df   

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,target
0,2,4,6,7,8,9,10,14,15
1,?,10th,Divorced,?,Husband,Amer-Indian-Eskimo,Female,?,<=50K
2,Federal-gov,11th,Married-AF-spouse,Adm-clerical,Not-in-family,Asian-Pac-Islander,Male,Cambodia,>50K
3,Local-gov,12th,Married-civ-spouse,Armed-Forces,Other-relative,Black,,Canada,
4,Never-worked,1st-4th,Married-spouse-absent,Craft-repair,Own-child,Other,,China,
5,Private,5th-6th,Never-married,Exec-managerial,Unmarried,White,,Columbia,
6,Self-emp-inc,7th-8th,Separated,Farming-fishing,Wife,,,Cuba,
7,Self-emp-not-inc,9th,Widowed,Handlers-cleaners,,,,Dominican-Republic,
8,State-gov,Assoc-acdm,,Machine-op-inspct,,,,Ecuador,
9,Without-pay,Assoc-voc,,Other-service,,,,El-Salvador,


## 4. Scripts

### Python script usage

python3 script.py --dataset DATASET

### Datasets supported

- iris
- adult
- auto