cpl

This repository contains the implementation used in our CP23 paper. The implementation aims at generating decision sets that are both interpretable and accurate, by compiling a gradient boosted tree model on demand, where each generated rule is equivalent to an abductive explanation for the prediction made by the gradient boosted tree. The experiments compare the proposed implementation with other state-of-the-art decision set learning algorithms in terms of accuracy, scalability, model size and explanation size.

Instruction

Before using the implementation, we need to extract the datasets stored in datasets.tar.xz. To extract the datasets, please ensure tar is installed and run:

$ tar -xvf datasets.tar.xz

If interested in the logs, please run:

$ tar -xvf logs.tar.xz

Required Packages

The implementation is written as a set of Python scripts. The python version used in the experiments is 3.8.5. Some packages are required. To install requirements:

$ pip install -r requirements.txt

In addition to the packages above, Gurobi with full licence is also required. To install Gurobi, please follow the instruction. Please also follow the instruction to install IDS.

Usage

cpl.py provides a number of parameters, which can be set from the command line. To see the list of parameters, run:

$ cd src/ && python cpl.py -h

Preparing a dataset

Cpl can address datasets in the CSV format. Before compiling a gradient boosted tree (BT) model in to a decision set (DS), we need to prepare the datasets the train a BT model.

Assume a target dataset is stored in somepath/dataset.csv
Create an extra file named somepath/dataset.csv.catcol containing the indices of the categorical columns ofthe target dataset. For example, if columns 0, 3, and 6 are categorical features, the file should be as follow:
```
0
3
6
```
With the two files above, we can run:

$ python cpl.py -p --pfiles dataset.csv,somename somepath/

to create a new dataset file somepath/somename_data.csv with the categorical features properly addressed. For example:

$ python cpl.py -p --pfiles iris_train1.csv,iris_train1 ../datasets/train/iris/

Training a gradient boosted tree model

A gradient boosted tree model is required before generating a decision set. Run the following command to train a BT model:

$ python cpl.py -c -t -n 50 -d 3 --testsplit 0 ../datasets/train/iris/iris_train1_data.csv

Here, a boosted tree consisting of 50 trees per class is trained, where the maximum depth of each tree is 3. ../datasets/train/iris/iris_train1_data.csv is the dataset to be trained. The value of --testsplit ranges from 0.0 to 1.0. In this command line, the given dataset is split into 100% to train and 0% to test. By default, the generated model is saved in ./temp/iris_train1_data/iris_train1_data_nbestim_50_maxdepth_3_testsplit_0.0.mod.pkl

Compiling a boosted tree into a decision set

To generate a decision set via local compilation, i.e. the computed decision set covers all instances in the training dataset:

$ python cpl.py -f -I -R lin -e mx -s g3 -v --clocal --fsort --fqupdate ./temp/iris_train1_data/iris_train1_data_nbestim_50_maxdepth_3_testsplit_0.0.mod.pkl

-f enables the compiled decision set in a particular format. -I -R lin activates the compilation process where the standard linear search for rule extraction is used. -e mx -s g3 indicates the MaxSAT encoding and g3 SAT solver are used. -v increases verbosity level. --clocal --fsort --fqupdate indicates local compilation and the feature sorting based on feature frequencies is activated.

Lexicographic optimization on each rule, i.e. minimizing misclassifications first then the number of literals used, can be activated by adding --reduce-lit after --reduce-lit-appr maxsat.

$ python cpl.py -f -I -R lin -e mx -s g3 -v --clocal --fsort --fqupdate --reduce-lit after --reduce-lit-appr maxsat ./temp/iris_train1_data/iris_train1_data_nbestim_50_maxdepth_3_testsplit_0.0.mod.pkl

To enable the tradeoff between misclassifications and the number of literals used in each rule, add --lam 0.005 --approx 1 :

$ python cpl.py -f -I -R lin -e mx -s g3 -v --clocal --fsort --fqupdate --reduce-lit after --reduce-lit-appr maxsat --lam 0.005 --approx 1 ./temp/iris_train1_data/iris_train1_data_nbestim_50_maxdepth_3_testsplit_0.0.mod.pkl

To activate rule reduction, add --reduce-rule --weighted:

$ python cpl.py -f -I -R lin -e mx -s g3 -v --clocal --fsort --fqupdate --reduce-rule --weighted ./temp/iris_train1_data/iris_train1_data_nbestim_50_maxdepth_3_testsplit_0.0.mod.pkl

To activate both lexicographic optimization and rule reduction, add both ```` --reduce-lit after --reduce-lit-appr maxsat`` and --reduce-rule --weighted :

$ python cpl.py -f -I -R lin -e mx -s g3 -v --clocal --fsort --fqupdate --reduce-lit after --reduce-lit-appr maxsat --reduce-rule --weighted ./temp/iris_train1_data/iris_train1_data_nbestim_50_maxdepth_3_testsplit_0.0.mod.pkl

The implementation also supports exhaustive compilation:

$ python cpl.py -f -I -R lin -e mx -s g3 -v ./temp/iris_train1_data/iris_train1_data_nbestim_50_maxdepth_3_testsplit_0.0.mod.pkl

Reproducing Experimental Results

Due to randomization used in the training phase, it seems unlikely that the experimental results reported in the report can be completely reproduced. Similar experimental results can be obtained by the following script:

$ ./src/experiment/repro_exp.sh

Since the total number of datasets is 295 and 13 decision set competitors are considered, running the experiments will take a while.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
plots		plots
src		src
stats		stats
LICENSE		LICENSE
README.md		README.md
datasets.tar.xz		datasets.tar.xz
logs.tar.xz		logs.tar.xz
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plots

plots

src

src

stats

stats

LICENSE

LICENSE

README.md

README.md

datasets.tar.xz

datasets.tar.xz

logs.tar.xz

logs.tar.xz

requirements.txt

requirements.txt

Repository files navigation

cpl

Instruction

Table of Content

Required Packages

Usage

Preparing a dataset

Training a gradient boosted tree model

Compiling a boosted tree into a decision set

Reproducing Experimental Results

About

Releases

Packages

Languages

License

jinqiang-yu/cpl

Folders and files

Latest commit

History

Repository files navigation

cpl

Instruction

Table of Content

Required Packages

Usage

Preparing a dataset

Training a gradient boosted tree model

Compiling a boosted tree into a decision set

Reproducing Experimental Results

About

Resources

License

Stars

Watchers

Forks

Languages