# Frequent Subtree Counting in Random Forests

The goal of this project is to compress the generated source code of decision tree classifiers on embedded devices.
Therefore, as a first step, we investigate for several trained random forests, whether they have certain frequent subtrees in common.
Such subtrees might be implemented by a function which is called several times in the corresponding places of the decision trees. 
This can decrease the code size of the generated embedded-c source files and executables.

## Datasets
There are several datasets.
At the moment, however, I'll experiment only with 'adult' and 'wine-quality'.
The data are given as json files of the following names:

In [5]:
ls forests/*/text/*.json

forests/adult/text/DT_10.json  forests/wine-quality/text/DT_10.json
forests/adult/text/DT_15.json  forests/wine-quality/text/DT_15.json
forests/adult/text/DT_1.json   forests/wine-quality/text/DT_1.json
forests/adult/text/DT_20.json  forests/wine-quality/text/DT_20.json
forests/adult/text/DT_5.json   forests/wine-quality/text/DT_5.json
forests/adult/text/ET_10.json  forests/wine-quality/text/ET_10.json
forests/adult/text/ET_15.json  forests/wine-quality/text/ET_15.json
forests/adult/text/ET_1.json   forests/wine-quality/text/ET_1.json
forests/adult/text/ET_20.json  forests/wine-quality/text/ET_20.json
forests/adult/text/ET_5.json   forests/wine-quality/text/ET_5.json
forests/adult/text/RF_10.json  forests/wine-quality/text/RF_10.json
forests/adult/text/RF_15.json  forests/wine-quality/text/RF_15.json
forests/adult/text/RF_1.json   forests/wine-quality/text/RF_1.json
forests/adult/text/RF_20.json  forests/wine-quality/text/RF_20.json
forests/adult/text/RF_5.json   forests/wine-quality/t

The filenames XX_n.json mean:
- XX:
  - DT decision tree
  - RF random forest with 25 trees
  - ET 'extremely random trees'
- n: depth of the trees

Each tree given here (whether as a decision tree or as a component in a random forest) is a rooted, binary, ordered tree.

## Convert the Data

We have written a few scripts to convert from json to the format required by the frequent subgraph mining software.

In [7]:
ls ./*.py

./json2graphNoLeafEdges.py  ./json2graphWithLeafEdges.py


See the documentation of the scripts to check what they are doing:

In [12]:
for f in ./*.py; do
    echo ${f}
    grep '^#' < ${f}
    echo
done

./json2graphNoLeafEdges.py
[01;31m[K#[m[K# This script creates a graph database from the decision trees in Sebastians json Format as follows:
[01;31m[K#[m[K - the root vertex of the tree has index 1 (counting starts with 1)
[01;31m[K#[m[K - each vertex is labeled by its split feature or by 'leaf'
[01;31m[K#[m[K - each edge is labeled either 'leftChild' or 'rightChild'
[01;31m[K#[m[K - there are no edges containing 'leaf' vertices
[01;31m[K#[m[K
[01;31m[K#[m[K It follows that the connected components resulting from a single decision tree are several isolated vertices labeled 'leaf' 
[01;31m[K#[m[K and a tree containing all the split vertices.

./json2graphWithLeafEdges.py
[01;31m[K#[m[K# This script creates a graph database from the decision trees in Sebastians json Format as follows:
[01;31m[K#[m[K - the root vertex of the tree has index 1 (counting starts with 1)
[01;31m[K#[m[K - each vertex is labeled by its split feature or by 'leaf'
[01;

In [13]:
mkdir forests/adult/WithLeafEdges/
mkdir forests/adult/NoLeafEdges/

mkdir forests/wine-quality/WithLeafEdges/
mkdir forests/wine-quality/NoLeafEdges/

In [28]:
for dataset in adult wine-quality; do
    for f in forests/${dataset}/text/*.json; do
        echo ${f} '->' `basename ${f} .json`.graph
        python json2graphWithLeafEdges.py ${f} > forests/${dataset}/WithLeafEdges/`basename ${f} .json`.graph
        python json2graphNoLeafEdges.py ${f} > forests/${dataset}/NoLeafEdges/`basename ${f} .json`.graph
    done
done

forests/adult/text/DT_10.json -> DT_10.graph
forests/adult/text/DT_15.json -> DT_15.graph
forests/adult/text/DT_1.json -> DT_1.graph
forests/adult/text/DT_20.json -> DT_20.graph
forests/adult/text/DT_5.json -> DT_5.graph
forests/adult/text/ET_10.json -> ET_10.graph
forests/adult/text/ET_15.json -> ET_15.graph
forests/adult/text/ET_1.json -> ET_1.graph
forests/adult/text/ET_20.json -> ET_20.graph
forests/adult/text/ET_5.json -> ET_5.graph
forests/adult/text/RF_10.json -> RF_10.graph
forests/adult/text/RF_15.json -> RF_15.graph
forests/adult/text/RF_1.json -> RF_1.graph
forests/adult/text/RF_20.json -> RF_20.graph
forests/adult/text/RF_5.json -> RF_5.graph
forests/wine-quality/text/DT_10.json -> DT_10.graph
forests/wine-quality/text/DT_15.json -> DT_15.graph
forests/wine-quality/text/DT_1.json -> DT_1.graph
forests/wine-quality/text/DT_20.json -> DT_20.graph
forests/wine-quality/text/DT_5.json -> DT_5.graph
forests/wine-quality/text/ET_10.json -> ET_10.graph
forests/wine-quality/text/ET_

## Find Frequent Subtrees

To be able to meaningfully find frequent subtrees here, we actually need to do two things in the graph mining executable:
- make the algorithm able to deal with rooted trees in a meaningful way
- 