## Find Frequent Subtrees

To be able to meaningfully find frequent subtrees in random forests, we actually need to do two things in the graph mining executable:
- make the algorithm able to deal with rooted trees in a meaningful way
- make the algorithm output at least the root vertex of an embedding (if it exists for a given transaction tree) instead of just 'there is a mapping'

### Find Frequent Undirected Trees

Regardless of the above, in this notebook I'll first check, how many undirected frequent trees we can find in the 'unrooted variant' of the random forests.
That is: We consider the undirected graphs arising from the rooted decision trees by 'forgetting' the root. 
If there exists a subgraph isomorphism in the rooted variant, then this implies that there exists a subgraph isomorphism in this undirected version. 

However, this does not imply that the number of frequent undirected subtrees is an upper or lower bound on the number of frequent directed subtrees: 
There are up to $k$ nonisomorphic rooted trees for each undirected frequent tree.  

- All or some of them might be frequent. Hence, the number of frequent directed trees might be larger than the number of undirected frequent trees.
- On the other hand, every rooted version by itself might be infrequent in the rooted transaction database, but seen as instances of the undirected subtree, in total they might go over the frequency threshold. Hence, the number of directed frequent trees might be smaller than the number of undirected frequent trees.

As a result: Be careful in your interpretation of the results.

In [None]:
# create output directories
mkdir forests/undirectedFrequentTrees/
for dataset in adult wine-quality; do
    mkdir forests/undirectedFrequentTrees/${dataset}/
    for variant in WithLeafEdges NoLeafEdges; do
        mkdir forests/undirectedFrequentTrees/${dataset}/${variant}/
    done
done

In [None]:
for dataset in adult wine-quality; do
    for variant in WithLeafEdges NoLeafEdges; do
        for f in forests/${dataset}/${variant}/*.graph; do
            for threshold in `seq 25 -1 2`; do
            
                echo "processing threshold ${threshold} for ${f}"
                ./lwg -e subtree -m bfs -t ${threshold} -p 10 \
                  -o forests/undirectedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.patterns \
                  < ${f} \
                  > forests/undirectedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.features \
                  2> forests/undirectedFrequentTrees/${dataset}/${variant}/`basename ${f} .graph`_t${threshold}.logs
                  
            done
        done
    done
done

### Next Steps

The results of this mining process are plotted in the python3 notebook 'Results for Undirected Frequent Trees.ipynb'.
Note that the mining process resulting in output 'forests/undirectedFrequentTrees/adult/WithLeafEdges/ER_20_t2.*' did not finish properly and probably got killed due to excessive memory usage while processing patterns of size 6.