Tree Compression with Top Trees Revisited
This implementation accompagnies the paper "Tree Compression with Top Trees Revisited" by Rajeev Raman and me (Lorenz Hübschle-Schneider), to be published in the SEA 2015 proceedings. A technical report is available on the arXiv (arXiv:1506.04499 [cs.DS]).
The central part is the implementation of top tree compression, but this repository also contains RePair compression, the generation of random trees, and some utilities around it.
Some autogenerated documentation can be found at https://lorenzhs.github.io/toptrees/.
Nearly all of the actual code is implemented in the header files (
.h). This allows the compiler to perform extensive optimisations. Each of the
.cpp files in this folder can be compiled into an executable with GNU Make and the provided
The executables are:
codingreads an XML file, compresses it with our method, and computes the size of an encoding that is suitable for storage and unpacking. It does not produce an actual encoded output file. It supports both classical top tree compression as well as our RePair-inspired combiner. Usage information is available with the command line switches
randomEvalapplies the top tree compression algorithm to trees generated uniformly at random. Command line switches specify the number and size of trees to evaluate, the number of trees to evaluate in parallel (as threads), as well as the label alphabet size and the random seed. Help is available with the
randomVerifyworks similarly to
randomEval, but computes the top tree and unpacks it again, comparing the result of that with the input tree. This allows us to experimentally verify the correctness of our implementation, using both classic and RePair-like combining. Parameters are similar to
testapllies the compression algorithm to a single XML file and prints some statistics about the result. In most cases,
codingshould be used. Pass
-wto write output DOT-files for top tree and Top DAG to
/tmpand invoke the GraphViz
dotcommand on them (warning: this can take a very long time for large graphs!). Pass
-rfor RePair-like combiner.
testTTworks similarly to
testbut performs unpacking of the Top DAG to verify correctness. Specify input file with
-i, output folder for the trimmed and recovered XML files with
/tmp), and pass
-rto use the RePair-inspired combiner.
repairapplies the RePair compression algorithm to the input file, printing the grammar and output string to stdout if
randomTreegenerates trees uniformly at random. Tree and alphabet size, seed, and output folder for an XML file (default: don't write) can be specified, as well as DOT graph plotting similar to
--helpfor full usage information.
A Note on Experiments
Some of the binaries output a line starting with "RESULT" and containing key-value pairs. This line is for parsing with Timo Bingmann's SqlPlotTools and was used to automatically update the plots, tables, and numbers in the paper as experiments were re-run after changes in the code. Don't give copy-paste errors a chance!
All algorithms are implemented in C++11 and have been tested with the GNU C++ compiler,
g++, in version 4.9 and
clang++ in version 3.6. To build one of the above executables including debug assertions (which can cause significant overhead!), the executable name serves as
make target, e.g.
make coding. A version with debug information and without optimisations for debugging can be compiled by appending
Debug to the executable name, while appending
NoDebug disables assertions (example:
make randomEvalNoDebug). For some executables, a target for profile guided optimisation builds is available as well, this might require changing of XML file paths in the Makefile. The suffix for these is
PGO. The binaries will be suffixed with
clang++ requires the
gold linker and linker plugin to use link-time optimisations, which are enabled by default on all non-
Debug builds. If this causes problems on your systems, remove
-flto from line 6 and all occurences of
=$(NPROCS) in the
Prior to version 3.6, the
clang++ compiler did not support debug information with the upcoming
c++1y standard. If you require support for
clang++ 3.4 or 3.5, remove
-g from the flags.
make targets are available for static analysis with
cppcheck (warning: many false positives, as it does not seem to have, among others, support for C++11 lambdas) and
The code was tested and run under Debian GNU/Linux in the unstable distribution as of June 2015, but should work on all Linux-based systems. Adaptation for other operating systems will most likely require a change of paths (e.g. '/tmp'), the
makePathRecursive function in
Common.h and the
drawSvg function in
DotGraphExporter.h. Additionally, the Makefile requires adaptation to discover the number of available processors for link-time optimisation. POSIX threads (pthreads) are a requirement for the
Unless otherwise noted, all code in this repository is subject to the GNU General Public License, Version 3. A copy of the licence is provided in the file LICENSE.