Sorts Consistent Tree outputs into contiguous order. Written in C, probably only works on Linux.
Consistent tree output is spread out over various tree_*_*_*.dat files with a locations.dat file
specifying the tree_root_id, fileid and offset to read in the data. Sort_Forests resorts
the entire data-set such that all forests are grouped contiguously in the same file -- speeding up IO.
The code is written to be very fast and uses lower level system calls (pread instead of fseek/fgets combo). For instance, rewriting entire 1.2 TB of Bolshoi trees takes 12 hours on a single cpu.
- A C compiler (gcc/icc/clang). gcc is default.
$ git clone https://github.com/manodeep/sort_forests/
$ make
$ ./sort_forests <input directory> <output directory>
Input directory should contain forests.list, locations.dat and tree_*_*_*.dat files.
Output directory will contain the new forests.list, locations.dat and tree_*_*_*.dat files.
In addition, the output directory also contains forests_and_locations_new.dat file that contains
forest id, tree_root_id, fileid, offset, filename and a new column bytes that
gives the number of (ASCII) bytes in each tree. This bytes column will let you pre-allocated a
buffer for a tree and read in the entire tree in one go (rather than line by line).
Sort_Forests is written/maintained by Manodeep Sinha. Please contact the author in
case of any issues.
Sort_Forests is released under the MIT license. Basically, do what you want
with the code including using it in commercial application.
- version control (https://github.com/manodeep/sort_forests/)