LZ data generator
Sometimes it can be useful to be able to generate data that is similar to real data for testing or benchmarking purposes. For instance it may be impractical to distribute large data sets with an application.
lzdatagen generates data suitable for dictionary compression techniques.
lzdatagen comes with an example application lzdgen that provides a command-line interface for generating data:
usage: lzdgen [options] OUTFILE Generate compressible data for testing purposes. options: -f, --force overwrite output file -h, --help print this help and exit -l, --literal-exp EXP literal distribution exponent [3.0] -m, --match-exp EXP match length distribution exponent [3.0] -o, --output OUTFILE write output to OUTFILE -r, --ratio RATIO compression ratio target [3.0] -S, --seed SEED use 64-bit SEED to seed PRNG -s, --size SIZE size with opt. k/m/g suffix [1m] -V, --version print version and exit -v, --verbose verbose mode If OUTFILE is `-', write to standard output.
Generate 1 MiB data which should compress roughly 1:4:
lzdgen -r 4.0 foo.bin
Generate 1 MiB data compressible by entropy coding, but without LZ repetitions:
lzdgen -r 1.0 foo.bin
Generate 1 GiB of data, piped to zstd:
lzdgen -s 1g - | zstd -o foo.zstd
Data is generated by inserting sequences of either random bytes or repetitions from a buffer of bytes, depending on the ratio parameter. This is based on the paper "SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks" by Raúl Gracia-Tinedo et al.
Instead of sampling actual data, lzdatagen uses a simple power function to
determine the distributions of literal values and match lengths. The exponents
used can be set using the
This simplification means it cannot generate data with a limited alphabet, like DNA sequences.
The ratio parameter is approximate. Skewed literal distributions may create matches, and the way matches are created from a buffer may affect the distribution of byte values.
Please note that while data generated in this way may be useful for some kinds of testing and benchmarking, it is no substitute for unit tests that cover the limits of an algorithm.
lzdatagen uses a PCG random number generator. In verbose mode it will print
the seed value to stderr. The
--seed option can be used to generate
A few other projects in this area:
This projected is licensed under the Apache License, Version 2.0.