LZ data generator
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.editorconfig
.gitignore
LICENSE
Makefile
Makefile.vc
README.md
lzdatagen.c
lzdatagen.h
lzdgen.c
parg.c
parg.h
pcg_basic.c
pcg_basic.h

README.md

LZ data generator

About

Sometimes it can be useful to be able to generate data that is similar to real data for testing or benchmarking purposes. For instance it may be impractical to distribute large data sets with an application.

lzdatagen generates data suitable for dictionary compression techniques.

Usage

lzdatagen comes with an example application lzdgen that provides a command-line interface for generating data:

usage: lzdgen [options] OUTFILE

Generate compressible data for testing purposes.

options:
  -f, --force            overwrite output file
  -h, --help             print this help and exit
  -l, --literal-exp EXP  literal distribution exponent [3.0]
  -m, --match-exp EXP    match length distribution exponent [3.0]
  -o, --output OUTFILE   write output to OUTFILE
  -r, --ratio RATIO      compression ratio target [3.0]
  -S, --seed SEED        use 64-bit SEED to seed PRNG
  -s, --size SIZE        size with opt. k/m/g suffix [1m]
  -V, --version          print version and exit
  -v, --verbose          verbose mode

If OUTFILE is `-', write to standard output.

Examples

Generate 1 MiB data which should compress roughly 1:4:

lzdgen -r 4.0 foo.bin

Generate 1 MiB data compressible by entropy coding, but without LZ repetitions:

lzdgen -r 1.0 foo.bin

Generate 1 GiB of data, piped to zstd:

lzdgen -s 1g - | zstd -o foo.zstd

Details

Data is generated by inserting sequences of either random bytes or repetitions from a buffer of bytes, depending on the ratio parameter. This is based on the paper "SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks" by Raúl Gracia-Tinedo et al.

Instead of sampling actual data, lzdatagen uses a simple power function to determine the distributions of literal values and match lengths. The exponents used can be set using the --literal-exp and --match-exp options.

This simplification means it cannot generate data with a limited alphabet, like DNA sequences.

The ratio parameter is approximate. Skewed literal distributions may create matches, and the way matches are created from a buffer may affect the distribution of byte values.

Please note that while data generated in this way may be useful for some kinds of testing and benchmarking, it is no substitute for unit tests that cover the limits of an algorithm.

lzdatagen uses a PCG random number generator. In verbose mode it will print the seed value to stderr. The --seed option can be used to generate reproducible data.

A few other projects in this area:

License

This projected is licensed under the Apache License, Version 2.0.