A customisable compression utility dedicated to short inputs. Get the newest release.
Skrót allows you to build a model of your data and use it to compress short byte sequences of predictable contents. It can efficiently compress byte sequences shorter than 200B. It's based on established dictionary-based data compression algorithms: LZMA and LZ4.
Interested in some numbers? Take a look at results of our benchmarks.
Skrót comes in two variants:
- a native library written in portable C, and
- a pure Java library with no native dependencies.
Skrót has two implementations: a native one and a JVM one. They share no code and the way they're built and used differs.
Native library and a command line tool
A tentative C API is specified in
It's implemented in form of a shared library
Its only dependencies are liblz4 and liblzma;
they're present in repositories of most contemporary GNU/Linux distributions and
they're available in Homebrew as
int skr_model(const uint8_t* input, size_t input_len, uint8_t** output, size_t output_len, skr_opts_t* opts); int skr_compress(const uint8_t* model, size_t model_len, const uint8_t* input, size_t input_len, uint8_t** output, size_t output_len, skr_opts_t* opts); int skr_decompress(const uint8_t* model, size_t model_len, const uint8_t* input, size_t input_len, uint8_t** output, size_t output_len, skr_opts_t* opts);
skr and its aliases
mkskr are command line wrappers around
As of time of writing it's written in Python.
They should be invoked in a following way:
mkskr < raw_model > modelbuilds a Skrót model file from standard input,
skr model < input > compressedcompresses contents of standard input using the model,
unskr model < compressed > uncompresseddecompresses contents of standard input using the model.
libskr can be built by invoking
The API of the Java library has some documentation.
The Java library is built with Leiningen;
Install it first if you want to build a JAR.
lein jar in the
The JAR will be placed in the
LZMA is a dictionary based compression algorithm used in, among others, 7zip and xz-utils. LZ4 is a compression algorithm belonging to the same family as LZMA and it's used among others in the Linux kernel, Hadoop and BSD implementation of ZFS. One of interesting features of these algorithms is the fact that changes introduced at the end of an uncompressed input stream result only in changes at the end of the compressed output stream; the rest remains intact. Moreover, LZ4 and the basic form of LZMA—i.e. LZMA1—don't use any checksums.
Let's focus on LZMA1 in the rest of this section
(the same applies to LZ4, though).
Consider the following back-of-the-envelope reasoning.
Let's take two non-empty byte sequences,
a with LZMA1 produces
Compressing a concatenation of
b are similar the difference of lengths of
comp_a is small compared to
As an example of similar byte sequences consider a C source file and a
corresponding header, or two JSON documents with an identical schema.
╭───┄─╮ ╭─────────╮ │ a │ → │ comp_a │ ╰───┄─╯ ╰─────────╯ ╭───┄─┬───┄─╮ ╭───────────╮ │ a │ b │ → │ comp_ab │ ╰───┄─┴───┄─╯ ╰───────────╯
comp_ab are identical up to index
where the first differing byte has index
n + 1 ≤ length(comp_a).
╭─────────────────╮ │ comp_a │ ╰─────────────────╯ ╭───────────────────────────╮ │ comp_ab │ ╰───────────────────────────╯ ╭─────────────┬─────────────╮ │ comp_a[:n] │ comp_ab[n:] │ ╰─────────────┴─────────────╯ 0 n
Since LZMA1 doesn't use any form of checksumming, in order to recover
we need is
Typically, the difference between
n is significantly
lesser than 255.
As a consequence,
comp_ab[n:] should be relatively short as long as
b are similar.
This allows us to build a specialised compression tool for any byte sequence
x similar to
A compressed representation of
comp_a, shall be built into the tool.
x, in order to compress it the tool
- compress concatenation of
- determines smallest
comp_ax[n] ≠ comp_a[n], and
comp_ax[n:]as a compressed representation of
In order to obtain
n and byte sequence
- uncompresses a concatenation of
len(a)first elements of
That's the whole algorithm. It tends to work.
Smaz “is a simple compression library suitable for compressing very short strings”. It excels at compressing English.
Copyright (c) 2014 Jan Stępień Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.