Lempel-Ziv-JodyBruchon compression
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore Rename lzjb to lzjody to avoid conflict with another algorithm Feb 4, 2015
LICENSE Split lzjb.c to have a header file, add LICENSE/README Nov 24, 2014
Makefile Add compiler warnings and correct reported warnings Nov 3, 2016
README Warn that pthreads mode doesn't work and disable by default Sep 1, 2015
benchmark.sh Rename lzjb to lzjody to avoid conflict with another algorithm Feb 4, 2015
byteplane_xfrm.c Separate byteplane transform into its own files Jan 4, 2015
byteplane_xfrm.h Obey C namespace underscore rules Mar 6, 2015
chroot_build.sh
lzjody.c Add compiler warnings and correct reported warnings Nov 3, 2016
lzjody.h Place C++ guards in header Sep 2, 2015
lzjody_util.c
lzjody_util.h Port lzjb and the example utility to Windows (MinGW-w64) Oct 15, 2015
test.incompressible Add some simple testing stuff Nov 23, 2014
test.input Add more test blocks to test.input Sep 1, 2015
test.sh test.sh: improve testing and expand number of test cases Sep 1, 2015

README

                    LEMPEL-ZIV/JODY BRUCHON COMPRESSION
                    ===================================

     If you have any questions, comments, or patches, send me an email:
                             jody@jodybruchon.com

             This software is released under the terms of the
                    GNU General Public License version 2.


GENERAL INFORMATION
-------------------

This code compresses and decompresses a data stream using a combination of
compression techniques that are optimized for compressing disk image data.

Compression methods used by this program include:

* Run-length encoding (RLE), packing long repetitions of a single byte value
  into a short value:length pair

* Lempel-Ziv (dictionary-based) compression with optimized searching

* Sequential increment compression, where 8-, 16-, and 32-bit values that
  are incremented by 1 are converted to a pair consisting of an  inital value
  and a count

* Byte plane transformation, putting bytes at specific intervals together to
  allow compression of some forms of otherwise incompressible data. This is
  performed on otherwise incompressible data to see if it can be arranged
  differently to produce a compressible pattern.

The included compression utility can use POSIX threads. If you want to try
this out (WARNING: 2015-09-01: currently does not work properly) type:

make THREADED=1

You can also use DEBUG=1 to turn on some very annoying debugging messages.

The LZJODY library accepts blocks for compression up to 4096 bytes in size and
is designed to guarantee no more than four bytes of data expansion for a
block that is 100% incompressible. The compress/decompress functions return
the number of bytes that are output by the function. This return value can
be examined by the calling application and used to determine if it will be
better to store the data uncompressed with an "out-of-band" indicator that
the block is stored raw instead of in the LZJODY compressed format.


KNOWN BUGS AND QUIRKS
---------------------

- lzjody seems to compress the same data slightly differently sometimes. The
  cause is not yet known and the phenomenon is only very rarely observed.
  Thorough testing has shown that decompressed data is consistently correct
  even when this rare glitch in compression occurs, and observed differences
  in final size only seem to deviate approximately 6 bytes for every 600MB of
  compressed data.


COMPRESSED DATA FORMAT
----------------------

All 12-bit lengths are stored in big-endian format. This allows the packing
of the most common compression flags into the upper 4 bits when a 12-bit
length or offset is stored.

The first 2 bytes of a compressed block are the 12-bit length of all of the
compressed data. All data following these bytes are sub-blocks of data
prefixed with a compression command, a command-dependent set of bytes of
metadata, and the compressed data to be processed by the decompressor.

The first byte of every sub-block always contains a compression command
Bit 0x80 is a flag that indicates whether the command is stored in a short
form. Bit 0x10 indicates a long LZ match that requires an additional byte
of storage. Bits 0x20 and 0x40 select between LZ, RLE, and sequential
increment compression. If 0x40, 0x20, and 0x10 are all zero, the lower 4
bits are an "extended" command; the short form flag 0x80 can apply to
extended commands. Refer to lzjody.h for the exact mapping of bits to
algorithms and modifiers.

For "standard" commands, the short forms store a 4-bit length or offset in
the lower 4 bits of the command byte. "Normal" (not short) forms store a
12-bit value instead, with the top 4 bits packed into the bottom 4 bits of
the command byte. If the "LZ long" flag is set, the LZ compressor stored
more than 255 bytes and the length value is an extra byte in length to
accommodate the 12-bit length.

For "extended" commands, the short forms have a slightly different meaning.
Since the lower 4 bits of the command byte contain the extended command,
the short form of an extended command indicates a one-byte offset instead
of a two-byte (12-bit) offset.


LEMPEL-ZIV COMPRESSION
----------------------

LZ compression uses a "dictionary" of previously encountered data to reduce
later occurrences of the same data to (offset,length) references. In this
implementation the dictionary scanning performance is greatly increased by
using a "jump list" of offsets for each byte. The data block is scanned by
an indexer before LZ compression starts to make these lists and the scanner
uses the byte value itself to find the correct list. If a particular byte
results in a list that is very long (the exact threshold for which is in the
lzjody.h header and was chosen through performance profiling) then the LZ
compressor will fall back to the byte-by-byte linear scanner. This is done
because following the jump list entries is more expensive than scanning all
bytes one by one when too many bytes are of the value being scanned for.

The LZ algorithm also performs "fast rejection" checks that prevent entry
into a full LZ scan loop if the last byte of the minimum match length does
not match. This check results in a significant increase in performance.


RUN-LENGTH ENCODING
-------------------

RLE detects long runs of identical bytes and replaces them with a simple
(value,length) pair. There is very little to say about this algorithm; it
is very simple in both concept and implementation.


INCREMENTAL SEQUENCE COMPRESSION
--------------------------------

Incremental sequences are runs of values that increase by 1 for every
successive value. LZJODY scans for sequences of 8, 16, and 32 bit widths and
reduces them to (start,count) pairs. For example, Seq(16) compression on the
following 8-byte data stream:

0x2201, 0x2202, 0x2203, 0x2204 -> (Start at 0x2201, output 4 values)

The 8 bytes would be reduced to 4 bytes: the compression command, a byte-wide
value count, and the initial 16-bit value.


BYTE PLANE TRANSFORMATION
-------------------------

While developing this compression algorithm and examining a data block
which could not be compressed, it was observed in the hex editor that the
data contained easily compressed patterns in the _columns_ even though the
rows could not be compressed. The data could be easily compressed if the
existing algorithms could somehow read the data "vertically." This is how the
idea for "byte plane transformation" was conceived. The name comes from the
way that VGA controllers store video data, where four "planes" of bits are
combined to obtain the 4-bit color value for a pixel.

The best way to see how this works is with some diagrams. Take this data
stream, for example:

                 Input stream position
 00 01 02 03   04 05 06 07   08 09 0a 0b   0c 0d 0e 0f
+--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+
|0a|ff|80|c9| |0b|ff|80|c9| |0c|ff|80|c9| |0d|ff|80|c9|
+--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

After a 4-byte plane transformation, the data stream consists of the data as
if it was arranged "vertically." Look at the patterns that emerge this way:

        Plane
      00 01 02 03
  --++--+--+--+--+
D 00||0a|ff|80|c9|
A 01||0b|ff|80|c9|
T 02||0c|ff|80|c9|
A 03||0d|ff|80|c9|
  --++--+--+--+--+
       |  |  |  \--- RLE (4)
       |  |  |
       |  |  \------ RLE (4)
       |  |
       |  \--------- RLE (4)
       |
       \------------ Seq(8) (4)

The output from the 4-byte plane transform on this data would be this stream:

                 Output stream position
 00 01 02 03   04 05 06 07   08 09 0a 0b   0c 0d 0e 0f
+--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+
|0a|0b|0c|0d| |ff|ff|ff|ff| |80|80|80|80| |c9|c9|c9|c9|
+--+--+--+--+ +--+--+--+--+ +--+--+--+--+ +--+--+--+--+

The result is a data stream that is now compressible for minimal extra cost.


A NOTE OF CAUTION
-----------------

This code has ONLY been tested on Intel i386 and x86_64 Linux platforms.

Endianness or unaligned access issues are likely to be present if run on
other platforms. Please submit patches to me if you have such a platform and
can help fix the bugs without otherwise changing the behavior of the program.


MISCELLANY
----------

lzjody was initially called "lzjb" and the name was changed to avoid confusion
with an existing compression algorithm by the same name that is used in the
ZFS filesystem.