Tatu Saloranta edited this page May 3, 2016 · 10 revisions
Clone this wiki locally

LZF format specification / description

This document describes structure of LZF data format, as originally used in LibLZF library. LZF is the format supported by Java LZF library.

High-level structure

At file or stream level, LZF compressed content consists of a simple sequence of chunks. Chunks consists of simple header and payload. There are no chunk separators, nor sequence numbers or other kind of metadata. This means that one can simply keep on appending chunks at the end of stream or file.

Chunk structure

Each chunk consists of two parts: header and payload.


Header block is either 5 or 7 bytes long: uncompressed blocks 5, compressed blocks 7.

Both start with same 5 bytes which contains a brief signature for content auto-detection (first 2 bytes); identifies chunk type (compressed or not), and indicates length of actual stored chunk:

#0: Letter 'Z' (0x5A)
#1: Letter 'V' (0x56)
#2: Type
   0x00: Non-compressed chunk
   0x01: Compressed chunk

   Other values are reserved for future use -- it is possible that in future this byte may be
   considered a bit field, where least-significant bit determines compression status of the chunl
#3, #4: `ChunkLength` (uint16): in Big Endian notation (#3 is MSB, #4 LSB). Indicates length of
   physical chunk content that follows header.

In addition, compressed chunks (type 0x01) contain additional 2 header bytes:

#5, #6: `OriginalLength` (uint16), Big-Endian -- length of original uncompressed chunk, which is
   typically bigger than the stored (compressed) chunk.

As can be seen from above, maximum length of both uncompressed input, and compressed payload is 0xFFFF (65535 bytes).



For uncompressed chunks, header is simply followed by ChunkLength source bytes; no additional processing is done.


For compressed chunks, payload consists of a sequence of two types of segments: literal runs, and back references. The first byte of a segment (TypeByte) determines the type as follows:

  • Values 0x00 - 0x1F indicate a literal run
  • Value 0x20 - 0xFF indicate a back reference; and further
    • Values 0xE0 - 0xFF indicate long back-reference; others short one

Another way to look at this is that 3 most-significant bits determine type, such that:

  • 0x00: Literal run
  • 0x01 - 0x06: Short back-reference (and type is used as 'run-length', see next section)
  • 0x07: Long back-reference

Compressed: literal run

Literal runs are simple: the first byte indicates length between 1 and 32 bytes (that is, length is TypeByte + 1); and TypeByte is followed by length number of content bytes as is.

Compressed: back-reference

Back-references encode a reference to a sequence of bytes that have already been decoded within this chunk. They consist of two parts: run-length, number of bytes to copy, and offset, number of bytes to go back from the current output position to find the sequence to copy. Valid offset values are between 1 and 8192, i.e. expressed in 13 bits. Minimum run-length is 3 bytes, and maximum 264.

Note that back-references can and do overlap with output buffer: that is, it is legal to have offset smaller than length of the referenced byte sequence. This is useful for repeating patterns: offset of -1, for example, can be used for basic Run-length Encoding (compressing sequences of same bytes). Decoder has to account for this possibility and may not be able to use bulk copy for handling all back references.

There are two kinds of back-references: short and long references. Difference is that short reference is encoded using 2 bytes, and can have run-length of at most 6 bytes; long reference can have run-length of up to 264 bytes and is encoded as 3 bytes.

Actual offset used will be - (offset + 1), calculated from the current output position (to give effective offsets between -1 and -8192)

Short references are encoded as follows:

#0: `TypeByte`:
  # Bits 0-4 (5 LSB): 5 Most-Significant Bits of `offset`
  # Bits 5-7 (3 MSB): `run-length - 2`: encodes lengths 3 - 8 with values 1 - 6
     -- values 0 and 7 prohibited since those indicate literal runs (0) and long-reference (7)
#1: `Offset`: 8 LSB of 'offset'

Long references are encoded as follows:

#0: `TypeByte`:
  # Bits 0-4 (5 LSB): 5 Most-Significant Bits of `offset`
  # Bits 5-7 (3 MSB): constant `0x7` (to indicate type as long reference)
#1: `Length`: `run-length + 9` to allow references to sequences of 9 - 264 bytes
#2: `Offset`: 8 LSB of 'offset'