At file or stream level, LZF compressed content consists of a simple sequence of chunks. Chunks consists of simple header and payload. There are no chunk separators, nor sequence numbers or other kind of metadata. This means that one can simply keep on appending chunks at the end of stream or file.
Each chunk consists of two parts: header and payload.
Header block is either 5 or 7 bytes long: uncompressed blocks 5, compressed blocks 7.
Both start with same 5 bytes which contains a brief signature for content auto-detection (first 2 bytes); identifies chunk type (compressed or not), and indicates length of actual stored chunk:
#0: Letter 'Z' (0x5A) #1: Letter 'V' (0x56) #2: Type 0x00: Non-compressed chunk 0x01: Compressed chunk Other values are reserved for future use -- it is possible that in future this byte may be considered a bit field, where least-significant bit determines compression status of the chunl #3, #4: `ChunkLength` (uint16): in Big Endian notation (#3 is MSB, #4 LSB). Indicates length of physical chunk content that follows header.
In addition, compressed chunks (type
0x01) contain additional 2 header bytes:
#5, #6: `OriginalLength` (uint16), Big-Endian -- length of original uncompressed chunk, which is typically bigger than the stored (compressed) chunk.
As can be seen from above, maximum length of both uncompressed input, and compressed payload is
0xFFFF (65535 bytes).
For uncompressed chunks, header is simply followed by
ChunkLength source bytes; no additional processing is done.
For compressed chunks, payload consists of a sequence of two types of segments: literal runs, and back references. The first byte of a segment (
TypeByte) determines the type as follows:
back reference; and further
0xFFindicate long back-reference; others short one
Another way to look at this is that 3 most-significant bits determine type, such that:
Literal runs are simple: the first byte indicates length between 1 and 32 bytes (that is, length is
TypeByte + 1); and
TypeByte is followed by
length number of content bytes as is.
Back-references encode a reference to a sequence of bytes that have already been decoded within this chunk.
They consist of two parts:
run-length, number of bytes to copy, and
offset, number of bytes to go back from the current output position to find the sequence to copy. Valid offset values are between 1 and 8192, i.e. expressed in 13 bits. Minimum run-length is 3 bytes, and maximum 264.
Note that back-references can and do overlap with output buffer: that is, it is legal to have offset smaller than length of the referenced byte sequence. This is useful for repeating patterns: offset of -1, for example, can be used for basic Run-length Encoding (compressing sequences of same bytes). Decoder has to account for this possibility and may not be able to use bulk copy for handling all back references.
There are two kinds of back-references: short and long references. Difference is that short reference is encoded using 2 bytes, and can have run-length of at most 6 bytes; long reference can have run-length of up to 264 bytes and is encoded as 3 bytes.
Actual offset used will be
- (offset + 1), calculated from the current output position (to give effective offsets between -1 and -8192)
Short references are encoded as follows:
#0: `TypeByte`: # Bits 0-4 (5 LSB): 5 Most-Significant Bits of `offset` # Bits 5-7 (3 MSB): `run-length - 2`: encodes lengths 3 - 8 with values 1 - 6 -- values 0 and 7 prohibited since those indicate literal runs (0) and long-reference (7) #1: `Offset`: 8 LSB of 'offset'
Long references are encoded as follows:
#0: `TypeByte`: # Bits 0-4 (5 LSB): 5 Most-Significant Bits of `offset` # Bits 5-7 (3 MSB): constant `0x7` (to indicate type as long reference) #1: `Length`: `run-length + 9` to allow references to sequences of 9 - 264 bytes #2: `Offset`: 8 LSB of 'offset'