-
Notifications
You must be signed in to change notification settings - Fork 41
LZFFormat
This document describes structure of LZF data format, as originally used in LibLZF library. LZF is the format supported by Java LZF library.
At file or stream level, LZF compressed content consists of a simple sequence of chunks. Chunks consists of simple header and payload. There are no chunk separators, nor sequence numbers or other kind of metadata. This means that one can simply keep on appending chunks at the end of stream or file.
Each chunk consists of two parts: header and payload.
Header block is either 5 or 7 bytes long: uncompressed blocks 5, compressed blocks 7.
Both start with same 5 bytes which contains a brief signature for content auto-detection (first 2 bytes); identifies chunk type (compressed or not), and indicates length of actual stored chunk:
#0: Letter 'Z' (0x5A)
#1: Letter 'V' (0x56)
#2: Type
0x00: Non-compressed chunk
0x01: Compressed chunk
Other values are reserved for future use -- it is possible that in future this byte may be
considered a bit field, where least-significant bit determines compression status of the chunl
#3, #4: `ChunkLength` (uint16): in Big Endian notation (#3 is MSB, #4 LSB). Indicates length of
physical chunk content that follows header.
In addition, compressed chunks (type 0x01
) contain additional 2 header bytes:
#5, #6: `OriginalLength` (uint16), Big-Endian -- length of original uncompressed chunk, which is
typically bigger than the stored (compressed) chunk.
As can be seen from above, maximum length of both uncompressed input, and compressed payload is 0xFFFF
(65535 bytes).
For uncompressed chunks, header is simply followed by ChunkLength
source bytes; no additional processing is done.
For compressed chunks, payload consists of a sequence of two types of segments: literal runs, and back references. The first byte of a segment (TypeByte
) determines the type as follows:
- Values
0x00
-0x1F
indicate aliteral run
- Value
0x20
-0xFF
indicate aback reference
; and further- Values
0xE0
-0xFF
indicate long back-reference; others short one
- Values
Another way to look at this is that 3 most-significant bits determine type, such that:
- 0x00: Literal run
- 0x01 - 0x06: Short back-reference (and type is used as 'run-length', see next section)
- 0x07: Long back-reference
Literal runs are simple: the first byte indicates length between 1 and 32 bytes (that is, length is TypeByte + 1
); and TypeByte
is followed by length
number of content bytes as is.
Back-references encode a reference to a sequence of bytes that have already been decoded within this chunk.
They consist of two parts: run-length
, number of bytes to copy, and offset
, number of bytes to go back from the current output position to find the sequence to copy. Valid offset values are between 1 and 8192, i.e. expressed in 13 bits. Minimum run-length is 3 bytes, and maximum 264.
Note that back-references can and do overlap with output buffer: that is, it is legal to have offset smaller than length of the referenced byte sequence. This is useful for repeating patterns: offset of -1, for example, can be used for basic Run-length Encoding (compressing sequences of same bytes). Decoder has to account for this possibility and may not be able to use bulk copy for handling all back references.
There are two kinds of back-references: short and long references. Difference is that short reference is encoded using 2 bytes, and can have run-length of at most 6 bytes; long reference can have run-length of up to 264 bytes and is encoded as 3 bytes.
Actual offset used will be - (offset + 1)
, calculated from the current output position (to give effective offsets between -1 and -8192)
Short references are encoded as follows:
#0: `TypeByte`:
# Bits 0-4 (5 LSB): 5 Most-Significant Bits of `offset`
# Bits 5-7 (3 MSB): `run-length - 2`: encodes lengths 3 - 8 with values 1 - 6
-- values 0 and 7 prohibited since those indicate literal runs (0) and long-reference (7)
#1: `Offset`: 8 LSB of 'offset'
Long references are encoded as follows:
#0: `TypeByte`:
# Bits 0-4 (5 LSB): 5 Most-Significant Bits of `offset`
# Bits 5-7 (3 MSB): constant `0x7` (to indicate type as long reference)
#1: `Length`: `run-length - 9` to allow references to sequences of 9 - 264 bytes
#2: `Offset`: 8 LSB of 'offset'