Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: ef889f719b
Fetching contributors…

Cannot retrieve contributors at this time

159 lines (105 sloc) 5.563 kb
== bzip2 stream format ==
NOTA BENE: bzip2 is a ''bitstream'' format, not a ''bytestream'' format.
Some stream components are not an integral number of bytes, and those
that are may not be aligned on byte coundaries.
This somewhat complicates processing in typical byte-oriented I/O and
array systems.
Multi-bit integers are written big-endian.
BYTE = BIT*8 ; Big-endian
UINT24 = BIT*24 ; Big-endian
UINT32 = BIT*32 ; Big-endian
stream = stream-header block*N stream-trailer
stream-header = header-signature header-blocksize
header-signature = BYTE*3("BZh")
header-blocksize = BYTE(DIGIT) ; 0 - 9
block = block-header block-data
block-header = block-signature block-crc block-random block-origptr
block-signature = BYTE*6(%x31 %x41 %x59 %x26 %x53 %x59)
block-crc = UINT32 ; CRC32 of this block
block-random = BIT(0) ; Always 0 in current versions of bzip2
block-origptr = UINT24 ; Block sort index for reversing BWT
block-data = BYTE*N
stream-trailer = trail-signature trail-crc
trail-signature = BYTE*6(%x17 %x72 %x45 %x38 %x50 %x90)
trail-crc = UINT32 ; Combined CRC32 of whole stream
== Internal notes ==
n = 100000 * blockSize100k;
s->arr1 = BZALLOC( n * sizeof(UInt32) );
s->arr2 = BZALLOC( (n+BZ_N_OVERSHOOT) * sizeof(UInt32) );
s->ftab = BZALLOC( 65537 * sizeof(UInt32) );
s->nblockMAX = 100000 * blockSize100k - 19;
BZ2_blockSort : performs BWT. sorted row headers in arr1, start index in origPtr
-> how many chunks in the block? s->arr1[0..s->nblock - 1]
generateMTFValues : 'move to front' stream transformation
mapping tables
coding tables
block data
== Block size ==
The size of uncompressed blocks is *not* exactly [0-9]00k bytes as one might
naively expect. On my French Wikipedia dump test file I see block sizes
varying from 899603 to 1216072 bytes, not counting the smaller trailing
According to the docs, bzip2 does some RLE compression on raw input in the
hopes of reducing slow sorting cases. It looks like the block chunking comes
*after* the RLE, so the uncompressed portion can be somewhat smaller or larger.
This somewhat breaks my dream of naively chopping up the file and extracting
the chunks from individually compressed streams; while it *should* still work
if you're careful about input size, producing bit-identical results to normal
bzip2 will probably require moving the RLE stage out to the input thread,
as bzip2-smp does. How much library hacking this will take, we'll see.
== CRC ==
Combined CRC seems to simply be build from each block's CRC. This
should be relatively simple to handle.
bzlib.c: s->combinedCRC = 0;
BZ_FINALISE_CRC ( s->blockCRC );
s->combinedCRC = (s->combinedCRC << 1) | (s->combinedCRC >> 31);
s->combinedCRC ^= s->blockCRC;
== Order of operations ==
Something like:
1) RLE on input data
2) Chunk RLE'd data into blocks
3) Perform BWT block sort on block
4) Write block header
5) MTF (Move To Front) encoding
6) Huffman coding
7) Dump compressed bits to output
bzip2-smp hacks the bzip2 library to interrupt things somwhere between steps
2 and 3. Step 3, the Burrows-Wheeler Transform, is the most CPU-intense part
of the task, and is offloaded to multiple background threads. Sorted blocks
then get further processing and are written out for steps 4-7 on yet another
It feels pretty hackish, with saves and restores of various bits of scary
data structures and no obvious protocol by which data could be transferred.
The docs for it say doing it cleaner would require rewriting parts of the
library... well, that may just be necessary. :)
== RLE ==
On decompression, un-RLE and CRC check are done together in
unRLE_obuf_to_output_(FAST|SMALL). This also has the case to handle
randomization (no longer used in current versions of the compressor).
rle-run = BYTE ; followed by a different value
/ 2*BYTE ; two same
/ 3*BYTE ; three same
/ 4*BYTE rle-run-count ; four same bytes, plus the...
rle-run-count = BYTE ; ...length of run minus four
This will turn 4-byte sequences into 5-byte sequences (4x plus a null byte),
but will be the same or smaller for all other sequence lengths.
The worst case for the RLE encoder would be a series of 4-byte character
sequences, resulting in a 20% decrease in capacity per block. This should
make the smallest possible block something like 720,000 with the 900k
In fact, such a file ends up with a 719,988-byte block. That would RLE out
to 899,985 bytes, 15 bytes short of the declared block size.
On compression, RLE and CRC are done in add_pair_to_block / ADD_CHAR_TO_BLOCK
called from copy_input_until_stop. This checks if we've gone up to or beyond
nblockMAX (set to block size - 19 bytes) before each RLE run.
This would have brought us up to 899,980 (blocksize-20) before the last run;
the final run would then bring it up to 899,985, tripping the condition and
cutting off the block.
== Simplifications ==
Many of the library functions feel kind of... funky because they accept
small chunks of data at a time. Since we expect to be dealing with full
blocks, perhaps we can simplify a version that's tailored to the parallelized
Jump to Line
Something went wrong with that request. Please try again.