It's a Java library based on the zran.c sample from zlib.
You can preprocess a large gzip archive, producing an "index" that can be used for random read access.
You can balance between index size and access speed.
You've got a file that is very large, compressible and needs random access: some kind of database, DNA, image, video, XML document etc.
You go through it and remember offsets of what is important to you.
You compress the file.
Now you can use these offsets to access the compressed file.
You give it a SeekableInputStream
over the compressed data.
It gives you a SeekableInputStream
over the decompressed data.
SeekableInputStream sis = new ByteArraySeekableInputStream(buf);
SeekableInputStream index = RandomAccessGZip.index(sis, 1048576);
...
index.open(sis);
...
index.seek(offset);
byte[] dest = new byte[100];
int n = index.read(dest, 0, dest.length);
You can monitor indexing progress and cancel indexing.
The index is serializable.
You can provide as input (gzip source) a byte[]
, a ByteBuffer
or a RandomAccessFile
.
zran just snapshots the decoder's internal state periodically.
I haven't yet done measurements, but essentially the seek
method is O(span)
(the sparser your index, the smaller it is and the slower seeks work) and after a seek,
you read with the speed of zlib (modulo a couple of memory copies maybe).