Start reading from the middle of a stream #187

Kronuz · 2016-02-25T02:31:55Z

If one has a list of compressed blocks written using stream mode and having each block pointing to the previous one (as dependency); can I start uncompressing from somewhere in the middle? i.e. Say I have a hundred blocks and the data I want should be compressed in block 70 of the stream. How do I start uncompressing from block 70 (or closer to it) instead of starting on the first block and discarding all uncompressed bytes until the ones in the block I want.

t-mat · 2016-02-25T03:41:44Z

Hi @Kronuz,

can I start uncompressing from somewhere in the middle?

I think you want to make block list like this: "I,D,D,D,I,D,D,D,I,D,D,D,...". Here, "I" means independent, "D" means dependent (to previous) block and both "I" and "D" has same block size.

Unfortunately, Standard LZ4 Frame Format and lz4frame library don't support such block format.
So if you want to implement your own dependency rule, you can use stream API (LZ4_stream_t and related functions) for it. examples/blockStreaming_ringBuffer.c may be good start point.

But IMHO, if you have enough memory and order of I and D are constant (e.g. one "I" has three trailing "D"s), easiest way is mimicing standard LZ4 frame format with large independent block. For example, transform small 4 * 64KiBytes blocks "I,D,D,D" to single large independent 256KiBytes block "L".
Obviously you can decompress each independent block "L" directly. And since each "I" block isolates block dependency, basically compression ratio of "I,D,D,D" and "L" are same.
You may also need offset list for random access.

Kronuz · 2016-02-25T04:07:24Z

So it is not possible to start decompressing from a block that has a dependency? I have to have independent blocks every now and then and start decompressing from there?

t-mat · 2016-02-25T04:34:37Z

So it is not possible to start decompressing from a block that has a dependency?

Yes, you can't decompress a block which has a dependency. Dependency block always needs previous block's information (i.e. non-empty LZ4_stream_t).

I have to have independent blocks every now and then and start decompressing from there?

Yes, you can only start decompression (i.e. start (de)compression with empty LZ4_stream_t) from independent block.

Kronuz · 2016-02-25T04:45:46Z

Understood! Thank you @t-mat. I just thought there could be some other way. I found it difficult to find an example/explanation about this fact.

Cyan4973 · 2016-02-25T12:02:15Z

If you want to access blocks randomly, you need to compress them independently.

An advanced way to access independent blocks and improve compression is to use a "shared dictionary". LZ4 does support it, although not in the frame format yet : one needs to work directly at block level (lz4.h).

The shared dictionary will only be useful if there are some "redundancies" between blocks (if they have the same kind of content). If blocks can have any type of content, no useful dictionary can be found.

Creating the shared dictionary can be a tough challenge. Fortunately, latest version of zstd offers a generic dictionary builder, using zstd --train command. The created dictionary file is compatible with lz4.

Kronuz · 2016-02-25T12:41:30Z

@t-mat, I'm working on a project for searching documents (somehow similar to eleasricsearch, but written in c++ and using xapian as the indexing engine). I do believe shared dictionary could make a difference, but I was wondering: are shared dictionaries created through an lz4 api? Is there an example somewhere? Also, what is the way that data structure for shared dictionaries work? As for writing atomicity it'd have to be append only preferably.

Kronuz · 2016-02-25T13:09:52Z

Also, what does "Usually, Dependent memory is previous adjacent contiguous memory up to 64KiBytes. LZ4 will not access further memories." mean? Does it mean 64K of previous adjacent contiguous uncompressed memory?

...and for the shared dictionary, could it be changing as data is being added? Do block dependencies work by using some sort of dictionary from the previous block or is it some different dependency structure other that a shared dictionary?

t-mat · 2016-03-01T13:22:38Z

@Cyan4973, I've created experimental LZ4 dictionary (de)compression example (modified line-by-line example). But I'm not sure it's optimal enough or not. I have 2 questions:

(1) Does the following command have proper options to make dictionary for LZ4?

zstd --train INPUT_FILE -o OUTPUT_DICT --maxdict 65536

Since zstd's dictionary file has header structure, it seems suspicious. I think I overlooked something.

(2) Does the following snippet have good enough structure to achieve optimal compression speed?

LZ4_stream_t* lz4s = LZ4_createStream();
const char* dictPtr = DICTIONARY_DATA;
const int dictBytes = DICTIONARY_DATA_SIZE_IN_BYTES; // 65536
while(...) {
    LZ4_loadDict(lz4s, dictPtr, dictBytes);
    int c = LZ4_compress_limitedOutput_continue(
                lz4s, src, dst, srcBytes, dstBytes);
    save(dst, c);
}

Maybe this question is related to PR #188.

@Kronuz,

are shared dictionaries created through an lz4 api?

No. Currently LZ4 doesn't have any dictionary creation API. As @Cyan4973 mentioned it, you can create dictionary with zstdcli or zdict.h.

Is there an example somewhere?
Also, what is the way that data structure for shared dictionaries work? As for writing atomicity it'd have to be append only preferably.

I've created example code, but please refer above my questions.
Basically, you can use shared dictionary for each independent block. It takes some time and memory, but it improves compression ratio.

Also, what does "Usually, Dependent memory is previous adjacent contiguous memory up to 64KiBytes. LZ4 will not access further memories." mean?
Does it mean 64K of previous adjacent contiguous uncompressed memory?

Yes, in the compression procedure, it means adjacent contiguous up to 64KiBytes uncompressed (source, raw data) memory.
In decompression procedure, it also means adjacent contiguous up to 64KiBytes decompressed (destination, decomprssed raw data) memory.

...and for the shared dictionary, could it be changing as data is being added?

I'm not sure about your purpose. But since shared dictionary is shared with past (previous) independent blocks, changing shared dictionary needs recompression of previous independent blocks which shares same shared dictionary.

Do block dependencies work by using some sort of dictionary from the previous block or is it some different dependency structure other that a shared dictionary?

You can imagine shared dictionary as some kind of "previous block". If you use shared dictionary, block list becomes "(S),I,D,D,D" (S:shared dictionary, I:independent block, D:dependent block).

Cyan4973 · 2016-03-01T14:08:59Z

A very nice initiative @t-mat !

(1) Does the following command have proper options to make dictionary for LZ4?
zstd --train INPUT_FILE -o OUTPUT_DICT --maxdict 65536

Yes, it is correct.
There is indeed some header, but it is at the very beginning of the file, therefore in an area which is barely useful for the compressor.
If you want to be on the safe side, you can add 200 bytes to maxdict so that header information will mostly stand there.

(2) Does the following snippet have good enough structure to achieve optimal compression speed?

Yes, it's a nice and effective example code.

The only area where I would recommend a change is in the way samples are provided to zstd --train.
In https://gist.github.com/t-mat/3f6be82a408b60c00312, it's a single file, containing 1K lines.
It works, but the dictionary builder will likely prefer 1000 files containing 1 line each (which can be created from the single file using split).
The difference between the 2 methods is going to be more pronounced for zstd than for lz4, because lz4 doesn't use the dictionary header. So I suspect improvements for lz4 will be small. It's more a recommendation to follow a method which will give better results for zstd.

Regards

Kronuz · 2016-03-01T14:15:44Z

@Cyan4973, and how would you train the same dictionary with 1000 different files? Multiple calls to train passing the same dictionary as output and different input buffers? Wouldn't that overwrite the dictionary file with the training of the last data file?

t-mat · 2016-03-01T14:37:28Z

It works, but the dictionary builder will likely prefer 1000 files containing 1 line each (which can be created from the single file using split).

Right, I've fixed it. I forgot about offset.

how would you train the same dictionary with 1000 different files?

zstd --train accepts multiple source file (splittemp-*):

split --lines=1 --suffix-length=10 testfile splittemp-testfile
./zstd/programs/zstd --train splittemp-* -o testfile.dict --maxdict 65536
rm splittemp-*

Behrouz-m · 2016-08-20T05:13:50Z

zstd --train splittemp-* -o testfile.dict

this does not work in windows! zstd does not accept wild-char in windows

Cyan4973 · 2016-08-20T10:25:18Z

Indeed, by default, Windows shell does not take in charge wild-char expansion.

In Window' world, this is up to the program to take care of this.
That's why it's necessary to build the program with a special library.

If you build zstd using the Visual project provided in projects directory, it will include the specific "shell expansion" library, and wild-char should work as intended.

Cyan4973 · 2016-11-04T21:12:38Z

regarding @ray-pixar question :
another way to solve the wildcard expansion pb on Windows is to use recursive mode -r.

say, for example, that all splittemp_* files are into splittemp_dir directory,
the command becomes :
zstd -r --train splittemp_dir -o testfile.dict

My understanding of this discussion is that the most important action worth doing is integration of @t-mat 's dictionary example into examples directory.

Cyan4973 added the question label Nov 4, 2016

terrelln mentioned this issue Nov 10, 2016

Add dictionary random access example #263

Merged

Cyan4973 closed this as completed in #263 Nov 10, 2016

tjjh89017 mentioned this issue Dec 23, 2017

lz4 compatibility tjjh89017/ezio#41

Closed

bjornharrtell mentioned this issue Jan 31, 2020

Proposed new virtual file system vsilz4 OSGeo/gdal#2201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start reading from the middle of a stream #187

Start reading from the middle of a stream #187

Kronuz commented Feb 25, 2016

t-mat commented Feb 25, 2016

Kronuz commented Feb 25, 2016

t-mat commented Feb 25, 2016

Kronuz commented Feb 25, 2016

Cyan4973 commented Feb 25, 2016

Kronuz commented Feb 25, 2016

Kronuz commented Feb 25, 2016

t-mat commented Mar 1, 2016

Cyan4973 commented Mar 1, 2016

Kronuz commented Mar 1, 2016

t-mat commented Mar 1, 2016

Behrouz-m commented Aug 20, 2016 •

edited

Cyan4973 commented Aug 20, 2016 •

edited

Cyan4973 commented Nov 4, 2016

Start reading from the middle of a stream #187

Start reading from the middle of a stream #187

Comments

Kronuz commented Feb 25, 2016

t-mat commented Feb 25, 2016

Kronuz commented Feb 25, 2016

t-mat commented Feb 25, 2016

Kronuz commented Feb 25, 2016

Cyan4973 commented Feb 25, 2016

Kronuz commented Feb 25, 2016

Kronuz commented Feb 25, 2016

t-mat commented Mar 1, 2016

Cyan4973 commented Mar 1, 2016

Kronuz commented Mar 1, 2016

t-mat commented Mar 1, 2016

Behrouz-m commented Aug 20, 2016 • edited

Cyan4973 commented Aug 20, 2016 • edited

Cyan4973 commented Nov 4, 2016

Behrouz-m commented Aug 20, 2016 •

edited

Cyan4973 commented Aug 20, 2016 •

edited