Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start reading from the middle of a stream #187

Closed
Kronuz opened this issue Feb 25, 2016 · 14 comments
Closed

Start reading from the middle of a stream #187

Kronuz opened this issue Feb 25, 2016 · 14 comments
Labels

Comments

@Kronuz
Copy link

Kronuz commented Feb 25, 2016

If one has a list of compressed blocks written using stream mode and having each block pointing to the previous one (as dependency); can I start uncompressing from somewhere in the middle? i.e. Say I have a hundred blocks and the data I want should be compressed in block 70 of the stream. How do I start uncompressing from block 70 (or closer to it) instead of starting on the first block and discarding all uncompressed bytes until the ones in the block I want.

@t-mat
Copy link
Contributor

t-mat commented Feb 25, 2016

Hi @Kronuz,

can I start uncompressing from somewhere in the middle?

I think you want to make block list like this: "I,D,D,D,I,D,D,D,I,D,D,D,...". Here, "I" means independent, "D" means dependent (to previous) block and both "I" and "D" has same block size.

Unfortunately, Standard LZ4 Frame Format and lz4frame library don't support such block format.
So if you want to implement your own dependency rule, you can use stream API (LZ4_stream_t and related functions) for it. examples/blockStreaming_ringBuffer.c may be good start point.

But IMHO, if you have enough memory and order of I and D are constant (e.g. one "I" has three trailing "D"s), easiest way is mimicing standard LZ4 frame format with large independent block. For example, transform small 4 * 64KiBytes blocks "I,D,D,D" to single large independent 256KiBytes block "L".
Obviously you can decompress each independent block "L" directly. And since each "I" block isolates block dependency, basically compression ratio of "I,D,D,D" and "L" are same.
You may also need offset list for random access.

@Kronuz
Copy link
Author

Kronuz commented Feb 25, 2016

So it is not possible to start decompressing from a block that has a dependency? I have to have independent blocks every now and then and start decompressing from there?

@t-mat
Copy link
Contributor

t-mat commented Feb 25, 2016

So it is not possible to start decompressing from a block that has a dependency?

Yes, you can't decompress a block which has a dependency. Dependency block always needs previous block's information (i.e. non-empty LZ4_stream_t).

I have to have independent blocks every now and then and start decompressing from there?

Yes, you can only start decompression (i.e. start (de)compression with empty LZ4_stream_t) from independent block.

@Kronuz
Copy link
Author

Kronuz commented Feb 25, 2016

Understood! Thank you @t-mat. I just thought there could be some other way. I found it difficult to find an example/explanation about this fact.

@Cyan4973
Copy link
Member

If you want to access blocks randomly, you need to compress them independently.

An advanced way to access independent blocks and improve compression is to use a "shared dictionary". LZ4 does support it, although not in the frame format yet : one needs to work directly at block level (lz4.h).

The shared dictionary will only be useful if there are some "redundancies" between blocks (if they have the same kind of content). If blocks can have any type of content, no useful dictionary can be found.

Creating the shared dictionary can be a tough challenge. Fortunately, latest version of zstd offers a generic dictionary builder, using zstd --train command. The created dictionary file is compatible with lz4.

@Kronuz
Copy link
Author

Kronuz commented Feb 25, 2016

@t-mat, I'm working on a project for searching documents (somehow similar to eleasricsearch, but written in c++ and using xapian as the indexing engine). I do believe shared dictionary could make a difference, but I was wondering: are shared dictionaries created through an lz4 api? Is there an example somewhere? Also, what is the way that data structure for shared dictionaries work? As for writing atomicity it'd have to be append only preferably.

@Kronuz
Copy link
Author

Kronuz commented Feb 25, 2016

Also, what does "Usually, Dependent memory is previous adjacent contiguous memory up to 64KiBytes. LZ4 will not access further memories." mean? Does it mean 64K of previous adjacent contiguous uncompressed memory?

...and for the shared dictionary, could it be changing as data is being added? Do block dependencies work by using some sort of dictionary from the previous block or is it some different dependency structure other that a shared dictionary?

@t-mat
Copy link
Contributor

t-mat commented Mar 1, 2016

@Cyan4973, I've created experimental LZ4 dictionary (de)compression example (modified line-by-line example). But I'm not sure it's optimal enough or not. I have 2 questions:

(1) Does the following command have proper options to make dictionary for LZ4?

zstd --train INPUT_FILE -o OUTPUT_DICT --maxdict 65536

Since zstd's dictionary file has header structure, it seems suspicious. I think I overlooked something.

(2) Does the following snippet have good enough structure to achieve optimal compression speed?

LZ4_stream_t* lz4s = LZ4_createStream();
const char* dictPtr = DICTIONARY_DATA;
const int dictBytes = DICTIONARY_DATA_SIZE_IN_BYTES; // 65536
while(...) {
    LZ4_loadDict(lz4s, dictPtr, dictBytes);
    int c = LZ4_compress_limitedOutput_continue(
                lz4s, src, dst, srcBytes, dstBytes);
    save(dst, c);
}

Maybe this question is related to PR #188.


@Kronuz,

are shared dictionaries created through an lz4 api?

No. Currently LZ4 doesn't have any dictionary creation API. As @Cyan4973 mentioned it, you can create dictionary with zstdcli or zdict.h.

Is there an example somewhere?
Also, what is the way that data structure for shared dictionaries work? As for writing atomicity it'd have to be append only preferably.

I've created example code, but please refer above my questions.
Basically, you can use shared dictionary for each independent block. It takes some time and memory, but it improves compression ratio.

Also, what does "Usually, Dependent memory is previous adjacent contiguous memory up to 64KiBytes. LZ4 will not access further memories." mean?
Does it mean 64K of previous adjacent contiguous uncompressed memory?

Yes, in the compression procedure, it means adjacent contiguous up to 64KiBytes uncompressed (source, raw data) memory.
In decompression procedure, it also means adjacent contiguous up to 64KiBytes decompressed (destination, decomprssed raw data) memory.

...and for the shared dictionary, could it be changing as data is being added?

I'm not sure about your purpose. But since shared dictionary is shared with past (previous) independent blocks, changing shared dictionary needs recompression of previous independent blocks which shares same shared dictionary.

Do block dependencies work by using some sort of dictionary from the previous block or is it some different dependency structure other that a shared dictionary?

You can imagine shared dictionary as some kind of "previous block". If you use shared dictionary, block list becomes "(S),I,D,D,D" (S:shared dictionary, I:independent block, D:dependent block).

@Cyan4973
Copy link
Member

Cyan4973 commented Mar 1, 2016

A very nice initiative @t-mat !

(1) Does the following command have proper options to make dictionary for LZ4?
zstd --train INPUT_FILE -o OUTPUT_DICT --maxdict 65536

Yes, it is correct.
There is indeed some header, but it is at the very beginning of the file, therefore in an area which is barely useful for the compressor.
If you want to be on the safe side, you can add 200 bytes to maxdict so that header information will mostly stand there.

(2) Does the following snippet have good enough structure to achieve optimal compression speed?

Yes, it's a nice and effective example code.

The only area where I would recommend a change is in the way samples are provided to zstd --train.
In https://gist.github.com/t-mat/3f6be82a408b60c00312, it's a single file, containing 1K lines.
It works, but the dictionary builder will likely prefer 1000 files containing 1 line each (which can be created from the single file using split).
The difference between the 2 methods is going to be more pronounced for zstd than for lz4, because lz4 doesn't use the dictionary header. So I suspect improvements for lz4 will be small. It's more a recommendation to follow a method which will give better results for zstd.

Regards

@Kronuz
Copy link
Author

Kronuz commented Mar 1, 2016

@Cyan4973, and how would you train the same dictionary with 1000 different files? Multiple calls to train passing the same dictionary as output and different input buffers? Wouldn't that overwrite the dictionary file with the training of the last data file?

@t-mat
Copy link
Contributor

t-mat commented Mar 1, 2016

It works, but the dictionary builder will likely prefer 1000 files containing 1 line each (which can be created from the single file using split).

Right, I've fixed it. I forgot about offset.

how would you train the same dictionary with 1000 different files?

zstd --train accepts multiple source file (splittemp-*):

split --lines=1 --suffix-length=10 testfile splittemp-testfile
./zstd/programs/zstd --train splittemp-* -o testfile.dict --maxdict 65536
rm splittemp-*

@Behrouz-m
Copy link

Behrouz-m commented Aug 20, 2016

zstd --train splittemp-* -o testfile.dict

this does not work in windows! zstd does not accept wild-char in windows

@Cyan4973
Copy link
Member

Cyan4973 commented Aug 20, 2016

Indeed, by default, Windows shell does not take in charge wild-char expansion.

In Window' world, this is up to the program to take care of this.
That's why it's necessary to build the program with a special library.

If you build zstd using the Visual project provided in projects directory, it will include the specific "shell expansion" library, and wild-char should work as intended.

@Cyan4973
Copy link
Member

Cyan4973 commented Nov 4, 2016

regarding @ray-pixar question :
another way to solve the wildcard expansion pb on Windows is to use recursive mode -r.

say, for example, that all splittemp_* files are into splittemp_dir directory,
the command becomes :
zstd -r --train splittemp_dir -o testfile.dict

My understanding of this discussion is that the most important action worth doing is integration of @t-mat 's dictionary example into examples directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants