New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start reading from the middle of a stream #187
Comments
Hi @Kronuz,
I think you want to make block list like this: "I,D,D,D,I,D,D,D,I,D,D,D,...". Here, "I" means independent, "D" means dependent (to previous) block and both "I" and "D" has same block size. Unfortunately, Standard LZ4 Frame Format and lz4frame library don't support such block format. But IMHO, if you have enough memory and order of I and D are constant (e.g. one "I" has three trailing "D"s), easiest way is mimicing standard LZ4 frame format with large independent block. For example, transform small 4 * 64KiBytes blocks "I,D,D,D" to single large independent 256KiBytes block "L". |
So it is not possible to start decompressing from a block that has a dependency? I have to have independent blocks every now and then and start decompressing from there? |
Yes, you can't decompress a block which has a dependency. Dependency block always needs previous block's information (i.e. non-empty
Yes, you can only start decompression (i.e. start (de)compression with empty |
Understood! Thank you @t-mat. I just thought there could be some other way. I found it difficult to find an example/explanation about this fact. |
If you want to access blocks randomly, you need to compress them independently. An advanced way to access independent blocks and improve compression is to use a "shared dictionary". LZ4 does support it, although not in the frame format yet : one needs to work directly at block level ( The shared dictionary will only be useful if there are some "redundancies" between blocks (if they have the same kind of content). If blocks can have any type of content, no useful dictionary can be found. Creating the shared dictionary can be a tough challenge. Fortunately, latest version of zstd offers a generic dictionary builder, using |
@t-mat, I'm working on a project for searching documents (somehow similar to eleasricsearch, but written in c++ and using xapian as the indexing engine). I do believe shared dictionary could make a difference, but I was wondering: are shared dictionaries created through an lz4 api? Is there an example somewhere? Also, what is the way that data structure for shared dictionaries work? As for writing atomicity it'd have to be append only preferably. |
Also, what does "Usually, Dependent memory is previous adjacent contiguous memory up to 64KiBytes. LZ4 will not access further memories." mean? Does it mean 64K of previous adjacent contiguous uncompressed memory? ...and for the shared dictionary, could it be changing as data is being added? Do block dependencies work by using some sort of dictionary from the previous block or is it some different dependency structure other that a shared dictionary? |
@Cyan4973, I've created experimental LZ4 dictionary (de)compression example (modified line-by-line example). But I'm not sure it's optimal enough or not. I have 2 questions: (1) Does the following command have proper options to make dictionary for LZ4?
Since zstd's dictionary file has header structure, it seems suspicious. I think I overlooked something. (2) Does the following snippet have good enough structure to achieve optimal compression speed?
Maybe this question is related to PR #188.
No. Currently LZ4 doesn't have any dictionary creation API. As @Cyan4973 mentioned it, you can create dictionary with zstdcli or zdict.h.
I've created example code, but please refer above my questions.
Yes, in the compression procedure, it means adjacent contiguous up to 64KiBytes uncompressed (source, raw data) memory.
I'm not sure about your purpose. But since shared dictionary is shared with past (previous) independent blocks, changing shared dictionary needs recompression of previous independent blocks which shares same shared dictionary.
You can imagine shared dictionary as some kind of "previous block". If you use shared dictionary, block list becomes "(S),I,D,D,D" (S:shared dictionary, I:independent block, D:dependent block). |
A very nice initiative @t-mat !
Yes, it is correct.
Yes, it's a nice and effective example code. The only area where I would recommend a change is in the way samples are provided to Regards |
@Cyan4973, and how would you train the same dictionary with 1000 different files? Multiple calls to train passing the same dictionary as output and different input buffers? Wouldn't that overwrite the dictionary file with the training of the last data file? |
Right, I've fixed it. I forgot about offset.
|
this does not work in windows! zstd does not accept wild-char in windows |
Indeed, by default, Windows shell does not take in charge wild-char expansion. In Window' world, this is up to the program to take care of this. If you build |
regarding @ray-pixar question : say, for example, that all My understanding of this discussion is that the most important action worth doing is integration of @t-mat 's dictionary example into |
If one has a list of compressed blocks written using stream mode and having each block pointing to the previous one (as dependency); can I start uncompressing from somewhere in the middle? i.e. Say I have a hundred blocks and the data I want should be compressed in block 70 of the stream. How do I start uncompressing from block 70 (or closer to it) instead of starting on the first block and discarding all uncompressed bytes until the ones in the block I want.
The text was updated successfully, but these errors were encountered: