Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frame-api: add function to insert uncomressed data #1094

Merged
merged 11 commits into from Jul 5, 2022

Conversation

alexmohr
Copy link
Contributor

@alexmohr alexmohr commented Jun 9, 2022

new function uncompressed_update allows to insert blocks without
compression into the lz4 stream.
The usage is documented in the frameCompress example

This could be a solution for #814

Alexander Mohr, alexander.m.mohr@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, imprint

Signed-off-by: Alexander Mohr alexander.m.mohr@mercedes-benz.com

@alexmohr alexmohr force-pushed the add-uncompressed-api branch 2 times, most recently from 9e24dac to c2e0230 Compare June 9, 2022 14:18
@t-mat
Copy link
Contributor

t-mat commented Jun 9, 2022

Hi, @alexmohr
If you have some difficulty to pass our compatibility test, please use make c_standards in your local terminal.
It checks compatibility with C90, C99 and C11.

new method `uncompressed_update` allows to insert blocks without
compression into the lz4 stream.
The usage is documented in the frameCompress example

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
@alexmohr
Copy link
Contributor Author

alexmohr commented Jun 9, 2022

@t-mat Thanks, I think the compatibility tests are working now. I also had a segfault because I forgot to add a null check if no uncompressed file is passed.

lib/lz4.h Outdated
@@ -346,6 +346,8 @@ LZ4LIB_API int LZ4_loadDict (LZ4_stream_t* streamPtr, const char* dictionary, in
*/
LZ4LIB_API int LZ4_compress_fast_continue (LZ4_stream_t* streamPtr, const char* src, char* dst, int srcSize, int dstCapacity, int acceleration);

LZ4LIB_API int LZ4_DictSize (LZ4_stream_t* LZ4_dict, int dictSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few conventions :

  • function names start in lowercase, excluding the prefix
  • new functions shall be documented. What does it do ? Set a new dictSize ? Get a current dictSize ? What are the limitations ? What is the parameter for ? What happens in case of error ?
  • Generally, function name starts with a verb/action, to better qualify the effect, for example LZ4_setDictSize() or LZ4_reduceDictSize().
  • New symbols do not start their life directly in "stable" area. They have to spend some time in "staging" area below, to prove their worth and collect user feedback. As a consequence, the qualifier changes to LZ4LIB_STATIC_API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I solved your comments. Also added a fuzzing test to make sure the changes are working properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a fuzzing test to make sure the changes are working properly

Great !

lib/lz4hc.h Outdated
@@ -173,6 +173,8 @@ LZ4LIB_API int LZ4_compress_HC_continue_destSize(LZ4_streamHC_t* LZ4_streamHCPtr
const char* src, char* dst,
int* srcSizePtr, int targetDstSize);

LZ4LIB_API int LZ4_DictHCSize(LZ4_streamHC_t* LZ4_streamHCPtr, int dictSize);
Copy link
Member

@Cyan4973 Cyan4973 Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment regarding new function symbol name

lib/lz4frame.h Outdated
@@ -160,6 +160,11 @@ typedef enum {
LZ4F_OBSOLETE_ENUM(skippableFrame)
} LZ4F_frameType_t;

typedef enum {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enum used / manipulated by the user ?
If not, it doesn't need to be part of the public API,
and can remain private inside lz4frame.c.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it into lz4frame.c as it's not supposed to by used by the user.

This commit fixes the review findings

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
This commit fixes the review findings

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
lib/lz4.h Outdated
@@ -509,6 +509,17 @@ LZ4LIB_STATIC_API int LZ4_compress_fast_extState_fastReset (void* state, const c
*/
LZ4LIB_STATIC_API void LZ4_attach_dictionary(LZ4_stream_t* workingStream, const LZ4_stream_t* dictionaryStream);

/*! LZ4_getDictSize():
Copy link
Member

@Cyan4973 Cyan4973 Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this documentation. It makes the intention clearer.
Yet, I had a hard time connecting the function's name with its intended objective:

This can be used for adding data without compression to the LZ4 archive.
If linked blocked mode is used the memory of the dictionary is kept free.

I suspect that's because the documentation blends multiple layers of responsibilities in this paragraph.

At this place in the API, LZ4_getDictSize() seems to be just about knowing the current dictionary size of the active LZ4_stream_t* state. And it's likely implied that this is not to be used in a concurrent access scenario.

That this function is then employed in the context of LZ4Frame for a specific mode adding uncompressed data can be interesting information, but it does not define what this function is doing. The size information it provides could be employed for any other usage, so it matters that it's cleanly defined.

This leads me to a few simple questions :

  • what is @dictSize argument for ?
    All it does is cap the reported dictSize, without changing anything to underlying situation ?
    If the point of this function is to return the dictionary size, maybe it should do just that ?
    And if there is a reason to cap the value at the calling site, maybe this should be done at the calling site ?
  • Getter generally to not mutate the state they are looking into. Assuming this is the case here too, the state could be const LZ4_stream_t* instead, which makes it clear that this function has no side effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code just has been moved out of the LZ4_saveDict function (where it's also used now).

I passed @dictSize to be consistent with the previous implementation. I'm fine with removing the dict size parameter and capping it to 64 KB locally in this function. But than we would either have to change the signature of LZ4_saveDict to remove the dictSize parameter or restore the old code of LZ4_saveDict to calculate the dict size there, which would lead to duplicated code.

As for you second point: I changed the parameter as well as dict const to make clear that these are not modified.

As I wrote in the other thread already all this only has been added so do not modify the dictionary when adding uncompressed data.
If you think modifying the dictionary is okay when adding uncompressed data, I'd remove the if from https://github.com/lz4/lz4/pull/1094/files#diff-16e71ed5519d7ce479c3a3c3158b3e5b121fd300b78497bb477a6695b6d08b50R969 and restore the old way the dict was calculated here.

In case we keep the get_dictSize function I'll update the documentation again to again make it a bit clearer what the intent of this function is.

lib/lz4frame.c Outdated
int const realDictSize = LZ4F_localSaveDict(cctxPtr);
assert(0 <= realDictSize && realDictSize <= 64 KB);
cctxPtr->tmpIn = cctxPtr->tmpBuff + realDictSize;
/* only keep the space of the dictionary, so dict data is kept for the next compressedUpdate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This portion of the code confuses me.
What is the objective ?

Copy link
Contributor Author

@alexmohr alexmohr Jun 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was not to write the dictionary if an uncompressed block is written. That's why I've added the get dict size functions. They are used to keep the space of the dictionary free without putting any new data in.
The alternative would be to remove this and always update the dictionary even if we are writing an uncompressed block.
It would make the dictionary a bit worse but probably simplify this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this might not be conformant to the frame specification (though I'm unsure if I do understand the details).

Let's quickly state that independent blocks are unaffected, this part is clear.

For linked blocks though, it's specified that the each block uses previous block(s) as a dictionary.

If this flag is set to “0”, each block depends on previous ones (up to LZ4 window size, which is 64 KB). In such case, it’s necessary to decode all blocks in sequence.

Note that each block depends on previous ones, not on previous compressed blocks. This means that, if a block is uncompressed, it's still part of the dictionary for the following block.

I'm not sure how this plays out here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we keep the dictionary but without modifications. The uncompressed block still contains the dictionary but it's not updated with new data.

lib/lz4frame.c#975

    realDictSize = LZ4F_localDictSize(cctxPtr);
}
assert(0 <= realDictSize && realDictSize <= 64 KB);
cctxPtr->tmpIn = cctxPtr->tmpBuff + realDictSize;

as real dict size is now set to the last size of the dictionary cctxPtr->tmpIn starts behind the dictionary data and the memory of the dict is not modified. When the block is written is still contains the data

I probably should update the fuzzing test to make sure it's working dependent and independent blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fuzzing test is updated to include both blocked modes.

alexmohr and others added 2 commits June 11, 2022 22:47
add a fuzzing test for uncompressed frame api

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
@alexmohr alexmohr changed the title frame-api: add method to insert uncomressed data frame-api: add function to insert uncomressed data Jun 11, 2022
change the context to const to make
clear that the context is not modified
alexmohr and others added 2 commits June 12, 2022 00:41
add static dependency to examples
fuzzing test now tests linked and independent blocks

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
@alexmohr
Copy link
Contributor Author

Hi @Cyan4973 do you have further change requests?

@Cyan4973
Copy link
Member

Hi @Cyan4973 do you have further change requests?

Hi @alexmohr ,
there is nothing needed to add on this PR.
You did a good job, it's a well done PR, it comes with good comments and good tests.

I'm just a bit uneasy about what happens to the dictionary after inserting an uncompressed block in the frame.
Nothing obvious, it's just that I don't fully understand everything.
Also this scenario is apparently tested in the fuzzer, so it should have caught something if there was any obvious flaw.

Essentially I just need to find some time to properly validate this PR.

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 1, 2022

I made a few tests with the new LZ4F_uncompressedUpdate() method in this PR,
unfortunately they all fail,
resulting in various errors on both the compression or decompression sides.

It's unclear if I did something wrong or if there is a pb with the new entry point.
I need to spend more time to understand the errors, and why they were not found by existing tests.

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 1, 2022

When trying to compile the ossfuzz test provided in this PR, I'm getting :

round_trip_frame_uncompressed_fuzzer.c:81: undefined reference to `LZ4F_uncompressedUpdate'

Probably some Include path issue.

@alexmohr
Copy link
Contributor Author

alexmohr commented Jul 1, 2022

I made a few tests with the new LZ4F_uncompressedUpdate() method in this PR, unfortunately they all fail, resulting in various errors on both the compression or decompression sides.

It's unclear if I did something wrong or if there is a pb with the new entry point. I need to spend more time to understand the errors, and why they were not found by existing tests.

Can you share the tests? I suspect there is still an issue with my implementation

@alexmohr
Copy link
Contributor Author

alexmohr commented Jul 1, 2022

When trying to compile the ossfuzz test provided in this PR, I'm getting :

round_trip_frame_uncompressed_fuzzer.c:81: undefined reference to `LZ4F_uncompressedUpdate'

Probably some Include path issue.

I'm probably doing something different here bc compiling works just fine for me

[22:58:46] # mohalex @ bob in /tmp 
$ git clone https://github.com/alexmohr/lz4 -b add-uncompressed-api 
Cloning into 'lz4'...
remote: Enumerating objects: 13123, done.
remote: Counting objects: 100% (72/72), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 13123 (delta 40), reused 67 (delta 40), pack-reused 13051
Receiving objects: 100% (13123/13123), 5.92 MiB | 3.63 MiB/s, done.
Resolving deltas: 100% (9122/9122), done.

[22:58:51] # mohalex @ bob in /tmp 
$ cd lz4 

[22:58:53] # mohalex @ bob in /tmp/lz4 on git:add-uncompressed-api o 
$ make -C ossfuzz
make: Entering directory '/tmp/lz4/ossfuzz'
make -C ../lib CFLAGS=" -g -DLZ4_DEBUG=1 " liblz4.a
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION compress_fuzzer.c -o compress_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION lz4_helpers.c -o lz4_helpers.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION fuzz_data_producer.c -o fuzz_data_producer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION standaloneengine.c -o standaloneengine.o
make[1]: Entering directory '/tmp/lz4/lib'
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION decompress_fuzzer.c -o decompress_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION round_trip_fuzzer.c -o round_trip_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION round_trip_stream_fuzzer.c -o round_trip_stream_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION compress_hc_fuzzer.c -o compress_hc_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION round_trip_hc_fuzzer.c -o round_trip_hc_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION compress_frame_fuzzer.c -o compress_frame_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION round_trip_frame_fuzzer.c -o round_trip_frame_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION round_trip_frame_uncompressed_fuzzer.c -o round_trip_frame_uncompressed_fuzzer.o
cc -c  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION decompress_frame_fuzzer.c -o decompress_frame_fuzzer.o
compiling static library
make[1]: Leaving directory '/tmp/lz4/lib'
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   compress_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o compress_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   decompress_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o decompress_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   round_trip_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o round_trip_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   round_trip_stream_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o round_trip_stream_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   compress_hc_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o compress_hc_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   round_trip_hc_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o round_trip_hc_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   compress_frame_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o compress_frame_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   round_trip_frame_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o round_trip_frame_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   round_trip_frame_uncompressed_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o round_trip_frame_uncompressed_fuzzer
g++  -g -DLZ4_DEBUG=1   -I../lib -DXXH_NAMESPACE=LZ4_ -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION   decompress_frame_fuzzer.o lz4_helpers.o fuzz_data_producer.o ../lib/liblz4.a standaloneengine.o -o decompress_frame_fuzzer
rm compress_frame_fuzzer.o decompress_frame_fuzzer.o decompress_fuzzer.o round_trip_hc_fuzzer.o compress_fuzzer.o round_trip_frame_uncompressed_fuzzer.o standaloneengine.o round_trip_stream_fuzzer.o fuzz_data_producer.o round_trip_frame_fuzzer.o round_trip_fuzzer.o compress_hc_fuzzer.o lz4_helpers.o
make: Leaving directory '/tmp/lz4/ossfuzz'

If you can post the commands you're using I'll try to find out whats different.

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 1, 2022

I think I narrowed down the issue to situations where the fuzzer generates a lot of data to pass via the new LZ4F_uncompressedUpdate method.

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 1, 2022

OK, so this happens specifically when the amount of data to pass via the new LZ4F_uncompressedUpdate is larger than a block size.

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 1, 2022

Can you share the tests? I suspect there is still an issue with my implementation

I'm using a modified variant of frametest, which is a sort of "poor man's fuzzer" implementation.
In this updated variant, uncompressed blocks are randomly added to the frame using the new LZ4F_uncompressedUpdate entry point.

I could create a feature branch to publish the modified test if you want.

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 1, 2022

If you can post the commands you're using I'll try to find out whats different.

I was doing something equivalent on my side when link stage failed.
Just to be on the safe side, I decided to copy/paste your proposed list of commands exactly, and it worked.
It seems to show that the build recipe is correct. I guess the issue is on my system.

Anyway, I'm not using this tool for testings currently, but rather a modified variant of frametest, so it did not block me.

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 1, 2022

Modified variant of frametest posted in feature branch pr1094_frametest

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 2, 2022

I also now realize that we have been jumping into the implementation details without even talking about the use case.

The initial message mentions #814 as a reason to propose this PR,
but #814 is actually very different (a niche scenario, with unspecified out-of-band capabilities, and focused on the decompression side) that this PR doesn't answer, not even partially.

So the question is :
In which scenario is it desirable to send raw uncompressed blocks inside an LZ4 Frame ?

Asking as:

  • we may possibly have existing ways to provide a solution for the target scenario.
  • any added code is more maintenance and more attack vectors to protect against. So it should be justified by a reasonable scenario to serve.

@alexmohr
Copy link
Contributor Author

alexmohr commented Jul 4, 2022

I also now realize that we have been jumping into the implementation details without even talking about the use case.

The initial message mentions #814 as a reason to propose this PR, but #814 is actually very different (a niche scenario, with unspecified out-of-band capabilities, and focused on the decompression side) that this PR doesn't answer, not even partially.

So the question is : In which scenario is it desirable to send raw uncompressed blocks inside an LZ4 Frame ?

Asking as:

* we may possibly have existing ways to provide a solution for the target scenario.

* any added code is more maintenance and more attack vectors to protect against. So it should be justified by a reasonable scenario to serve.

Regarding #814 I was refering to this part of the description:

Alternatively, the user could prepend a fake LZ4F block header to the uncompressed data, and pass that to the normal decompression function. This works with the current LZ4 version.

The use case for me is that we're streaming a lz4 compressed tar archive from memory to disk. Tar does not support streaming out of the box, so we have to patch the header by setting the correct file size as soon as we're done with streaming a file.
As our output is lz4 compressed we have to write the header uncompressed so we can seek back in the file and correct the data on disk.

when the block mode changes a flush is executed, to prevent
mixing compressed and uncompressed data.
Prior to this commit dstStart, dstPtr, dstCapacity
where not updated to include the offset from bytesWritten.
For inputs > blockSize this meant the flushed data was
overwritten.

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
@alexmohr
Copy link
Contributor Author

alexmohr commented Jul 4, 2022

Modified variant of frametest posted in feature branch pr1094_frametest

Thanks, I found the issue using your test and pushed a new commit. Should I cherry-pick your commit on pr1094_frametest into this PR?

@Cyan4973
Copy link
Member

Cyan4973 commented Jul 4, 2022

The use case for me is that we're streaming a lz4 compressed tar archive from memory to disk. Tar does not support streaming out of the box, so we have to patch the header by setting the correct file size as soon as we're done with streaming a file.
As our output is lz4 compressed we have to write the header uncompressed so we can seek back in the file and correct the data on disk.

OK, thanks for the explanation, that's an important starting point.

It looks to me that your use case doesn't only need to send some data uncompressed. In order to modify this data later,
it also needs this segment to be excluded from history, so that no future block could be based on past data that will be modified afterwards, resulting in corruption.
Such condition is automatically valid when using Block Independence mode, but is more troublesome when blocks are linked: now history must be actively messed up with.

While I understand the use case, there is a balance to find between serving it, and making the general library more complex to maintain and understand for everybody. Sometimes, for very niche use cases, it's acceptable to create a fork to serve it, and keep the "general" library free.

Here are a few proposals that could be employed to serve this use case :

  • The LZ4 Frame format is designed to deal with multiple concatenated frames, and deal with them as if they were a single content. Therefore, one approach could be to generate a "fake" frame with uncompressed content, for the tar header, followed by a normal frame for the file's content. This approach can be employed multiple times, the concatenation of all these frames would still be decompressed as if it was a single content.
  • If, for some reason, the receiving system is unable to deal with multiple frames, an intermediate idea would be to allow the creation of uncompressed data blocks, but only if the block mode is set to Independent. This way, it naturally solves the issue of making the content of the uncompressed block "disappear" from history, with no extra complexity.

I find the second idea attractive because it's likely going to reduce complexity significantly, and if becomes "simple enough", then there is less "weight" supporting it into the general library. I also suspect independent blocks is what you had in mind to begin with, so making it a pre-requisite to use this capability is not going to hurt your use case.

Regarding #814 I was refering to this part of the description:

Alternatively, the user could prepend a fake LZ4F block header to the uncompressed data, and pass that to the normal decompression function. This works with the current LZ4 version.

This is actually a very different scenario. Here, data is presumed sent "out of band", in order to remove any kind of additional byte, not even a small header. And then it's re-inserted into LZ4F history on the decompression side.
That's a very niche use case, and it's unclear if it's the right move to have dedicated code to support it directly within the general liblz4 library, or if a fork would be more appropriate for that. The proposed solution doesn't need direct contribution from the library, and is likely what is being used currently, since it would work "as is".

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
@alexmohr
Copy link
Contributor Author

alexmohr commented Jul 5, 2022

  • If, for some reason, the receiving system is unable to deal with multiple frames, an intermediate idea would be to allow the creation of uncompressed data blocks, but only if the block mode is set to Independent. This way, it naturally solves the issue of making the content of the uncompressed block "disappear" from history, with no extra complexity.

I quite like the idea of making this only available for independent blocks. This means we can remove the special dictionary handling which you commented on. It makes everything much simpler and works for my use case just fine. I changed the MR accordingly and ran your modified frame test again and everything seems to be in working order

lib/lz4hc.h Outdated
@@ -405,6 +405,18 @@ LZ4LIB_STATIC_API void LZ4_attach_HC_dictionary(
LZ4_streamHC_t *working_stream,
const LZ4_streamHC_t *dictionary_stream);

/*! LZ4_getDictHCSize():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this entry point is not needed anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to remove that.

lib/lz4frame.c Outdated
void* dstBuffer, size_t dstCapacity,
const void* srcBuffer, size_t srcSize,
const LZ4F_compressOptions_t* compressOptionsPtr) {
assert(cctxPtr->prefs.frameInfo.blockMode == LZ4F_blockIndependent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I would prefer an actual test, followed by an error if the condition is not respected.
Wrong block mode is an easy mistake to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with RETURN_ERROR_IF(cctxPtr->prefs.frameInfo.blockMode != LZ4F_blockIndependent, blockMode_invalid);

@@ -26,7 +26,7 @@ foreach e, src : examples
executable(
e,
lz4_source_root / 'examples' / src,
dependencies: liblz4_dep,
dependencies: [liblz4_dep, liblz4_internal_dep],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is liblz4_internal_dep ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed liblz4_dep as liblz4_internal_dep is necessary for static linkage of lz4. Having both is redundant.
It's defined here as static_library(...) for example contrib/meson/meson/programs/meson.build is using the static linkage as well.

@alexmohr alexmohr force-pushed the add-uncompressed-api branch 3 times, most recently from ab8c4ee to 25feb4f Compare July 5, 2022 19:12
@@ -43,7 +43,7 @@ liblz4_dep = declare_dependency(
include_directories: include_directories(lz4_source_root / 'lib')
)

if get_option('tests') or get_option('programs')
if get_option('tests') or get_option('programs') or get_option('programs')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_option('programs') seems repeated twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commited it too early. I didn't think you're going to review it right away but I guess you got a few mails :/

* replace assert with test for LZ4F_uncompressedUpdate
* update documentation to incldue correct docstring
* remove unecessary entry point
* remove compress_linked_block_mode from fuzzing test

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants