New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frame-api: add function to insert uncomressed data #1094
Conversation
9e24dac
to
c2e0230
Compare
Hi, @alexmohr |
new method `uncompressed_update` allows to insert blocks without compression into the lz4 stream. The usage is documented in the frameCompress example Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
c2e0230
to
4aeb502
Compare
@t-mat Thanks, I think the compatibility tests are working now. I also had a segfault because I forgot to add a null check if no uncompressed file is passed. |
lib/lz4.h
Outdated
@@ -346,6 +346,8 @@ LZ4LIB_API int LZ4_loadDict (LZ4_stream_t* streamPtr, const char* dictionary, in | |||
*/ | |||
LZ4LIB_API int LZ4_compress_fast_continue (LZ4_stream_t* streamPtr, const char* src, char* dst, int srcSize, int dstCapacity, int acceleration); | |||
|
|||
LZ4LIB_API int LZ4_DictSize (LZ4_stream_t* LZ4_dict, int dictSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few conventions :
- function names start in lowercase, excluding the prefix
- new functions shall be documented. What does it do ? Set a new dictSize ? Get a current dictSize ? What are the limitations ? What is the parameter for ? What happens in case of error ?
- Generally, function name starts with a verb/action, to better qualify the effect, for example
LZ4_setDictSize()
orLZ4_reduceDictSize()
. - New symbols do not start their life directly in "stable" area. They have to spend some time in "staging" area below, to prove their worth and collect user feedback. As a consequence, the qualifier changes to
LZ4LIB_STATIC_API
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I solved your comments. Also added a fuzzing test to make sure the changes are working properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a fuzzing test to make sure the changes are working properly
Great !
lib/lz4hc.h
Outdated
@@ -173,6 +173,8 @@ LZ4LIB_API int LZ4_compress_HC_continue_destSize(LZ4_streamHC_t* LZ4_streamHCPtr | |||
const char* src, char* dst, | |||
int* srcSizePtr, int targetDstSize); | |||
|
|||
LZ4LIB_API int LZ4_DictHCSize(LZ4_streamHC_t* LZ4_streamHCPtr, int dictSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment regarding new function symbol name
lib/lz4frame.h
Outdated
@@ -160,6 +160,11 @@ typedef enum { | |||
LZ4F_OBSOLETE_ENUM(skippableFrame) | |||
} LZ4F_frameType_t; | |||
|
|||
typedef enum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this enum
used / manipulated by the user ?
If not, it doesn't need to be part of the public API,
and can remain private inside lz4frame.c
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it into lz4frame.c
as it's not supposed to by used by the user.
This commit fixes the review findings Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
This commit fixes the review findings Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
70c8abb
to
d0a67df
Compare
lib/lz4.h
Outdated
@@ -509,6 +509,17 @@ LZ4LIB_STATIC_API int LZ4_compress_fast_extState_fastReset (void* state, const c | |||
*/ | |||
LZ4LIB_STATIC_API void LZ4_attach_dictionary(LZ4_stream_t* workingStream, const LZ4_stream_t* dictionaryStream); | |||
|
|||
/*! LZ4_getDictSize(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this documentation. It makes the intention clearer.
Yet, I had a hard time connecting the function's name with its intended objective:
This can be used for adding data without compression to the LZ4 archive.
If linked blocked mode is used the memory of the dictionary is kept free.
I suspect that's because the documentation blends multiple layers of responsibilities in this paragraph.
At this place in the API, LZ4_getDictSize()
seems to be just about knowing the current dictionary size of the active LZ4_stream_t*
state. And it's likely implied that this is not to be used in a concurrent access scenario.
That this function is then employed in the context of LZ4Frame
for a specific mode adding uncompressed data can be interesting information, but it does not define what this function is doing. The size information it provides could be employed for any other usage, so it matters that it's cleanly defined.
This leads me to a few simple questions :
- what is
@dictSize
argument for ?
All it does is cap the reporteddictSize
, without changing anything to underlying situation ?
If the point of this function is to return the dictionary size, maybe it should do just that ?
And if there is a reason to cap the value at the calling site, maybe this should be done at the calling site ? - Getter generally to not mutate the state they are looking into. Assuming this is the case here too, the state could be
const LZ4_stream_t*
instead, which makes it clear that this function has no side effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code just has been moved out of the LZ4_saveDict
function (where it's also used now).
I passed @dictSize
to be consistent with the previous implementation. I'm fine with removing the dict size parameter and capping it to 64 KB locally in this function. But than we would either have to change the signature of LZ4_saveDict
to remove the dictSize
parameter or restore the old code of LZ4_saveDict
to calculate the dict size there, which would lead to duplicated code.
As for you second point: I changed the parameter as well as dict
const
to make clear that these are not modified.
As I wrote in the other thread already all this only has been added so do not modify the dictionary when adding uncompressed data.
If you think modifying the dictionary is okay when adding uncompressed data, I'd remove the if
from https://github.com/lz4/lz4/pull/1094/files#diff-16e71ed5519d7ce479c3a3c3158b3e5b121fd300b78497bb477a6695b6d08b50R969 and restore the old way the dict was calculated here.
In case we keep the get_dictSize
function I'll update the documentation again to again make it a bit clearer what the intent of this function is.
lib/lz4frame.c
Outdated
int const realDictSize = LZ4F_localSaveDict(cctxPtr); | ||
assert(0 <= realDictSize && realDictSize <= 64 KB); | ||
cctxPtr->tmpIn = cctxPtr->tmpBuff + realDictSize; | ||
/* only keep the space of the dictionary, so dict data is kept for the next compressedUpdate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This portion of the code confuses me.
What is the objective ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was not to write the dictionary if an uncompressed block is written. That's why I've added the get dict size functions. They are used to keep the space of the dictionary free without putting any new data in.
The alternative would be to remove this and always update the dictionary even if we are writing an uncompressed block.
It would make the dictionary a bit worse but probably simplify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this might not be conformant to the frame specification (though I'm unsure if I do understand the details).
Let's quickly state that independent blocks are unaffected, this part is clear.
For linked blocks though, it's specified that the each block uses previous block(s) as a dictionary.
If this flag is set to “0”, each block depends on previous ones (up to LZ4 window size, which is 64 KB). In such case, it’s necessary to decode all blocks in sequence.
Note that each block depends on previous ones, not on previous compressed blocks. This means that, if a block is uncompressed, it's still part of the dictionary for the following block.
I'm not sure how this plays out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we keep the dictionary but without modifications. The uncompressed block still contains the dictionary but it's not updated with new data.
lib/lz4frame.c#975
realDictSize = LZ4F_localDictSize(cctxPtr);
}
assert(0 <= realDictSize && realDictSize <= 64 KB);
cctxPtr->tmpIn = cctxPtr->tmpBuff + realDictSize;
as real dict size is now set to the last size of the dictionary cctxPtr->tmpIn
starts behind the dictionary data and the memory of the dict is not modified. When the block is written is still contains the data
I probably should update the fuzzing test to make sure it's working dependent and independent blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fuzzing test is updated to include both blocked modes.
add a fuzzing test for uncompressed frame api Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
d0a67df
to
3c57d2f
Compare
change the context to const to make clear that the context is not modified
c3b5ef7
to
9a42a9d
Compare
add static dependency to examples
fuzzing test now tests linked and independent blocks Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
Hi @Cyan4973 do you have further change requests? |
Hi @alexmohr , I'm just a bit uneasy about what happens to the dictionary after inserting an uncompressed block in the frame. Essentially I just need to find some time to properly validate this PR. |
I made a few tests with the new It's unclear if I did something wrong or if there is a pb with the new entry point. |
When trying to compile the
Probably some Include path issue. |
Can you share the tests? I suspect there is still an issue with my implementation |
I'm probably doing something different here bc compiling works just fine for me
If you can post the commands you're using I'll try to find out whats different. |
I think I narrowed down the issue to situations where the fuzzer generates a lot of data to pass via the new |
OK, so this happens specifically when the amount of data to pass via the new |
I'm using a modified variant of I could create a feature branch to publish the modified test if you want. |
I was doing something equivalent on my side when link stage failed. Anyway, I'm not using this tool for testings currently, but rather a modified variant of |
Modified variant of |
I also now realize that we have been jumping into the implementation details without even talking about the use case. The initial message mentions #814 as a reason to propose this PR, So the question is : Asking as:
|
Regarding #814 I was refering to this part of the description:
The use case for me is that we're streaming a lz4 compressed tar archive from memory to disk. Tar does not support streaming out of the box, so we have to patch the header by setting the correct file size as soon as we're done with streaming a file. |
when the block mode changes a flush is executed, to prevent mixing compressed and uncompressed data. Prior to this commit dstStart, dstPtr, dstCapacity where not updated to include the offset from bytesWritten. For inputs > blockSize this meant the flushed data was overwritten. Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
Thanks, I found the issue using your test and pushed a new commit. Should I cherry-pick your commit on |
OK, thanks for the explanation, that's an important starting point. It looks to me that your use case doesn't only need to send some data uncompressed. In order to modify this data later, While I understand the use case, there is a balance to find between serving it, and making the general library more complex to maintain and understand for everybody. Sometimes, for very niche use cases, it's acceptable to create a fork to serve it, and keep the "general" library free. Here are a few proposals that could be employed to serve this use case :
I find the second idea attractive because it's likely going to reduce complexity significantly, and if becomes "simple enough", then there is less "weight" supporting it into the general library. I also suspect independent blocks is what you had in mind to begin with, so making it a pre-requisite to use this capability is not going to hurt your use case.
This is actually a very different scenario. Here, data is presumed sent "out of band", in order to remove any kind of additional byte, not even a small header. And then it's re-inserted into LZ4F history on the decompression side. |
Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
6610426
to
42eb47d
Compare
I quite like the idea of making this only available for independent blocks. This means we can remove the special dictionary handling which you commented on. It makes everything much simpler and works for my use case just fine. I changed the MR accordingly and ran your modified frame test again and everything seems to be in working order |
lib/lz4hc.h
Outdated
@@ -405,6 +405,18 @@ LZ4LIB_STATIC_API void LZ4_attach_HC_dictionary( | |||
LZ4_streamHC_t *working_stream, | |||
const LZ4_streamHC_t *dictionary_stream); | |||
|
|||
/*! LZ4_getDictHCSize(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this entry point is not needed anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to remove that.
lib/lz4frame.c
Outdated
void* dstBuffer, size_t dstCapacity, | ||
const void* srcBuffer, size_t srcSize, | ||
const LZ4F_compressOptions_t* compressOptionsPtr) { | ||
assert(cctxPtr->prefs.frameInfo.blockMode == LZ4F_blockIndependent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I would prefer an actual test, followed by an error if the condition is not respected.
Wrong block mode is an easy mistake to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with RETURN_ERROR_IF(cctxPtr->prefs.frameInfo.blockMode != LZ4F_blockIndependent, blockMode_invalid);
@@ -26,7 +26,7 @@ foreach e, src : examples | |||
executable( | |||
e, | |||
lz4_source_root / 'examples' / src, | |||
dependencies: liblz4_dep, | |||
dependencies: [liblz4_dep, liblz4_internal_dep], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is liblz4_internal_dep
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed liblz4_dep
as liblz4_internal_dep
is necessary for static linkage of lz4. Having both is redundant.
It's defined here as static_library(...)
for example contrib/meson/meson/programs/meson.build
is using the static linkage as well.
ab8c4ee
to
25feb4f
Compare
contrib/meson/meson/lib/meson.build
Outdated
@@ -43,7 +43,7 @@ liblz4_dep = declare_dependency( | |||
include_directories: include_directories(lz4_source_root / 'lib') | |||
) | |||
|
|||
if get_option('tests') or get_option('programs') | |||
if get_option('tests') or get_option('programs') or get_option('programs') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_option('programs')
seems repeated twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commited it too early. I didn't think you're going to review it right away but I guess you got a few mails :/
* replace assert with test for LZ4F_uncompressedUpdate * update documentation to incldue correct docstring * remove unecessary entry point * remove compress_linked_block_mode from fuzzing test Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
25feb4f
to
0ac3c74
Compare
new function
uncompressed_update
allows to insert blocks withoutcompression into the lz4 stream.
The usage is documented in the frameCompress example
This could be a solution for #814
Alexander Mohr, alexander.m.mohr@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, imprint
Signed-off-by: Alexander Mohr alexander.m.mohr@mercedes-benz.com