Add functions for faster preserving/restoring of streams #188

rkd-msw · 2016-02-29T15:10:26Z

We have a use-case that requires us to create lots of new LZ4 streams, and we found that loading the dictionary for each of these was taking a non-trivial amount of CPU (about 3% of CPU for the whole application).

We investigated whether we could just load a dictionary once and preserve the resulting LZ4_stream_t structure. Just doing a memcpy was actually slower (because of the overhead of copying the 16KiB hash table), but we've implemented a scheme where we create a separate structure recording which bits of the hash table we should copy, and can then just copy those entries when cloning the structure.

This is about 33% faster than using LZ4_loadDict each time.

We're happy for this to be under the 2-clause BSD license (from https://github.com/Cyan4973/lz4/blob/d008c87151abf8c36a9f98d28461bf6f3dfdc6ae/lib/LICENSE).

Cyan4973 · 2016-03-01T14:31:35Z

Thanks for this PR Rob.
This is a very welcomed initiative.

Note that a few automated tests fail with this PR (see results above).
It seems to be some minor style differences making your code incompatible with g++ and Visual compilation. Nothing too difficult to get there.

Now regarding the PR itself.
First, I'm surprised that memcpy() gets a noticeable share of computing. Compression operation is supposed to be much slower than memcpy(), so to become noticeable, I assume the amount of data you try to compress is very small ( < < 16 KB).

Now let's assume this 16 KB memcpy() is a performance hog, the solution you propose is reported to be 3x faster. Which is impressive, and I'm even more surprised ....

Even though your proposed code is fairly simple, it's still more complex for the CPU than a straight memcpy() operation.
The only situation I can imagine it to provide some speed benefits is in association with sparse hash table, that is with a dictionary which is relatively small, and as a consequence leave the hash table mostly empty. In which case, the number of non-zero references can be small enough to be faster to set than a full table memcpy(). Note though that you will still have to memset() the table before filling it. It's a very fast operation, but so is memcpy(), so the difference between the 2 becomes smaller.

If above hypothesis is right, that means your method will be faster with small input and small dictionaries, while the memcpy() should prove faster with larger dictionaries. I would naively expect both to be equivalent around ~ 8 KB dictionary size, but only tests will tell.

So I believe the next stage is to make some tests.
It's important to properly understand in which case your proposed patch provide benefits, and how much. It's both promising and surprising, so tests will be fairly important.

Cyan4973 · 2016-03-02T16:14:26Z

oh btw, last detail @rkd-msw,
no PR directly into "master" branch,
this should go to "dev" or to a dedicated feature branch first

rkd-msw · 2016-03-02T16:56:04Z

Yann, thanks!

Taking your points one-by-one:

Do you have a style guide that describes what you expect? It's not quite clear to me from the error messages what's wrong.
Yes, we're only compressing small amounts of data in each stream - perhaps 0.2-4KB.
Yes, we're only using small dictionaries - one is about 350 bytes and one is about 1000 bytes.

I can easily believe that when compressing more data with larger dictionaries, the benefit would be much smaller.

I have some tests I used during development to prove the benefits in this scenario - let me dig those out, clean them up and attach them.

Cyan4973 · 2016-03-02T17:06:35Z

Do you have a style guide that describes what you expect?

There is no specific style guide. I just follow compiler's errors notifications.

For g++, the problem is that C++ doesn't like implicit cast from void* to another type.
line 1016 of lz4.c should add an explicit type-casting to preserved_hash_table_entry_t*.

For Visual, the issue it that it doesn't like variable declaration in the middle of a block.
It requires variable declaration at the beginning of a block.
You can either move int i there, or create a sub-block if the variable scope is well located.

Another point is that if your patch improves speed for a combination of small dictionary + small blocks, it is important to document it (typically using inline comments) and provide some good hints of what size it should concern (and what gain is expected).

The risk here is a confusion for 3rd-party users : they may assume that they must use these new functions, while being in fact in a fairly different case (large dictionary or larger blocks), for which the patch would hurt performance.

Cyan4973 · 2016-11-07T21:12:42Z

We'll proceed differently for this use case.
Instead of modifying / restoring the reference hash table, which is not multi-thread compatible,
we'll try to make it unmutable and usable "as is".

But it's a significant change, so it's unclear when we will have time to do it.
Still, it's in our todo list now.

rkd-msw added 2 commits February 29, 2016 14:54

Add utility functions for faster preserving/restoring of streams

a72a9f0

Clearer commenting

c8e3bf6

t-mat mentioned this pull request Mar 1, 2016

Start reading from the middle of a stream #187

Closed

Cyan4973 closed this Nov 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functions for faster preserving/restoring of streams #188

Add functions for faster preserving/restoring of streams #188

rkd-msw commented Feb 29, 2016

Cyan4973 commented Mar 1, 2016

Cyan4973 commented Mar 2, 2016

rkd-msw commented Mar 2, 2016

Cyan4973 commented Mar 2, 2016

Cyan4973 commented Nov 7, 2016

Add functions for faster preserving/restoring of streams #188

Add functions for faster preserving/restoring of streams #188

Conversation

rkd-msw commented Feb 29, 2016

Cyan4973 commented Mar 1, 2016

Cyan4973 commented Mar 2, 2016

rkd-msw commented Mar 2, 2016

Cyan4973 commented Mar 2, 2016

Cyan4973 commented Nov 7, 2016