-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lj_str_new hash conflict is serious when length larger than 128 #60
Comments
Not sure if this fork wants to try this instead: |
Thanks for the troubleshooting @stone-wind, it is a valid concern for your specific use case, unfortunately properly fixing it inside OpenResty would be harder. The original string hash algorithm in LuaJIT is very fast, it has constant time complexity for strings of arbitrary length. The CRC32 variant https://github.com/openresty/luajit2 uses is intend to make hash collision attack harder by randomize the positions the hash function samples while calculating the hash, it is slower than the original hash function, but nevertheless still constant time. To actually address your issue of large amount of strings with same length and almost identical content, we have to resort to an O(n) hash algorithm, because we don't have ways to know beforehand where the differences in the strings are. The CRC32 hash implementation does use every byte for CRC32 calculation for sizes up to 127 bytes. So if you can change your program to only read 127 bytes of data from the socket at once it should not experience this much collision attacks. If that is not acceptable, another method to fix it is that we may consider allowing user to compile LuaJIT to use all bytes in the string for CRC32 calculation, downgrading the algorithm to O(n) but more resilient against collisions in very similar long strings. Also a word of caution of using LuaJIT/LuaJIT#294, we never tested it in OpenResty and can not guarantee it does not introduce bugs in the LuaJIT core. |
Just $0.02 but my intuition is that constant factors will be at least as important as asymptotic cost:
Does anybody know what is the current state of the art hashing algorithm and how it performs in these ways? Or know what numbers would be acceptable for OpenResty? Or what numbers we achieve today? I am especially wondering whether an afternoon of Googling might turn up a Here is the little bit of hard data I've been able to turn up right now:
These are only upper bounds on throughput. Maybe actual performance will be more dominated by fixed costs or memory access. I'm just trying to frame the problem in quantitative terms and establish what level of performance we want and what we should realistically be able to achieve. That might help to decide whether it's better to hash whole strings or if we really have to sample them. |
Just an aside thought: Could an interesting feature be string creation without interning? I am thinking that the cost of hashing and interning strings of binary data probably outweighs the benefit. The interning would be valuable if the string were already present in the table, or was going to be used as a table key, or was going to be used for equality tests against other strings. However none of these things seems to be true for strings containing e.g. raw binary data received from a socket. So the idea would be to create those strings using a special reserved hash value ( This would introduce a certain amount of new complexity - probably too much - but it would also simplify some things e.g. not having to worry about the cost of hashing/interning buffers of binary data, and not having to account for hashing while benchmarking applications e.g. accidentally sending test data that will intern much better/worse than live traffic. (I suppose this is the same distinction in many other languages where some string data is interned and others is not and go by names like strings, string buffers, symbols, binaries, etc.) EDIT: Could also be a step towards being able to use |
@lukego |
I already made a string buffer system that works in place of strings for some of the string library functions and experimented with them pretending to be strings object. |
That would be the direct approach and probably a better idea if it's feasible to migrate to that. I was musing here about an indirect approach of adding a special sub-type of Lua strings that are non-interned and more suitable for binary data while mostly compatible with existing APIs. This would avoid the problem on this issue #60 because you would skip the checksum on binary network data, and that would reduce the pressure to settle for a collision-prone checksum algorithm on other strings. Everybody wins? It might be easy to implement on the VM side but definitely introduces new complexity for users. (I'm thinking about Erlang where on the C side they had efficient strings but on the Erlang side they had inefficient ones and they invented a new type - "binary" - that worked as a good compromise.) |
Cool @fsfod! |
I have made a example hash-test.lua for test:
when cnt is 2000:
when cnt is 20000:
Obvious conflict will become more frequency when cnt is larger. Then i change the hash function and test again:
when cnt is 20000:
when cnt is 200000:
Obvious better, you can verify it by youself. The new hash func is wyhash.
Wyhash is better than crc32_hw. it do not need SSE. Simple(header only, less c code) , fastest and solid, less conflict. below is the performance data from smhasher:
luajit had done so much work on interning, i think we should add a function for string to get the hash value , thus we can fully use the results. the hash value is useful. wyhash is also a good PRNG. |
@lukego LuaJIT/LuaJIT#294 is used in Tarantool's LuaJIT fork in production for several years. (Since this version of patch were created. In fact Nick - patch co-author who helped me to debug patch and suggested some improvements - were member of Tarantool core team at that moment) |
Hash function used in patch is not much essential. You could replace it. Main part is collision detection and switch to O(n) hashing. |
We won't consider O(n) hashing in this branch either. I'd suggest avoid using Lua strings in the first place for large string buffers (like introducing a string.buffer builtin data structure or using FFI C buffers). |
I mean "dynamically switch to O(n) hashing if long collision chain were detected". 99% of time hashing remains to be constant-time ( O(1) ). And only if bad things happens and a lot of strings with same hash value and length is inserted, then O(n) hashing is used only for strings with similar hash value (filtered with Bloom Filter). And if such strings collected with GC, then filter is cleared either. |
This looks very complex and I don't think we should go down this route. My hunch is that it would be very difficult to debug when bad things happen in such a sophisticated implementation. |
@agentzh did you read the patch? It is not that sophisticated. Sorry for being noisy. I shut up myself at this point. |
@funny-falcon Yes, I read the patch. It's quite far reaching, even touched the GC part. |
every loop just move one chunk,it's a bug here? |
at lj_str_hash_128_above :
chunk_sz
,chunk_sz_log2
,pos1
,pos2
are always samechunk_ptr += 2*chunk_sz
, the loop only cover half stringwhen i use
sock:receive(len)
, recv the same length(527) strings which are similar(changed 36 bytes at second half), the hash conflict is very frequency, and cpu can reach high(100).It is terrible in my test, it stop all the thing (because recv is not blocked) and can only recv 200 msg/s. it run too many
str_fastcmp
atlj_str_new
.if (sx->len == len && sx->hash == h && str_fastcmp(str, strdata(sx), len) == 0) {
The text was updated successfully, but these errors were encountered: