Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lj_str_new hash conflict is serious when length larger than 128 #60

Closed
stone-wind opened this issue Jun 13, 2019 · 17 comments
Closed

lj_str_new hash conflict is serious when length larger than 128 #60

stone-wind opened this issue Jun 13, 2019 · 17 comments

Comments

@stone-wind
Copy link

stone-wind commented Jun 13, 2019

bad

at lj_str_hash_128_above :

 lj_str_hash_128_above(const char* str,  uint32_t len)
  chunk_num = 16;
  chunk_sz = len / chunk_num;
  chunk_sz_log2 = log2_floor(chunk_sz);

  pos1 = get_random_pos_unsafe(chunk_sz_log2, 0);
  pos2 = get_random_pos_unsafe(chunk_sz_log2, 1);
 /* loop over 14 chunks, 2 chunks at a time */
  for (i = 0, chunk_ptr = str; i < (chunk_num / 2 - 1);
       **chunk_ptr += chunk_sz**, i++) {

    v = *cast_uint64p(chunk_ptr + pos1);
    h1 = _mm_crc32_u64(h1, v);

    v = *cast_uint64p(chunk_ptr + chunk_sz + pos2);
    h2 = _mm_crc32_u64(h2, v);
  }
  1. when len is same, the chunk_sz, chunk_sz_log2, pos1, pos2 are always same
  2. in loop, should we set chunk_ptr += 2*chunk_sz, the loop only cover half string

when i use sock:receive(len), recv the same length(527) strings which are similar(changed 36 bytes at second half), the hash conflict is very frequency, and cpu can reach high(100).

It is terrible in my test, it stop all the thing (because recv is not blocked) and can only recv 200 msg/s. it run too many str_fastcmp at lj_str_new.
if (sx->len == len && sx->hash == h && str_fastcmp(str, strdata(sx), len) == 0) {

@bungle
Copy link
Member

bungle commented Jun 13, 2019

Not sure if this fork wants to try this instead:
LuaJIT/LuaJIT#294

@dndx
Copy link
Member

dndx commented Jun 13, 2019

Thanks for the troubleshooting @stone-wind, it is a valid concern for your specific use case, unfortunately properly fixing it inside OpenResty would be harder.

The original string hash algorithm in LuaJIT is very fast, it has constant time complexity for strings of arbitrary length. The CRC32 variant https://github.com/openresty/luajit2 uses is intend to make hash collision attack harder by randomize the positions the hash function samples while calculating the hash, it is slower than the original hash function, but nevertheless still constant time.

To actually address your issue of large amount of strings with same length and almost identical content, we have to resort to an O(n) hash algorithm, because we don't have ways to know beforehand where the differences in the strings are.

The CRC32 hash implementation does use every byte for CRC32 calculation for sizes up to 127 bytes. So if you can change your program to only read 127 bytes of data from the socket at once it should not experience this much collision attacks.

If that is not acceptable, another method to fix it is that we may consider allowing user to compile LuaJIT to use all bytes in the string for CRC32 calculation, downgrading the algorithm to O(n) but more resilient against collisions in very similar long strings.

Also a word of caution of using LuaJIT/LuaJIT#294, we never tested it in OpenResty and can not guarantee it does not introduce bugs in the LuaJIT core.

@lukego
Copy link
Contributor

lukego commented Jun 14, 2019

we have to resort to an O(n) hash algorithm

Just $0.02 but my intuition is that constant factors will be at least as important as asymptotic cost:

  • Throughput: How many bytes can we hash per CPU cycle?
  • Fixed cost: How many cycles is the per-string cost, excluding the per-byte cost?

Does anybody know what is the current state of the art hashing algorithm and how it performs in these ways? Or know what numbers would be acceptable for OpenResty? Or what numbers we achieve today?

I am especially wondering whether an afternoon of Googling might turn up a .h file containing a good all-round hash function that solves the problem for all workloads.

Here is the little bit of hard data I've been able to turn up right now:

  • CRC32 performance is limited in hardware to 8 bytes per cycle. On Intel CPUs the 64-bit CRC32 instruction takes three cycles to complete, and the CPU can execute three of them in parallel, and so the throughput limit is 64 bits per cycle. (Source: Agner).
  • Our current CRC32 implementation issues two CRC32 instructions in parallel and so it should be limited to 5.33 bytes per cycle of throughput i.e. completes max 2 x 64-bit CRC instructions every three cycles.
  • Meow Hash claims to achieve 16 bytes per cycle of throughput. That would be 2x the hardware limit for CRC32 and 3x the limit of our current routine. (Just picked that one from a random google search.)

These are only upper bounds on throughput. Maybe actual performance will be more dominated by fixed costs or memory access. I'm just trying to frame the problem in quantitative terms and establish what level of performance we want and what we should realistically be able to achieve. That might help to decide whether it's better to hash whole strings or if we really have to sample them.

@lukego
Copy link
Contributor

lukego commented Jun 14, 2019

Just an aside thought:

Could an interesting feature be string creation without interning?

I am thinking that the cost of hashing and interning strings of binary data probably outweighs the benefit. The interning would be valuable if the string were already present in the table, or was going to be used as a table key, or was going to be used for equality tests against other strings. However none of these things seems to be true for strings containing e.g. raw binary data received from a socket.

So the idea would be to create those strings using a special reserved hash value (-1) that means they have not been hashed, have not been interned, cannot be used as table keys (error), cannot be tested for equality using pointer value. However, they can be used for I/O, they can be used for string library functions, and data extracted from them will be ordinary strings.

This would introduce a certain amount of new complexity - probably too much - but it would also simplify some things e.g. not having to worry about the cost of hashing/interning buffers of binary data, and not having to account for hashing while benchmarking applications e.g. accidentally sending test data that will intern much better/worse than live traffic.

(I suppose this is the same distinction in many other languages where some string data is interned and others is not and go by names like strings, string buffers, symbols, binaries, etc.)

EDIT: Could also be a step towards being able to use string module functions to operate on strings allocated in FFI memory. I have the feeling when I write OpenResty code that really I'd prefer for binary data to be stored in C memory but that's in conflict with the idiom of operating on it using Lua string routines.

@spacewander
Copy link
Member

@lukego
Do you mean a FFI version of the cosocket API? Something like socket:recv(bufp, len).
It is possible to implement string module equivalent with the FFI char buffer. Just do the old C job. And we can also implement a FFI version of the ngx.re API.

@fsfod
Copy link

fsfod commented Jun 14, 2019

So the idea would be to create those strings using a special reserved hash value (-1) that means they have not been hashed, have not been interned, cannot be used as table keys (error), cannot be tested for equality using pointer value. However, they can be used for I/O, they can be used for string library functions, and data extracted from them will be ordinary strings.

I already made a string buffer system that works in place of strings for some of the string library functions and experimented with them pretending to be strings object.
I also tried to extend it to work in the FFI system so it could work with binary data.

@lukego
Copy link
Contributor

lukego commented Jun 14, 2019

Do you mean a FFI version of the cosocket API? Something like socket:recv(bufp, len)

That would be the direct approach and probably a better idea if it's feasible to migrate to that.

I was musing here about an indirect approach of adding a special sub-type of Lua strings that are non-interned and more suitable for binary data while mostly compatible with existing APIs. This would avoid the problem on this issue #60 because you would skip the checksum on binary network data, and that would reduce the pressure to settle for a collision-prone checksum algorithm on other strings. Everybody wins? It might be easy to implement on the VM side but definitely introduces new complexity for users.

(I'm thinking about Erlang where on the C side they had efficient strings but on the Erlang side they had inefficient ones and they invented a new type - "binary" - that worked as a good compromise.)

@lukego
Copy link
Contributor

lukego commented Jun 14, 2019

Cool @fsfod!

@stone-wind
Copy link
Author

stone-wind commented Oct 12, 2019

I have made a example hash-test.lua for test:


local ori = "I was musing here about an indirect approach of adding a special sub-type of Lua strings that are non-interned and more suitable for binary data while mostly compatible with existing APIs. This would avoid the problem on this issue #60 because you would skip the checksum on binary network data, and that would reduce the pressure to settle for a collision-prone checksum algorithm on other strings. Everybody wins?"
local cnt = 1000
local rm = {}
for i = 1, cnt do
	rm[i] = tostring(math.random(100000000000, 1000000000000)) .. "-" 
	rm[i] = rm[i] .. rm[i] .. rm[i]
end
local s = os.clock()
local pre,cur = s
local raw = {}
for i = 1, cnt do
	raw[i] = ori .. rm[i] .. "The return and break statements can only be written as the last statement of a block."
	if i > cnt - 5 then
		cur = os.clock()
		print(cur - pre)
		pre = cur
	end
end
local e = os.clock()
print("total:", e-s)

when cnt is 2000:

0.154724  
0.000195  
0.000195  
0.00017900000000001  
0.000196  
total:	0.155491  

when cnt is 20000:

14.531935
0.0015210000000003
0.0015210000000003
0.0014779999999988
0.0015999999999998
total:	14.53806

Obvious conflict will become more frequency when cnt is larger. Then i change the hash function and test again:
when cnt is 1000:

0.00488
0.000135
1.2999999999999e-05
9.0000000000003e-06
2.9000000000001e-05
total:	0.005086

when cnt is 20000:

0.030095
2.4000000000003e-05
4.9999999999981e-06
3.9999999999971e-06
3.000000000003e-06
total:	0.030134

when cnt is 200000:

0.163455  
2.9000000000001e-05  
4.000000000004e-06  
2.9999999999752e-06  
3.0000000000308e-06  
total:	0.163497  

Obvious better, you can verify it by youself. The new hash func is wyhash.
I changed the lj_str_hash at lj_str_hash_x64.h

#include "wyhash.h"
static LJ_AINLINE uint32_t lj_str_hash(const char* str, size_t len)
{
	static uint64_t seed;
	if(!seed) seed = wygrand();
	return wyhash(str, len, seed);
}

Wyhash is better than crc32_hw. it do not need SSE. Simple(header only, less c code) , fastest and solid, less conflict. below is the performance data from smhasher:

Hash function MiB/sec cycles/hash Quality problems
crc32 544.84 92.65 insecure, 8589.93x collisions, distrib
crc32_hw 9310.60 24.63 insecure, 100% bias, collisions, distrib, machine-specific (x86 SSE4.2)
crc32_hw1 30145.28 30.98 insecure, 100% bias, collisions, distrib, machine-specific (x86 SSE4.2)
crc64_hw 10181.36 23.46 insecure, 100% bias, collisions, distrib, machine-specific (x86_64 SSE4.2)
wyhash 16528.58 17.67  
wyhash32lo 15933.47 17.77  

luajit had done so much work on interning, i think we should add a function for string to get the hash value , thus we can fully use the results. the hash value is useful. wyhash is also a good PRNG.

@funny-falcon
Copy link

@lukego LuaJIT/LuaJIT#294 is used in Tarantool's LuaJIT fork in production for several years. (Since this version of patch were created. In fact Nick - patch co-author who helped me to debug patch and suggested some improvements - were member of Tarantool core team at that moment)
https://github.com/tarantool/LuaJIT
https://tarantool.io

@funny-falcon
Copy link

Hash function used in patch is not much essential. You could replace it. Main part is collision detection and switch to O(n) hashing.

@agentzh
Copy link
Member

agentzh commented Jan 12, 2020

We won't consider O(n) hashing in this branch either. I'd suggest avoid using Lua strings in the first place for large string buffers (like introducing a string.buffer builtin data structure or using FFI C buffers).

@agentzh agentzh closed this as completed Jan 12, 2020
@funny-falcon
Copy link

I mean "dynamically switch to O(n) hashing if long collision chain were detected". 99% of time hashing remains to be constant-time ( O(1) ). And only if bad things happens and a lot of strings with same hash value and length is inserted, then O(n) hashing is used only for strings with similar hash value (filtered with Bloom Filter). And if such strings collected with GC, then filter is cleared either.

@agentzh
Copy link
Member

agentzh commented Jan 13, 2020

This looks very complex and I don't think we should go down this route. My hunch is that it would be very difficult to debug when bad things happen in such a sophisticated implementation.

@funny-falcon
Copy link

@agentzh did you read the patch? It is not that sophisticated.

Sorry for being noisy. I shut up myself at this point.

@agentzh
Copy link
Member

agentzh commented Jan 13, 2020

@funny-falcon Yes, I read the patch. It's quite far reaching, even touched the GC part.

@wymhuster
Copy link

wymhuster commented Nov 26, 2022

  chunk_num = 16;
  chunk_sz = len / chunk_num;
  chunk_sz_log2 = log2_floor(chunk_sz);

  pos1 = get_random_pos_unsafe(chunk_sz_log2, 0);
  pos2 = get_random_pos_unsafe(chunk_sz_log2, 1);

  h1 = lj_crc32_u32(0, len ^ seed);
  h2 = 0;

  /* loop over 14 chunks, 2 chunks at a time */
  for (i = 0, chunk_ptr = str; i < (chunk_num / 2 - 1);
       chunk_ptr += chunk_sz, i++) {

    v = *cast_uint64p(chunk_ptr + pos1);
    h1 = lj_crc32_u64(h1, v);

    v = *cast_uint64p(chunk_ptr + chunk_sz + pos2);
    h2 = lj_crc32_u64(h2, v);
  }

every loop just move one chunk,it's a bug here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants