lj_str_new hash conflict is serious when length larger than 128 #60

stone-wind · 2019-06-13T07:56:17Z

at lj_str_hash_128_above :

 lj_str_hash_128_above(const char* str,  uint32_t len)
  chunk_num = 16;
  chunk_sz = len / chunk_num;
  chunk_sz_log2 = log2_floor(chunk_sz);

  pos1 = get_random_pos_unsafe(chunk_sz_log2, 0);
  pos2 = get_random_pos_unsafe(chunk_sz_log2, 1);
 /* loop over 14 chunks, 2 chunks at a time */
  for (i = 0, chunk_ptr = str; i < (chunk_num / 2 - 1);
       **chunk_ptr += chunk_sz**, i++) {

    v = *cast_uint64p(chunk_ptr + pos1);
    h1 = _mm_crc32_u64(h1, v);

    v = *cast_uint64p(chunk_ptr + chunk_sz + pos2);
    h2 = _mm_crc32_u64(h2, v);
  }

when len is same, the chunk_sz, chunk_sz_log2, pos1, pos2 are always same
in loop, should we set chunk_ptr += 2*chunk_sz, the loop only cover half string

when i use sock:receive(len), recv the same length(527) strings which are similar(changed 36 bytes at second half), the hash conflict is very frequency, and cpu can reach high(100).

It is terrible in my test, it stop all the thing (because recv is not blocked) and can only recv 200 msg/s. it run too many str_fastcmp at lj_str_new.
if (sx->len == len && sx->hash == h && str_fastcmp(str, strdata(sx), len) == 0) {

The text was updated successfully, but these errors were encountered:

bungle · 2019-06-13T08:39:57Z

Not sure if this fork wants to try this instead:
LuaJIT/LuaJIT#294

dndx · 2019-06-13T23:07:42Z

Thanks for the troubleshooting @stone-wind, it is a valid concern for your specific use case, unfortunately properly fixing it inside OpenResty would be harder.

The original string hash algorithm in LuaJIT is very fast, it has constant time complexity for strings of arbitrary length. The CRC32 variant https://github.com/openresty/luajit2 uses is intend to make hash collision attack harder by randomize the positions the hash function samples while calculating the hash, it is slower than the original hash function, but nevertheless still constant time.

To actually address your issue of large amount of strings with same length and almost identical content, we have to resort to an O(n) hash algorithm, because we don't have ways to know beforehand where the differences in the strings are.

The CRC32 hash implementation does use every byte for CRC32 calculation for sizes up to 127 bytes. So if you can change your program to only read 127 bytes of data from the socket at once it should not experience this much collision attacks.

If that is not acceptable, another method to fix it is that we may consider allowing user to compile LuaJIT to use all bytes in the string for CRC32 calculation, downgrading the algorithm to O(n) but more resilient against collisions in very similar long strings.

Also a word of caution of using LuaJIT/LuaJIT#294, we never tested it in OpenResty and can not guarantee it does not introduce bugs in the LuaJIT core.

lukego · 2019-06-14T09:02:17Z

we have to resort to an O(n) hash algorithm

Just $0.02 but my intuition is that constant factors will be at least as important as asymptotic cost:

Throughput: How many bytes can we hash per CPU cycle?
Fixed cost: How many cycles is the per-string cost, excluding the per-byte cost?

Does anybody know what is the current state of the art hashing algorithm and how it performs in these ways? Or know what numbers would be acceptable for OpenResty? Or what numbers we achieve today?

I am especially wondering whether an afternoon of Googling might turn up a .h file containing a good all-round hash function that solves the problem for all workloads.

Here is the little bit of hard data I've been able to turn up right now:

CRC32 performance is limited in hardware to 8 bytes per cycle. On Intel CPUs the 64-bit CRC32 instruction takes three cycles to complete, and the CPU can execute three of them in parallel, and so the throughput limit is 64 bits per cycle. (Source: Agner).
Our current CRC32 implementation issues two CRC32 instructions in parallel and so it should be limited to 5.33 bytes per cycle of throughput i.e. completes max 2 x 64-bit CRC instructions every three cycles.
Meow Hash claims to achieve 16 bytes per cycle of throughput. That would be 2x the hardware limit for CRC32 and 3x the limit of our current routine. (Just picked that one from a random google search.)

These are only upper bounds on throughput. Maybe actual performance will be more dominated by fixed costs or memory access. I'm just trying to frame the problem in quantitative terms and establish what level of performance we want and what we should realistically be able to achieve. That might help to decide whether it's better to hash whole strings or if we really have to sample them.

lukego · 2019-06-14T09:42:08Z

Just an aside thought:

Could an interesting feature be string creation without interning?

I am thinking that the cost of hashing and interning strings of binary data probably outweighs the benefit. The interning would be valuable if the string were already present in the table, or was going to be used as a table key, or was going to be used for equality tests against other strings. However none of these things seems to be true for strings containing e.g. raw binary data received from a socket.

So the idea would be to create those strings using a special reserved hash value (-1) that means they have not been hashed, have not been interned, cannot be used as table keys (error), cannot be tested for equality using pointer value. However, they can be used for I/O, they can be used for string library functions, and data extracted from them will be ordinary strings.

This would introduce a certain amount of new complexity - probably too much - but it would also simplify some things e.g. not having to worry about the cost of hashing/interning buffers of binary data, and not having to account for hashing while benchmarking applications e.g. accidentally sending test data that will intern much better/worse than live traffic.

(I suppose this is the same distinction in many other languages where some string data is interned and others is not and go by names like strings, string buffers, symbols, binaries, etc.)

EDIT: Could also be a step towards being able to use string module functions to operate on strings allocated in FFI memory. I have the feeling when I write OpenResty code that really I'd prefer for binary data to be stored in C memory but that's in conflict with the idiom of operating on it using Lua string routines.

spacewander · 2019-06-14T11:18:24Z

@lukego
Do you mean a FFI version of the cosocket API? Something like socket:recv(bufp, len).
It is possible to implement string module equivalent with the FFI char buffer. Just do the old C job. And we can also implement a FFI version of the ngx.re API.

fsfod · 2019-06-14T15:45:42Z

So the idea would be to create those strings using a special reserved hash value (-1) that means they have not been hashed, have not been interned, cannot be used as table keys (error), cannot be tested for equality using pointer value. However, they can be used for I/O, they can be used for string library functions, and data extracted from them will be ordinary strings.

I already made a string buffer system that works in place of strings for some of the string library functions and experimented with them pretending to be strings object.
I also tried to extend it to work in the FFI system so it could work with binary data.

lukego · 2019-06-14T15:52:28Z

Do you mean a FFI version of the cosocket API? Something like socket:recv(bufp, len)

That would be the direct approach and probably a better idea if it's feasible to migrate to that.

I was musing here about an indirect approach of adding a special sub-type of Lua strings that are non-interned and more suitable for binary data while mostly compatible with existing APIs. This would avoid the problem on this issue #60 because you would skip the checksum on binary network data, and that would reduce the pressure to settle for a collision-prone checksum algorithm on other strings. Everybody wins? It might be easy to implement on the VM side but definitely introduces new complexity for users.

(I'm thinking about Erlang where on the C side they had efficient strings but on the Erlang side they had inefficient ones and they invented a new type - "binary" - that worked as a good compromise.)

lukego · 2019-06-14T15:53:51Z

Cool @fsfod!

stone-wind · 2019-10-12T10:02:18Z

I have made a example hash-test.lua for test:


local ori = "I was musing here about an indirect approach of adding a special sub-type of Lua strings that are non-interned and more suitable for binary data while mostly compatible with existing APIs. This would avoid the problem on this issue #60 because you would skip the checksum on binary network data, and that would reduce the pressure to settle for a collision-prone checksum algorithm on other strings. Everybody wins?"
local cnt = 1000
local rm = {}
for i = 1, cnt do
	rm[i] = tostring(math.random(100000000000, 1000000000000)) .. "-" 
	rm[i] = rm[i] .. rm[i] .. rm[i]
end
local s = os.clock()
local pre,cur = s
local raw = {}
for i = 1, cnt do
	raw[i] = ori .. rm[i] .. "The return and break statements can only be written as the last statement of a block."
	if i > cnt - 5 then
		cur = os.clock()
		print(cur - pre)
		pre = cur
	end
end
local e = os.clock()
print("total:", e-s)

when cnt is 2000:

0.154724  
0.000195  
0.000195  
0.00017900000000001  
0.000196  
total:	0.155491

when cnt is 20000:

14.531935
0.0015210000000003
0.0015210000000003
0.0014779999999988
0.0015999999999998
total:	14.53806

Obvious conflict will become more frequency when cnt is larger. Then i change the hash function and test again:
when cnt is 1000:

0.00488
0.000135
1.2999999999999e-05
9.0000000000003e-06
2.9000000000001e-05
total:	0.005086

when cnt is 20000:

0.030095
2.4000000000003e-05
4.9999999999981e-06
3.9999999999971e-06
3.000000000003e-06
total:	0.030134

when cnt is 200000:

0.163455  
2.9000000000001e-05  
4.000000000004e-06  
2.9999999999752e-06  
3.0000000000308e-06  
total:	0.163497

Obvious better, you can verify it by youself. The new hash func is wyhash.
I changed the lj_str_hash at lj_str_hash_x64.h

#include "wyhash.h"
static LJ_AINLINE uint32_t lj_str_hash(const char* str, size_t len)
{
	static uint64_t seed;
	if(!seed) seed = wygrand();
	return wyhash(str, len, seed);
}

Wyhash is better than crc32_hw. it do not need SSE. Simple(header only, less c code) , fastest and solid, less conflict. below is the performance data from smhasher:

Hash function	MiB/sec	cycles/hash	Quality problems
crc32	544.84	92.65	insecure, 8589.93x collisions, distrib
crc32_hw	9310.60	24.63	insecure, 100% bias, collisions, distrib, machine-specific (x86 SSE4.2)
crc32_hw1	30145.28	30.98	insecure, 100% bias, collisions, distrib, machine-specific (x86 SSE4.2)
crc64_hw	10181.36	23.46	insecure, 100% bias, collisions, distrib, machine-specific (x86_64 SSE4.2)
wyhash	16528.58	17.67
wyhash32lo	15933.47	17.77

luajit had done so much work on interning, i think we should add a function for string to get the hash value , thus we can fully use the results. the hash value is useful. wyhash is also a good PRNG.

funny-falcon · 2020-01-11T22:36:55Z

@lukego LuaJIT/LuaJIT#294 is used in Tarantool's LuaJIT fork in production for several years. (Since this version of patch were created. In fact Nick - patch co-author who helped me to debug patch and suggested some improvements - were member of Tarantool core team at that moment)
https://github.com/tarantool/LuaJIT
https://tarantool.io

funny-falcon · 2020-01-11T22:38:13Z

Hash function used in patch is not much essential. You could replace it. Main part is collision detection and switch to O(n) hashing.

agentzh · 2020-01-12T09:07:11Z

We won't consider O(n) hashing in this branch either. I'd suggest avoid using Lua strings in the first place for large string buffers (like introducing a string.buffer builtin data structure or using FFI C buffers).

funny-falcon · 2020-01-13T07:10:53Z

I mean "dynamically switch to O(n) hashing if long collision chain were detected". 99% of time hashing remains to be constant-time ( O(1) ). And only if bad things happens and a lot of strings with same hash value and length is inserted, then O(n) hashing is used only for strings with similar hash value (filtered with Bloom Filter). And if such strings collected with GC, then filter is cleared either.

agentzh · 2020-01-13T07:28:37Z

This looks very complex and I don't think we should go down this route. My hunch is that it would be very difficult to debug when bad things happen in such a sophisticated implementation.

funny-falcon · 2020-01-13T07:46:54Z

@agentzh did you read the patch? It is not that sophisticated.

Sorry for being noisy. I shut up myself at this point.

agentzh · 2020-01-13T09:13:44Z

@funny-falcon Yes, I read the patch. It's quite far reaching, even touched the GC part.

wymhuster · 2022-11-26T00:36:45Z

  chunk_num = 16;
  chunk_sz = len / chunk_num;
  chunk_sz_log2 = log2_floor(chunk_sz);

  pos1 = get_random_pos_unsafe(chunk_sz_log2, 0);
  pos2 = get_random_pos_unsafe(chunk_sz_log2, 1);

  h1 = lj_crc32_u32(0, len ^ seed);
  h2 = 0;

  /* loop over 14 chunks, 2 chunks at a time */
  for (i = 0, chunk_ptr = str; i < (chunk_num / 2 - 1);
       chunk_ptr += chunk_sz, i++) {

    v = *cast_uint64p(chunk_ptr + pos1);
    h1 = lj_crc32_u64(h1, v);

    v = *cast_uint64p(chunk_ptr + chunk_sz + pos2);
    h2 = lj_crc32_u64(h2, v);
  }

every loop just move one chunk，it's a bug here?

capr mentioned this issue Oct 31, 2019

Dynamic detection of SSE42 #75

Merged

agentzh closed this as completed Jan 12, 2020

whitehatboxer mentioned this issue Jul 7, 2021

version 0.20210206 brought a lot of overheads knyar/nginx-lua-prometheus#124

Closed

fairyqb mentioned this issue May 21, 2022

lj_str_new and lj_tab hashkey CPU 100% #158

Open

wymhuster mentioned this issue Nov 26, 2022

hash_128_above can only cover half string for hash compute #189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lj_str_new hash conflict is serious when length larger than 128 #60

lj_str_new hash conflict is serious when length larger than 128 #60

stone-wind commented Jun 13, 2019 •

edited

Loading

bungle commented Jun 13, 2019

dndx commented Jun 13, 2019

lukego commented Jun 14, 2019

lukego commented Jun 14, 2019 •

edited

Loading

spacewander commented Jun 14, 2019

fsfod commented Jun 14, 2019

lukego commented Jun 14, 2019 •

edited

Loading

lukego commented Jun 14, 2019

stone-wind commented Oct 12, 2019 •

edited

Loading

funny-falcon commented Jan 11, 2020

funny-falcon commented Jan 11, 2020

agentzh commented Jan 12, 2020

funny-falcon commented Jan 13, 2020

agentzh commented Jan 13, 2020

funny-falcon commented Jan 13, 2020

agentzh commented Jan 13, 2020

wymhuster commented Nov 26, 2022 •

edited

Loading

lj_str_new hash conflict is serious when length larger than 128 #60

lj_str_new hash conflict is serious when length larger than 128 #60

Comments

stone-wind commented Jun 13, 2019 • edited Loading

bungle commented Jun 13, 2019

dndx commented Jun 13, 2019

lukego commented Jun 14, 2019

lukego commented Jun 14, 2019 • edited Loading

spacewander commented Jun 14, 2019

fsfod commented Jun 14, 2019

lukego commented Jun 14, 2019 • edited Loading

lukego commented Jun 14, 2019

stone-wind commented Oct 12, 2019 • edited Loading

funny-falcon commented Jan 11, 2020

funny-falcon commented Jan 11, 2020

agentzh commented Jan 12, 2020

funny-falcon commented Jan 13, 2020

agentzh commented Jan 13, 2020

funny-falcon commented Jan 13, 2020

agentzh commented Jan 13, 2020

wymhuster commented Nov 26, 2022 • edited Loading

stone-wind commented Jun 13, 2019 •

edited

Loading

lukego commented Jun 14, 2019 •

edited

Loading

lukego commented Jun 14, 2019 •

edited

Loading

stone-wind commented Oct 12, 2019 •

edited

Loading

wymhuster commented Nov 26, 2022 •

edited

Loading