hashing collision: improve performance using linear/pseudorandom probing mix #13440

timotheecour · 2020-02-20T02:25:21Z

TLDR

compared to right before PR:

performance improves in almost all cases thanks to better cache locality
performance is slightly worse (1.3X) for the edge cases (long collision chains) that were causing extreme (100X-unbounded) slowdowns before pseudorandom probing (pseudorandom probing for hash collision #13418)

compared to right before #13418:

often quite a bit faster
worst slowdown is 1.06 slowdown
avoids the extreme slowdowns (100X-unbounded) for edge cases

details

as I mentioned in #13418 we can refine the collision strategy to get the best of both worlds between:

cache locality (linear probing, make common case fast)
avoiding large collision clusters (pseudorandom probing, prevents edge cases causing extreme slowdowns)

using a threshold depthThres on search depth (ie number of calls to nextTry), which is a parameter to tune; in practice 20 works across a range of applications but future PR could expose this single param to users;

depthThres=0 will just use pseudorandom probing
depthThres=int.high will just use linear probing
depthThres > 0 < int.high will switch dynamically for "bad cases" from linear to pseudorandom

performance numbers

I published some easy to modify benchmarking code to compare performance of a (customizable) task involving inserting some 20 million keys in a table, retrieving keys, and searching for both existing and nonexisting keys. I tried on a number of different distributions of keys (english words, oids, various string keys, various random, high order bit keys, consecutive numbers (int32/64), explicit formulas (eg squares, multiples of K), small floats, etc

The results show that:

with this PR, the worst ratio compared to before pseudorandom probing for hash collision #13418 is 1.06 slowdown
the best ratio is essentially unlimited (eg for cases like Slow insertion of uint in Tables and Sets Collections depending on the value #13393, which could be 100X, 1000X or worse); in this benchmark I added a timeout of 50s
many other distributions (unrelated to Slow insertion of uint in Tables and Sets Collections depending on the value #13393) compare favorably (eg 3X speedup)
better hash (fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410) also avoids the extreme bad cases, but is 2X slower for many simple distributions because of the hash computation overhead, and isn't as robust as pseudorandom probing against adversarial attacks
this PR noticably improves performance over pure pseudorandom (what's currently in devel) in almost all cases, except for the pathological cases (eg toHighOrderBits distribution), but instead of giving an unbounded slowdown (as is case for linear probing), it just gives a 1.3X slowdown), so this is a very good tradeoff that makes the common cases faster at the expense of edge cases, for which we pay a small penalty (1.3X instead of 100X or unbounded slowdown)
depthThres is a very simple heuristic that helps almost in all cases

git clone https://github.com/timotheecour/vitanim && cd vitanim
git checkout 06a65792d075815e718d2b6b0d31211127dcf8d1
nim c -r -d:danger testcases/tableutils/benchmain.nim

I'm comparing 4 algos:

mix: this PR
pseudorandom (as of pseudorandom probing for hash collision #13418)
linear (right before pseudorandom probing for hash collision #13418)
better hash (fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410 with a small modification: hashUInt32 just forwarding to hashUInt64 as hashUInt32 had at least 1 very bad case)

mix(this PR) | pseudorandom | linear | better hash | 1 / 3
-- | -- | -- | -- | --
24.7842 | 28.1233 | 23.3294 | 23.5042 | 1.062359083
28.4544 | 32.4396 | 26.811 | 27.3472 | 1.061295737
11.8106 | 14.9808 | 11.7898 | 12.0099 | 1.001764237
24.7805 | 28.4708 | 24.0842 | 24.5816 | 1.02891107
24.1552 | 27.9138 | 22.6895 | 23.3472 | 1.064598162
22.417 | 26.535 | 22.0931 | 22.2877 | 1.014660686
2.90124 | 2.65254 | 3.03189 | 10.906 | 0.9569080672
3.51184 | 2.7196 | 4.48502 | 11.9838 | 0.7830154604
9.6545 | 16.254 | 15.2375 | 11.3906 | 0.6336013126
9.76797 | 16.0781 | 15.7369 | 11.4528 | 0.6207048402
2.47488 | 3.93578 | 6.82566 | 11.8847 | 0.3625847171
4.06318 | 4.58738 | 8.18277 | 12.073 | 0.4965531232
8.33806 | 14.4545 | 11.2387 | 11.431 | 0.7419060923
10.2709 | 14.9484 | 11.3907 | 11.7426 | 0.901691731
11.1881 | 14.963 | 50.0003 | 11.7045 | 0.2237606574
4.51758 | 2.65223 | 50.0001 | 11.4548 | 0.0903514193
19.213 | 16.0138 | 47.0852 | 8.76856 | 0.4080475394
27.7964 | 19.6391 | 50 | 11.481 | 0.555928

details performance numbers

mix: after current PR (mix of linear/pseudo-random probing)
runPerfAll benchmark for nim git hash: 586e7f672a50b4a4d450e8d9cb8b167b6221fa12
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              19.7674                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:              28.4858                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              12.0973                      ex[0]: 5980795527476110011  ex[1]: -8770990215615663721 ex[2]: 346373246543934521
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              25.3611                      ex[0]: 5e4e013d49519734557bef48 ex[1]: 5e4e013e49519734561485c7 ex[2]: 5e4e013f4951973456ad1c46
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              24.4193                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:              22.7159                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:              3.07487                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:              3.68188                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:              12.5888                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              13.1395                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              2.68821                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:              4.79935                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:              11.3335                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              11.6262                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:              11.4167                      ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              3.55724                      ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              20.5522                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:              21.6669                      ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808

pseudorandom: after https://github.com/nim-lang/Nim/pull/13418 (pseudo-random probing)
runPerfAll benchmark for nim git hash: 87dd19453b9a9aeded5f85714e90e12fc8bab0bf
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              28.1233                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:              32.4396                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              14.9808                      ex[0]: 2007510477909153732  ex[1]: -2712027032279094734 ex[2]: -3177538604408694653
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              28.4708                      ex[0]: 5e4dcdafb2217733483f66dd ex[1]: 5e4dcdb0b221773348d7fd5c ex[2]: 5e4dcdb0b2217733497093db
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              27.9138                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:               26.535                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:              2.65254                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:               2.7196                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:               16.254                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              16.0781                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              3.93578                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:              4.58738                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:              14.4545                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              14.9484                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:               14.963                      ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              2.65223                      ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              16.0138                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:              19.6391                      ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808


linear: before https://github.com/nim-lang/Nim/pull/13418 (ie, before pseudo-random probing)
runPerfAll benchmark for nim git hash: 273a93581f851d2d16d6891d7dd1d7d9edfbb4ef
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              23.3294                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:               26.811                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              11.7898                      ex[0]: 6791954661837144404  ex[1]: 6737600927306182383  ex[2]: -9096880239424528216
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              24.0842                      ex[0]: 5e4dcf6da16c6e5948b6e8db ex[1]: 5e4dcf73a16c6e59494f7f5a ex[2]: 5e4dcf74a16c6e5949e815d9
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              22.6895                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:              22.0931                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:              3.03189                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:              4.48502                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:              15.2375                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              15.7369                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              6.82566                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:              8.18277                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:              11.2387                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              11.3907                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:              50.0003 timeoutAtIter: 0     ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              50.0001 timeoutAtIter: 0     ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              47.0852                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:                   50 timeoutAtIter: 0     ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808

better hash
runPerfAll benchmark for nim git hash: e6b517ad36ee4f053fcef8f878cd2c82201c1523
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              23.5042                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:              27.3472                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              12.0099                      ex[0]: 8341231083579904102  ex[1]: 8281382965556985093  ex[2]: 2998183081495492116
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              24.5816                      ex[0]: 5e4dd45a91112c1649f98781 ex[1]: 5e4dd46091112c164a921e00 ex[2]: 5e4dd46191112c164b2ab47f
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              23.3472                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:              22.2877                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:               10.906                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:              11.9838                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:              11.3906                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              11.4528                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              11.8847                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:               12.073                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:               11.431                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              11.7426                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:              11.7045                      ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              11.4548                      ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              8.76856                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:               11.481                      ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808

Araq · 2020-03-16T14:06:25Z

Look ok to me but @narimiran is the reviewer. CC @c-blake

c-blake · 2020-03-16T14:24:37Z

I still see not a single benchmark of an alternate workload besides just inserting everything into the hash table. Given that the main point of contention, performance-wise, is impact of all those tombstones on delete heavy workloads, that seems inadequate to prove any interesting point. Indeed, all these workloads really test is how well the indirect hash function encoded as a perturbed sequence works.

If the threat model is a user hash that is A) weak in some groups of bits, but B) still has many varying bits (which is the only situation this perturbed probe sequence solves) then there is another natural mitigation which is inherently local - just rehash the output of hash() with better scrambling. Of course, part of that is C) not relying on a user to define their own better proc hash(). If the user hash() just has no diversity in the output bits, there's almost nothing you can do except fall back to a tree (which may even require < or <= to be defined on user keys to work, since with no diversity in hash() values the trees will also be slow.).

I've been testing this in a private library off to the side that is all about search depth-based triggering of A) first Robin Hood hashing activation, and then B) hash() output rehashing and Robin Hood that is also structured such that a final fallback to a B-tree might be workable. I'm not as happy with a lot of the code repetition/factoring as I could be, but perhaps it's ready enough to make public. It's not all that well benchmarked, micro-optimized, or tested, but it should exhibit the ideas I'm referring to less abstractly.

lib/pure/collections/hashcommon.nim

narimiran · 2020-03-16T16:19:38Z

lib/pure/collections/tables.nim

+      # using this and creating tombstones as in `Table` would make `remove`
+      # amortized O(1) instead of O(n); this requires updating remove,
+      # and changing `val != 0` to what's used in Table. Or simply make CountTable
+      # a thin wrapper around Table (which would also enable `dec`, etc)


Or simply make CountTable a thin wrapper around Table (which would also enable dec, etc)

I don't understand this comment. Is this a TODO?

I've reworded a bit this comment to clarify. I'm suggesting possible venues to improve CountTable:

using countDeleted and tombstones, and adapting the algorithms used in Table

or (best option IMO, to be discussed in separate PR) using something analog to MultiTable: thin wrapper around Table with sane API for duplicate keys; deprecate Table.add RFCs#200 but for CountTable, where we offer same API as current CountTable but implemented as a thin wrapper around Table.

Both approaches would enable these:

all operations would become at least as efficient as current CountTable operations

some CountTable operations would turn from O(n) to amortized O(1)

current API restrictions would be lifted (eg we'd allow inc with negative values (or equivalently, dec), and we'd allow counts to be 0)

You could call it a TODO, or you could call it something to help the next person reading this code and wondering about why some operations are currently O(n). I could do that PR, after this PR is merged.

lib/pure/collections/tables.nim

timotheecour · 2020-03-18T09:43:41Z

PTAL

If the threat model is a user hash that is A) weak in some groups of bits, but B) still has many varying bits (which is the only situation this perturbed probe sequence solves)

pseudorandom probing is a general collision resolution technique, not limited to "weakness in some group of bits".

there is another natural mitigation which is inherently local - just rehash the output of hash() with better scrambling

the term for that is double hashing; it's not clear at all to me this would be better than pseudorandom probing, but the good news is, after the refactorings in PR, one will easily be able to experiment with alternate collision resolution strategies (including double hashing), as the logic is now refactored in a single place (say, findCell) instead of subtely duplicated in many places. If anything, python went for pseudorandom instead of double hashing, and double hashing with a good scrambler (eg nonlinearHash derived form #13410) is more expensive than pseudo-random probing, per probe.

I still see not a single benchmark of an alternate workload besides just inserting everything into the hash table. Given that the main point of contention, performance-wise, is impact of all those tombstones on delete heavy workloads, that seems inadequate to prove any interesting point. Indeed, all these workloads really test is how well the indirect hash function encoded as a perturbed sequence works.

PR's welcome to add deletion workloads to https://github.com/timotheecour/vitanim/blob/master/testcases/tableutils/benchutils.nim ; I intend to add these at some point regardless.
As mentioned in this PR, after this PR is merged, we can expose depthThres as a Table parameter so user can simply set depthThres=int.high if he wants linear probing only for his data (and this mode can support deletion without tombstones, re-introducing Algorithm R (Deletion with linear probing) from Knuth Volume 3). But that would require running a benchmark on a deletion heavy workload to check whether it would indeed improve performance.
I disagree that the benchmarks presented here don't prove anything interesting. A common case usage of Table is non-deletion heavy workloads, and these benchmarks test a number of usage patterns and input distributions. Deletion heavy workloads add another dimension and can be addressed in future benchmarks, as this current PR doesn't make anything worse for deletion heavy workloads since tombstones are used before and after this PR. It does however make it easier to test alternate collision resolution strategies, since the logic is now all in 1 place instead of scattered.

I've been testing this in a private library

I'm happy to run the same benchmark once it's public and offers same (or subset) of Table API so can be plugged in

c-blake · 2020-03-18T11:40:00Z

I made it public a couple days ago. Tombstones make inserts or lookups after many deletions slow as well. In any case, since that general style of workload is the one thing where one might expect a performance difference, it seems to me you should do it before claiming performance neutrality.

Based on your response to my A,B) points, I think you are getting hung up on terminology, possibly simply don't understand what you're saying, and definitely do not understand what I am saying. I know about double hashing and that is not what I am suggesting and I was careful to not use that term, actually. DH uses a second truly independent hash of the key to generate an increment interval. Because of the way people usually abstract user == and user hash functions, they usually use different bits of the output of just one hash() and assume they are independent. That is only approximately right and sometimes that approximation breaks down.

What I was suggesting is different, and personally I think clear from my prior text. I was trying to say use hash2(hash()) and mask to generate an initial table address with linear probing as usual for collision resolution. If the outer hash2 scrambles well then you get all the positives of pure linear probing including near ageless deletion with no tombstones and the positives of your more complex probe sequence using all the bits of hash() output.

Mathematically, the PRNG stuff you're copying from Python is just a fancier, more memory system indirection-heavy way to have the probe sequence "depend upon all the bits" a user hash() produces, but it is obviously not the only way to engineer that dependency. Unfortunately "rehashing" often is used to refer to the table resize phase. "hash()output hashing" might be the clearest phrasing but is kind of a mouthful.

@Araq's initial complaint for a better hash about "what about other distributions" still holds for your PRNG, only with your approach you have hard coded the way all bits are depended upon into the library. But no hash can be great all the time for all sets of keys, especially if attackers with access to your source code exist.

With my suggestion you can preserve the "freedom to fix" by having that hash2 effectively redefinable by a caller. You can preserve the "identity is a perfect hash sometimes" by activating that hash2(hash()) construction only after you get an overlong probe depth on an underfull table, meaning once it has proved untrustworthy. The Nim stdlib could easily provide a few vetted, easy, fast hash functions in hashes instead of just one. Any stronger secondary hash for Hash (aka int) itself would be a viable definition of hash2. hash2 could even optionally depend upon the virtual memory addresses of data making it even harder or a cryptographic secret even attackers with source code may not have. All this is possible in the standard linear probing framework, too. Or with optional Robin Hood reorganization as in my https://github.com/c-blake/adix (see e.g. lpset.nim).

c-blake · 2020-03-18T12:01:04Z

Incidentally, while mentioned in the mentioned repo, it deserves reiteration that probe depth based resize triggers as well as probe depth triggered mitigations for weak hashes have other nice properties. For one, if the user can provide a better than random hash they might enjoy 100% memory utilization with no performance degradation. Admittedly, providing actual minimal perfect hashes outside of test cases is unlikely, but "close to that" might not be. In any case, reacting to the actual underlying cause of slowness (deep probes) is inherently more robust than trying to guess what load factor to target. I am pretty sure Rust was using such depth based triggers for like 4 years, but it's also probably an ancient tactic. It's just sort of looking at those "probe depth vs. load" graphs with flipped axes.

c-blake · 2020-03-18T12:03:13Z

Oh, and it would be easy (though non portable) to integrate a source of true randomness into hash2 to actually make it as cryptographically secure as the randomness was true, but I agree with @Araq (nim-lang/RFCs#201 (comment)) that this should probably not be the default mode for debugging purposes.

c-blake · 2020-03-18T12:53:23Z

In an attempt to be more constructive benchmark-wise, this blog post describes at least some things you can do https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ He makes more an effort to avoid prefetching biases than you. It's still incomplete. You need to use true randomness that the CPU cannot get by "working ahead" with PRNGs. Linux has getrandom(2), but using the actual CPU instructions would probably make more sense in a benchmark setting.

Another comment would be that I don't think there is any "workload independent best" solution. So, it will never be possible to really decide what benchmarks are most representative. All your "1.06x" kinds of answers are also going to be clearly highly contingent upon which CPUs you tested with, which backend compilers, with or without PGO at the C level, and may well not generalize to many/most other deployment contexts. More abstract/mathematical properties like "about as fast, but can handle situation X,Y,Z gracefully" or arguments in terms of "probe depth" or "unprefetched cache misses" are just less deployment-vague ways to argue about suitability for a general purpose library.

c-blake · 2020-03-18T16:57:51Z

There are a few easy "proofs" of workload dependence, btw.

Consider simply iteration. Something like adix/olset.nim which is similar to Python3.6+'s dict but with linear/robinhood probing is going to provide the fastest possible iteration period. So, if your workload is to A) build a table and then B) mostly iterate over it several to many times then nothing will beat that. The best alternatives will always have gaps, tombstones, branches to test for gaps and so on..never as fast as just iterating over a dense array. In main memory, just in DRAM bandwidth terms, the performance will have to differ by the same ratio as the density which can easily be 2x. However, engineering the speed of that case costs at insert/delete time..That mutate-time cost may be negligible or dominant in any given context.

Similarly, Robin Hood can do "miss" lookups faster than "hits" on dense tables because an order invariant is maintained (at some cost). So, determining two sets have an empty set (or near it) intersection is usually best with Robin Hood.

Another not well advertised property of Robin Hood is that I believe if you have < defined and break depth ties with that inequality then there is exactly one memory representation of a given set. So, regular memcmp could be used to compare two sets for equality which is likely to be 5-10x faster than doing all the lookups because bulk memory compares are very optimizable. But maybe < is not (or cannot be) defined. Nim is great because we can just check with a when compiles(). Even so, it's still hard to know if maintaining that stronger ordering invariant is worthwhile. If the user never calls a==b for sets even once, which is probably a common-ish case then it clearly is not worth it. It might make a huge delta if they do it a lot.

In light of all the above use case/workload variability, the only real long-term solution is obviously to provide a set of alternatives (which is what I'm trying to do with adix). If you would like to contribute there, I welcome help. While I have my conceptual biases, I'm really not sure coding-wise if tables in terms of sets is better than sets in terms of tables, and there is other code duplication I do not like. The library is young and ripe for total overhauls like that. There are other API enhancements like getCap and setCap, depths, better auto-resize controls and "rounding out" the set APIs and table APIs to overlap more fully.

c-blake · 2020-03-19T17:14:47Z

Another comment on this PR is that though cache line size does vary with 64 bytes being the most common, it varies much less than "infinity" which is how much per-object sizes in the Table or HashSet can vary. Therefore, the threshold should probably be a number of bytes not a number of elements, unless the threshold can be set per-Table where the instantiator knows the object size. So, depth * s.data[0].sizeof < threshold, not what is there currently, would be a better idea.

Even with a known cache line size there are system-specific hacks where the memory system can make it effectively larger. E.g., I have a workstation whose BIOS can make mine 128B instead of 64B, and the pre-fetcher can probably successfully hide latency of a tight loop up to at least 256B. That's just one system. It will be hard to set that in general and vary per HashSet/Table based on key distribution relative to hash quality.

c-blake · 2020-03-20T15:51:40Z

Also, while whatever handful of "well scrambling but too slow" hashes you may have personally tried like nonlinearHash or whatever, that is very much just one small sample out of many. I believe that constitutes extraordinarily weak evidence that there is no sufficiently strong & fast hash. "Well, the couple slow hashes I tried were too slow" is just not a persuasive statement to justify any higher level algorithm choices. You should stop trying to use it that way. A great many/most hashes are optimized for very long strings, not short integers. So, you have to hunt a more smartly for them, maybe but there are many.

Of course, there is no perfect hash for all situations, and that includes the implicit hash hard-coded into your perturbed probe sequence. The academic terminology on this is usually that a "hash function" maps "keys" to "the whole probe sequence". Usually that sequence is a permutation of table addresses, not the weird Python thing where table addresses can repeat. Another argument against this whole "just copy Python" direction is that Python tables do not allow duplicate keys while its probe sequence makes allItems need to do weird tricks related to that perturbed probe sequence not being a permutation. So, at a minimum that general approach to incorporate all hash code bits into the probe sequence should probably be modified for Nim to be a strict perturbation and allItems simplified. (I realize from nim-lang/RFCs#200 that you'd rather just ditch duplicate key capability and break 1.0 functionality backward compatibility, but this argument should still be voiced here.)

…nges (#13816) * Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem.

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * This is an alternate resolution to #13393 (which arguably could be resolved outside the stdlib). Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e., A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))` in the instantiation context of a `HashSet` or `Table` or B) more globally, compile with `nim c -d:hashWangYi1`. No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts. Some people might want to stampede ahead unbridled, but my view is that a good plan is to A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out), B) have them report problems (if any), C) if all seems good, make the stdlib more novice friendly by adding `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1` with some `when defined` rearranging so users can `-d:hashIdentity` if they want the old behavior back. This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all *at least* pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes. * Re-organize to work around `when nimvm` limitations; Add some tests; Add a changelog.md entry. * Add less than 64-bit CPU when fork. * Fix decl instead of call typo. * First attempt at fixing range error on 32-bit platforms; Still do the arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well. * A second try at making 32-bit mode CI work. * Use a more systematic identifier convention than Wang Yi's code. * Fix test that was wrong for as long as `toHashSet` used `rightSize` (a very long time, I think). `$a`/`$b` depend on iteration order which varies with table range reduced hash order which varies with range for some `hash()`. With 3 elements, 3!=6 is small and we've just gotten lucky with past experimental `hash()` changes. An alternate fix here would be to not stringify but use the HashSet operators, but it is not clear that doesn't alter the "spirit" of the test. * Fix another stringified test depending upon hash order. * Oops - revert the string-keyed test. * Fix another stringify test depending on hash order. * Add a better than always zero `defined(js)` branch. * It turns out to be easy to just work all in `BigInt` inside JS and thus guarantee the same low order bits of output hashes (for `isSafeInteger` input numbers). Since `hashWangYi1` output bits are equally random in all their bits, this means that tables will be safely scrambled for table sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the integer keys are all < 2**53 (also likely fine). (I'm unsure why the infidelity with C/C++ back ends cut off is 32, not 53 bits.) Since HashSet & Table only use the low order bits, a quick corollary of this is that `$` on most int-keyed sets/tables will be the same in all the various back ends which seems a nice-to-have trait. * These string hash tests fail for me locally. Maybe this is what causes the CI hang for testament pcat collections? * Oops. That failure was from me manually patching string hash in hashes. Revert. * Import more test improvements from #13410 * Fix bug where I swapped order when reverting the test. Ack. * Oh, just accept either order like more and more hash tests. * Iterate in the same order. * `return` inside `emit` made us skip `popFrame` causing weird troubles. * Oops - do Windows branch also. * `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic resolutions are "1 == identity", 1 for Nim Version 1, 1 for first/simplest/fastest in a series of possibilities. Should be very easy to remember.) * Re-organize `when nimvm` logic to be a strict `when`-`else`. * Merge other changes. * Lift constants to a common area. * Fall back to identity hash when `BigInt` is unavailable. * Increase timeout slightly (probably just real-time perturbation of CI system performance).

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * Add neglected API call `find` to heapqueue. * Add a changelog.md entry, `since` annotation and rename parameter to be `heap` like all the other procs for consistency. * Add missing import.

stale · 2021-03-20T22:11:12Z

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

stale · 2022-03-22T03:35:03Z

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

timotheecour force-pushed the pr_pseudorand_probing_followup branch from f5e7d9d to 5423438 Compare February 20, 2020 02:29

timotheecour mentioned this pull request Feb 20, 2020

pseudorandom probing for hash collision #13418

Merged

5 tasks

timotheecour force-pushed the pr_pseudorand_probing_followup branch from 8951c59 to 3385027 Compare February 20, 2020 05:35

timotheecour referenced this pull request in timotheecour/Nim Feb 24, 2020

robinhood hashing

96a2d49

This was referenced Feb 28, 2020

Slow insertion of uint in Tables and Sets Collections depending on the value #13393

Closed

performance comparison vs std/re, std/nre nitely/nim-regex#57

Closed

timotheecour force-pushed the pr_pseudorand_probing_followup branch from 55050e0 to f084f69 Compare March 6, 2020 02:19

timotheecour changed the title ~~[WIP] hashing collision: use linear/pseudorandom probing mix~~ hashing collision: improve performance using linear/pseudorandom probing mix Mar 6, 2020

timotheecour marked this pull request as ready for review March 6, 2020 03:53

timotheecour mentioned this pull request Mar 10, 2020

MultiTable: thin wrapper around Table with sane API for duplicate keys; deprecate Table.add nim-lang/RFCs#200

Closed

Araq requested a review from narimiran March 16, 2020 14:06

narimiran reviewed Mar 16, 2020

View reviewed changes

c-blake mentioned this pull request Mar 16, 2020

How smart should Nim sets & tables be? nim-lang/RFCs#201

Closed

timotheecour mentioned this pull request Mar 18, 2020

fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad #13649

Closed

timotheecour force-pushed the pr_pseudorand_probing_followup branch from 011a6ae to a0684d6 Compare March 18, 2020 09:27

timotheecour added 4 commits March 19, 2020 08:02

hashing collision: use linear/pseudorandom probing mix

62e8ea3

fixup

ae94905

fix test

04dcb22

address comments

863bb0c

timotheecour force-pushed the pr_pseudorand_probing_followup branch from a0684d6 to 863bb0c Compare March 19, 2020 15:07

c-blake mentioned this pull request Mar 31, 2020

Unwind just the "pseudorandom probing" part of recent sets,tables changes #13816

Merged

timotheecour mentioned this pull request Dec 6, 2020

add performance regression tests that can be tracked over git commit history timotheecour/Nim#425

Open

timotheecour mentioned this pull request Feb 26, 2021

tables misc timotheecour/Nim#82

Open

stale bot added the stale Staled PR/issues; remove the label after fixing them label Mar 20, 2021

timotheecour removed the stale Staled PR/issues; remove the label after fixing them label Mar 20, 2021

timotheecour mentioned this pull request Apr 15, 2021

std/hashes: hash(ref|ptr|pointer) + other improvements #17731

Merged

timotheecour mentioned this pull request May 19, 2021

add lookuptables for writing efficient lookup tables; make symbolRank, symbolName O(1) #18044

Closed

timotheecour mentioned this pull request Jun 30, 2021

Iteration over Tables does not randomize the order #12070

Closed

mratsim mentioned this pull request Aug 27, 2021

Jordan/transaction pool study status-im/nimbus-eth1#799

Closed

mratsim mentioned this pull request Nov 23, 2021

Speed up altair block processing 2x status-im/nimbus-eth2#3115

Merged

stale bot added the stale Staled PR/issues; remove the label after fixing them label Mar 22, 2022

stale bot closed this Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hashing collision: improve performance using linear/pseudorandom probing mix #13440

hashing collision: improve performance using linear/pseudorandom probing mix #13440

timotheecour commented Feb 20, 2020 •

edited

Loading

Araq commented Mar 16, 2020

c-blake commented Mar 16, 2020

narimiran Mar 16, 2020

timotheecour Mar 18, 2020 •

edited

Loading

timotheecour commented Mar 18, 2020 •

edited

Loading

c-blake commented Mar 18, 2020 •

edited

Loading

c-blake commented Mar 18, 2020

c-blake commented Mar 18, 2020 •

edited

Loading

c-blake commented Mar 18, 2020 •

edited

Loading

c-blake commented Mar 18, 2020

c-blake commented Mar 19, 2020

c-blake commented Mar 20, 2020

stale bot commented Mar 20, 2021

stale bot commented Mar 22, 2022

hashing collision: improve performance using linear/pseudorandom probing mix #13440

hashing collision: improve performance using linear/pseudorandom probing mix #13440

Conversation

timotheecour commented Feb 20, 2020 • edited Loading

TLDR

details

performance numbers

details performance numbers

Araq commented Mar 16, 2020

c-blake commented Mar 16, 2020

narimiran Mar 16, 2020

Choose a reason for hiding this comment

timotheecour Mar 18, 2020 • edited Loading

Choose a reason for hiding this comment

timotheecour commented Mar 18, 2020 • edited Loading

c-blake commented Mar 18, 2020 • edited Loading

c-blake commented Mar 18, 2020

c-blake commented Mar 18, 2020 • edited Loading

c-blake commented Mar 18, 2020 • edited Loading

c-blake commented Mar 18, 2020

c-blake commented Mar 19, 2020

c-blake commented Mar 20, 2020

stale bot commented Mar 20, 2021

stale bot commented Mar 22, 2022

timotheecour commented Feb 20, 2020 •

edited

Loading

timotheecour Mar 18, 2020 •

edited

Loading

timotheecour commented Mar 18, 2020 •

edited

Loading

c-blake commented Mar 18, 2020 •

edited

Loading

c-blake commented Mar 18, 2020 •

edited

Loading

c-blake commented Mar 18, 2020 •

edited

Loading