Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hashing collision: improve performance using linear/pseudorandom probing mix #13440

Closed

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Feb 20, 2020

TLDR

compared to right before PR:

  • performance improves in almost all cases thanks to better cache locality
  • performance is slightly worse (1.3X) for the edge cases (long collision chains) that were causing extreme (100X-unbounded) slowdowns before pseudorandom probing (pseudorandom probing for hash collision #13418)

compared to right before #13418:

  • often quite a bit faster
  • worst slowdown is 1.06 slowdown
  • avoids the extreme slowdowns (100X-unbounded) for edge cases

details

as I mentioned in #13418 we can refine the collision strategy to get the best of both worlds between:

  • cache locality (linear probing, make common case fast)
  • avoiding large collision clusters (pseudorandom probing, prevents edge cases causing extreme slowdowns)

using a threshold depthThres on search depth (ie number of calls to nextTry), which is a parameter to tune; in practice 20 works across a range of applications but future PR could expose this single param to users;

  • depthThres=0 will just use pseudorandom probing
  • depthThres=int.high will just use linear probing
  • depthThres > 0 < int.high will switch dynamically for "bad cases" from linear to pseudorandom

performance numbers

I published some easy to modify benchmarking code to compare performance of a (customizable) task involving inserting some 20 million keys in a table, retrieving keys, and searching for both existing and nonexisting keys. I tried on a number of different distributions of keys (english words, oids, various string keys, various random, high order bit keys, consecutive numbers (int32/64), explicit formulas (eg squares, multiples of K), small floats, etc

The results show that:

git clone https://github.com/timotheecour/vitanim && cd vitanim
git checkout 06a65792d075815e718d2b6b0d31211127dcf8d1
nim c -r -d:danger testcases/tableutils/benchmain.nim

I'm comparing 4 algos:

mix(this PR) | pseudorandom | linear | better hash | 1 / 3
-- | -- | -- | -- | --
24.7842 | 28.1233 | 23.3294 | 23.5042 | 1.062359083
28.4544 | 32.4396 | 26.811 | 27.3472 | 1.061295737
11.8106 | 14.9808 | 11.7898 | 12.0099 | 1.001764237
24.7805 | 28.4708 | 24.0842 | 24.5816 | 1.02891107
24.1552 | 27.9138 | 22.6895 | 23.3472 | 1.064598162
22.417 | 26.535 | 22.0931 | 22.2877 | 1.014660686
2.90124 | 2.65254 | 3.03189 | 10.906 | 0.9569080672
3.51184 | 2.7196 | 4.48502 | 11.9838 | 0.7830154604
9.6545 | 16.254 | 15.2375 | 11.3906 | 0.6336013126
9.76797 | 16.0781 | 15.7369 | 11.4528 | 0.6207048402
2.47488 | 3.93578 | 6.82566 | 11.8847 | 0.3625847171
4.06318 | 4.58738 | 8.18277 | 12.073 | 0.4965531232
8.33806 | 14.4545 | 11.2387 | 11.431 | 0.7419060923
10.2709 | 14.9484 | 11.3907 | 11.7426 | 0.901691731
11.1881 | 14.963 | 50.0003 | 11.7045 | 0.2237606574
4.51758 | 2.65223 | 50.0001 | 11.4548 | 0.0903514193
19.213 | 16.0138 | 47.0852 | 8.76856 | 0.4080475394
27.7964 | 19.6391 | 50 | 11.481 | 0.555928

details performance numbers

mix: after current PR (mix of linear/pseudo-random probing)
runPerfAll benchmark for nim git hash: 586e7f672a50b4a4d450e8d9cb8b167b6221fa12
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              19.7674                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:              28.4858                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              12.0973                      ex[0]: 5980795527476110011  ex[1]: -8770990215615663721 ex[2]: 346373246543934521
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              25.3611                      ex[0]: 5e4e013d49519734557bef48 ex[1]: 5e4e013e49519734561485c7 ex[2]: 5e4e013f4951973456ad1c46
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              24.4193                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:              22.7159                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:              3.07487                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:              3.68188                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:              12.5888                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              13.1395                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              2.68821                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:              4.79935                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:              11.3335                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              11.6262                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:              11.4167                      ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              3.55724                      ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              20.5522                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:              21.6669                      ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808

pseudorandom: after https://github.com/nim-lang/Nim/pull/13418 (pseudo-random probing)
runPerfAll benchmark for nim git hash: 87dd19453b9a9aeded5f85714e90e12fc8bab0bf
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              28.1233                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:              32.4396                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              14.9808                      ex[0]: 2007510477909153732  ex[1]: -2712027032279094734 ex[2]: -3177538604408694653
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              28.4708                      ex[0]: 5e4dcdafb2217733483f66dd ex[1]: 5e4dcdb0b221773348d7fd5c ex[2]: 5e4dcdb0b2217733497093db
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              27.9138                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:               26.535                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:              2.65254                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:               2.7196                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:               16.254                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              16.0781                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              3.93578                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:              4.58738                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:              14.4545                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              14.9484                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:               14.963                      ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              2.65223                      ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              16.0138                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:              19.6391                      ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808


linear: before https://github.com/nim-lang/Nim/pull/13418 (ie, before pseudo-random probing)
runPerfAll benchmark for nim git hash: 273a93581f851d2d16d6891d7dd1d7d9edfbb4ef
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              23.3294                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:               26.811                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              11.7898                      ex[0]: 6791954661837144404  ex[1]: 6737600927306182383  ex[2]: -9096880239424528216
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              24.0842                      ex[0]: 5e4dcf6da16c6e5948b6e8db ex[1]: 5e4dcf73a16c6e59494f7f5a ex[2]: 5e4dcf74a16c6e5949e815d9
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              22.6895                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:              22.0931                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:              3.03189                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:              4.48502                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:              15.2375                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              15.7369                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              6.82566                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:              8.18277                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:              11.2387                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              11.3907                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:              50.0003 timeoutAtIter: 0     ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              50.0001 timeoutAtIter: 0     ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              47.0852                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:                   50 timeoutAtIter: 0     ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808

better hash
runPerfAll benchmark for nim git hash: e6b517ad36ee4f053fcef8f878cd2c82201c1523
[ 1] name: toEnglishWords       num: 20000000 numIter: 1 runtime:              23.5042                      ex[0]: Aaron_0              ex[1]: darted_220           ex[2]: itch_440
[ 2] name: toRandFloatAsString  num: 20000000 numIter: 1 runtime:              27.3472                      ex[0]: 0.8228467011541094   ex[1]: 0.8883906930501309   ex[2]: 0.7158161478612264
[ 3] name: oidHash              num: 20000000 numIter: 1 runtime:              12.0099                      ex[0]: 8341231083579904102  ex[1]: 8281382965556985093  ex[2]: 2998183081495492116
[ 4] name: oidHashAsString      num: 20000000 numIter: 1 runtime:              24.5816                      ex[0]: 5e4dd45a91112c1649f98781 ex[1]: 5e4dd46091112c164a921e00 ex[2]: 5e4dd46191112c164b2ab47f
[ 5] name: randIint32AsString   num: 20000000 numIter: 1 runtime:              23.3472                      ex[0]: 6286188681263997477  ex[1]: 5241554617367510837  ex[2]: 1048061160398310475
[ 6] name: intAsString          num: 20000000 numIter: 1 runtime:              22.2877                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 7] name: toInt32              num: 20000000 numIter: 1 runtime:               10.906                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 8] name: toInt64              num: 20000000 numIter: 1 runtime:              11.9838                      ex[0]: 1                    ex[1]: 10000000             ex[2]: 19999999
[ 9] name: toSquares            num: 20000000 numIter: 1 runtime:              11.3906                      ex[0]: 1                    ex[1]: 100000000000000      ex[2]: 399999960000001
[10] name: toSquaresSigned      num: 20000000 numIter: 1 runtime:              11.4528                      ex[0]: -1                   ex[1]: 100000000000000      ex[2]: -399999960000001
[11] name: toIntTimes5          num: 20000000 numIter: 1 runtime:              11.8847                      ex[0]: 5                    ex[1]: 50000000             ex[2]: 99999995
[12] name: toIntTimes13         num: 20000000 numIter: 1 runtime:               12.073                      ex[0]: 13                   ex[1]: 130000000            ex[2]: 259999987
[13] name: toRand               num: 20000000 numIter: 1 runtime:               11.431                      ex[0]: 8389824593544564917  ex[1]: 1388452581874133311  ex[2]: 5738321176173495635
[14] name: toRandFloat          num: 20000000 numIter: 1 runtime:              11.7426                      ex[0]: 0.005933778906357601 ex[1]: 0.9201098173171902   ex[2]: 0.05296104548725977
[15] name: toSmallFloat         num: 20000000 numIter: 1 runtime:              11.7045                      ex[0]: 9.999999999999999e-21 ex[1]: 9.999999999999999e-14 ex[2]: 1.9999999e-13
[16] name: toHighOrderBits      num: 20000000 numIter: 1 runtime:              11.4548                      ex[0]: 4294967296           ex[1]: 42949672960000000    ex[2]: 85899341625032704
[17] name: toHighOrderBits2     num: 20000000 numIter: 1 runtime:              8.76856                      ex[0]: 2048                 ex[1]: 3300130816           ex[2]: 2305292288
[18] name: toHighOrderBits3     num: 20000000 numIter: 1 runtime:               11.481                      ex[0]: 8192                 ex[1]: 81920000000          ex[2]: 163839991808


timotheecour referenced this pull request in timotheecour/Nim Feb 24, 2020
@timotheecour timotheecour force-pushed the pr_pseudorand_probing_followup branch from 55050e0 to f084f69 Compare March 6, 2020 02:19
@timotheecour timotheecour changed the title [WIP] hashing collision: use linear/pseudorandom probing mix hashing collision: improve performance using linear/pseudorandom probing mix Mar 6, 2020
@timotheecour timotheecour marked this pull request as ready for review March 6, 2020 03:53
@Araq Araq requested a review from narimiran March 16, 2020 14:06
@Araq
Copy link
Member

Araq commented Mar 16, 2020

Look ok to me but @narimiran is the reviewer. CC @c-blake

@c-blake
Copy link
Contributor

c-blake commented Mar 16, 2020

I still see not a single benchmark of an alternate workload besides just inserting everything into the hash table. Given that the main point of contention, performance-wise, is impact of all those tombstones on delete heavy workloads, that seems inadequate to prove any interesting point. Indeed, all these workloads really test is how well the indirect hash function encoded as a perturbed sequence works.

If the threat model is a user hash that is A) weak in some groups of bits, but B) still has many varying bits (which is the only situation this perturbed probe sequence solves) then there is another natural mitigation which is inherently local - just rehash the output of hash() with better scrambling. Of course, part of that is C) not relying on a user to define their own better proc hash(). If the user hash() just has no diversity in the output bits, there's almost nothing you can do except fall back to a tree (which may even require < or <= to be defined on user keys to work, since with no diversity in hash() values the trees will also be slow.).

I've been testing this in a private library off to the side that is all about search depth-based triggering of A) first Robin Hood hashing activation, and then B) hash() output rehashing and Robin Hood that is also structured such that a final fallback to a B-tree might be workable. I'm not as happy with a lot of the code repetition/factoring as I could be, but perhaps it's ready enough to make public. It's not all that well benchmarked, micro-optimized, or tested, but it should exhibit the ideas I'm referring to less abstractly.

lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved
lib/pure/collections/hashcommon.nim Outdated Show resolved Hide resolved
# using this and creating tombstones as in `Table` would make `remove`
# amortized O(1) instead of O(n); this requires updating remove,
# and changing `val != 0` to what's used in Table. Or simply make CountTable
# a thin wrapper around Table (which would also enable `dec`, etc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or simply make CountTable a thin wrapper around Table (which would also enable dec, etc)

I don't understand this comment. Is this a TODO?

Copy link
Member Author

@timotheecour timotheecour Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworded a bit this comment to clarify. I'm suggesting possible venues to improve CountTable:

Both approaches would enable these:

  • all operations would become at least as efficient as current CountTable operations
  • some CountTable operations would turn from O(n) to amortized O(1)
  • current API restrictions would be lifted (eg we'd allow inc with negative values (or equivalently, dec), and we'd allow counts to be 0)

You could call it a TODO, or you could call it something to help the next person reading this code and wondering about why some operations are currently O(n). I could do that PR, after this PR is merged.

lib/pure/collections/tables.nim Outdated Show resolved Hide resolved
@timotheecour
Copy link
Member Author

timotheecour commented Mar 18, 2020

PTAL

If the threat model is a user hash that is A) weak in some groups of bits, but B) still has many varying bits (which is the only situation this perturbed probe sequence solves)

pseudorandom probing is a general collision resolution technique, not limited to "weakness in some group of bits".

there is another natural mitigation which is inherently local - just rehash the output of hash() with better scrambling

the term for that is double hashing; it's not clear at all to me this would be better than pseudorandom probing, but the good news is, after the refactorings in PR, one will easily be able to experiment with alternate collision resolution strategies (including double hashing), as the logic is now refactored in a single place (say, findCell) instead of subtely duplicated in many places. If anything, python went for pseudorandom instead of double hashing, and double hashing with a good scrambler (eg nonlinearHash derived form #13410) is more expensive than pseudo-random probing, per probe.

I still see not a single benchmark of an alternate workload besides just inserting everything into the hash table. Given that the main point of contention, performance-wise, is impact of all those tombstones on delete heavy workloads, that seems inadequate to prove any interesting point. Indeed, all these workloads really test is how well the indirect hash function encoded as a perturbed sequence works.

  • PR's welcome to add deletion workloads to https://github.com/timotheecour/vitanim/blob/master/testcases/tableutils/benchutils.nim ; I intend to add these at some point regardless.
  • As mentioned in this PR, after this PR is merged, we can expose depthThres as a Table parameter so user can simply set depthThres=int.high if he wants linear probing only for his data (and this mode can support deletion without tombstones, re-introducing Algorithm R (Deletion with linear probing) from Knuth Volume 3). But that would require running a benchmark on a deletion heavy workload to check whether it would indeed improve performance.
  • I disagree that the benchmarks presented here don't prove anything interesting. A common case usage of Table is non-deletion heavy workloads, and these benchmarks test a number of usage patterns and input distributions. Deletion heavy workloads add another dimension and can be addressed in future benchmarks, as this current PR doesn't make anything worse for deletion heavy workloads since tombstones are used before and after this PR. It does however make it easier to test alternate collision resolution strategies, since the logic is now all in 1 place instead of scattered.

I've been testing this in a private library

I'm happy to run the same benchmark once it's public and offers same (or subset) of Table API so can be plugged in

@c-blake
Copy link
Contributor

c-blake commented Mar 18, 2020

I made it public a couple days ago. Tombstones make inserts or lookups after many deletions slow as well. In any case, since that general style of workload is the one thing where one might expect a performance difference, it seems to me you should do it before claiming performance neutrality.

Based on your response to my A,B) points, I think you are getting hung up on terminology, possibly simply don't understand what you're saying, and definitely do not understand what I am saying. I know about double hashing and that is not what I am suggesting and I was careful to not use that term, actually. DH uses a second truly independent hash of the key to generate an increment interval. Because of the way people usually abstract user == and user hash functions, they usually use different bits of the output of just one hash() and assume they are independent. That is only approximately right and sometimes that approximation breaks down.

What I was suggesting is different, and personally I think clear from my prior text. I was trying to say use hash2(hash()) and mask to generate an initial table address with linear probing as usual for collision resolution. If the outer hash2 scrambles well then you get all the positives of pure linear probing including near ageless deletion with no tombstones and the positives of your more complex probe sequence using all the bits of hash() output.

Mathematically, the PRNG stuff you're copying from Python is just a fancier, more memory system indirection-heavy way to have the probe sequence "depend upon all the bits" a user hash() produces, but it is obviously not the only way to engineer that dependency. Unfortunately "rehashing" often is used to refer to the table resize phase. "hash()output hashing" might be the clearest phrasing but is kind of a mouthful.

@Araq's initial complaint for a better hash about "what about other distributions" still holds for your PRNG, only with your approach you have hard coded the way all bits are depended upon into the library. But no hash can be great all the time for all sets of keys, especially if attackers with access to your source code exist.

With my suggestion you can preserve the "freedom to fix" by having that hash2 effectively redefinable by a caller. You can preserve the "identity is a perfect hash sometimes" by activating that hash2(hash()) construction only after you get an overlong probe depth on an underfull table, meaning once it has proved untrustworthy. The Nim stdlib could easily provide a few vetted, easy, fast hash functions in hashes instead of just one. Any stronger secondary hash for Hash (aka int) itself would be a viable definition of hash2. hash2 could even optionally depend upon the virtual memory addresses of data making it even harder or a cryptographic secret even attackers with source code may not have. All this is possible in the standard linear probing framework, too. Or with optional Robin Hood reorganization as in my https://github.com/c-blake/adix (see e.g. lpset.nim).

@c-blake
Copy link
Contributor

c-blake commented Mar 18, 2020

Incidentally, while mentioned in the mentioned repo, it deserves reiteration that probe depth based resize triggers as well as probe depth triggered mitigations for weak hashes have other nice properties. For one, if the user can provide a better than random hash they might enjoy 100% memory utilization with no performance degradation. Admittedly, providing actual minimal perfect hashes outside of test cases is unlikely, but "close to that" might not be. In any case, reacting to the actual underlying cause of slowness (deep probes) is inherently more robust than trying to guess what load factor to target. I am pretty sure Rust was using such depth based triggers for like 4 years, but it's also probably an ancient tactic. It's just sort of looking at those "probe depth vs. load" graphs with flipped axes.

@c-blake
Copy link
Contributor

c-blake commented Mar 18, 2020

Oh, and it would be easy (though non portable) to integrate a source of true randomness into hash2 to actually make it as cryptographically secure as the randomness was true, but I agree with @Araq (nim-lang/RFCs#201 (comment)) that this should probably not be the default mode for debugging purposes.

@c-blake
Copy link
Contributor

c-blake commented Mar 18, 2020

In an attempt to be more constructive benchmark-wise, this blog post describes at least some things you can do https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ He makes more an effort to avoid prefetching biases than you. It's still incomplete. You need to use true randomness that the CPU cannot get by "working ahead" with PRNGs. Linux has getrandom(2), but using the actual CPU instructions would probably make more sense in a benchmark setting.

Another comment would be that I don't think there is any "workload independent best" solution. So, it will never be possible to really decide what benchmarks are most representative. All your "1.06x" kinds of answers are also going to be clearly highly contingent upon which CPUs you tested with, which backend compilers, with or without PGO at the C level, and may well not generalize to many/most other deployment contexts. More abstract/mathematical properties like "about as fast, but can handle situation X,Y,Z gracefully" or arguments in terms of "probe depth" or "unprefetched cache misses" are just less deployment-vague ways to argue about suitability for a general purpose library.

@c-blake
Copy link
Contributor

c-blake commented Mar 18, 2020

There are a few easy "proofs" of workload dependence, btw.

Consider simply iteration. Something like adix/olset.nim which is similar to Python3.6+'s dict but with linear/robinhood probing is going to provide the fastest possible iteration period. So, if your workload is to A) build a table and then B) mostly iterate over it several to many times then nothing will beat that. The best alternatives will always have gaps, tombstones, branches to test for gaps and so on..never as fast as just iterating over a dense array. In main memory, just in DRAM bandwidth terms, the performance will have to differ by the same ratio as the density which can easily be 2x. However, engineering the speed of that case costs at insert/delete time..That mutate-time cost may be negligible or dominant in any given context.

Similarly, Robin Hood can do "miss" lookups faster than "hits" on dense tables because an order invariant is maintained (at some cost). So, determining two sets have an empty set (or near it) intersection is usually best with Robin Hood.

Another not well advertised property of Robin Hood is that I believe if you have < defined and break depth ties with that inequality then there is exactly one memory representation of a given set. So, regular memcmp could be used to compare two sets for equality which is likely to be 5-10x faster than doing all the lookups because bulk memory compares are very optimizable. But maybe < is not (or cannot be) defined. Nim is great because we can just check with a when compiles(). Even so, it's still hard to know if maintaining that stronger ordering invariant is worthwhile. If the user never calls a==b for sets even once, which is probably a common-ish case then it clearly is not worth it. It might make a huge delta if they do it a lot.

In light of all the above use case/workload variability, the only real long-term solution is obviously to provide a set of alternatives (which is what I'm trying to do with adix). If you would like to contribute there, I welcome help. While I have my conceptual biases, I'm really not sure coding-wise if tables in terms of sets is better than sets in terms of tables, and there is other code duplication I do not like. The library is young and ripe for total overhauls like that. There are other API enhancements like getCap and setCap, depths, better auto-resize controls and "rounding out" the set APIs and table APIs to overlap more fully.

@timotheecour timotheecour force-pushed the pr_pseudorand_probing_followup branch from a0684d6 to 863bb0c Compare March 19, 2020 15:07
@c-blake
Copy link
Contributor

c-blake commented Mar 19, 2020

Another comment on this PR is that though cache line size does vary with 64 bytes being the most common, it varies much less than "infinity" which is how much per-object sizes in the Table or HashSet can vary. Therefore, the threshold should probably be a number of bytes not a number of elements, unless the threshold can be set per-Table where the instantiator knows the object size. So, depth * s.data[0].sizeof < threshold, not what is there currently, would be a better idea.

Even with a known cache line size there are system-specific hacks where the memory system can make it effectively larger. E.g., I have a workstation whose BIOS can make mine 128B instead of 64B, and the pre-fetcher can probably successfully hide latency of a tight loop up to at least 256B. That's just one system. It will be hard to set that in general and vary per HashSet/Table based on key distribution relative to hash quality.

@c-blake
Copy link
Contributor

c-blake commented Mar 20, 2020

Also, while whatever handful of "well scrambling but too slow" hashes you may have personally tried like nonlinearHash or whatever, that is very much just one small sample out of many. I believe that constitutes extraordinarily weak evidence that there is no sufficiently strong & fast hash. "Well, the couple slow hashes I tried were too slow" is just not a persuasive statement to justify any higher level algorithm choices. You should stop trying to use it that way. A great many/most hashes are optimized for very long strings, not short integers. So, you have to hunt a more smartly for them, maybe but there are many.

Of course, there is no perfect hash for all situations, and that includes the implicit hash hard-coded into your perturbed probe sequence. The academic terminology on this is usually that a "hash function" maps "keys" to "the whole probe sequence". Usually that sequence is a permutation of table addresses, not the weird Python thing where table addresses can repeat. Another argument against this whole "just copy Python" direction is that Python tables do not allow duplicate keys while its probe sequence makes allItems need to do weird tricks related to that perturbed probe sequence not being a permutation. So, at a minimum that general approach to incorporate all hash code bits into the probe sequence should probably be modified for Nim to be a strict perturbation and allItems simplified. (I realize from nim-lang/RFCs#200 that you'd rather just ditch duplicate key capability and break 1.0 functionality backward compatibility, but this argument should still be voiced here.)

narimiran pushed a commit that referenced this pull request Mar 31, 2020
…nges (#13816)

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.
Araq pushed a commit that referenced this pull request Apr 15, 2020
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.

* This is an alternate resolution to #13393
(which arguably could be resolved outside the stdlib).

Add version1 of Wang Yi's hash specialized to 8 byte integers.  This gives
simple help to users having trouble with overly colliding hash(key)s.  I.e.,
  A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))`
      in the instantiation context of a `HashSet` or `Table`
or
  B) more globally, compile with `nim c -d:hashWangYi1`.

No hash can be all things to all use cases, but this one is A) vetted to
scramble well by the SMHasher test suite (a necessarily limited but far
more thorough test than prior proposals here), B) only a few ALU ops on
many common CPUs, and C) possesses an easy via "grade school multi-digit
multiplication" fall back for weaker deployment contexts.

Some people might want to stampede ahead unbridled, but my view is that a
good plan is to
  A) include this in the stdlib for a release or three to let people try it
     on various key sets nim-core could realistically never access/test
     (maybe mentioning it in the changelog so people actually try it out),
  B) have them report problems (if any),
  C) if all seems good, make the stdlib more novice friendly by adding
     `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1`
     with some `when defined` rearranging so users can `-d:hashIdentity`
     if they want the old behavior back.
This plan is compatible with any number of competing integer hashes if
people want to add them.  I would strongly recommend they all *at least*
pass the SMHasher suite since the idea here is to become more friendly to
novices who do not generally understand hashing failure modes.

* Re-organize to work around `when nimvm` limitations; Add some tests; Add
a changelog.md entry.

* Add less than 64-bit CPU when fork.

* Fix decl instead of call typo.

* First attempt at fixing range error on 32-bit platforms; Still do the
arithmetic in doubled up 64-bit, but truncate the hash to the lower
32-bits, but then still return `uint64` to be the same.  So, type
correct but truncated hash value.  Update `thashes.nim` as well.

* A second try at making 32-bit mode CI work.

* Use a more systematic identifier convention than Wang Yi's code.

* Fix test that was wrong for as long as `toHashSet` used `rightSize` (a
very long time, I think).  `$a`/`$b` depend on iteration order which
varies with table range reduced hash order which varies with range for
some `hash()`.  With 3 elements, 3!=6 is small and we've just gotten
lucky with past experimental `hash()` changes.  An alternate fix here
would be to not stringify but use the HashSet operators, but it is not
clear that doesn't alter the "spirit" of the test.

* Fix another stringified test depending upon hash order.

* Oops - revert the string-keyed test.

* Fix another stringify test depending on hash order.

* Add a better than always zero `defined(js)` branch.

* It turns out to be easy to just work all in `BigInt` inside JS and thus
guarantee the same low order bits of output hashes (for `isSafeInteger`
input numbers).  Since `hashWangYi1` output bits are equally random in
all their bits, this means that tables will be safely scrambled for table
sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the
integer keys are all < 2**53 (also likely fine).  (I'm unsure why the
infidelity with C/C++ back ends cut off is 32, not 53 bits.)

Since HashSet & Table only use the low order bits, a quick corollary of
this is that `$` on most int-keyed sets/tables will be the same in all
the various back ends which seems a nice-to-have trait.

* These string hash tests fail for me locally.  Maybe this is what causes
the CI hang for testament pcat collections?

* Oops. That failure was from me manually patching string hash in hashes.  Revert.

* Import more test improvements from #13410

* Fix bug where I swapped order when reverting the test.  Ack.

* Oh, just accept either order like more and more hash tests.

* Iterate in the same order.

* `return` inside `emit` made us skip `popFrame` causing weird troubles.

* Oops - do Windows branch also.

* `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic
resolutions are "1 == identity", 1 for Nim Version 1, 1 for
first/simplest/fastest in a series of possibilities.  Should be very
easy to remember.)

* Re-organize `when nimvm` logic to be a strict `when`-`else`.

* Merge other changes.

* Lift constants to a common area.

* Fall back to identity hash when `BigInt` is unavailable.

* Increase timeout slightly (probably just real-time perturbation of CI
system performance).
dom96 pushed a commit that referenced this pull request Jun 10, 2020
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable
stride double hashing) part of recent sets & tables changes (which has
still been causing bugs over a month later (e.g., two days ago
#13794) as well as still having
several "figure this out" implementation question comments in them (see
just diffs of this PR).

This topic has been discussed in many places:
  #13393
  #13418
  #13440
  #13794

Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in
identity hashes) are a better solution that is more general (no illusion
of one hard-coded sequence solving all problems) while retaining the
virtues of linear probing such as cache obliviousness and age-less tables
under delete-heavy workloads (still untested after a month of this change).

The only real solution for truly adversarial keys is a hash keyed off of
data unobservable to attackers.  That all fits better with a few families
of user-pluggable/define-switchable hashes which can be provided in a
separate PR more about `hashes.nim`.

This PR carefully preserves the better (but still hard coded!) probing
of the  `intsets` and other recent fixes like `move` annotations, hash
order invariant tests, `intsets.missingOrExcl` fixing, and the move of
`rightSize` into `hashcommon.nim`.

* Fix `data.len` -> `dataLen` problem.

* Add neglected API call `find` to heapqueue.

* Add a changelog.md entry, `since` annotation and rename parameter to be
`heap` like all the other procs for consistency.

* Add missing import.
@stale
Copy link

stale bot commented Mar 20, 2021

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

@stale
Copy link

stale bot commented Mar 22, 2022

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

@stale stale bot added the stale Staled PR/issues; remove the label after fixing them label Mar 22, 2022
@stale stale bot closed this Apr 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Staled PR/issues; remove the label after fixing them
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants