-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hashing collision: improve performance using linear/pseudorandom probing mix #13440
hashing collision: improve performance using linear/pseudorandom probing mix #13440
Conversation
f5e7d9d
to
5423438
Compare
8951c59
to
3385027
Compare
55050e0
to
f084f69
Compare
Look ok to me but @narimiran is the reviewer. CC @c-blake |
I still see not a single benchmark of an alternate workload besides just inserting everything into the hash table. Given that the main point of contention, performance-wise, is impact of all those tombstones on delete heavy workloads, that seems inadequate to prove any interesting point. Indeed, all these workloads really test is how well the indirect hash function encoded as a perturbed sequence works. If the threat model is a user hash that is A) weak in some groups of bits, but B) still has many varying bits (which is the only situation this perturbed probe sequence solves) then there is another natural mitigation which is inherently local - just rehash the output of I've been testing this in a private library off to the side that is all about search depth-based triggering of A) first Robin Hood hashing activation, and then B) hash() output rehashing and Robin Hood that is also structured such that a final fallback to a B-tree might be workable. I'm not as happy with a lot of the code repetition/factoring as I could be, but perhaps it's ready enough to make public. It's not all that well benchmarked, micro-optimized, or tested, but it should exhibit the ideas I'm referring to less abstractly. |
lib/pure/collections/tables.nim
Outdated
# using this and creating tombstones as in `Table` would make `remove` | ||
# amortized O(1) instead of O(n); this requires updating remove, | ||
# and changing `val != 0` to what's used in Table. Or simply make CountTable | ||
# a thin wrapper around Table (which would also enable `dec`, etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or simply make CountTable a thin wrapper around Table (which would also enable
dec
, etc)
I don't understand this comment. Is this a TODO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworded a bit this comment to clarify. I'm suggesting possible venues to improve CountTable:
- using
countDeleted
and tombstones, and adapting the algorithms used inTable
- or (best option IMO, to be discussed in separate PR) using something analog to MultiTable: thin wrapper around Table with sane API for duplicate keys; deprecate Table.add RFCs#200 but for CountTable, where we offer same API as current
CountTable
but implemented as a thin wrapper around Table.
Both approaches would enable these:
- all operations would become at least as efficient as current CountTable operations
- some CountTable operations would turn from O(n) to amortized O(1)
- current API restrictions would be lifted (eg we'd allow
inc
with negative values (or equivalently,dec
), and we'd allow counts to be 0)
You could call it a TODO, or you could call it something to help the next person reading this code and wondering about why some operations are currently O(n). I could do that PR, after this PR is merged.
011a6ae
to
a0684d6
Compare
PTAL
pseudorandom probing is a general collision resolution technique, not limited to "weakness in some group of bits".
the term for that is
I'm happy to run the same benchmark once it's public and offers same (or subset) of Table API so can be plugged in |
I made it public a couple days ago. Tombstones make inserts or lookups after many deletions slow as well. In any case, since that general style of workload is the one thing where one might expect a performance difference, it seems to me you should do it before claiming performance neutrality. Based on your response to my A,B) points, I think you are getting hung up on terminology, possibly simply don't understand what you're saying, and definitely do not understand what I am saying. I know about double hashing and that is not what I am suggesting and I was careful to not use that term, actually. DH uses a second truly independent hash of the key to generate an increment interval. Because of the way people usually abstract user What I was suggesting is different, and personally I think clear from my prior text. I was trying to say use Mathematically, the PRNG stuff you're copying from Python is just a fancier, more memory system indirection-heavy way to have the probe sequence "depend upon all the bits" a user @Araq's initial complaint for a better hash about "what about other distributions" still holds for your PRNG, only with your approach you have hard coded the way all bits are depended upon into the library. But no hash can be great all the time for all sets of keys, especially if attackers with access to your source code exist. With my suggestion you can preserve the "freedom to fix" by having that |
Incidentally, while mentioned in the mentioned repo, it deserves reiteration that probe depth based resize triggers as well as probe depth triggered mitigations for weak hashes have other nice properties. For one, if the user can provide a better than random hash they might enjoy 100% memory utilization with no performance degradation. Admittedly, providing actual minimal perfect hashes outside of test cases is unlikely, but "close to that" might not be. In any case, reacting to the actual underlying cause of slowness (deep probes) is inherently more robust than trying to guess what load factor to target. I am pretty sure Rust was using such depth based triggers for like 4 years, but it's also probably an ancient tactic. It's just sort of looking at those "probe depth vs. load" graphs with flipped axes. |
Oh, and it would be easy (though non portable) to integrate a source of true randomness into |
In an attempt to be more constructive benchmark-wise, this blog post describes at least some things you can do https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ He makes more an effort to avoid prefetching biases than you. It's still incomplete. You need to use true randomness that the CPU cannot get by "working ahead" with PRNGs. Linux has Another comment would be that I don't think there is any "workload independent best" solution. So, it will never be possible to really decide what benchmarks are most representative. All your "1.06x" kinds of answers are also going to be clearly highly contingent upon which CPUs you tested with, which backend compilers, with or without PGO at the C level, and may well not generalize to many/most other deployment contexts. More abstract/mathematical properties like "about as fast, but can handle situation X,Y,Z gracefully" or arguments in terms of "probe depth" or "unprefetched cache misses" are just less deployment-vague ways to argue about suitability for a general purpose library. |
There are a few easy "proofs" of workload dependence, btw. Consider simply iteration. Something like Similarly, Robin Hood can do "miss" lookups faster than "hits" on dense tables because an order invariant is maintained (at some cost). So, determining two sets have an empty set (or near it) intersection is usually best with Robin Hood. Another not well advertised property of Robin Hood is that I believe if you have In light of all the above use case/workload variability, the only real long-term solution is obviously to provide a set of alternatives (which is what I'm trying to do with |
a0684d6
to
863bb0c
Compare
Another comment on this PR is that though cache line size does vary with 64 bytes being the most common, it varies much less than "infinity" which is how much per-object sizes in the Even with a known cache line size there are system-specific hacks where the memory system can make it effectively larger. E.g., I have a workstation whose BIOS can make mine 128B instead of 64B, and the pre-fetcher can probably successfully hide latency of a tight loop up to at least 256B. That's just one system. It will be hard to set that in general and vary per |
Also, while whatever handful of "well scrambling but too slow" hashes you may have personally tried like Of course, there is no perfect hash for all situations, and that includes the implicit hash hard-coded into your perturbed probe sequence. The academic terminology on this is usually that a "hash function" maps "keys" to "the whole probe sequence". Usually that sequence is a permutation of table addresses, not the weird Python thing where table addresses can repeat. Another argument against this whole "just copy Python" direction is that Python tables do not allow duplicate keys while its probe sequence makes |
…nges (#13816) * Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem.
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * This is an alternate resolution to #13393 (which arguably could be resolved outside the stdlib). Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e., A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))` in the instantiation context of a `HashSet` or `Table` or B) more globally, compile with `nim c -d:hashWangYi1`. No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts. Some people might want to stampede ahead unbridled, but my view is that a good plan is to A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out), B) have them report problems (if any), C) if all seems good, make the stdlib more novice friendly by adding `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1` with some `when defined` rearranging so users can `-d:hashIdentity` if they want the old behavior back. This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all *at least* pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes. * Re-organize to work around `when nimvm` limitations; Add some tests; Add a changelog.md entry. * Add less than 64-bit CPU when fork. * Fix decl instead of call typo. * First attempt at fixing range error on 32-bit platforms; Still do the arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well. * A second try at making 32-bit mode CI work. * Use a more systematic identifier convention than Wang Yi's code. * Fix test that was wrong for as long as `toHashSet` used `rightSize` (a very long time, I think). `$a`/`$b` depend on iteration order which varies with table range reduced hash order which varies with range for some `hash()`. With 3 elements, 3!=6 is small and we've just gotten lucky with past experimental `hash()` changes. An alternate fix here would be to not stringify but use the HashSet operators, but it is not clear that doesn't alter the "spirit" of the test. * Fix another stringified test depending upon hash order. * Oops - revert the string-keyed test. * Fix another stringify test depending on hash order. * Add a better than always zero `defined(js)` branch. * It turns out to be easy to just work all in `BigInt` inside JS and thus guarantee the same low order bits of output hashes (for `isSafeInteger` input numbers). Since `hashWangYi1` output bits are equally random in all their bits, this means that tables will be safely scrambled for table sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the integer keys are all < 2**53 (also likely fine). (I'm unsure why the infidelity with C/C++ back ends cut off is 32, not 53 bits.) Since HashSet & Table only use the low order bits, a quick corollary of this is that `$` on most int-keyed sets/tables will be the same in all the various back ends which seems a nice-to-have trait. * These string hash tests fail for me locally. Maybe this is what causes the CI hang for testament pcat collections? * Oops. That failure was from me manually patching string hash in hashes. Revert. * Import more test improvements from #13410 * Fix bug where I swapped order when reverting the test. Ack. * Oh, just accept either order like more and more hash tests. * Iterate in the same order. * `return` inside `emit` made us skip `popFrame` causing weird troubles. * Oops - do Windows branch also. * `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic resolutions are "1 == identity", 1 for Nim Version 1, 1 for first/simplest/fastest in a series of possibilities. Should be very easy to remember.) * Re-organize `when nimvm` logic to be a strict `when`-`else`. * Merge other changes. * Lift constants to a common area. * Fall back to identity hash when `BigInt` is unavailable. * Increase timeout slightly (probably just real-time perturbation of CI system performance).
* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * Add neglected API call `find` to heapqueue. * Add a changelog.md entry, `since` annotation and rename parameter to be `heap` like all the other procs for consistency. * Add missing import.
This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions. |
This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions. |
TLDR
compared to right before PR:
compared to right before #13418:
1.06
slowdowndetails
as I mentioned in #13418 we can refine the collision strategy to get the best of both worlds between:
using a threshold
depthThres
on search depth (ie number of calls to nextTry), which is a parameter to tune; in practice 20 works across a range of applications but future PR could expose this single param to users;performance numbers
I published some easy to modify benchmarking code to compare performance of a (customizable) task involving inserting some 20 million keys in a table, retrieving keys, and searching for both existing and nonexisting keys. I tried on a number of different distributions of keys (english words, oids, various string keys, various random, high order bit keys, consecutive numbers (int32/64), explicit formulas (eg squares, multiples of K), small floats, etc
The results show that:
1.06
slowdownbetter hash
(fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410) also avoids the extreme bad cases, but is 2X slower for many simple distributions because of the hash computation overhead, and isn't as robust as pseudorandom probing against adversarial attackstoHighOrderBits
distribution), but instead of giving an unbounded slowdown (as is case for linear probing), it just gives a 1.3X slowdown), so this is a very good tradeoff that makes the common cases faster at the expense of edge cases, for which we pay a small penalty (1.3X instead of 100X or unbounded slowdown)depthThres
is a very simple heuristic that helps almost in all casesI'm comparing 4 algos:
hashUInt32
just forwarding tohashUInt64
ashashUInt32
had at least 1 very bad case)details performance numbers