Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Performance of ImmutableSet.contains #1155
Original issue created by Maaartinus on 2012-09-27 at 10:48 AM
In case an
The problem is caused by 1. big continuous filled area of the table, 2. using linear reprobing. Any lookup starting near the beginning of such a continuous area takes a very long time as it has to traverse to its end.
Preventing continuous filled area
The hash code gets processed in
But there are no collisions here. What happens in the extreme case (of the set containing N consecutive integers where N is a power of two)
With a smearing which propagates bits also to the left this could be avoided. A modification like
hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
helps here a lot. Funnily enough, MULTIPLIER=2 seems to work perfectly, as it makes every other slot empty. Obviously, an even multiplier is wrong, as it increases the number of collision in other cases.
For any given set size some "optimal" MULTIPLIER can be found, e.g. 0x80001 works nicely for set size = 500000, but not for the other two. Randomly chosen odd values work reasonably well on the average. Instead of the multiplication some xors and/or additions with left shift could be used, but the multiplication seems to be faster.
Obviously, the number of collision for the expression
IMHO, the linear reprobing will always amplify any problems caused by hash collisions or bad distribution. I'd suggest a combination with double hashing: After some number of failed attempts to find a free slot, a jump to some other part of the table should be done. This has absolutely no impact in case everything works fine and can be a huge win in case of problems. It's quite simple (maybe 50 lines).
This is legit, and while I know a lot of work went into ImmutableSet, I do think this demonstrates that it's still too easy to trigger the worst case of linear probing.
I'm currently experimenting with the following smearing function:
(The multiplier constant is borrowed from murmur hash, which I suspected would probably yield good results.)
I'm experimenting with quadratic probing at the moment with this smear function, just because I want to avoid additional branches, and quadratic probing retains pretty good locality.
What's the latest rehash function being used in ConcurrentHashMapV8.java and so forth? Doug changes it often and we might improve just by trying whatever he's doing nowadays.
I also think double hashing sounds very worth trying -- all those bits sitting there in the upper reaches of the "smeared" hash code, unused. (Maybe double hashing isn't the right term here, not sure, perhaps this is cuckoo hashing with N=2.)
But honestly, part of the issue is that linear or quadratic open addressing is vulnerable not just to exact hash collisions, but to hashes that differ only in relatively low bits. The smearing methods don't do anything at all to help there, and that's the issue that needs addressing.
(I'm pretty strongly inclined to keep linear or quadratic probing for the first couple of probes at least, just to take advantage of memory locality.)
The experimental smearing method I proposed above is doing pretty well in benchmarks at the moment, which is nice -- and it's doing well with both linear and quadratic probing. Will keep y'all updated.
Original comment posted by Maaartinus on 2012-09-29 at 07:24 AM
@Louis: Note that unlike the right shifts, the rotations gives you no guarantee that "hashCodes that differ only by constant multiples at each bit position have a bounded number of collisions". Or maybe they do, but I can't see how.
I don't think so. It has good locality in the first few steps (which is good as you quite often stop there), but on the average it's quite bad. Assuming the probe sequence
The classical double hashing has distances
What I'd propose is using distances
In order for the probe sequence to cover the whole table it's sufficient that
Explanation: A block consists of K-1 ones and a single N, so the distance travelled after one block is (K-1)_1 + 1_N. Moving by an odd multiple of the block length guarantees that all blocks get visited (as an odd number is co-prime to the number of blocks which is a power of two).
Agreed, while most hashing algorithms use only
I think you've meant double hashing, as cuckoo works quite differently.
Actually, I've just got a different idea, very simple and maybe optimal for the issue I described. I'll try it out and report later.
Original comment posted by Maaartinus on 2012-10-02 at 01:41 PM
Here the idea I've already announced: During
However, there's a similar problem including a lot of collision, and then it can't help at all (see the attachment). The benchmark is here:
Note that with
I think I've got a nice solution for all these problems (soon to come).
I'm not seeing how this would be a doable thing without significant increased complication in the construction method -- we don't even remember hash codes at all, at the moment.
Hrrrrm. I know you're quoting the HashMap source, but I don't really understand what "differ only by constant multiples at each bit position" could possibly mean. Do the bits differ by only a constant multiple? Of what?
I'm going to poke around Google for advice, maybe.
I observe that the JDK's IdentityHashMap uses open addressing, and its "smearing" function looks like
Worth exploring, perhaps?
Original comment posted by Maaartinus on 2012-10-03 at 01:08 PM
It's trivial and very cheap, just counting the iterations in the inner loop, see the patch.
But in general, there's a tradeoff with unknown parameters: Storing the hashes would speed up reprobing and so can spending more time in
I can only guess what they mean, too. But whatever property it is, it can't be destroyed by a final multiplication, as there's no additional flow from the higher bits to the lower ones. An initial multiplication (i.e. before the right shift) is a different case, as them together create such a flow. It's hard to explain what I mean, as it's based of my fuzzy understanding of their stated goal.
You mean personally? I couldn't google out anything.
I don't think so. The multiplication puts nearby hashes far apart, which is a good thing because of the linear reprobing, but that's about all. The hash distrubution is different and the cost of reprobing much lower.
Original comment posted by wasserman.louis on 2012-10-03 at 08:49 PM
Perhaps a more accurate statement would be
In any event, I'm getting the impression that we agree that an initial multiplication strictly helps -- it's still possible to cause the worst-case, of course, but it requires being more deliberate about it. Should I go ahead and try to push ahead on that change while we continue the discussion on the other changes?
Original comment posted by Maaartinus on 2012-10-04 at 09:57 AM
Yes. Currently, I know of no better way how to find the cases when the multiplication is detrimental than bruteforcing it.
It's very cheap on modern CPUs:
As it is very cheap and improves the distribution, it might speed the search up even in commonly occuring cases, too.
It depends on the probability that someone gets hit by this problem in the meantime. That's something I can't tell.
Original comment posted by Maaartinus on 2012-10-11 at 09:55 AM
This small program illustrates how I understand what "differ only by constant multiples at each bit position" means:
According to it:
But the program only counts collisions assuming closed hashing. Secondary collisions due to rehashing are a very different beast.
I'm putting this down as fixed by 69ad96b .
Even if the new smearing function can be defeated, it's at least much more difficult to make it blow up accidentally. (Incidentally, it also beats the old Hashing.smear in benchmarks.)