Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BloomFilter broken when really big #1119

Closed
gissuebot opened this issue Oct 31, 2014 · 23 comments
Closed

BloomFilter broken when really big #1119

gissuebot opened this issue Oct 31, 2014 · 23 comments
Assignees
Labels
Milestone

Comments

@gissuebot
Copy link

Original issue created by Maaartinus on 2012-08-24 at 05:40 PM


**BUG 1:
When the number of bits in a BloomFilter gets high, it's FPP is much worse than expected. The culprit is the modular arithmetic in BloomFilterStrategies.MURMUR128_MITZ_32.

You compute x%N where N=bits.size() and x is uniformly distributed in range 0..Integer.MAX_VALUE. For big N, x%N is far from uniform in range 0..N-1. For example, with N=3<<29, values below 1<<29 are twice as probable as the others.

This non-uniformity leads to results like this:
desiredFpp 0.000001000000
expectedFpp 0.000000610089
realFpp 0.000003000000

Here, desiredFpp is the value used in BloomFilter.create, expectedFpp was reported after exactly expectedInsertions were done. Obviously, much fewer bits than expected were set. If this happened once, it might be a good luck but here it's a sign of this bug, as the realFpp shows.

This problem is well reproducible, it's no glitch caused by bad luck with selected values. AFAIK it concerns all versions since the switch from powers of two.

**BUG 2:
With "31149b4 The number of bits can reach Integer.MAX_VALUE now, rather than Integer.MAX_VALUE/64" another bug was introduced. The commit message is obviously wrong, as there can be allocated up to Integer.MAX_VALUE longs, allowing nearly 2**37 bits. However, the arithmetic is still int-based and allows to address only2**31 bits. So most of the allocated memory get wasted.

Even worse, bits.size() may overflow leading to all kinds of disaster, like "/ by zero" (e.g. for expectedInsertions=244412641 and desiredFpp=1e-11) or using only 64 bits.

**INEFFICIENCY:
In MURMUR128_MITZ_32 there are one modulus operation and one unpredictable branch per hash function. This is quite wasteful, as it's enough to compute modulus for the basic two hashes and than use conditional subtraction.

**ENHANCEMENT 1:
As the filter may take up to 16 GB, there should be a method to find out the memory consumption.

**ENHANCEMENT 2:
Possibly there could be a strategy using a power of two table, which may be faster. In case the speed up is non-negligible, such a strategy makes a lot of sense, as the additional memory (assuming rounding up) is not wasted at all -- you get better FPP.

QUESTION:
I see no reason for limiting numHashFunctions to 25.5 In the SerialForm, there's an int, so why?

**PROPOSED SOLUTION:
Because of serialized form compatibility, I'd suggest to leave MURMUR128_MITZ_32 alone, and create MURMUR128_MITZ_64, which

  • extracts two longs instead of two ints from the HashCode
  • uses long arithmetic for everything

The BitArray must use long indexes, long bitCount() and size(), too. This works with both strategies and the SerialForm needs no change.

For small filters (up to few millions bits, before the non-uniformity starts to make problems) it's fine to use the old strategy,
for larger the new one must get used. I'd suggest to use the new one for all new filters.

In order to get maximum speed, the following comes in mind:

  • create package-private HashCode.asSecondLong()
  • compute hash2 only if numHashFunctions>1

The attached patch solves it all.

@gissuebot
Copy link
Author

Original comment posted by kurt.kluever on 2012-08-24 at 06:11 PM


(No comment entered for this change.)


Owner: andreou@google.com
Labels: Package-Hash, Type-Enhancement

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2012-09-19 at 09:08 PM


I wouldn't say that an arbitrary bad FPP and "/ by zero" is a RFE. It happens for big sizes only, but nobody said they aren't allowed.

@gissuebot
Copy link
Author

Original comment posted by andreou@google.com on 2012-09-19 at 09:40 PM


Good suggestions, thanks for reading deeply the code! This didn't get nearly enough focused eyeballs yet, unfortunately.

Hmm, there was a version that I was extracting two longs instead of two ints from the hash, can't remember why I went with ints after all. I suspect I didn't see any fpp improvement from the longs, and went for simpler arithmetic, ignoring the case of really large BFs where longs are preferable anyway.

And yes, the modulo operator is unnecessarily performed too many times, well spotted!

The other bugs related to very large BFs are straightforward as well.

Guava team, can someone pick this up? Pretty much the patch sums it up, though probably without the HashCode.asSecondLong(), and the for-loops are written in a slightly weird way. But since the serial form is designed for changes like this, it's easy. It's a bit unlikely that anyone is affected by any of these cases, so it's not quite severe, but it's still nice to have this in a better shape anticipating for such cases.

Maartin, could you file a separate request for the "figure out the memory size of a BF"? I wish we had that as well, especially since there is no constructor that is configured by that.

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2012-10-02 at 10:19 AM


Without HashCode.asSecondLong() you have to either use HashCode.asBytes which means cloning or cast and access HashCode.bytes directly. Neither alternative feels nicer than asSecondLong.

Agreed that this bug unprobably hits someone soon.

The separate RFE is the issue 1156.

@gissuebot
Copy link
Author

Original comment posted by andreou@google.com on 2012-10-15 at 07:33 PM


Kurt (Louis?), could someone take care of this? Inevitably more people are going to hit these bugs in very big bloomfilters


Owner: kak@google.com

@gissuebot
Copy link
Author

Original comment posted by kak@google.com on 2012-10-23 at 04:47 PM


(No comment entered for this change.)


Status: Accepted

@gissuebot
Copy link
Author

gissuebot commented Nov 1, 2014

Original comment posted by spie...@addthis.com on 2014-03-06 at 11:50 PM


Can we revisit this issue? Here is a version of the patch that is updated against the latest git master branch. The majority of the implementation is faithful to the original patch. One small bugfix: in BitArray#get() and BitArray#put() the right shift operations should be unsigned. I also cleaned up the looping constructs to make them easier to understand. I introduced HashCode#asLeastSignificantLong() and HashCode#asMostSignificantLong() methods. The names of the methods are intended to be similar to the names of the methods in java.util.UUID. With the introduction of a new enum in BloomFilterStrategies these patches are backwards compatible. This fixes #1656

@gissuebot
Copy link
Author

Original comment posted by kevinb@google.com on 2014-03-12 at 02:19 PM


(No comment entered for this change.)


Labels: Milestone-Release17

@gissuebot
Copy link
Author

Original comment posted by kak@google.com on 2014-03-27 at 08:50 PM


OK, I think we've got a version we're going to check in. Can both spiegel@ and Maaartinus patch and test out this new BF strategy (patch against head, not Guava 16)? Also if you see anything obvious, please point that out as well.

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
      long bitSize = bits.bitSize();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

  boolean bitsChanged = false;
  long index = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    index &= Long.MAX_VALUE; // make it positive
    index %= bitSize; // make it indexable
    bitsChanged |= bits.set(index);
    index += hash2;
  }
  return bitsChanged;
}

@Override
public <T> boolean mightContain(T object, Funnel<? super T> funnel,
    int numHashFunctions, BitArray bits) {
  byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).asBytes();
  long bitSize = bits.bitSize();
  long hash1 = lowerEight(bytes);
  long hash2 = upperEight(bytes);

  long index = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    index &= Long.MAX_VALUE; // make it positive
    index %= bitSize; // make it indexable
    if (!bits.get(index)) {
      return false;
    }
    index += hash2;
  }
  return true;
}

private /* static */ long lowerEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}

private /* static */ long upperEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}

Please let me know.

Thanks,
-kak


Status: Started
Labels: -Type-Enhancement, Type-Defect

@gissuebot
Copy link
Author

Original comment posted by kak@google.com on 2014-03-27 at 09:38 PM


Also we're having a bit of internal debate over the methodology of:
  if (index < 0) {
    index = ~index;
  }
vs.
  index &= Long.MAX_VALUE;

They both accomplish similar goals, but should one be preferable to the other? The branchless approach is more attractive runtime-wise, but it's not flipping the bits (just making the result positive).

Any thoughts?

@gissuebot
Copy link
Author

gissuebot commented Nov 1, 2014

Original comment posted by spie...@addthis.com on 2014-03-27 at 09:43 PM


Yes I don't think it matters what happens to the bits as long as the transformation doesn't mess up the distribution of values and that the result is always positive. Another thought I had was applying index >>>= 1. It removes the sign bit and drops the least significant bit. All three proposals are OK with me. I kind of prefer the two solutions that do not have a branch but for me that's not a big concern.

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2014-03-28 at 05:04 AM


Running spiegel@' BloomFilterTest says
Observed false positive rate: 1.4666666666666666E-7
Expected false positive rate:9.196321345133148E-8
which is much better than it was but still wrong.

I see no point in if (nextHash < 0) nextHash = ~nextHash;. It gets compiled into a conditional move or maybe into index ^= index >> 63, but IMHO it's not enough mixing for the buck. If the last mixing operation was a multiplication, the right shift would be clearly better than masking, but given how Murmur3_128Hasher.makeHash ends, I guess that the LSb and MSb are about equally good. Given that masking needs a 64-bit operand, I'd go for the shift.

Concerning the speed, there's something wrong:

normal: https://microbenchmarks.appspot.com/runs/7f9093a6-52c7-463d-8438-19f3295b4723

scaled: https://microbenchmarks.appspot.com/runs/85f26479-000a-4c06-82c5-cd4254013411#r:scenario.benchmarkSpec.parameters.desiredFpp,scenario.benchmarkSpec.parameters.expectedInsertions,scenario.benchmarkSpec.parameters.strategy&c:scenario.benchmarkSpec.methodName

It may be caused by the 64-bit division or by asBytes. Both could be eliminated and I'd bet that most of my optimizations for 32 bits would apply here as well. I guess I'll give it a try next week.

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2014-03-28 at 05:31 AM


There are some errors in my attached BloomFilterStrategies, especially many shifts must be unsigned. However, fixing it doesn't change the speed.

@gissuebot
Copy link
Author

Original comment posted by cgdecker@google.com on 2014-03-28 at 04:27 PM


FYI, in the current iteration of this we're avoiding the asBytes() call with a package-private backdoor on HashCode that (for a byte-array based HashCode such as this one) returns the internal array directly without cloning.

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2014-03-28 at 05:03 PM


Nice. Still, my 64-bit version is by 30-50% faster than the original MURMUR128_MITZ_32 (which needs asLong() only). Currently it's too hacky, but I'll come back when I polish it a bit.

I found a nice modulus eliminating hack.

I think that the original BloomFilterTest was (slightly) wrong, with my updated version the false positive rates are fine for the 64-bit version.

@gissuebot
Copy link
Author

Original comment posted by kak@google.com on 2014-03-28 at 06:51 PM


First, congrats on using nearly 30 minutes walltime of my 6-Core Xeon workstation with the GuavaBloomTest ;-)

Second, here's are my current results:

MURMUR128_MITZ_32
Desired false positive rate: 1.0E-75
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.83005825E-316
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.9992650567524934
Factor: Infinity

MURMUR128_MITZ_64
Desired false positive rate: 1.0E-75
Inserting 0%^[ 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.0003117180465422E-75
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

These numbers seem wrong. Is there something wrong with the tool?

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2014-03-28 at 07:23 PM


Oh sorry.... I fixed the formula but blew the inputs. Change it to something sane (even 1e-10 is fine, but 1e-75 was only an exercise with maximum memory).

I believe that the numbers are correct: MURMUR128_MITZ_32 is completely broken because of overflow. It sets only a few bits, so it reports unbelievable small expected fpp and end up with an terrible fpp.

For MURMUR128_MITZ_64 the expected fpp was 1e-75 and there was no single false positive. That's fine. With proper figures you'll see proper data. Sorry again.

@gissuebot
Copy link
Author

Original comment posted by kak@google.com on 2014-03-28 at 07:37 PM


OK, I re-ran the tests with 1e-10. Glad to see we have a reproducible failure case for MITZ_32.

MURMUR128_MITZ_32 (what's currently committed in Guava):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 1.6110489235558792E-16
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0011919304361762886
Factor: 7398474489188.57

MURMUR128_MITZ_64 (my proposed fix, source is attached below):
Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998270307088567E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(T object, Funnel<? super T> funnel,
        int numHashFunctions, BitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

  boolean bitsChanged = false;
  long combinedHash = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    // Make the combined hash positive and indexable
    bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
    combinedHash += hash2;
  }
  return bitsChanged;
}

@Override
public <T> boolean mightContain(T object, Funnel<? super T> funnel,
    int numHashFunctions, BitArray bits) {
  long bitSize = bits.bitSize();
  byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal(); // no byte[] clone!
  long hash1 = lowerEight(bytes);
  long hash2 = upperEight(bytes);

  long combinedHash = hash1 + hash2;
  for (int i = 0; i < numHashFunctions; i++) {
    // Make the combined hash positive and indexable
    if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
      return false;
    }
    combinedHash += hash2;
  }
  return true;
}

private /* static */ long lowerEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}

private /* static */ long upperEight(byte[] bytes) {
  return Longs.fromBytes(
      bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}

  };

I realize yours may contain a few more small optimizations, but I'd like to prefer clarity and correctness over speed. If you can confirm that this fix is correct, we can discuss the tradeoffs between clarity and speed.

@gissuebot
Copy link
Author

gissuebot commented Nov 1, 2014

Original comment posted by spie...@addthis.com on 2014-03-28 at 07:54 PM


+1 this looks good. I'll rerun the tests and I expect them to pass. I'm going to run the latest version (#18) in a profiler and look for any bottlenecks.

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2014-03-28 at 08:33 PM


Looks good, maybe except for long combinedHash = hash1 + hash2;, where long combinedHash = hash1 would be both clearer and (a very tiny bit) faster (there's really no point in combining the longs as the Murmur's last two steps are h1 += h2; h2 += h1;).

The problem with later optimizations is serialization. When people start saving their filter, you can't really change it anymore. You can create a new strategy, but you should maintain them all forever. No nice implementation-independent SerialForm works here.

Unfortunately, there's no single big optimization here, just a couple of 10% speedup for 10 minutes headache trade-offs. The good news is that the new version is faster than the broken one: https://microbenchmarks.appspot.com/runs/5d98efb8-9013-460b-9c0f-a576e6d158ba

@gissuebot
Copy link
Author

Original comment posted by kak@google.com on 2014-03-28 at 08:41 PM


I think I'm going to leave the extra addition in. The paper that's referenced in the Javadoc, "Less Hashing, Same Performance: Building a Better Bloom Filter" (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/rsa.pdf) uses this formula:

gi(x) = h1(x) + i*h2(x)

Which is what we've got now. I'd rather leave it as is to match the paper (for clarity sake), than to tinker with it more.

Thanks for all the help Martin + Spiegel. Submitting shortly!

@gissuebot
Copy link
Author

Original comment posted by Maaartinus on 2014-03-28 at 09:26 PM


Right, but they also write "In our context i will range from 0 up to some number k − 1 to give k hash functions, and ...", so their very first gi is h1 rather than h1+h2.

@gissuebot
Copy link
Author

Original comment posted by kak@google.com on 2014-03-28 at 09:42 PM


Ugh, not sure how I missed that...made the appropriate change and re-ran the GuavaBloomTest. Results:

Desired false positive rate: 1.0E-10
Inserting 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Expected false positive rate: 9.998725915640892E-11
Testing 0% 8% 15% 23% 31% 38% 46% 53% 61% 69% 76% 84% 92% 99% done
Observed false positive rate: 0.0
Factor: 0.00


Status: Fixed

@cgdecker cgdecker modified the milestone: 17.0 Nov 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants