Skip to content

Bloom filter sharing

Stephen Oliver edited this page Aug 13, 2016 · 1 revision

This is an idea that has been proposed for some time. It has been documented mostly on the mailing list but also on the bug tracker. The principle is to tell our peers what is in our datastore.

Implementation

Originally the proposal was that we share the whole datastore bloom filter with our peers. One problem with this is that it does not become useful until the whole file has been transferred. Another problem is we may want to change the datastore bloom filter to some other structure - see bug 3288.

If we use the structure proposed there, we can use the filter immediately as soon as we have any bytes, but the catch is that the receiving node knows (immediately, without needing to wait for the full filter) enough to construct keys that will likely collide with a target key; because of quadratic probing, they may need to send quite a few inserts to be sure of overwriting it, but this will likely be difficult to detect - possibly impossible to detect if it is over a long time period. The other option is to hash the keys, with a nonce different to that used for the datastore, use the hashed output to divide the keyspace up into roughly even sized blocks, and have a bloom filter for each. This would need rebuilding periodically, and would only be usable once a whole chunk's bloom filter had been transferred.

Security

This is relatively safe now because we don't store locally requested data in our datastore. The issues mentioned above could be partly mitigated by only sharing bloom filters with our Darknet peers.

Bloom filter sharing, and transmitting data from our datastore when the peer's store is empty, could help to flatten out traffic, which might help to disguise what is really going on - as long as there is plenty of bandwidth.

Bandwidth problems

Using this on Opennet may not make sense. Certainly we only want to share bloom filters with nodes that will be online again, so we probably want to wait for some level of uptime.

Resource impact

Ideally we would want to keep all the bloom filters in RAM; this might result in a significant increase in RAM usage, and we need to degrade gracefully. This again suggests that we need the filters to be usable in smaller chunks than a whole 500GB datastore's filter.

We may be able to periodically read through the on-disk copy of the filters, linearly, checking it against both the keys we want and the keys that other nodes have recently requested. If we find matches, we can then fetch the blocks, and offer them via ULPRs. We could do something similar with long-term requests and big-block transports (wireless rendezvous, sneakernet). We would certainly want to re-check the filters when a peer comes online.

Performance impact

Simulations suggest we should get roughly a 10% performance boost overall to success rates from knowing what keys are in adjacent nodes' datastores. However this may underestimate the benefit. For very popular files it saves us a hop or two, and for very unpopular files it may well be the difference between finding the data and not finding it, by effectively visiting a lot more nodes.

Publicity issues

When we asked Google for money in 2009 we said we'd implement Bloom filters. :)

Clone this wiki locally