Enabling O(1) removal from digest #268

Open
yoavweiss opened this Issue Nov 17, 2016 · 9 comments

Projects

None yet

3 participants

@yoavweiss

Current spec is using Golomb-coded sets as the algorithm to create digests.

While they show great space-efficiency, Golomb-coded sets do not enable O(1) removal from the digest, which means from a browser implementation perspective, the browser would have to calculate the hash for each host upon connection creation.

That poses a couple of issues from an implementation perspective:

  • Calculating the hash on each connection establishment may be expensive. That part seems inherent to the algorithm and not likely to be optimized away.
  • Calculating the hash requires per-host indexing. That part is just a limitation of many current cache implementations.

A cache digest algorithm that enables O(1) removal (as well as addition) to the digest would enable us to move away from those limitations:

  • Browsers can calculate a per-host digest once, then keep updating it as resources are added to the cache as well as when resources are removed from the cache. No need for per-host indexing.
    • In order to do that, browsers would need to persist digests along with the cache
  • Upon connection establishment, the browser can just send the ready-made digest to the server. Win!

During the HTTPWS, counting bloom filters were mentioned as an O(1) removal algorithm, but they are extremely inefficient when it comes to space. (~4 times bigger than bloom filters)

Turns out, Cuckoo filters enable O(1) removal while being more space efficient than bloom filters. While they are slightly bigger than Golomb-coded sets based digests, the cheaper runtime costs can make up for that deficiency.

/cc @kazuho @mnot @cbentzel @mcmanus

@kazuho
kazuho commented Nov 17, 2016 edited

Thank you for opening the issue. I think we should adopt a method that can be implemented by web browsers easily, as long as the digest remains fairly compact. Regarding the size of the digest, I have no objection for choosing Cuckoo Filters.

However, I am not sure if the proposed method would work as expected.

Browsers can calculate a per-host digest once, then keep updating it as resources are added to the cache as well as when resources are removed from the cache.

There are two digests (i.e. fresh and stale) for every host. You need to update the digests when the state of a cached response changes from fresh to stale, by evicting the corresponding entry from the fresh digest and inserting it into the stale digest.

Is that possible?

Also, if it is possible to maintain a per-origin Cuckoo Filter that is kept in sync with the cached resources, I think it would also be possible to maintain a per-host list of cached URLs as well, which can be used to build the digest.

Anyways, let me state that I'm very delighted to see people looking into how Cache Digests can be implemented within the web browser. I also hope that I'd be proved wrong, that it would be easier to implement Cache Digest in the browser than I had expected.

PS. I was considering of polishing up my ServiceWorker implementation, but maybe I should spend my time looking into the browsers as well.

@yoavweiss

There are two digests (i.e. fresh and stale) for every host. You need to update the digests when the state of a cached response changes from fresh to stale, by evicting the entry from the fresh digest and inserting it to the stale digest.

Hmm, that would indeed increase complexity as I believe the cache is not aware of resources going into the "stale" state unless they're queried. (@cbentzel & @mcmanus - please correct me if I'm wrong).

Is there any reason we can't use a single digest with URL+ETag (so use the ETags also for fresh resources) to avoid that complexity? Are there any downsides to that?

@yoavweiss

Also, if it is possible to maintain a per-origin Cuckoo Filter that is kept in sync with the cached resources, I think it would also be possible to maintain a per-host list of cached URLs as well.

That is indeed possible, as I mentioned: "Calculating the hash requires per-host indexing. That part is just a limitation of many current cache implementations".
The problem with that is that creating a digest from that list of URLs would mean an O(N) operation.

@kazuho
kazuho commented Nov 17, 2016 edited

@yoavweiss

Is there any reason we can't use a single digest with URL+ETag (so use the ETags also for fresh resources) to avoid that complexity? Are there any downsides to that?

It is true that you could use URL+ETag as a key of fresh digests (though server implementations might become slightly complicated).

However, it is still desirable to be able to distinguish whether if a resource is freshly cached or stale-cached, since the server behavior needs to be different for the two. For freshly cached resources, there is no need to push anything. For stale-cached resources, the server needs to push 304.

If the client sends a single digest covering both freshly and stale-cached resources, then the server would need to either push nothing for both of them (which means that client would issue conditional requests for stale-cached responses that it need to use), or push 304 for both of them (which means that the server would send 304 for even the freshly cached resources).

Note also that we haven't agreed on whether if sending a digest of stale responses is worth the extra bandwidth, or how we should push 304.

Last time I looked into my Firefox's cache, the number of cached objects were as follows:

domain total objects fresh objects
*.facebook.com 2273 790
*.google.com 1003 373

As you can see, fresh objects consists about 1/3 of the total objects cached for a domain. Sending a digest for all objects would roughly triple the size of the digest from when we just send the digest for freshly-cached objects.

Considering the fact that assets critical to the rendering path are more likely to be marked as fresh, it would make sense to just send the digest for freshly cached objects.

@kazuho
kazuho commented Nov 17, 2016

@yoavweiss

Also, if it is possible to maintain a per-origin Cuckoo Filter that is kept in sync with the cached resources, I think it would also be possible to maintain a per-host list of cached URLs as well.

That is indeed possible, as I mentioned: "Calculating the hash requires per-host indexing. That part is just a limitation of many current cache implementations".
The problem with that is that creating a digest from that list of URLs would mean an O(N) operation.

I agree that it is O(N). However, for the browser use-case, we would not want to send a digest covering tens of thousands of responses (since it would not fit into a single TCP packet or two). And from the table in my comment above, it is likely that a digest would not be that big.

So I think N would not be very large that it would have impact on performance. Converting a pre-calculated SHA256 into a GCS entry is a lightweight operation: truncate, emit one or two '0's followed by '1' (corresponds to the quotient), and then the remainder bits.

@kazuho
kazuho commented Nov 18, 2016

@yoavweiss
In addition to my previous comments, I think you cannot expand an existing Cuckoo Filter.

In other words, every time the number of resources that need to be tracked doubles (or halves), you would need to reconstruct a new Cuckoo Filter (of the adjusted size) by iterating through the cache entries. I do not think that is something you would like to do.

@yoavweiss

In addition to my previous comments, I think you cannot expand an existing Cuckoo Filter.

In other words, every time the number of resources that need to be tracked doubles (or halves), you would need to reconstruct a new Cuckoo Filter (of the adjusted size) by iterating through the cache entries. I do not think that is something you would like to do.

My understanding from the Cuckoo filter and Cuckoo hashing papers is that this expansion is taken into consideration in the amortized complexity analysis. Even if you look at the worst case, such expansion's cost is O(N), which is asymptotically similar to the cost GCS incur on every connection establishment.

@kazuho
kazuho commented Nov 19, 2016 edited

@yoavweiss What I would like to discuss is not the amortized cost, but the time required to expand (or shrink) the digest.

In both Cuckoo hashing and Cuckoo filter, expansion or shrinking is done by rebuilding the hash table.

With Cuckoo Hashing, the cost of (re)building a hash table is O(N) (where N is the number of entries), since the entries within the hash table contains references to the objects being stored. During expansion, you iterate through the contained objects by dereferencing them to build a new hash table.

However, in a Cuckoo Filter, there is no such reference. Therefore, to (re)build a Cuckoo Filter, you would need to use some pre-existing structure in the browser cache to iterate through the objects belonging to a specific origin. However, under premise that there is no per-origin index (note: we are considering the use of Cookie Filters since we do not want to introduce such index), we would need to iterate through the entire cache.

Can we afford such an operation (e.g. iterating through the entire cache) running periodically?

@yoavweiss

However, in a Cuckoo Filter, there is no such reference. Therefore, to (re)build a Cuckoo Filter, you would need to use some pre-existing structure in the browser cache to iterate through the objects belonging to a specific origin. However, under premise that there is no per-origin index (note: we are considering the use of Cookie Filters since we do not want to introduce such index), we would need to iterate through the entire cache.

That would indeed be a serious handicap to the Cuckoo filter approach.
Fortunately, browsers can extend the basic Cuckoo filter implementation to also store the full crypto hashes or other references to the stored objects on top of storing fingerprints.

This extra info won't be sent up as part of the Cuckoo filter, but can be used locally to extend it in O(N) complexity.

Such references to the actual stored objects can also allow periodic (potentially out-of-band) O(N) calculation to create a stale resources filter that would make it easier to push 304s.

@mnot mnot added the cache-digest label Nov 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment