Enforce disk size limit more strictly in DiskBasedCache. #196

jpd236 · 2018-05-31T02:10:26Z

-Previously, when new entries were written to the cache with put(),
the size accounted for in those entries did not include cache overhead
when actually storing them on disk. That overhead is now included.

-We now prune excess entries after writing new entries to disk instead
of before (so that we know the size-on-disk). This prevents the cache
from exceeding the size limit due to one large entry (until the next
put). The cache will now exceed the maximum size when writing an entry
which pushes it over the limit; however, this excess will be cleared
before put() returns rather than some indefinite time in the future.

Fixes #57

-Previously, when new entries were written to the cache with put(), the size accounted for in those entries did not include cache overhead when actually storing them on disk. That overhead is now included. -We now prune excess entries after writing new entries to disk instead of before (so that we know the size-on-disk). This prevents the cache from exceeding the size limit due to one large entry (until the next put). The cache will now exceed the maximum size when writing an entry which pushes it over the limit; however, this excess will be cleared before put() returns rather than some indefinite time in the future. Fixes google#57

joebowbeer

The change looks simple enough and it's great to see unit tests, but I have two concerns:

How can you be sure that the subsequent prune will never remove the entry that was just added? I'd like to see a mathematical proof for this.
The bug report for this was internally generated. That is, it was not reported by a user but by a developer. I worry that changing the behavior at this time for no certain benefit to a real user is not worth the risk it introduces.

Better to save this for the next release.

jpd236 · 2018-05-31T03:35:04Z

Thanks for your comments!

Re: 1 - there is absolutely no guarantee that a new entry won't get removed. It will be removed if it either takes over 100% of the cache size on its own, or over 90% of the cache size if other entries were already present (due to hysteresis). To me it seems like a bug that you could put in an entry that was larger than the requested maximum size (with no size limit at all!) and have it stick around until the next put() call. Why expect such a large entry to be retained?

Re: 2 - Yeah, that's fair. It might have been relevant to whatever was happening in #145 but we never really got to the bottom of that. Moving this to the 1.2.0 milestone without evidence of a user hitting bugs due to the current behavior.

joebowbeer · 2018-05-31T03:57:24Z

To be clear, the situation that I most want to avoid is adding a new entry, at great cost, and then immediately pruning it, along with everything else, before the cached entry is ever used.

jpd236 · 2018-05-31T21:35:16Z

I think this may not have been clear - the cache data structure is a LinkedHashMap, which preserves insertion order. So new entries are removed in the order they were inserted, and the newly inserted entry is going to be the last entry we remove, and we'd only remove it if it alone were over the hysteresis mark (90% of the cache size), as we'd remove every other entry first before looking at the latest entry. One of the new unit tests verifies the order of removals.

I don't believe there should be significant churn here as a result of this change compared to what we had before, assuming reasonably sized cache entries relative to the maximum size.

joebowbeer · 2018-05-31T23:02:07Z

I understand the LRU nature.

The 'bug' is that an excessive large entry can be cached and it won't be removed until the next entry is written.

However, in this same scenario, your solution will clear the entire cache and leave nothing, which is also not good.

I think there's a middle ground that works better: don't cache if the entry.data.length is larger than some threshold (max size? hysteresis?) Otherwise do cache but don't prune the entry that was just written.

I do think it is better to prune before writing a new entry. This minimizes the peak cache size, and also preserves the newest entry.

In short, a better solution is to simply test entry.data.length at the top of the put method.

jpd236 · 2018-05-31T23:51:23Z

Yeah, I was considering an up-front check to make sure the new entry being written isn't too large before writing it. I was leaning against it because entry.data.length is technically the wrong length to evaluate (it's not counting the header overhead), and so to get it precisely right would require an up-front calculation of the header size which is what I was hoping to avoid.

If we just use entry.data.length, we might write an entry only to find that with the cache overhead, it's now too large to fit. But perhaps that's reasonable enough - it would allow us to eliminate caching for entries which are definitely too big, while doing a bit more work for entries which are right at the boundary.

If an entry is so large that adding it to the cache would trigger a prune, and if that prune would clear the new entry as well, then skip adding it in the first place. For simplicity, we don't include the cache header overhead in this calculation; this implies that we may still put and immediately clear an entry if it is very close to the threshold. But this won't result in incorrect behavior; it will just lead to a bit of file churn. We could resolve this in the future by calculating the header overhead.

jpd236 · 2018-08-16T21:16:16Z

PTAL - I added a check to put which prevents writing a cache entry which would be pruned immediately. This reduces the likelihood of churn, although it's still possible in some edge cases because we don't include the header side in that calculation. I think this is a reasonable compromise, since ultimately this is just an optimization to avoid some churn, and calculating the header size by itself is non-trivial, although feasible. Let me know if you disagree.

joebowbeer

LGTM

jpd236 requested a review from jjoslin May 31, 2018 02:10

jpd236 mentioned this pull request May 31, 2018

DiskBasedCache uses different inputs for file size #57

Closed

joebowbeer reviewed May 31, 2018

View reviewed changes

jpd236 added this to the 1.2.0 milestone May 31, 2018

joebowbeer approved these changes Aug 16, 2018

View reviewed changes

jjoslin approved these changes Aug 17, 2018

View reviewed changes

jpd236 merged commit 99d8ed8 into google:master Aug 17, 2018

jpd236 deleted the cache-sizes branch August 17, 2018 17:12

joebowbeer mentioned this pull request Jul 20, 2020

Implementations for rest of AsyncCache, tests for cache methods, still no thread safety yet #350

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce disk size limit more strictly in DiskBasedCache. #196

Enforce disk size limit more strictly in DiskBasedCache. #196

jpd236 commented May 31, 2018

joebowbeer left a comment

jpd236 commented May 31, 2018

joebowbeer commented May 31, 2018 •

edited

Loading

jpd236 commented May 31, 2018 •

edited

Loading

joebowbeer commented May 31, 2018 •

edited

Loading

jpd236 commented May 31, 2018

jpd236 commented Aug 16, 2018

joebowbeer left a comment

Enforce disk size limit more strictly in DiskBasedCache. #196

Enforce disk size limit more strictly in DiskBasedCache. #196

Conversation

jpd236 commented May 31, 2018

joebowbeer left a comment

Choose a reason for hiding this comment

jpd236 commented May 31, 2018

joebowbeer commented May 31, 2018 • edited Loading

jpd236 commented May 31, 2018 • edited Loading

joebowbeer commented May 31, 2018 • edited Loading

jpd236 commented May 31, 2018

jpd236 commented Aug 16, 2018

joebowbeer left a comment

Choose a reason for hiding this comment

joebowbeer commented May 31, 2018 •

edited

Loading

jpd236 commented May 31, 2018 •

edited

Loading

joebowbeer commented May 31, 2018 •

edited

Loading