Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce disk size limit more strictly in DiskBasedCache. #196

Merged
merged 2 commits into from
Aug 17, 2018

Conversation

jpd236
Copy link
Collaborator

@jpd236 jpd236 commented May 31, 2018

-Previously, when new entries were written to the cache with put(),
the size accounted for in those entries did not include cache overhead
when actually storing them on disk. That overhead is now included.

-We now prune excess entries after writing new entries to disk instead
of before (so that we know the size-on-disk). This prevents the cache
from exceeding the size limit due to one large entry (until the next
put). The cache will now exceed the maximum size when writing an entry
which pushes it over the limit; however, this excess will be cleared
before put() returns rather than some indefinite time in the future.

Fixes #57

-Previously, when new entries were written to the cache with put(),
the size accounted for in those entries did not include cache overhead
when actually storing them on disk. That overhead is now included.

-We now prune excess entries after writing new entries to disk instead
of before (so that we know the size-on-disk). This prevents the cache
from exceeding the size limit due to one large entry (until the next
put). The cache will now exceed the maximum size when writing an entry
which pushes it over the limit; however, this excess will be cleared
before put() returns rather than some indefinite time in the future.

Fixes google#57
Copy link
Contributor

@joebowbeer joebowbeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks simple enough and it's great to see unit tests, but I have two concerns:

  1. How can you be sure that the subsequent prune will never remove the entry that was just added? I'd like to see a mathematical proof for this.

  2. The bug report for this was internally generated. That is, it was not reported by a user but by a developer. I worry that changing the behavior at this time for no certain benefit to a real user is not worth the risk it introduces.

Better to save this for the next release.

@jpd236
Copy link
Collaborator Author

jpd236 commented May 31, 2018

Thanks for your comments!

Re: 1 - there is absolutely no guarantee that a new entry won't get removed. It will be removed if it either takes over 100% of the cache size on its own, or over 90% of the cache size if other entries were already present (due to hysteresis). To me it seems like a bug that you could put in an entry that was larger than the requested maximum size (with no size limit at all!) and have it stick around until the next put() call. Why expect such a large entry to be retained?

Re: 2 - Yeah, that's fair. It might have been relevant to whatever was happening in #145 but we never really got to the bottom of that. Moving this to the 1.2.0 milestone without evidence of a user hitting bugs due to the current behavior.

@jpd236 jpd236 added this to the 1.2.0 milestone May 31, 2018
@joebowbeer
Copy link
Contributor

joebowbeer commented May 31, 2018

To be clear, the situation that I most want to avoid is adding a new entry, at great cost, and then immediately pruning it, along with everything else, before the cached entry is ever used.

@jpd236
Copy link
Collaborator Author

jpd236 commented May 31, 2018

I think this may not have been clear - the cache data structure is a LinkedHashMap, which preserves insertion order. So new entries are removed in the order they were inserted, and the newly inserted entry is going to be the last entry we remove, and we'd only remove it if it alone were over the hysteresis mark (90% of the cache size), as we'd remove every other entry first before looking at the latest entry. One of the new unit tests verifies the order of removals.

I don't believe there should be significant churn here as a result of this change compared to what we had before, assuming reasonably sized cache entries relative to the maximum size.

@joebowbeer
Copy link
Contributor

joebowbeer commented May 31, 2018

I understand the LRU nature.

The 'bug' is that an excessive large entry can be cached and it won't be removed until the next entry is written.

However, in this same scenario, your solution will clear the entire cache and leave nothing, which is also not good.

I think there's a middle ground that works better: don't cache if the entry.data.length is larger than some threshold (max size? hysteresis?) Otherwise do cache but don't prune the entry that was just written.

I do think it is better to prune before writing a new entry. This minimizes the peak cache size, and also preserves the newest entry.

In short, a better solution is to simply test entry.data.length at the top of the put method.

@jpd236
Copy link
Collaborator Author

jpd236 commented May 31, 2018

Yeah, I was considering an up-front check to make sure the new entry being written isn't too large before writing it. I was leaning against it because entry.data.length is technically the wrong length to evaluate (it's not counting the header overhead), and so to get it precisely right would require an up-front calculation of the header size which is what I was hoping to avoid.

If we just use entry.data.length, we might write an entry only to find that with the cache overhead, it's now too large to fit. But perhaps that's reasonable enough - it would allow us to eliminate caching for entries which are definitely too big, while doing a bit more work for entries which are right at the boundary.

If an entry is so large that adding it to the cache would trigger a
prune, and if that prune would clear the new entry as well, then skip
adding it in the first place.

For simplicity, we don't include the cache header overhead in this
calculation; this implies that we may still put and immediately clear
an entry if it is very close to the threshold. But this won't result
in incorrect behavior; it will just lead to a bit of file churn. We
could resolve this in the future by calculating the header overhead.
@jpd236
Copy link
Collaborator Author

jpd236 commented Aug 16, 2018

PTAL - I added a check to put which prevents writing a cache entry which would be pruned immediately. This reduces the likelihood of churn, although it's still possible in some edge cases because we don't include the header side in that calculation. I think this is a reasonable compromise, since ultimately this is just an optimization to avoid some churn, and calculating the header size by itself is non-trivial, although feasible. Let me know if you disagree.

Copy link
Contributor

@joebowbeer joebowbeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants