-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce disk size limit more strictly in DiskBasedCache. #196
Conversation
-Previously, when new entries were written to the cache with put(), the size accounted for in those entries did not include cache overhead when actually storing them on disk. That overhead is now included. -We now prune excess entries after writing new entries to disk instead of before (so that we know the size-on-disk). This prevents the cache from exceeding the size limit due to one large entry (until the next put). The cache will now exceed the maximum size when writing an entry which pushes it over the limit; however, this excess will be cleared before put() returns rather than some indefinite time in the future. Fixes google#57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks simple enough and it's great to see unit tests, but I have two concerns:
-
How can you be sure that the subsequent prune will never remove the entry that was just added? I'd like to see a mathematical proof for this.
-
The bug report for this was internally generated. That is, it was not reported by a user but by a developer. I worry that changing the behavior at this time for no certain benefit to a real user is not worth the risk it introduces.
Better to save this for the next release.
Thanks for your comments! Re: 1 - there is absolutely no guarantee that a new entry won't get removed. It will be removed if it either takes over 100% of the cache size on its own, or over 90% of the cache size if other entries were already present (due to hysteresis). To me it seems like a bug that you could put in an entry that was larger than the requested maximum size (with no size limit at all!) and have it stick around until the next put() call. Why expect such a large entry to be retained? Re: 2 - Yeah, that's fair. It might have been relevant to whatever was happening in #145 but we never really got to the bottom of that. Moving this to the 1.2.0 milestone without evidence of a user hitting bugs due to the current behavior. |
To be clear, the situation that I most want to avoid is adding a new entry, at great cost, and then immediately pruning it, along with everything else, before the cached entry is ever used. |
I think this may not have been clear - the cache data structure is a LinkedHashMap, which preserves insertion order. So new entries are removed in the order they were inserted, and the newly inserted entry is going to be the last entry we remove, and we'd only remove it if it alone were over the hysteresis mark (90% of the cache size), as we'd remove every other entry first before looking at the latest entry. One of the new unit tests verifies the order of removals. I don't believe there should be significant churn here as a result of this change compared to what we had before, assuming reasonably sized cache entries relative to the maximum size. |
I understand the LRU nature. The 'bug' is that an excessive large entry can be cached and it won't be removed until the next entry is written. However, in this same scenario, your solution will clear the entire cache and leave nothing, which is also not good. I think there's a middle ground that works better: don't cache if the entry.data.length is larger than some threshold (max size? hysteresis?) Otherwise do cache but don't prune the entry that was just written. I do think it is better to prune before writing a new entry. This minimizes the peak cache size, and also preserves the newest entry. In short, a better solution is to simply test entry.data.length at the top of the put method. |
Yeah, I was considering an up-front check to make sure the new entry being written isn't too large before writing it. I was leaning against it because entry.data.length is technically the wrong length to evaluate (it's not counting the header overhead), and so to get it precisely right would require an up-front calculation of the header size which is what I was hoping to avoid. If we just use entry.data.length, we might write an entry only to find that with the cache overhead, it's now too large to fit. But perhaps that's reasonable enough - it would allow us to eliminate caching for entries which are definitely too big, while doing a bit more work for entries which are right at the boundary. |
If an entry is so large that adding it to the cache would trigger a prune, and if that prune would clear the new entry as well, then skip adding it in the first place. For simplicity, we don't include the cache header overhead in this calculation; this implies that we may still put and immediately clear an entry if it is very close to the threshold. But this won't result in incorrect behavior; it will just lead to a bit of file churn. We could resolve this in the future by calculating the header overhead.
PTAL - I added a check to put which prevents writing a cache entry which would be pruned immediately. This reduces the likelihood of churn, although it's still possible in some edge cases because we don't include the header side in that calculation. I think this is a reasonable compromise, since ultimately this is just an optimization to avoid some churn, and calculating the header size by itself is non-trivial, although feasible. Let me know if you disagree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
-Previously, when new entries were written to the cache with put(),
the size accounted for in those entries did not include cache overhead
when actually storing them on disk. That overhead is now included.
-We now prune excess entries after writing new entries to disk instead
of before (so that we know the size-on-disk). This prevents the cache
from exceeding the size limit due to one large entry (until the next
put). The cache will now exceed the maximum size when writing an entry
which pushes it over the limit; however, this excess will be cleared
before put() returns rather than some indefinite time in the future.
Fixes #57