DiskBasedCache rewrite with unit tests #13

joebowbeer · 2017-02-14T01:26:29Z

Rewrite of DiskBasedCache to fix Issue #12

Changes:

DiskBasedCache rewritten using DataInputStream and DataOutputStream for serialization, and incorporating a dataLength field in the cached header in order to detect file corruption before encountering the troublesome OutOfMemoryError.
DiskBasedCacheTest unit tests rewritten.
All tests passing: https://circleci.com/gh/joebowbeer/volley/11

jpd236 · 2017-02-14T02:56:14Z

First of all - thank you for taking the time to put this together, and in particular for including extensive tests with your PR. In the future, you may want to wait for someone to respond to the initial issue request before putting the PR together, as for something this large it's likely we'll have fairly fundamental concerns/questions to resolve first.

So, my high level concerns:

I'm concerned about breaking this in a way that completely invalidates all existing cache entries. While a rewrite might lead to a cleaner end result, I think it would need to be done in a way that attempts to migrate any existing data over rather than blowing it away. As such, it may be best to keep this more incremental and stick to the existing structure, adding new fields as needed but failing gracefully if they're not defined.
I'm not sure I understand how a length field helps with corruption (or more precisely, why it wouldn't be possible for corrupt data to lead to similar exceptions). DataInputStream does not offer any intrinsic protection against corrupt data to my knowledge, so using it doesn't seem like it would prevent situations where the cache data causes an allocation attempt that is invalid. That is - if the cache header itself is corrupted, don't you run into the same problems?
Fundamentally, if you want to guard against corrupt data here, I think there are two changes you need: a checksum on the data, and a guard against extremely large allocations prior to evaluating that checksum (say, larger than the maximum size of the cache). Of course, the checksum needs to be fast given that the intent of this is to save time vs. fetching data from the network.

@yilanl - added you in case you have additional thoughts/concerns/disagreements.

joebowbeer · 2017-02-14T06:35:15Z

I'll try to respond to you each of your concerns. Let me know if I don't touch on something.

First, let me discuss the underlying problem in more detail.

OOME in DiskBasedCache.streamToBytes has been reported periodically for almost 4 years.

Here's a fairly thorough report from 2013:

https://code.google.com/p/android/issues/detail?id=61406

If the problem happens during initialization, as it tends to, the app is generally hosed.

If you are able to search for this exception in the Play Store, I expect you'll find that several apps are plagued with these.

The sister exception is NegativeArraySizeException. I think both of these exceptions should be addressed in the same way because they are both symptoms of the same problem: bad or mismatched length. Given that both of these problems occur, there are probably other instances where the problem goes undetected because the computed length is neither negative nor so large as to cause an OOME.

I'm concerned about breaking this in a way that completely invalidates all existing cache entries. [...]

On the other hand, the cache magic has changed a few times over the years without any attempt to migrate existing entries. (Right? Or did I miss something?) In addition, the default size of the disk cache is only 5MB.

For those users/apps that are hitting this problem, it's much better to be able to actually initialize the cache than to retain old entries.

Devs who are seeing these OOME issues want a solution, and I think the best solution is a new cache implementation. I can think of a couple ways to make this available other than replacing the existing class:

Offer this as a sibling implementation. Call it ResilientDiskBasedCache. By default, new instantiations of Volley would use the new implementation, and existing instantiations would use the old. Except that devs can override this choice, as they would want to do if their existing users are being hit with OOME's.
Publish this implementation as a new library. Let's say I call it com.joebowbeer.volley.diskbasedcache. Then devs can choose to use this cache implementation instead of the existing one.

I'm not sure I understand how a length field helps with corruption (or more precisely, why it wouldn't be possible for corrupt data to lead to similar exceptions). [...]

Adding a length field at the end of the header serves to validate the header on disk. If, after reading all of the preceding fields, the length is then read and it matches the remaining file length, then we can be reasonably certain that the header has been read correctly.

The changes in the new implementation were designed to prevent the kinds of problems that are spelled out in the unit tests. In practice, these changes seem to be sufficient to prevent the OOME and NASE that were previously caused by bad length calculations. Unfortunately, because I don't know the root cause of these OOME's (nor has anyone else been able to determine it in the past 4 years), I don't know if all of these changes are necessary.

As an experiment, try targeting the new unit tests against the old cache implementation, and you will see that the old implementation fails several of the tests, such as testGetWrongKey. The new implementation is also more resilient in the presence of failures. It may be that these failures are the root cause for the OOME's in the field...

If you can find a more compatible implementation that passes these tests, perhaps you will have fixed the troublesome issues in a better way...

Fundamentally, if you want to guard against corrupt data here, I think [...]

I have implemented a poor-man's validation for the header that seems to be effective. I'm not attempting to validate the user's data. The user can do that if they want. (On the other hand, no one has been complaining about corrupt user data, right?) I suspect that the underlying cause of the OOME's is more refined than simply random "corruption".

jpd236 · 2017-02-16T02:23:30Z

Acknowledged re: the underlying problem.

You're right that the magic has changed over the years. In a situation where it would be unavoidable to change it, then it could make sense to do so. But it has not changed since the "official" 1.0.0 release, which makes me hesitant to endorse a change now without a best-effort attempt to migrate existing entries.

I suspect you could probably stamp out the major corruption issues you're seeing here if you just put a sanity check in streamToBytes that verifies that the number of bytes being read is less than the maximum size of the cache (with cushion as needed). That's a much smaller/safer change that doesn't touch the file format, in the same way that the fix for the negative array size exception just catches the invalid data exception and deletes it.

Beyond that, putting in an explicit length could also be fine (it's a form of primitive checksum here). But that could be done in a smaller change, keeping the existing format overall, but adding a new length field to the end and calculating the length for any entries that don't have them rather than discarding them.

I don't think a full rewrite of the class is in order here, especially given your point that you "don't know if all of these changes are necessary". That's not a slight against your work - it's just that Volley is years old and relatively stable, which means I'm going to naturally be averse to significant rewrites of anything without solid evidence that they're needed, vs. very targeted and small changes that address specific issues. The rewrite may fix this issue but bring about other ones.

joebowbeer · 2017-02-16T03:27:56Z

Acknowledged re: the underlying problem.

I wish we knew what the root cause was. A lot of the OOME's and NASE's are thrown while reading the string-string map during initialization, but this is not the only location. For all I know, there is a specific input or usage pattern that causes this issue. (In which case, the NASE patch that was recently applied is merely a band-aid.)

If you need more evidence that this is a serious problem, or a variety of stack traces, please let me know. I can probably dig some up.

Before this repo showed up on GitHub (recently), most of these issues were filed against the mcxiaoke clone, where this problem has been reported repeatedly for a few years. This is the most recent incarnation:

mcxiaoke/android-volley#141

Note the large number of me-too's.

I suspect you could probably stamp out the major corruption issues you're seeing here if you just put a sanity check in streamToBytes that verifies that the number of bytes being read is less than the maximum size of the cache (with cushion as needed). [...]

Why do you think that additionally checking for extremely large sizes would be effective? We know that negative sizes are being mis-calculated (NASE), and we know that extremely large sizes are being mis-calculated (OOME), wouldn't it also be likely that bad intermediate sizes are sometimes being calculated?

If the problem is caused by a specific kind of input or usage, then the kind of patch you are suggesting may only result in deleting these same troublesome entries over and over again.

I don't think a full rewrite of the class is in order here, especially given your point that you "don't know if all of these changes are necessary". [...]

The existing implementation uses its own custom serialization format instead of relying on DataInputStream and DataOutputStream, and the existing unit tests are very minimal. In fact, almost all of the existing tests serve only to test the serialization format, and would contribute nothing if DataInputStream and DataOutputStream were used for serialization.

What I implemented could be viewed as a rewrite, but it really consists of these changes:

Conversion to DataInputStream and DataOutputStream for serialization
Limit size of string-string map to 16 bits (unsigned short). In one of the bug reports, the reporter felt that it was the home-brewed long serialization that was buggy, and there is no reason why a larger number of headers needs to be supported.
Implement very rigorous exception handling and recovery. See, for example, testFileIsDeletedWhenWriteHeaderFails.
Add a dataLength field for extra validation of the header and a sanity check for the length calculation.

PTAL. If there is no way forward with this PR, then I will publish it as a standalone library in hopes that the devs who are struggling with this issue will be able to use my DiskBasedCache implementation.

jpd236 · 2017-02-18T01:42:00Z

I would certainly like to get this fixed in Volley in some form given the number of reports of the issue.

Why do you think that additionally checking for extremely large sizes would be effective? We know that negative sizes are being mis-calculated (NASE), and we know that extremely large sizes are being mis-calculated (OOME), wouldn't it also be likely that bad intermediate sizes are sometimes being calculated?

I'm not attempting to validate the user's data. The user can do that if they want.

I don't think the goal here is clear. As your first point says, yes, corruption is possible in many forms. So we can either try to guard against that fully, or we can throw up our hands and pass it back to the user while avoiding crashes. The approach in this change is somewhere in the middle.

So the first question is to decide on a goal. It seems like the two reasonable options are to either prevent crashes but return data without worrying much about corruption, or to try and ensure full data integrity.

I am thus open to one of the following paths forward:

The former can be accomplished with a sanity check on the length to prevent large allocations. That is a small and simple change and a way forward that will stop the crashes. Bad intermediate sizes would just return corrupted data to the client (which is already a possibility, given the current code).
A length is insufficient for the latter; you need an integrity checksum of some sort. But even in doing this, as I said, I do not think we should have to rewrite the entire serialization format and logic to do so. "Conversion to DataInputStream and DataOutputStream for serialization" is a change without a clear purpose towards one of these goals, and it makes the change significantly riskier to take. It should be sufficient to add a checksum using the current serialization system while migrating existing entries (and assuming they are not corrupt while adding their checksum during the migration).

the reporter felt that it was the home-brewed long serialization that was buggy

Whether or not this is true can be easily verified on its own merits (the serialization is a simple method... is there an actual example of how it might be going wrong?)

joebowbeer · 2017-02-18T07:39:13Z

My goal in making the changes I did was to prevent, or to at least detect and recover from, the states that, unchecked, were leading to persistent OOME and NASE. This is similar to the first goal of the two you listed.

Regarding the length check, I think there's a tighter comparison: compare the computed length with the file length.

The long field used to hold the number of headers is also a weak spot. I recommend failing/removing if this value is larger than unsigned short. (Unsigned short represents a number of headers that is larger than any http server supports.)

Hopefully the aforementioned changes are enough to prevent unrecoverable OOME.

The other important change, I believe, is to implement very rigorous exception catching and recovery, so that any error reading or writing or validating an entry will remove the bad entry.

With these changes, I think most of the new unit tests will still apply.

If you are interested, I can try to back-port the new unit tests to the old code, making the above changes so that the tests pass - but without changing the serialization format. And maybe this will be sufficient.

jpd236 · 2017-02-20T17:26:06Z

Regarding the length check, I think there's a tighter comparison: compare the computed length with the file length.

Not sure I follow - what computed length? My understanding here is that the stored header doesn't offer any insight into the expected length of the file, so the only piece of information we have here is the file size itself. If you have a tighter constraint in mind than the overall cache size, though, I'm certainly open to it.

The long field used to hold the number of headers is also a weak spot. I recommend failing/removing if this value is larger than unsigned short. (Unsigned short represents a number of headers that is larger than any http server supports.)

This is an interesting thought. I don't see any restriction on the maximum number of headers in the HTTP spec, just de facto restrictions in HTTP server implementations (which is what you're specifically pointing out). While your solution is probably reasonable in any practical case - even if your server does support more, why would you be sending tens of thousands of headers in your responses - it is theoretically possible for someone to send more than that many headers and have it be fully valid HTTP. And I've seen too many examples over time of apps which stretch "valid" to great extents :)

There are some safe but looser restrictions, though, based on the max cache size. That'd restrict you to somewhere on the order of a million headers, which is still three orders of magnitude better than the maximum int size that it uses today.

And, of course, there is always the possibility of adding new fields to the serialization format without dropping old data, as noted earlier. To be concrete - you would bump the magic as before, but keep the old magic around for reference; if you read data with the old magic, you'd calculate the header length and file length and store it in the new format.

The other important change, I believe, is to implement very rigorous exception catching and recovery, so that any error reading or writing or validating an entry will remove the bad entry.

Generally this sounds good to me. My main concern up front would be either:

Wrapping everything in a big catch block would mask exceptions that represent something going wrong other than corruption.
Wrapping lots of things in small catch blocks would make for harder to maintain code. (Hard to know for sure without trying it).

From an ideal perspective, if we can keep it simple and only try-catch around some core method(s) that are specifically responsible for reading data from disk, I think that's ideal. Does that make sense?

If you are interested, I can try to back-port the new unit tests to the old code, making the above changes so that the tests pass - but without changing the serialization format. And maybe this will be sufficient.

To some extent, it feels to me like a unit test shouldn't be tightly tied to the serialization format. Ideally, the point of a unit test is to catch regressions if you change the code under test; if changing the code requires rewriting the whole unit test, then it serves little value.

So while I definitely (and heartily!) approve of new unit tests, my hope would be that they can be specified in a generic way, e.g. write data, corrupt the underlying data, read data and ensure no crash, such that they still serve a purpose without being tightly tied to the specific implementation.

joebowbeer · 2017-02-20T22:57:09Z

I added a lot more information to the report: Issue #12

The exception handling needs to be per-entry so that uncaught exceptions while reading or writing the entry will remove the entry. Otherwise the bad entry persists.

After looking at stack traces and at the repro case attached to the original bug report, one local improvement comes to mind: validate the long string byte length before casting to int.

The unit tests I wrote are in general not tied to serialization format. The three that are tied to serialization format are:

testGetBadMagic
testGetWrongKey
testGetWrongLength // added field

The first two, at least, are necessary tests, and all these tests are OK going forward if you stick to your plan to append new fields to the header.

joebowbeer · 2017-07-07T06:22:20Z

See #52

Rewrite DiskBasedCache to fix Issue google#12

e6946a5

joebowbeer mentioned this pull request Feb 14, 2017

java.lang.OutOfMemoryErrorDiskBasedCache.java:327 mcxiaoke/android-volley#141

Open

jpd236 requested review from jpd236 and yilanl February 14, 2017 02:55

wx2013 mentioned this pull request May 3, 2017

ANR for NetworkDispatcher #29

Closed

joebowbeer changed the title ~~DiskBasedCache fix with unit tests~~ DiskBasedCache rewrite with unit tests Jul 7, 2017

joebowbeer closed this Jul 7, 2017

joebowbeer mentioned this pull request Jul 7, 2017

DiskBasedCache fix with unit tests #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiskBasedCache rewrite with unit tests #13

DiskBasedCache rewrite with unit tests #13

joebowbeer commented Feb 14, 2017

jpd236 commented Feb 14, 2017

joebowbeer commented Feb 14, 2017

jpd236 commented Feb 16, 2017

joebowbeer commented Feb 16, 2017 •

edited

Loading

jpd236 commented Feb 18, 2017

joebowbeer commented Feb 18, 2017

jpd236 commented Feb 20, 2017

joebowbeer commented Feb 20, 2017

joebowbeer commented Jul 7, 2017

DiskBasedCache rewrite with unit tests #13

DiskBasedCache rewrite with unit tests #13

Conversation

joebowbeer commented Feb 14, 2017

jpd236 commented Feb 14, 2017

joebowbeer commented Feb 14, 2017

jpd236 commented Feb 16, 2017

joebowbeer commented Feb 16, 2017 • edited Loading

jpd236 commented Feb 18, 2017

joebowbeer commented Feb 18, 2017

jpd236 commented Feb 20, 2017

joebowbeer commented Feb 20, 2017

joebowbeer commented Jul 7, 2017

joebowbeer commented Feb 16, 2017 •

edited

Loading