Error handling for broken ARC/WARC files #234
Comments
Is this in reference to the errors described in #222? If so, @anjackson's comment offers a way forward. |
You'll probably hit other oddities that mean you need to be able to skip records -- we certainly have! |
Yes and yes, I think. |
I tracked down the issue yesterday - for some ARC records, the body content just isn't present for whatever reason (crawler glitch?). The headers there going to be n bytes, but the content doesn't appear, so the parser freaks out. I'm catching the exception so this doesn't croak the entire job. |
We're still running into the problem on a collection of WARC files collected by University of Alberta. Gist error dump can be found here. Maybe worth pushing the ARC changes into the WARC handler too? |
Ah yes - I would except the same issue w/ WARCs as well. I'm in the middle of doing the sub-artifact conversion, which involves a lot of code movement. Let me work on this after everything has stabilized? |
Sounds good! |
@ianmilligan1 Can you somehow give me access to the collection of WARCs that are breaking? I.e., either scp to trantor (stage on camalon) or copy to rho? Or give me access to whatever machine it's on. Will make it much easier to debug. I have a pretty good idea what's causing the issue, but I need to reproduce the error to decide on the best fix. |
Doing so – transfer is a bit slow but should be done by morning, I think. |
Fixed. commit 16db934 |
In many collections, we end up with
java.lang.NegativeArraySizeException
errors. This is probably because warcbase is expecting X bytes in a given ARC or WARC file but then encounters Y, throwing it off ad nauseum.We should build more robust error handling, perhaps just skipping the broken ARC/WARC and letting us know what file was skipped...
The text was updated successfully, but these errors were encountered: