Error handling for broken ARC/WARC files #234

ianmilligan1 · 2016-06-14T18:19:08Z

In many collections, we end up with java.lang.NegativeArraySizeException errors. This is probably because warcbase is expecting X bytes in a given ARC or WARC file but then encounters Y, throwing it off ad nauseum.

We should build more robust error handling, perhaps just skipping the broken ARC/WARC and letting us know what file was skipped...

The text was updated successfully, but these errors were encountered:

jrwiebe · 2016-06-14T19:00:20Z

Is this in reference to the errors described in #222? If so, @anjackson's comment offers a way forward.

anjackson · 2016-06-14T19:02:26Z

You'll probably hit other oddities that mean you need to be able to skip records -- we certainly have!

ianmilligan1 · 2016-06-14T19:09:15Z

Yes and yes, I think.

lintool · 2016-06-15T13:25:28Z

I tracked down the issue yesterday - for some ARC records, the body content just isn't present for whatever reason (crawler glitch?). The headers there going to be n bytes, but the content doesn't appear, so the parser freaks out. I'm catching the exception so this doesn't croak the entire job.

…ent in ARC records.

ianmilligan1 · 2016-06-20T13:27:21Z

We're still running into the problem on a collection of WARC files collected by University of Alberta.

Gist error dump can be found here.

Maybe worth pushing the ARC changes into the WARC handler too?

lintool · 2016-06-20T20:47:04Z

Ah yes - I would except the same issue w/ WARCs as well. I'm in the middle of doing the sub-artifact conversion, which involves a lot of code movement. Let me work on this after everything has stabilized?

ianmilligan1 · 2016-06-21T00:12:13Z

Sounds good!

lintool · 2016-06-28T18:33:45Z

@ianmilligan1 Can you somehow give me access to the collection of WARCs that are breaking? I.e., either scp to trantor (stage on camalon) or copy to rho? Or give me access to whatever machine it's on. Will make it much easier to debug. I have a pretty good idea what's causing the issue, but I need to reproduce the error to decide on the best fix.

ianmilligan1 · 2016-06-28T20:36:05Z

Doing so – transfer is a bit slow but should be done by morning, I think.

lintool · 2016-06-29T01:26:12Z

Fixed. commit 16db934

ianmilligan1 added the bug label Jun 14, 2016

lintool mentioned this issue Jun 15, 2016

java.lang.NegativeArraySizeException #222

Closed

lintool added a commit that referenced this issue Jun 15, 2016

Fixed issue #234 Error handling for broken ARC/WARC files: empty cont…

63c3d59

…ent in ARC records.

lintool added a commit that referenced this issue Jun 29, 2016

Skips corrupt WARC records with negative lengths. Issue #234.

16db934

lintool closed this as completed Jun 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling for broken ARC/WARC files #234

Error handling for broken ARC/WARC files #234

ianmilligan1 commented Jun 14, 2016

jrwiebe commented Jun 14, 2016

anjackson commented Jun 14, 2016

ianmilligan1 commented Jun 14, 2016

lintool commented Jun 15, 2016

ianmilligan1 commented Jun 20, 2016

lintool commented Jun 20, 2016

ianmilligan1 commented Jun 21, 2016

lintool commented Jun 28, 2016

ianmilligan1 commented Jun 28, 2016

lintool commented Jun 29, 2016

Error handling for broken ARC/WARC files #234

Error handling for broken ARC/WARC files #234

Comments

ianmilligan1 commented Jun 14, 2016

jrwiebe commented Jun 14, 2016

anjackson commented Jun 14, 2016

ianmilligan1 commented Jun 14, 2016

lintool commented Jun 15, 2016

ianmilligan1 commented Jun 20, 2016

lintool commented Jun 20, 2016

ianmilligan1 commented Jun 21, 2016

lintool commented Jun 28, 2016

ianmilligan1 commented Jun 28, 2016

lintool commented Jun 29, 2016