Skip to content
This repository has been archived by the owner on Apr 27, 2018. It is now read-only.

Error handling for broken ARC/WARC files #234

Closed
ianmilligan1 opened this issue Jun 14, 2016 · 10 comments
Closed

Error handling for broken ARC/WARC files #234

ianmilligan1 opened this issue Jun 14, 2016 · 10 comments
Labels

Comments

@ianmilligan1
Copy link
Collaborator

In many collections, we end up with java.lang.NegativeArraySizeException errors. This is probably because warcbase is expecting X bytes in a given ARC or WARC file but then encounters Y, throwing it off ad nauseum.

We should build more robust error handling, perhaps just skipping the broken ARC/WARC and letting us know what file was skipped...

@jrwiebe
Copy link
Collaborator

jrwiebe commented Jun 14, 2016

Is this in reference to the errors described in #222? If so, @anjackson's comment offers a way forward.

@anjackson
Copy link
Contributor

You'll probably hit other oddities that mean you need to be able to skip records -- we certainly have!

@ianmilligan1
Copy link
Collaborator Author

Yes and yes, I think.

@lintool
Copy link
Owner

lintool commented Jun 15, 2016

I tracked down the issue yesterday - for some ARC records, the body content just isn't present for whatever reason (crawler glitch?). The headers there going to be n bytes, but the content doesn't appear, so the parser freaks out. I'm catching the exception so this doesn't croak the entire job.

lintool added a commit that referenced this issue Jun 15, 2016
@ianmilligan1
Copy link
Collaborator Author

We're still running into the problem on a collection of WARC files collected by University of Alberta.

Gist error dump can be found here.

Maybe worth pushing the ARC changes into the WARC handler too?

@lintool
Copy link
Owner

lintool commented Jun 20, 2016

Ah yes - I would except the same issue w/ WARCs as well. I'm in the middle of doing the sub-artifact conversion, which involves a lot of code movement. Let me work on this after everything has stabilized?

@ianmilligan1
Copy link
Collaborator Author

Sounds good!

@lintool
Copy link
Owner

lintool commented Jun 28, 2016

@ianmilligan1 Can you somehow give me access to the collection of WARCs that are breaking? I.e., either scp to trantor (stage on camalon) or copy to rho? Or give me access to whatever machine it's on. Will make it much easier to debug. I have a pretty good idea what's causing the issue, but I need to reproduce the error to decide on the best fix.

@ianmilligan1
Copy link
Collaborator Author

Doing so – transfer is a bit slow but should be done by morning, I think.

@lintool
Copy link
Owner

lintool commented Jun 29, 2016

Fixed. commit 16db934

@lintool lintool closed this as completed Jun 29, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants