Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARCRecord entered inconsistent state for some ARC files #40

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

tokee
Copy link

@tokee tokee commented Feb 18, 2015

If no status could be found, ARCReader#readHttpHeader read forward until next status in the originating ARC-Stream, thereby reading past the current record and into the next one. This was combined with a faulty exclusion of status lines containing colons.

This pull request contains a sample ARC file that triggered the two bugs as well as unit tests for the bugs.

@anjackson
Copy link
Member

This 'off spec' hook looks like an ARC format variation edge case to me (see https://github.com/iipc/webarchive-commons/pull/40/files#diff-1a20f5668b210061c0267c7f079c8830L600), so I'd really like @nlevitt or @kngenie to take a look at this and see if they are happy with the change.

@tokee
Copy link
Author

tokee commented Feb 18, 2015

With the check for empty line (i.e. no more HTTP-response lines), keeping the check for colon just means that individual records are wrongly processed instead of the rest of the ARC file. But I do find it curious to discard well-formed status lines. If the colon-check it kept, maybe a rationale could be given as a comment in the code?

@anjackson
Copy link
Member

@tokee And, just to check, the removal of that colon-test is the only material change you're making right? All the other stuff is in the /test/ area, right?

@tokee
Copy link
Author

tokee commented Feb 18, 2015

Removing of colon and adding of newline-check are the only real changes, yes. We are in the process of testing with 95 known problem-ARCs (~100MB each) locally and I am happy to say that the first 9 has passed without ARC-parse-showstoppers with the new ARCReader.

@tokee
Copy link
Author

tokee commented Feb 18, 2015

Out test run finished with 94/95 previously partly-skipped ARC-files being fully processed. The last one seems to fail due to the HTTP-header-parser being too greedy (I'll update issue #41 with a description).

@thomasegense informs me that about 3% of our ARC files gets marked as problematic when indexing without the patch, so it seems that colon is not uncommon in HTTP-status lines.

@johnerikhalse
Copy link
Contributor

Seems to me that this change is all about supporting real arcs which actually conforms to the spec. But since it is possible for this change to let through records to OpenWayback which today would fail, I propose it to be included in a minor release and not the next bugfix release.

@johnerikhalse johnerikhalse added this to the 1.2.0 Release milestone Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants