Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC variants with different interpretations of version-block length #82

Closed
tballison opened this issue Feb 7, 2024 · 7 comments
Closed
Labels

Comments

@tballison
Copy link

It looks from unit tests that jwarc should read arc files. When I try to read ARC test files from warcio, I'm getting an exception.

Is this user error in how I'm calling jwarc or are ARC files not supported?

Test files:
https://github.com/webrecorder/warcio/blob/master/test/data/example.arc
https://github.com/webrecorder/warcio/blob/master/test/data/example.arc.gz

My code:

        try (InputStream is = Files.newInputStream(Paths.get("/.../example.arc"))) {
            WarcReader reader = new WarcReader(is);
            for (WarcRecord record : reader) {
                System.out.println(record.type());
            }
        }

"warcinfo" is printed once on the console, then there's an exception:

Exception (is the same for both files):

ava.io.UncheckedIOException: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->filedesc://live-web-example.arc.gz 127.0...

	at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:330)
	at org.netpreserve.jwarc.apitests.ArcTest.testMine(ArcTest.java:108)
.....
Caused by: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->filedesc://live-web-example.arc.gz 127.0...
	at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:309)
	at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:159)
	at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:328)
	... 28 more
@ato ato added the bug label Feb 8, 2024
@ato
Copy link
Member

ato commented Feb 8, 2024

There seems to be some variation in how the length field in the version block is calculated between different ARC files. jwarc's ARC support was tested against files generated by Heritrix and some other tools from the Internet Archive.

The example.arc file you linked to has a length value of "75" (0x4b) in the version block. This would exclude the two newlines at the end of it:

image

Whereas an ARC file in our collection sourced from the Internet Archive includes just the first newline as part of the length "76" (0x4c):

image

The ARC file format reference itself seems to introduce two more possible variations! It defines the length for the version block as:

The length specifies the size, in bytes, of the rest of the version block.

and the grammar for version-block defines it as ending with two newlines:

version-block == filedesc://<path><sp><version specific data><sp><length><nl>
                 <version-number><sp><reserved><sp><origin-code><nl>
                 <URL-record-definition><nl>
                 <nl> 

But reading carefully we see that doc is defined as starting with a single <nl>:

arc_file == <version_block><rest_of_arc_file>
rest_of_arc_file == <doc>|<doc><rest_of_arc_file>
doc == <nl><URL-record><nl><network_doc> 

So a strict reading of the grammar implies there should in fact be three newlines between the text "Archive-length" and the URL of the first doc, and the first two of them should count towards length as they're part of the version block.

If we look at the example in that same document though it uses a length of "76" (0x4c) and only has two newlines and counts both of them:

image

@ato
Copy link
Member

ato commented Feb 8, 2024

Have you seen this error with in the wild ARC files containing real data as well or just the example files from the warcio unit tests? I'm also curious what such files look like if they have more than one document in them and whether they also have extra linefeeds between documents or if it's just the version-block length that differs.

For reference there's an example Heritrix ARC file here which jwarc can successfully read: https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz

@ato ato changed the title Reading ARC files? ARC variants with different interpretations of version-block length Feb 8, 2024
@gleporeNARA
Copy link

Greetings - tballison has sourced many of his test files from my agency, the National Archives and Records Administration. The ARC files were actually created by the Internet Archive with Heritrix back in 2004.

I can confirm that the files we received from the Internet Archive appear to have three newlines before the first record (0A 0A 0A). And the record length takes us from the end of the header to the beginning of the first record:

image

@gleporeNARA
Copy link

As for the remaining records, a single new line is the most used separator:
Preceding bytes: b'\n\n' - Occurrences: 228
Preceding bytes: b';\n' - Occurrences: 199
Preceding bytes: b'\xd9\n' - Occurrences: 130
Preceding bytes: b'>\n' - Occurrences: 96
Preceding bytes: b'\x00\n' - Occurrences: 15
Preceding bytes: b' \n' - Occurrences: 10
Preceding bytes: b'\r\n' - Occurrences: 7
Preceding bytes: b'\x82\n' - Occurrences: 5
Preceding bytes: b'}\n' - Occurrences: 4
Preceding bytes: b'\xa9\n' - Occurrences: 2
Preceding bytes: b'\xb0\n' - Occurrences: 1
Preceding bytes: b'F\n' - Occurrences: 1
Preceding bytes: b'l/' - Occurrences: 1
Preceding bytes: b'\x83\n' - Occurrences: 1
Preceding bytes: b'\x7f\n' - Occurrences: 1
Preceding bytes: b'd\n' - Occurrences: 1

@tballison
Copy link
Author

What @gleporeNARA said. LOL. Thank you so much @ato for looking into this! Let me know if I can help in any way.

@ato ato closed this as completed in c4e3ab7 Feb 9, 2024
ato added a commit that referenced this issue Feb 9, 2024
Bugs fixed

* Improved compatibility with ARC variants (version-block length off by one, v2 version-block, spurious linefeeds) #82
* WarcParser: Context in parse error messages was incorrectly using the parser (file) position instead of buffer position
@ato
Copy link
Member

ato commented Feb 9, 2024

Fix released as v0.28.6. Should sync to Maven central in an hour or so.

I've updated jwarc to accept 0 to 3 newlines between the end of the previous record's body and the URL of the next record. This should make it compatible with all the variants discussed above and it seems to work with the warcio example.arc:

$ jwarc cdx example.arc
 CDX N b a m s k r M S V g
com,example)/ 20140216050221 http://example.com/ text/html 200 - - - 1658 150 example.arc
$ jwarc extract --payload example.arc 150
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
...

I've also made it understand the "v2" version-block headers and fixed the parsing exception message so the "<-- HERE -->" should show the right context now.

@tballison
Copy link
Author

Wow. Thank you. I'll upgrade in Tika and see what I find on my local set of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants