Feat/format warc #81

hroptatyr · 2014-05-21T15:43:22Z

This changeset provides support for ISO 28500 archives (aka The WARC File Format), as produced for instance by wget --warc-file=...

The idea is to be able to use proven tools (i.e. libarchive) and a known workflow to operate on those archives.

While there are dedicated tools (warc-tools), hanzo-warc-tools, etc.) their usability is highly questionable: For example the author of this changeset couldn't decide between the 2 mentioned tool sets; the C variant looked a lot more robust and functional than the python variant and yet the C variant has been declared obsolete already, while the python tool set does not even offer a means to extract one particular file.

This changeset is by no means complete and fail-safe (especially not tested under non-unix environments) but it should give anyone who's interested a good starting point.

previously announce in _warc_header(). The test suite (as is) is one offender. It populates a 9-byte string, mimicking an IFREG file but by the time the header makes it into the archive, the size changes from 0 to 9.

Moreover, assume a response of less than the bare minimum header length to be the archive's EOF.

…r is written

in particular: Don't compare integers of different signedness, always initialise all members of a struct explicitly.

Signed-off-by: Sebastian Freundt <freundt@ga-group.nl>

kientzle · 2014-06-07T16:40:02Z

libarchive/test/test_write_format_warc.c

+	archive_entry_copy_pathname(ae, "dir");
+	archive_entry_set_filetype(ae, AE_IFDIR);
+	archive_entry_set_size(ae, 512);
+	assertEqualIntA(a, ARCHIVE_OK, archive_write_header(a, ae));


Your comment says this should fail, but you test for ARCHIVE_OK. If it fails, it should return ARCHIVE_FAILED.

kientzle · 2014-06-07T16:57:11Z

This looks very good. Though I've made a few stylistic comments, my only real concern is that the tests be extended a little bit:

the write tests should always turn around and re-read to verify.
entries that cannot be written should return ARCHIVE_FAILED, not ARCHIVE_OK
there should be one or more read tests that read a known-good file. There are lots of examples in the current test suite you can copy. (Basically, find or create a small known-good file, uuencode it for storing in version control, the test decodes, reads, and verifies that the contents are read correctly.)

With the test fixes above, I'm happy to commit this.

that cannot be stored (natively) in WARC format.

Return ARCHIVE_WARN immediately.

Heeding Tim's advice, a non-NULL from __archive_read_ahead() is guaranteed to be of at least the minimum size, therefore no need to check for this condition again.

in _warc_read(). Also kick __archive_read_consume() because the writer will consume the bytes for us. So for the EOF case, set unconsumed to 0, for the non-EOF case set unconsumed to the minimum of the number of bytes read and the content length.

This changeset adheres to the previously imported read test. The archive format is hard-set to ARCHIVE_FORMAT_WARC, while the format name is the stringified WARC/x.y version designator, which for performance reasons will be cached between calls to the header reader _warc_rdhdr().

This changeset will refuse to extract WARCs that contain filenames a la http://example.com/implicit/content/ There is a todo note in archive_read_support_format_warc.c discussing possible archive options to extract filenames like those either by explicit user input or by some sort of heuristic as used in wget for example.

hroptatyr · 2014-06-10T13:48:39Z

Ok, I've practically gone through all your points and they should be fixed now.

kientzle · 2014-06-10T23:26:23Z

I've merged this. Thanks for your hard work.

Feat/format warc

hroptatyr added 11 commits May 21, 2014 08:39

Provide ISO 28500:2009 writer (aka warc, aka web archive)

667961d

Provide ISO 28500:2009 reader (aka warc, aka web archive)

9693801

Implant WARC support in tar's get_format_code()

3937c70

Provide WARC read/write tests

af78448

fix, never write more bytes in _warc_data() than ...

faa12ea

previously announce in _warc_header(). The test suite (as is) is one offender. It populates a 9-byte string, mimicking an IFREG file but by the time the header makes it into the archive, the size changes from 0 to 9.

fix, request just the bare minimum for a WARC header

fa28baa

Moreover, assume a response of less than the bare minimum header length to be the archive's EOF.

fix, WARC files urgently need the filesize to be known when the heade…

3bb9e31

…r is written

Obey gcc warnings,

75d16bd

in particular: Don't compare integers of different signedness, always initialise all members of a struct explicitly.

Fix, actually consume data between calls to _warc_read()

cb3e79b

Store and read back mtimes through Last-Modified custom header

84114d7

Signed-off-by: Sebastian Freundt <freundt@ga-group.nl>

Hygiene, clean up xmemmem() code a little, use xor sums.

09100be

Signed-off-by: Sebastian Freundt <freundt@ga-group.nl>

kientzle reviewed Jun 7, 2014
View reviewed changes

hroptatyr added 10 commits June 10, 2014 10:42

Tests, heed Tim's advice and emit ARCHIVE_FAILED for entries ...

beb4ff6

that cannot be stored (natively) in WARC format.

Fix, an empty WARC archive needs a bit more than 256 bytes.

e251bcf

Hygiene, protect against NULL filenames in _warc_header()

6e70844

Return ARCHIVE_WARN immediately.

Hygiene, trust __archive_read_ahead() and kick superfluous check.

b3cc3bc

Heeding Tim's advice, a non-NULL from __archive_read_ahead() is guaranteed to be of at least the minimum size, therefore no need to check for this condition again.

Hygiene, use FALLTHROUGH instead of lint's @fallthrough@

dd1c7ee

Hygiene, always use xmemmem() because memmem() is a GNU extension

75b78d5

Tests, provide known-good archive read test.

c81efb9

kientzle added a commit that referenced this pull request Jun 10, 2014

Merge pull request #81 from hroptatyr/feat/format-warc

f684c7d

Feat/format warc

kientzle merged commit f684c7d into libarchive:master Jun 10, 2014

hroptatyr deleted the feat/format-warc branch June 11, 2014 04:31

mxmlnkn mentioned this pull request Mar 19, 2024

Add WARC support mxmlnkn/ratarmount#128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/format warc #81

Feat/format warc #81

hroptatyr commented May 21, 2014

kientzle Jun 7, 2014

kientzle commented Jun 7, 2014

hroptatyr commented Jun 10, 2014

kientzle commented Jun 10, 2014

Feat/format warc #81

Feat/format warc #81

Conversation

hroptatyr commented May 21, 2014

kientzle Jun 7, 2014

Choose a reason for hiding this comment

kientzle commented Jun 7, 2014

hroptatyr commented Jun 10, 2014

kientzle commented Jun 10, 2014