Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/format warc #81

Merged
merged 21 commits into from
Jun 10, 2014
Merged

Feat/format warc #81

merged 21 commits into from
Jun 10, 2014

Conversation

hroptatyr
Copy link
Contributor

This changeset provides support for ISO 28500 archives (aka The WARC File Format), as produced for instance by wget --warc-file=...

The idea is to be able to use proven tools (i.e. libarchive) and a known workflow to operate on those archives.

While there are dedicated tools (warc-tools), hanzo-warc-tools, etc.) their usability is highly questionable: For example the author of this changeset couldn't decide between the 2 mentioned tool sets; the C variant looked a lot more robust and functional than the python variant and yet the C variant has been declared obsolete already, while the python tool set does not even offer a means to extract one particular file.

This changeset is by no means complete and fail-safe (especially not tested under non-unix environments) but it should give anyone who's interested a good starting point.

previously announce in _warc_header().

The test suite (as is) is one offender.  It populates a 9-byte string, mimicking an IFREG file
but by the time the header makes it into the archive, the size changes from 0 to 9.
Moreover, assume a response of less than the bare minimum header
length to be the archive's EOF.
in particular: Don't compare integers of different signedness,
always initialise all members of a struct explicitly.
Signed-off-by: Sebastian Freundt <freundt@ga-group.nl>
Signed-off-by: Sebastian Freundt <freundt@ga-group.nl>
archive_entry_copy_pathname(ae, "dir");
archive_entry_set_filetype(ae, AE_IFDIR);
archive_entry_set_size(ae, 512);
assertEqualIntA(a, ARCHIVE_OK, archive_write_header(a, ae));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment says this should fail, but you test for ARCHIVE_OK. If it fails, it should return ARCHIVE_FAILED.

@kientzle
Copy link
Contributor

kientzle commented Jun 7, 2014

This looks very good. Though I've made a few stylistic comments, my only real concern is that the tests be extended a little bit:

  • the write tests should always turn around and re-read to verify.
  • entries that cannot be written should return ARCHIVE_FAILED, not ARCHIVE_OK
  • there should be one or more read tests that read a known-good file. There are lots of examples in the current test suite you can copy. (Basically, find or create a small known-good file, uuencode it for storing in version control, the test decodes, reads, and verifies that the contents are read correctly.)

With the test fixes above, I'm happy to commit this.

that cannot be stored (natively) in WARC format.
Heeding Tim's advice, a non-NULL from __archive_read_ahead() is
guaranteed to be of at least the minimum size, therefore no need to
check for this condition again.
in _warc_read().  Also kick __archive_read_consume() because the writer
will consume the bytes for us.  So for the EOF case, set unconsumed to 0,
for the non-EOF case set unconsumed to the minimum of the number of bytes
read and the content length.
This changeset adheres to the previously imported read test.
The archive format is hard-set to ARCHIVE_FORMAT_WARC, while the
format name is the stringified WARC/x.y version designator, which
for performance reasons will be cached between calls to the header
reader _warc_rdhdr().
This changeset will refuse to extract WARCs that contain filenames
a la http://example.com/implicit/content/

There is a todo note in archive_read_support_format_warc.c discussing
possible archive options to extract filenames like those either by
explicit user input or by some sort of heuristic as used in wget
for example.
@hroptatyr
Copy link
Contributor Author

Ok, I've practically gone through all your points and they should be fixed now.

@kientzle
Copy link
Contributor

I've merged this. Thanks for your hard work.

kientzle added a commit that referenced this pull request Jun 10, 2014
@kientzle kientzle merged commit f684c7d into libarchive:master Jun 10, 2014
@hroptatyr hroptatyr deleted the feat/format-warc branch June 11, 2014 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants