-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/format warc #81
Feat/format warc #81
Conversation
previously announce in _warc_header(). The test suite (as is) is one offender. It populates a 9-byte string, mimicking an IFREG file but by the time the header makes it into the archive, the size changes from 0 to 9.
Moreover, assume a response of less than the bare minimum header length to be the archive's EOF.
in particular: Don't compare integers of different signedness, always initialise all members of a struct explicitly.
Signed-off-by: Sebastian Freundt <freundt@ga-group.nl>
Signed-off-by: Sebastian Freundt <freundt@ga-group.nl>
archive_entry_copy_pathname(ae, "dir"); | ||
archive_entry_set_filetype(ae, AE_IFDIR); | ||
archive_entry_set_size(ae, 512); | ||
assertEqualIntA(a, ARCHIVE_OK, archive_write_header(a, ae)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment says this should fail, but you test for ARCHIVE_OK. If it fails, it should return ARCHIVE_FAILED.
This looks very good. Though I've made a few stylistic comments, my only real concern is that the tests be extended a little bit:
With the test fixes above, I'm happy to commit this. |
that cannot be stored (natively) in WARC format.
Return ARCHIVE_WARN immediately.
Heeding Tim's advice, a non-NULL from __archive_read_ahead() is guaranteed to be of at least the minimum size, therefore no need to check for this condition again.
in _warc_read(). Also kick __archive_read_consume() because the writer will consume the bytes for us. So for the EOF case, set unconsumed to 0, for the non-EOF case set unconsumed to the minimum of the number of bytes read and the content length.
This changeset adheres to the previously imported read test. The archive format is hard-set to ARCHIVE_FORMAT_WARC, while the format name is the stringified WARC/x.y version designator, which for performance reasons will be cached between calls to the header reader _warc_rdhdr().
This changeset will refuse to extract WARCs that contain filenames a la http://example.com/implicit/content/ There is a todo note in archive_read_support_format_warc.c discussing possible archive options to extract filenames like those either by explicit user input or by some sort of heuristic as used in wget for example.
Ok, I've practically gone through all your points and they should be fixed now. |
I've merged this. Thanks for your hard work. |
This changeset provides support for ISO 28500 archives (aka The WARC File Format), as produced for instance by
wget --warc-file=...
The idea is to be able to use proven tools (i.e. libarchive) and a known workflow to operate on those archives.
While there are dedicated tools (warc-tools), hanzo-warc-tools, etc.) their usability is highly questionable: For example the author of this changeset couldn't decide between the 2 mentioned tool sets; the C variant looked a lot more robust and functional than the python variant and yet the C variant has been declared obsolete already, while the python tool set does not even offer a means to extract one particular file.
This changeset is by no means complete and fail-safe (especially not tested under non-unix environments) but it should give anyone who's interested a good starting point.