-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BNF grammar for WARC-Target-URI and WARC-Profile is inconsistent with examples #23
Comments
|
Yes, current tools follow the examples as far as I can tell. The I suspect most tools would choke on something that is formatted as per the BNF. |
In the examples and in all popular implementations, URIs in the WARC-Target-URL and WARC-Profile fields are not surrounded by "<" and ">" characters. This change makes the grammar consistent with practice by removing "<" and ">" from the basic `uri` rule and introducing a new `record-id` rule for the fields WARC-Record-ID, WARC-Concurrent-To, WARC-Refers-To, WARC-Warcinfo-ID and WARC-Segment-Origin-ID. Fixes iipc#23
Addresses issue iipc#23
|
I always though the point was that target-uri and profile where most likely URLs that could be browsed to. And the rest were more likely UUIDs and hence surrounded by "<" / ">". |
|
The following changes have been integrated in the revised ISO draft during the ISO working group meeting on November 16-17, 2015: in section 4 file and record model, change the definition of uri and add a note: NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as "<" <'URI' per RFC3986> ">". This rule has been changed to meet requests from implementers. |
|
Included in WARC 1.1 |
Due to an error in the spec, most implementations of WARC 1.0 don't surround the contents of the WARC-Target-URI field with angle brackets even though they're required to do so. There are some tools that do respect this requirement, such as newer versions of GNU Wget. Since both variants are in use, it's best to just strip angle brackets when they're present. Reference: iipc/warc-specifications#23
(In WARC 1.0) Sections 5.2 and 5.16 define the grammar for these fields as:
Section 4 defines
urias:However all examples in sections 12 and 13 do not include the "<" and ">" characters on the WARC-Target-URI and WARC-Profile fields. All other
urifields (WARC-Record-ID, WARC-Refers-To etc) in the examples include "<" and ">".Many (all?) implementations have adopted the form shown in the examples rather than strictly following the grammar.
The text was updated successfully, but these errors were encountered: