Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BNF grammar for WARC-Target-URI and WARC-Profile is inconsistent with examples #23

Closed
ato opened this issue Sep 17, 2015 · 4 comments
Closed

Comments

@ato
Copy link
Member

ato commented Sep 17, 2015

(In WARC 1.0) Sections 5.2 and 5.16 define the grammar for these fields as:

WARC-Record-ID = "WARC-Record-ID" ":" uri
WARC-Profile   = "WARC-Profile" ":" uri
WARC-Target-URI = "WARC-Target-URI" ":" uri

Section 4 defines uri as:

uri            = "<" <'URI' per RFC3986> ">"

However all examples in sections 12 and 13 do not include the "<" and ">" characters on the WARC-Target-URI and WARC-Profile fields. All other uri fields (WARC-Record-ID, WARC-Refers-To etc) in the examples include "<" and ">".

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2007-03-06T00:43:35Z
WARC-Profile: http://netpreserve.org/warc/1.0/server-not-modified
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: message/http
Content-Length: 226

Many (all?) implementations have adopted the form shown in the examples rather than strictly following the grammar.

@kris-sigur
Copy link
Member

Yes, current tools follow the examples as far as I can tell. The < > form is only used when dealing with UUIDs.

I suspect most tools would choke on something that is formatted as per the BNF.

ato added a commit to ato/warc-specifications that referenced this issue Sep 17, 2015
In the examples and in all popular implementations, URIs in the
WARC-Target-URL and WARC-Profile fields are not surrounded by
"<" and ">" characters.  This change makes the grammar consistent
with practice by removing "<" and ">" from the basic `uri` rule and
introducing a new `record-id` rule for the fields WARC-Record-ID,
WARC-Concurrent-To, WARC-Refers-To, WARC-Warcinfo-ID and
WARC-Segment-Origin-ID.

Fixes iipc#23
ato added a commit to ato/warc-specifications that referenced this issue Sep 18, 2015
@nclarkekb
Copy link

I always though the point was that target-uri and profile where most likely URLs that could be browsed to. And the rest were more likely UUIDs and hence surrounded by "<" / ">".
Both being URIs was more a technicallity.

@saraaubry
Copy link

The following changes have been integrated in the revised ISO draft during the ISO working group meeting on November 16-17, 2015:

in section 4 file and record model, change the definition of uri and add a note:
uri = <'URI' per RFC3986>

NOTE: in WARC 1.0 standard (ISO 28500:2009), uri was defined as "<" <'URI' per RFC3986> ">". This rule has been changed to meet requests from implementers.

@saraaubry
Copy link

Included in WARC 1.1

@anjackson anjackson added this to the The WARC Format 1.1 milestone Dec 7, 2017
tsudoko added a commit to tsudoko/XADMaster that referenced this issue Nov 1, 2018
Due to an error in the spec, most implementations of WARC 1.0
don't surround the contents of the WARC-Target-URI field with
angle brackets even though they're required to do so. There are
some tools that do respect this requirement, such as newer
versions of GNU Wget. Since both variants are in use, it's best
to just strip angle brackets when they're present.

Reference: iipc/warc-specifications#23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants