Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal to update WARC-Date spec #6

Closed
wants to merge 5 commits into from
Closed

Conversation

@nlevitt
Copy link
Member

nlevitt commented Apr 30, 2015

No description provided.

@PsypherPunk

This comment has been minimized.

Copy link

PsypherPunk commented Apr 30, 2015

Relating to the "reduced precision" issue, reading the Wikipedia page the ISO8601 standard already allows for this. If we were to broaden the spec. to include ISO8601-compatible dates this should technically be covered (and anything which correctly implements it should handle date-comparison).

@nlevitt

This comment has been minimized.

Copy link
Member Author

nlevitt commented Apr 30, 2015

@PsypherPunk yes, those are exactly what I proposed using, in my "alternative", "not preferred" proposal. http://nlevitt.github.io/warc-specifications/specifications/warc-date/allow-more-precise.html#warc-date-mandatory-2

@anjackson anjackson modified the milestone: The WARC Format - 1.1 Jun 11, 2015
@cleymour

This comment has been minimized.

Copy link

cleymour commented Jun 29, 2015

The "Proposed Revised Spec" looks good to me.
I have a question with the "Alternative Proposed Revised Spec (Not Preferred)" : if we allow for shorter dates (which may be useful in case of format conversion, eg from HTTrack2WARC), should we help implementers deciding how access interfaces should react? For example, should we recommend that the Wayback Machine should replace missing digits by a "0", or should we let all implementers free of their decision?
Anyway, both solutions don't seem a threat for 1.0 and 1.1 compatibility.

@nlevitt

This comment has been minimized.

Copy link
Member Author

nlevitt commented Jun 29, 2015

cleymour commented 6 hours ago

For example, should we recommend that the Wayback Machine should replace missing digits by a "0", or should we let all implementers free of their decision?

I would hesitate to make any particular recommendation, because I think the correct behavior is not obvious.

Consider the effect of replacing missing digits with zeroes. Suppose you have a warc record for http://example.com/ with timestamp "2005", and another record with timestamp "2005-01-05T12:34:56.789Z". Someone requests http://example.com/ from "2005-01-01T00:00:00Z". The capture with timestamp "2005-[00-00T00:00:00]" would be chosen in preference over "2005-01-05T12:34:56.789Z", even though it's more likely that the latter was captured closer to the desired timestamp.

What is the preferred playback behavior for timestamps with reduced precision? It's not at all clear to me. Maybe it's something like "return the record with less precise timestamp if it's definitely? probably? the closest to the requested timestamp". But that could be exceedingly difficult to implement, especially given how wayback indexes currently work, without any special provision for reduced precision timestamps.

In general, I don't think it's a good idea for standards or specs to make any recommendations, except when they are based on proven implementations. You can never think of everything until you actually do it.

@ikreymer

This comment has been minimized.

Copy link
Member

ikreymer commented Jul 1, 2015

FWIW, the current wayback machine behavior is to pad up not down.. e.g., a request for /2005/ is actually equivalent to a request for /20051231235959/ not /20050101000000/

The merits of this should probably be discussed elsewhere, but just pointing out current default behavior.

@anjackson

This comment has been minimized.

Copy link
Member

anjackson commented Jul 29, 2015

I think this issue can be split, into an easy part and a not-easy part. We clearly need microsecond precision as an option, so I think we should deal first. The issues around reducing the precision by omitting day/month or whatever would seem to need more discussion.

So, can I propose that we deal with the microsecond-precision option first? Furthermore, can I propose that this is done in a new pull request that directly modified the proposed version 1.1 of the specification:

https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md

I'd also like to suggest that each proposed change is also noted in the Document History section (at the end) so this can act as a change-log for the specification document. This could like back to relevant issues or pull requests, in the same way as for the CHANGES.md files we tend to use elsewhere.

@PsypherPunk

This comment has been minimized.

Copy link

PsypherPunk commented Jul 29, 2015

I think we'd agreed in the last discussion to adopt the "alternative" proposal as-is...? If that's the case the current pull request should be fine, although the addition to the "Document History" might be a good practice for future amendments.

@anjackson

This comment has been minimized.

Copy link
Member

anjackson commented Jul 29, 2015

@nlevitt's original pull-request was based around adding a new specification document. I am proposing instead that the change is submitted as a modification to the WARC 1.1 standard document itself. However, if the community would rather this is a separate specification, that's fine by me.

nlevitt added a commit to nlevitt/warc-specifications that referenced this pull request Jul 31, 2015
Revise WARC-Date specification to permit values with varying levels of precision. It is the same as the "Alternative Proposed Revised Spec"
from http://nlevitt.github.io/warc-specifications/specifications/warc-date/allow-more-precise.html but with the addition of the sentence "This document recommends no particular algorithm for choosing a record by date when an exact match is not available." I also added an entry to Document History. See also iipc#6
@nlevitt

This comment has been minimized.

Copy link
Member Author

nlevitt commented Jul 31, 2015

Took @anjackson's suggestion and created a new pull request against the standard document itself. #21 It replaces the WARC-Date section with my "alternative" proposal, with the addition of the sentence "This document recommends no particular algorithm for choosing a record by date when an exact match is not available." That was my understanding of the consensus established on the phone meeting. Also added a document history entry as requested.

@nlevitt nlevitt closed this Jul 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.