Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for robotsflag field based on OpenWayback behaviour #4

Closed
anjackson opened this issue Jul 8, 2016 · 2 comments
Closed

Support for robotsflag field based on OpenWayback behaviour #4

anjackson opened this issue Jul 8, 2016 · 2 comments

Comments

@anjackson
Copy link
Contributor

We were looking at using the <robotstxt> field, which is currently not supported by tinycdxserver, and were wondering if you'd be happy for us to submit a pull request that enables it?

The implementation in OpenWayback is rather odd, in that it populate this field using the M meta tags (AIF) field (see here). It's not clear why the meta tags field becomes the robotstxt field, but AFAICT this is the only way to populate that field via the CDX format.

It doesn't look like too difficult a change, but given that it's nearly there but commented out I thought I'd better ask if there's a problem? Presumably the indexes won't be compatible either?

@ato
Copy link
Member

ato commented Jul 9, 2016

Sure, would be happy to accept a pull request that implements it.

There's no problem with it. The index format includes a pre-record (per CDX line) version number. So create a new ddcodd

@ato ato closed this as completed Jul 9, 2016
@ato ato reopened this Jul 9, 2016
@ato
Copy link
Member

ato commented Jul 9, 2016

Gah. Sorry. 'close issue' is too near the text box on mobile.

... So create a new Capture.decodeValueV2() method for a version 2 record format that supports the robots field and update Capture.encodeValue() to write the new format. Then the index server will happily read both new and old records and you can even mix them in the one index while incrementally reindexing to fill in the robots field data.

It was marked as todo simply because I didn't have any CDX files on hand with that field populated and wasn't sure what the data format was or what exactly it was used for in Wayback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants