Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs.chromium.org reports an incorrect robots.txt restriction #140

Closed
nightpool opened this issue Feb 24, 2017 · 2 comments
Closed

bugs.chromium.org reports an incorrect robots.txt restriction #140

nightpool opened this issue Feb 24, 2017 · 2 comments

Comments

@nightpool
Copy link

Navigate to: https://web.archive.org/web/http://bugs.chromium.org/p/project-zero/issues/detail?id=1139

see that wayback says it's blocked by robots.txt:

image

See that the robots.txt for that domain, while complicated, specifically allows that type of URL:

User-agent: *
# Start by disallowing everything.
Disallow: /
# Some specific things are okay, though.
Allow: /$
Allow: /hosting
Allow: /p/*/adminIntro
# Query strings are hard. We only allow ?id=N, no other parameters.
Allow: /p/*/issues/detail?id=*
Disallow: /p/*/issues/detail?id=*&*
Disallow: /p/*/issues/detail?*&id=*
# 10 second crawl delay for bots that honor it.
Crawl-delay: 10

Expect that complex robot.txt files are parsed and matched correctly by the wayback machine.

@nightpool
Copy link
Author

web-beta seems to report a similar but distinct error. Not sure if this is related or not: https://web-beta.archive.org/web/20170224002517/https://bugs.chromium.org/p/project-zero/issues/detail?id=1139

image

(I'm sure the page is archived, because it shows up under the search: https://web-beta.archive.org/web/*/https://bugs.chromium.org/p/project-zero/issues/detail?id=1139)

@kngenie
Copy link
Member

kngenie commented Sep 21, 2017

Java Wayback's robots.txt parser doe's not understand wildcard and Allow: directives.
URL reported now plays back on web.archive.org (uses new robots.txt parser).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants