bugs.chromium.org reports an incorrect robots.txt restriction #140

nightpool · 2017-02-24T02:35:50Z

Navigate to: https://web.archive.org/web/http://bugs.chromium.org/p/project-zero/issues/detail?id=1139

see that wayback says it's blocked by robots.txt:

See that the robots.txt for that domain, while complicated, specifically allows that type of URL:

User-agent: *
# Start by disallowing everything.
Disallow: /
# Some specific things are okay, though.
Allow: /$
Allow: /hosting
Allow: /p/*/adminIntro
# Query strings are hard. We only allow ?id=N, no other parameters.
Allow: /p/*/issues/detail?id=*
Disallow: /p/*/issues/detail?id=*&*
Disallow: /p/*/issues/detail?*&id=*
# 10 second crawl delay for bots that honor it.
Crawl-delay: 10

Expect that complex robot.txt files are parsed and matched correctly by the wayback machine.

The text was updated successfully, but these errors were encountered:

nightpool · 2017-02-24T02:40:37Z

web-beta seems to report a similar but distinct error. Not sure if this is related or not: https://web-beta.archive.org/web/20170224002517/https://bugs.chromium.org/p/project-zero/issues/detail?id=1139

(I'm sure the page is archived, because it shows up under the search: https://web-beta.archive.org/web/*/https://bugs.chromium.org/p/project-zero/issues/detail?id=1139)

kngenie · 2017-09-21T00:09:17Z

Java Wayback's robots.txt parser doe's not understand wildcard and Allow: directives.
URL reported now plays back on web.archive.org (uses new robots.txt parser).

kngenie closed this as completed Sep 21, 2017

ErikBorra mentioned this issue Nov 10, 2017

Discrepancy between web and api results #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugs.chromium.org reports an incorrect robots.txt restriction #140

bugs.chromium.org reports an incorrect robots.txt restriction #140

nightpool commented Feb 24, 2017

nightpool commented Feb 24, 2017

kngenie commented Sep 21, 2017

bugs.chromium.org reports an incorrect robots.txt restriction #140

bugs.chromium.org reports an incorrect robots.txt restriction #140

Comments

nightpool commented Feb 24, 2017

nightpool commented Feb 24, 2017

kngenie commented Sep 21, 2017