Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(is-crawlable): make sure that page is not blocked by robots.txt file #4548

Merged
merged 7 commits into from
Mar 2, 2018

Conversation

kdzwinel
Copy link
Collaborator

First part of #4356

Extends is-crawlable audit, so that it also verifies robots.txt file (besides meta tag and header).

screen shot 2018-02-15 at 14 37 51

This change adds a new dependency - robots-parser. It's a lightweight, well tested robots.txt parser with no dependencies.

return response.text()
.then(content => ({status: response.status, content}));
})
.catch(_ => ({status: null, content: null}));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second part of #4356 is a new audit that checks if robots.txt file is valid. That audit should fail if fetch returns HTTP 500+ and that's why we collect it.

Copy link
Member

@paulirish paulirish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the PR. just one question:

right now we don't really expose where the blocking directive came from... meta tag, resp header or robots.txt. feels like knowing that, in the case of a failure, would be really useful. wdyt?

@paulirish
Copy link
Member

Also... looking at the robots-parser test suite, do you think there are any additional cases they should add?

i'm looking at https://github.com/ChristopherAkroyd/robots-txt-parser/tree/master/test/test-data https://github.com/python/cpython/blob/master/Lib/test/test_robotparser.py https://github.com/meanpath/robots/tree/master/test

@kdzwinel
Copy link
Collaborator Author

kdzwinel commented Mar 1, 2018

@paulirish we do show where the blocking directive comes from (header, meta, robots.txt) - check out the screenshot at the top. I'd be great to show which robots.txt directive is blocking, but robots-parser doesn't tell us that.

Regarding tests - I quickly scanned test suites that you linked and robots-parser test file and I don't see any prominent test cases missing. Are you concerned about some specific case?

@GoogleChrome GoogleChrome deleted a comment from googlebot Mar 2, 2018
@paulirish paulirish merged commit fd750fc into GoogleChrome:master Mar 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants