core(is-crawlable): make sure that page is not blocked by robots.txt file #4548

kdzwinel · 2018-02-15T14:18:48Z

First part of #4356

Extends is-crawlable audit, so that it also verifies robots.txt file (besides meta tag and header).

This change adds a new dependency - robots-parser. It's a lightweight, well tested robots.txt parser with no dependencies.

kdzwinel · 2018-02-15T14:23:17Z

lighthouse-core/gather/gatherers/seo/robots-txt.js

+      return response.text()
+        .then(content => ({status: response.status, content}));
+    })
+    .catch(_ => ({status: null, content: null}));


Second part of #4356 is a new audit that checks if robots.txt file is valid. That audit should fail if fetch returns HTTP 500+ and that's why we collect it.

paulirish

I love the PR. just one question:

right now we don't really expose where the blocking directive came from... meta tag, resp header or robots.txt. feels like knowing that, in the case of a failure, would be really useful. wdyt?

paulirish · 2018-02-21T20:10:19Z

Also... looking at the robots-parser test suite, do you think there are any additional cases they should add?

i'm looking at https://github.com/ChristopherAkroyd/robots-txt-parser/tree/master/test/test-data https://github.com/python/cpython/blob/master/Lib/test/test_robotparser.py https://github.com/meanpath/robots/tree/master/test

kdzwinel · 2018-03-01T16:17:11Z

@paulirish we do show where the blocking directive comes from (header, meta, robots.txt) - check out the screenshot at the top. I'd be great to show which robots.txt directive is blocking, but robots-parser doesn't tell us that.

Regarding tests - I quickly scanned test suites that you linked and robots-parser test file and I don't see any prominent test cases missing. Are you concerned about some specific case?

kdzwinel added 5 commits February 8, 2018 23:54

Add robots.txt gatherer

ce74d3d

Extend is-crawlable audit with info from robots.txt

e0ba590

Make linter happy

ad61780

robots.txt fail tests

43ac738

Some more tests.

644a0a0

devtools-bot added the waiting4reviewer label Feb 15, 2018

kdzwinel commented Feb 15, 2018

View reviewed changes

Get rid of redundant variable

33e31bf

paulirish approved these changes Feb 21, 2018

View reviewed changes

paulirish added waiting4committer and removed waiting4reviewer labels Feb 21, 2018

devtools-bot added waiting4reviewer and removed waiting4committer labels Feb 23, 2018

paulirish added waiting4committer and removed waiting4reviewer labels Mar 1, 2018

update header title

db48e52

paulirish approved these changes Mar 2, 2018

View reviewed changes

GoogleChrome deleted a comment from googlebot Mar 2, 2018

devtools-bot added waiting4reviewer and removed waiting4committer labels Mar 2, 2018

paulirish merged commit fd750fc into GoogleChrome:master Mar 2, 2018

paulirish removed the waiting4reviewer label Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core(is-crawlable): make sure that page is not blocked by robots.txt file #4548

core(is-crawlable): make sure that page is not blocked by robots.txt file #4548

kdzwinel commented Feb 15, 2018

kdzwinel Feb 15, 2018

paulirish left a comment

paulirish commented Feb 21, 2018

kdzwinel commented Mar 1, 2018

core(is-crawlable): make sure that page is not blocked by robots.txt file #4548

core(is-crawlable): make sure that page is not blocked by robots.txt file #4548

Conversation

kdzwinel commented Feb 15, 2018

kdzwinel Feb 15, 2018

Choose a reason for hiding this comment

paulirish left a comment

Choose a reason for hiding this comment

paulirish commented Feb 21, 2018

kdzwinel commented Mar 1, 2018