new_audit(robots-txt): /robots.txt validation #4845

kdzwinel · 2018-03-22T16:35:17Z

Validator spec, and list of resources I used to create it, can be found here.

I've run this validator on the top 1000 domains and it found issues on 39 of them: https://gist.github.com/kdzwinel/b791967eb66d0e2925ea22c8ca14233a

Example websites with errors in their robots.txt: netflix.com , chase.com
Example websites with no robots.txt: office.com
Example websites with valid robots.txt: nike.com, brainly.com

Example failure:

patrickhulce

lookin' good! a fileoverview comment with some links to any resources you used to help craft these rules would be ✈️

patrickhulce · 2018-03-22T16:39:00Z

lighthouse-core/audits/seo/robots-txt.js

+    if (!status) {
+      return {
+        rawValue: false,
+        debugString: 'Lighthouse was unable to download your robots.txt file',


should we mark this as not applicable instead?

This will only happen when a fetch call fails which, AFAIK, can only happen on a network failure (or a CORS failure, but we are fetching a local resource here). IMO that's an extraordinary situation which prevents us from finishing the audit, ergo the error. Let me know if that makes sense to you!

that's fair, sounds fine to me 👍

patrickhulce · 2018-03-22T16:44:03Z

lighthouse-core/audits/seo/robots-txt.js

+ * @param {!string} line single line from a robots.txt file
+ * @returns {!{directive: string, value: string}}
+ */
+function parseLine(line) {


this function is a tad long, think we can break this up into a few subchunks of work? maybe ~45-80 gets split out to parseDirective or something?

good call, I extracted the part that was validating the directives to a separate function

patrickhulce · 2018-03-22T16:45:06Z

lighthouse-core/audits/seo/robots-txt.js

+      throw new Error('Pattern should either be empty, start with "/" or "*"');
+    }
+
+    const dolarIndex = directiveValue.indexOf('$');


nit: dollar :)

💸good catch!

patrickhulce · 2018-03-22T16:46:32Z

lighthouse-core/audits/seo/robots-txt.js

+      }
+
+      if (parsedLine.directive === DIRECTIVE_USER_AGENT) {
+        inGroup = true;


once you enter a group you never leave it? man robots files are weird 😆

a comment or two explaining how this works would be nice for us less seo-expert maintainers :)

I added a short comment with explanation and doc link in this code 👍

BTW I wrote down list of rules and resources that I've used here

kdzwinel

@patrickhulce thank you for your input! I addressed all of your comments.

kdzwinel · 2018-03-22T23:16:51Z

lighthouse-core/audits/seo/robots-txt.js

+    if (!status) {
+      return {
+        rawValue: false,
+        debugString: 'Lighthouse was unable to download your robots.txt file',


This will only happen when a fetch call fails which, AFAIK, can only happen on a network failure (or a CORS failure, but we are fetching a local resource here). IMO that's an extraordinary situation which prevents us from finishing the audit, ergo the error. Let me know if that makes sense to you!

kdzwinel · 2018-03-22T23:17:55Z

lighthouse-core/audits/seo/robots-txt.js

+      throw new Error('Pattern should either be empty, start with "/" or "*"');
+    }
+
+    const dolarIndex = directiveValue.indexOf('$');


💸good catch!

kdzwinel · 2018-03-22T23:26:26Z

lighthouse-core/audits/seo/robots-txt.js

+      }
+
+      if (parsedLine.directive === DIRECTIVE_USER_AGENT) {
+        inGroup = true;


I added a short comment with explanation and doc link in this code 👍

BTW I wrote down list of rules and resources that I've used here

kdzwinel · 2018-03-22T23:46:00Z

lighthouse-core/audits/seo/robots-txt.js

+ * @param {!string} line single line from a robots.txt file
+ * @returns {!{directive: string, value: string}}
+ */
+function parseLine(line) {


good call, I extracted the part that was validating the directives to a separate function

patrickhulce · 2018-03-23T00:06:00Z

lighthouse-core/audits/seo/robots-txt.js

+const SITEMAP_VALID_PROTOCOLS = new Set(['https:', 'http:', 'ftp:']);
+
+/**
+ * @param {!string} directiveName


no ! necessary with our tsc typechecking anymore :)

mind adding this file to the typechecked files in tsconfig while we're at it? if it leads down a rabbit hole of other files to fix, then don't worry about it IMO

patrickhulce

LGTM % nits!

throwing a link to that great comment of resources in a fileoverview would be great 👍

patrickhulce · 2018-03-23T00:08:04Z

lighthouse-core/audits/seo/robots-txt.js

+
+      return null;
+    })
+    .filter(error => error !== null);


nit: can we rename either this variable or the error.error to error.message

patrickhulce · 2018-03-23T00:09:40Z

lighthouse-core/audits/seo/robots-txt.js

+      name: 'robots-txt',
+      description: 'robots.txt is valid',
+      failureDescription: 'robots.txt is not valid',
+      helpText: 'If your robots.txt file is malformed, crawlers may not be able to understand ' +


do we have a link we can give them? perhaps your favorite validator/resource :)

Since we are trying to stay crawler agnostic I think we shouldn't link to any of the official google/bing/yandex/yahoo docs here. Only independent resource that comes to mind is http://www.robotstxt.org/ but TBH it wasn't that useful to me. IMO it would be best to wait for the proper developer.google.com/lighthouse docs for this audit (which are coming soon - #4355).

patrickhulce · 2018-03-24T00:21:05Z

@kdzwinel were you able to give the typechecking a try by chance?

kdzwinel · 2018-03-24T21:08:43Z

@patrickhulce yeah, I'm wrestling with it (so far - type checking: 1, Konrad: 0). It will be the first audit in tsconfig, so the path isn't paved yet.

kdzwinel · 2018-03-24T22:28:20Z

lighthouse-core/audits/seo/robots-txt.js

-        return {
-          index: index + 1,
+        errors.push({
+          index: (index + 1).toString(),


Audit.makeTableDetails type definition forces all of these fields to be strings

kdzwinel · 2018-03-24T22:30:33Z

lighthouse-core/audits/seo/robots-txt.js

    .split(/\r\n|\r|\n/)
-    .map((line, index) => {


I couldn't figure out how to explain to tsc that:

arr.map(() => { if(x) { return {} } return null; } .filter(x => x !== null)

can't possibly return nulls because they are filtered out, so I rewrote the whole thing using forEach.

kdzwinel · 2018-03-24T22:42:00Z

@patrickhulce ok, it looks like I figured it out 👍 I had to rewrite parts of validateRobots function though - PTAL

patrickhulce

LGTM!! great working pushing this through type checking :D

off to @brendankenny in case he has opinions on first typechecked audit

patrickhulce · 2018-03-26T15:19:03Z

lighthouse-core/audits/seo/robots-txt.js

+
+  /**
+   * @param {{RobotsTxt: {status: number, content: string}}} artifacts
+   * @return {LH.Audit.Product}


heh somewhat strange to see this in real usage :)

brendankenny

Yep, great work on those types :) LGTM

kdzwinel added 3 commits March 22, 2018 00:57

Raw audit, no tests

8ccd82c

Small fixes.

3fb4059

Add tests

a758014

kdzwinel requested review from brendankenny, patrickhulce and paulirish as code owners March 22, 2018 16:35

devtools-bot added the waiting4reviewer label Mar 22, 2018

patrickhulce suggested changes Mar 22, 2018

View reviewed changes

kdzwinel mentioned this pull request Mar 22, 2018

[SEO Audits] Integrate robots.txt analysis #4356

Closed

2 tasks

kdzwinel added 2 commits March 23, 2018 00:27

typo

ea30507

extract verifyDirective from parseLine

dc17317

kdzwinel commented Mar 22, 2018

View reviewed changes

patrickhulce reviewed Mar 23, 2018

View reviewed changes

patrickhulce approved these changes Mar 23, 2018

View reviewed changes

kdzwinel added 3 commits March 23, 2018 01:11

add fileoverview

dc0c3bc

Link to the comment with all rules and resources

6cf1dd8

rename error.error to error.message

b8dc59b

patrickhulce added waiting4committer and removed waiting4reviewer labels Mar 24, 2018

Add to tsconfig, adjust jsdoc

9e7d946

kdzwinel commented Mar 24, 2018

View reviewed changes

devtools-bot added waiting4reviewer and removed waiting4committer labels Mar 24, 2018

kdzwinel commented Mar 24, 2018

View reviewed changes

add simple robots-txt check to the smokehouse

b3abcd2

patrickhulce approved these changes Mar 26, 2018

View reviewed changes

patrickhulce assigned brendankenny Mar 26, 2018

brendankenny approved these changes Mar 27, 2018

View reviewed changes

brendankenny merged commit 42d47ba into GoogleChrome:master Mar 27, 2018

kdzwinel deleted the robots-txt-validator branch March 27, 2018 21:18

paulirish removed the waiting4reviewer label Apr 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new_audit(robots-txt): /robots.txt validation #4845

new_audit(robots-txt): /robots.txt validation #4845

kdzwinel commented Mar 22, 2018 •

edited

Loading

patrickhulce left a comment

patrickhulce Mar 22, 2018

kdzwinel Mar 22, 2018

patrickhulce Mar 23, 2018

patrickhulce Mar 22, 2018

kdzwinel Mar 22, 2018

patrickhulce Mar 22, 2018

kdzwinel Mar 22, 2018

patrickhulce Mar 22, 2018

kdzwinel Mar 22, 2018

kdzwinel left a comment

kdzwinel Mar 22, 2018

kdzwinel Mar 22, 2018

kdzwinel Mar 22, 2018

kdzwinel Mar 22, 2018

patrickhulce Mar 23, 2018 •

edited

Loading

patrickhulce left a comment

patrickhulce Mar 23, 2018

patrickhulce Mar 23, 2018

kdzwinel Mar 23, 2018 •

edited

Loading

patrickhulce commented Mar 24, 2018

kdzwinel commented Mar 24, 2018 •

edited

Loading

kdzwinel Mar 24, 2018

kdzwinel Mar 24, 2018 •

edited

Loading

kdzwinel commented Mar 24, 2018

patrickhulce left a comment

patrickhulce Mar 26, 2018

brendankenny left a comment

new_audit(robots-txt): /robots.txt validation #4845

new_audit(robots-txt): /robots.txt validation #4845

Conversation

kdzwinel commented Mar 22, 2018 • edited Loading

patrickhulce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdzwinel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickhulce Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

patrickhulce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdzwinel Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

patrickhulce commented Mar 24, 2018

kdzwinel commented Mar 24, 2018 • edited Loading

Choose a reason for hiding this comment

kdzwinel Mar 24, 2018 • edited Loading

Choose a reason for hiding this comment

kdzwinel commented Mar 24, 2018

patrickhulce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendankenny left a comment

Choose a reason for hiding this comment

kdzwinel commented Mar 22, 2018 •

edited

Loading

patrickhulce Mar 23, 2018 •

edited

Loading

kdzwinel Mar 23, 2018 •

edited

Loading

kdzwinel commented Mar 24, 2018 •

edited

Loading

kdzwinel Mar 24, 2018 •

edited

Loading