Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_audit(robots-txt): /robots.txt validation #4845

Merged
merged 10 commits into from
Mar 27, 2018

Conversation

kdzwinel
Copy link
Collaborator

@kdzwinel kdzwinel commented Mar 22, 2018

Closes #4356

Validator spec, and list of resources I used to create it, can be found here.

I've run this validator on the top 1000 domains and it found issues on 39 of them: https://gist.github.com/kdzwinel/b791967eb66d0e2925ea22c8ca14233a

Example websites with errors in their robots.txt: netflix.com , chase.com
Example websites with no robots.txt: office.com
Example websites with valid robots.txt: nike.com, brainly.com

Example failure:
screen shot 2018-03-22 at 16 16 44

Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookin' good! a fileoverview comment with some links to any resources you used to help craft these rules would be ✈️

if (!status) {
return {
rawValue: false,
debugString: 'Lighthouse was unable to download your robots.txt file',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mark this as not applicable instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only happen when a fetch call fails which, AFAIK, can only happen on a network failure (or a CORS failure, but we are fetching a local resource here). IMO that's an extraordinary situation which prevents us from finishing the audit, ergo the error. Let me know if that makes sense to you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair, sounds fine to me 👍

* @param {!string} line single line from a robots.txt file
* @returns {!{directive: string, value: string}}
*/
function parseLine(line) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is a tad long, think we can break this up into a few subchunks of work? maybe ~45-80 gets split out to parseDirective or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, I extracted the part that was validating the directives to a separate function

throw new Error('Pattern should either be empty, start with "/" or "*"');
}

const dolarIndex = directiveValue.indexOf('$');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dollar :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💸good catch!

}

if (parsedLine.directive === DIRECTIVE_USER_AGENT) {
inGroup = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once you enter a group you never leave it? man robots files are weird 😆

a comment or two explaining how this works would be nice for us less seo-expert maintainers :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a short comment with explanation and doc link in this code 👍

BTW I wrote down list of rules and resources that I've used here

Copy link
Collaborator Author

@kdzwinel kdzwinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickhulce thank you for your input! I addressed all of your comments.

if (!status) {
return {
rawValue: false,
debugString: 'Lighthouse was unable to download your robots.txt file',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only happen when a fetch call fails which, AFAIK, can only happen on a network failure (or a CORS failure, but we are fetching a local resource here). IMO that's an extraordinary situation which prevents us from finishing the audit, ergo the error. Let me know if that makes sense to you!

throw new Error('Pattern should either be empty, start with "/" or "*"');
}

const dolarIndex = directiveValue.indexOf('$');
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💸good catch!

}

if (parsedLine.directive === DIRECTIVE_USER_AGENT) {
inGroup = true;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a short comment with explanation and doc link in this code 👍

BTW I wrote down list of rules and resources that I've used here

* @param {!string} line single line from a robots.txt file
* @returns {!{directive: string, value: string}}
*/
function parseLine(line) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, I extracted the part that was validating the directives to a separate function

const SITEMAP_VALID_PROTOCOLS = new Set(['https:', 'http:', 'ftp:']);

/**
* @param {!string} directiveName
Copy link
Collaborator

@patrickhulce patrickhulce Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ! necessary with our tsc typechecking anymore :)

mind adding this file to the typechecked files in tsconfig while we're at it? if it leads down a rabbit hole of other files to fix, then don't worry about it IMO

Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits!

throwing a link to that great comment of resources in a fileoverview would be great 👍


return null;
})
.filter(error => error !== null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we rename either this variable or the error.error to error.message

name: 'robots-txt',
description: 'robots.txt is valid',
failureDescription: 'robots.txt is not valid',
helpText: 'If your robots.txt file is malformed, crawlers may not be able to understand ' +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a link we can give them? perhaps your favorite validator/resource :)

Copy link
Collaborator Author

@kdzwinel kdzwinel Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are trying to stay crawler agnostic I think we shouldn't link to any of the official google/bing/yandex/yahoo docs here. Only independent resource that comes to mind is http://www.robotstxt.org/ but TBH it wasn't that useful to me. IMO it would be best to wait for the proper developer.google.com/lighthouse docs for this audit (which are coming soon - #4355).

@patrickhulce
Copy link
Collaborator

@kdzwinel were you able to give the typechecking a try by chance?

@kdzwinel
Copy link
Collaborator Author

kdzwinel commented Mar 24, 2018

@patrickhulce yeah, I'm wrestling with it (so far - type checking: 1, Konrad: 0). It will be the first audit in tsconfig, so the path isn't paved yet.

return {
index: index + 1,
errors.push({
index: (index + 1).toString(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit.makeTableDetails type definition forces all of these fields to be strings

.split(/\r\n|\r|\n/)
.map((line, index) => {
Copy link
Collaborator Author

@kdzwinel kdzwinel Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to explain to tsc that:

arr.map(() => {
  if(x) { return {} }
  return null;
}
.filter(x => x !== null)

can't possibly return nulls because they are filtered out, so I rewrote the whole thing using forEach.

@kdzwinel
Copy link
Collaborator Author

@patrickhulce ok, it looks like I figured it out 👍 I had to rewrite parts of validateRobots function though - PTAL

Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! great working pushing this through type checking :D

off to @brendankenny in case he has opinions on first typechecked audit


/**
* @param {{RobotsTxt: {status: number, content: string}}} artifacts
* @return {LH.Audit.Product}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh somewhat strange to see this in real usage :)

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, great work on those types :) LGTM

@brendankenny brendankenny merged commit 42d47ba into GoogleChrome:master Mar 27, 2018
@kdzwinel kdzwinel deleted the robots-txt-validator branch March 27, 2018 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants