-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new_audit(robots-txt): /robots.txt validation #4845
new_audit(robots-txt): /robots.txt validation #4845
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookin' good! a fileoverview comment with some links to any resources you used to help craft these rules would be
if (!status) { | ||
return { | ||
rawValue: false, | ||
debugString: 'Lighthouse was unable to download your robots.txt file', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mark this as not applicable instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only happen when a fetch
call fails which, AFAIK, can only happen on a network failure (or a CORS failure, but we are fetching a local resource here). IMO that's an extraordinary situation which prevents us from finishing the audit, ergo the error. Let me know if that makes sense to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fair, sounds fine to me 👍
* @param {!string} line single line from a robots.txt file | ||
* @returns {!{directive: string, value: string}} | ||
*/ | ||
function parseLine(line) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is a tad long, think we can break this up into a few subchunks of work? maybe ~45-80 gets split out to parseDirective or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, I extracted the part that was validating the directives to a separate function
throw new Error('Pattern should either be empty, start with "/" or "*"'); | ||
} | ||
|
||
const dolarIndex = directiveValue.indexOf('$'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: dollar :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💸good catch!
} | ||
|
||
if (parsedLine.directive === DIRECTIVE_USER_AGENT) { | ||
inGroup = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once you enter a group you never leave it? man robots files are weird 😆
a comment or two explaining how this works would be nice for us less seo-expert maintainers :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a short comment with explanation and doc link in this code 👍
BTW I wrote down list of rules and resources that I've used here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickhulce thank you for your input! I addressed all of your comments.
if (!status) { | ||
return { | ||
rawValue: false, | ||
debugString: 'Lighthouse was unable to download your robots.txt file', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only happen when a fetch
call fails which, AFAIK, can only happen on a network failure (or a CORS failure, but we are fetching a local resource here). IMO that's an extraordinary situation which prevents us from finishing the audit, ergo the error. Let me know if that makes sense to you!
throw new Error('Pattern should either be empty, start with "/" or "*"'); | ||
} | ||
|
||
const dolarIndex = directiveValue.indexOf('$'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💸good catch!
} | ||
|
||
if (parsedLine.directive === DIRECTIVE_USER_AGENT) { | ||
inGroup = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a short comment with explanation and doc link in this code 👍
BTW I wrote down list of rules and resources that I've used here
* @param {!string} line single line from a robots.txt file | ||
* @returns {!{directive: string, value: string}} | ||
*/ | ||
function parseLine(line) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, I extracted the part that was validating the directives to a separate function
const SITEMAP_VALID_PROTOCOLS = new Set(['https:', 'http:', 'ftp:']); | ||
|
||
/** | ||
* @param {!string} directiveName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no !
necessary with our tsc typechecking anymore :)
mind adding this file to the typechecked files in tsconfig
while we're at it? if it leads down a rabbit hole of other files to fix, then don't worry about it IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % nits!
throwing a link to that great comment of resources in a fileoverview would be great 👍
|
||
return null; | ||
}) | ||
.filter(error => error !== null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we rename either this variable or the error.error
to error.message
name: 'robots-txt', | ||
description: 'robots.txt is valid', | ||
failureDescription: 'robots.txt is not valid', | ||
helpText: 'If your robots.txt file is malformed, crawlers may not be able to understand ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a link we can give them? perhaps your favorite validator/resource :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are trying to stay crawler agnostic I think we shouldn't link to any of the official google/bing/yandex/yahoo docs here. Only independent resource that comes to mind is http://www.robotstxt.org/ but TBH it wasn't that useful to me. IMO it would be best to wait for the proper developer.google.com/lighthouse docs for this audit (which are coming soon - #4355).
@kdzwinel were you able to give the typechecking a try by chance? |
@patrickhulce yeah, I'm wrestling with it (so far - type checking: 1, Konrad: 0). It will be the first audit in tsconfig, so the path isn't paved yet. |
return { | ||
index: index + 1, | ||
errors.push({ | ||
index: (index + 1).toString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Audit.makeTableDetails
type definition forces all of these fields to be strings
.split(/\r\n|\r|\n/) | ||
.map((line, index) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to explain to tsc that:
arr.map(() => {
if(x) { return {} }
return null;
}
.filter(x => x !== null)
can't possibly return null
s because they are filtered out, so I rewrote the whole thing using forEach
.
@patrickhulce ok, it looks like I figured it out 👍 I had to rewrite parts of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!! great working pushing this through type checking :D
off to @brendankenny in case he has opinions on first typechecked audit
|
||
/** | ||
* @param {{RobotsTxt: {status: number, content: string}}} artifacts | ||
* @return {LH.Audit.Product} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh somewhat strange to see this in real usage :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, great work on those types :) LGTM
Closes #4356
Validator spec, and list of resources I used to create it, can be found here.
I've run this validator on the top 1000 domains and it found issues on 39 of them: https://gist.github.com/kdzwinel/b791967eb66d0e2925ea22c8ca14233a
Example websites with errors in their robots.txt: netflix.com , chase.com
Example websites with no robots.txt: office.com
Example websites with valid robots.txt: nike.com, brainly.com
Example failure:
![screen shot 2018-03-22 at 16 16 44](https://user-images.githubusercontent.com/985504/37783950-03649684-2df7-11e8-8694-f52a39b918ef.png)