Skip to content
This repository has been archived by the owner on Sep 7, 2021. It is now read-only.

Lint pages against recipes #320

Merged
merged 22 commits into from
Mar 2, 2020
Merged

Lint pages against recipes #320

merged 22 commits into from
Mar 2, 2020

Conversation

ddbeck
Copy link
Contributor

@ddbeck ddbeck commented Feb 26, 2020

This PR lints scraped pages against its recognized recipe, if it has one.

This ought to satisfy the user story in mdn/sprints#2784. Running this gets you a nice report for each page. You'll learn whether its missing any expected sections or cannot be linted (due to a lack of identifiable recipe). The summary for a tree of pages gives you an overview of how many pages need to be tagged for recipes and how many pages fail to meet the standards set by their recipes. Once this is merged, I can quickly create a spreadsheet for the work needed to complete mdn/sprints#2785.

At least for now, ingredient matching is pretty minimal. For most ingredients, it's a check for a specific <h2>. For a couple of ingredients (specification and compat data), this PR demonstrates a more in-depth check: a specific heading paired with a specific macro call. These checks also demonstrate using the hast-util-select package to use a CSS-like selector to find content.

There was at least one kind of ingredient I didn't handle that appeared in JS pages: info boxes. I couldn't find a good example to test against. I'd appreciate pointers about this.

To make this work sensibly, there were a few seemingly off-topic changes I ended up making:

  • We can now handle throwing VMessages. This way I can signal that a file can be processed no further. In this PR, I use this to handle the case of a file not having a recipe to lint against: there's no meaningful linting that can be done from that point, so we bail out by throwing.

    (Throwing a VMessage is what VFile.fail() does ordinarily, though its API doesn't support exactly how I use VMessages right now. Something I can deal with in a follow-up PR.)

  • I removed html-require-macros.js. Now that we can require, for example, compat macros in context, we don't need the generic rule.

There are a few changes I would like to make based on what I learned writing this, but I refrained from putting them into this PR. Some possible follow-up tasks include:

  • Some sort of consistent style for naming warnings and errors
  • Defining a real KS macro hast node type, rather than using text literal nodes with data attributes. These would be easier to match against (no need for isMacro()) and I suspect easier to transform into Markdown (when we get there).

At this point, I'm interested in feedback on pretty much anything. This took longer than I expected and I think I'm too close to it now to see what I might've done wrong.

This change will halt the processing of a file if it doesn't have one
and only one valid recipe assigned to it. We can't do recipe-based
linting for such files, so we should we stop processing at that point.
@wbamberg
Copy link

wbamberg commented Feb 28, 2020

There was at least one kind of ingredient I didn't handle that appeared in JS pages: info boxes. I couldn't find a good example to test against. I'd appreciate pointers about this.

info boxes (at least in JS property pages) are supposed to be things like the table immediately after the interactive example on https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/MAX_VALUE.

info boxes are not specific to JS properties though, we currently have something like them for HTML elements and CSS properties and probably some other things too. There's some background on the concept here: https://docs.google.com/document/d/1fwbTs6uoeAfGjjZef3cOOzG1Bx-S8sbLbCBjzx9gBPY/edit#heading=h.98r3vvtunbo and here: #106 (comment).

In general, they're a collection of name/value facts about a type of thing, suitable for a tabular representation, where each recipe could define the set of facts (so JS properties and one, CSS properties have a different set...). So in this case full testing would be:

  • there is a table in the expected place in the page (wherever that is)
  • it contains the 3 rows "Writable", "Enumerable", "Configurable"
  • each row has one of the acceptable values ("yes" or "no" I guess)

But we could also ask:

  • where in the page info boxes should be expected
  • whether they should be mandatory (I don't know how many JS properties are missing them)
  • whether we should even bother structuring these things at all, or just slurp them up as prose content
  • whether we think they are even particularly useful

I think perhaps we should just slurp them for now, and remove them from the recipes...

),
"prose.see_also": requireTopLevelHeading("prose.see_also", "See_also"),
"prose.short_description": (tree, file) => {
if (select("body > p", tree) === null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too liberal: a page like https://wiki.developer.mozilla.org/en-US/docs/User:wbamberg/test-linter2 passes. I think testing this is tricky: you more or less want the first element of the raw document body to be <p>, except that sidebars will usually precede this, and pages very occasionally (and wrongly) include the interactive example macro in <p> instead of <div> (luckily this is very apparent to humans because it means the example wraps at text-width, instead of getting the full page width).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing something that hopefully does a little better at this

@wbamberg
Copy link

@ddbeck, thanks so much for this, it seems like a big step forward for us!

I haven't looked at the code much yet, but had some comments on the functional behaviour and output. I will probably have some more comments and wish I'd had more time to think more about the comments I have so far, but I thought it might be helpful to have comments sooner even if they are half-baked.


The output when there are errors is sometimes a bit non-obvious. For example:

/en-US/docs/Web/JavaScript/Reference/Global_Objects/WeakSet/clear
      1:1  error    Page doesn't match recipe javascript-method       html-must-match-recipe:javascript-method
      1:1  warning  Missing ingredient: data.specifications           html-require-recipe-ingredient:data.specifications
      1:1  warning  Missing ingredient: prose.description             html-require-recipe-ingredient:prose.description
      1:6  warning  Macro: JSRef                                      html-warn-on-macros:jsref
     1:16  warning  Macro: obsolete_header                            html-warn-on-macros:obsolete_header
     36:6  warning  Macro: jsxref                                     html-warn-on-macros:jsxref
     37:6  warning  Macro: jsxref                                     html-warn-on-macros:jsxref

This page had an error "Page doesn't match recipe javascript-method". But what was the actual problem with the recipe? Because I know, I know that it's the next two lines "Missing ingredient:". But those lines are listed as warnings, not errors.

It might be better to report missing ingredient (or "Missing mandatory ingredient") as an error. Even better perhaps to somehow report that "Page doesn't match recipe javascript-method" isn't a separate error from "Missing mandatory ingredient" - it's essentially the same error.


Sometimes the linter could provide more precise error information. For example, it took me a little time to understand why https://wiki.developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Locale/caseFirst was giving me:

/en-US/docs/Web/JavaScript/Reference/Global_Objects/Locale/caseFirst
      1:1  error    Page doesn't match recipe javascript-property     html-must-match-recipe:javascript-property
      1:1  warning  No handler for ingredient data.info_box
      1:1  warning  Missing ingredient: data.specifications           html-require-recipe-ingredient:data.specifications
      1:6  warning  Macro: JSRef                                      html-warn-on-macros:jsref
    7:226  warning  Macro: jsxref                                     html-warn-on-macros:jsxref
     78:6  warning  Macro: jsxref                                     html-warn-on-macros:jsxref

...because it has a "Specifications" H2 with a valid-looking table. Actually it's because the section is missing "specname", which the linter knows but doesn't report.


As well as data.info_box, this doesn't understand data.class_constructor. At the moment the linter communicates missing handlers as content problems:

/en-US/docs/Web/JavaScript/Reference/Global_Objects/WebAssembly/Global
      1:1  error    Page doesn't match recipe javascript-class        html-must-match-recipe:javascript-class
      1:1  warning  No handler for ingredient data.class_constructor
      1:6  warning  Macro: JSRef                                      html-warn-on-macros:jsref
    3:178  warning  Macro: jsxref                                     html-warn-on-macros:jsxref
    20:87  warning  Macro: jsxref                                     html-warn-on-macros:jsxref
     77:8  warning  Macro: SpecName                                   html-warn-on-macros:specname

But these aren't content problems, they're pieces of the linter that aren't implemented yet. The linter could actually know this as soon as it looks at the recipe, without looking at the page content at all. So it could report: "I can't check page XYZ, because it reports itself to be javascript-class" and I don't know how to lint with that recipe yet".


It seems cruel to warn for including {{specname}} but also warn for not including it :). Now we have an idea which macros we need to "replace", perhaps we should exclude them all from the linter. Out of scope for this PR though I suppose.

@Elchi3
Copy link
Member

Elchi3 commented Feb 28, 2020

Overall, I really like this PR and the mechanics it introduces! Great work! I've used this already to lint some trees and I'm amazed by the output. I think what Will has to say about the error reporting is true:

This page had an error "Page doesn't match recipe javascript-method". But what was the actual problem with the recipe? Because I know, I know that it's the next two lines "Missing ingredient:". But those lines are listed as warnings, not errors.

This was confusing to me at first as well. To resolve "Page doesn't match recipe javascript-method", I need to resolve the subsequent warnings which can be "missing handler", or "missing ingredient" or something else and only then the error "Page doesn't match recipe javascript-method" disappears.

I've toyed with adding some trivial handlers for the javascript-error ingredients, and it works well. I think we need to deep dive into the individual handlers and make them good, but I think that can happen as follow-ups where we allow ourselves more time and thinking for each ingredient and its handler. Also, this will also come up when fixing the pages anyway, because either the pages need to be fixed for real or the handlers will need to become smarter about things. So, I think, let's not block on that for the initial landing here.

if (ingredient in ingredientHandlers) {
ok = ingredientHandlers[ingredient](tree, file) && ok;
} else {
file.message(`No handler for ingredient ${ingredient}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no message.ruleId so it displays null.
Should we add something or are we fine with null in the report?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm fixing this specifically, but this brought to my attention more issues with ruleIds so I'm going to mention this in a much longer comment, forthcoming.

@ddbeck
Copy link
Contributor Author

ddbeck commented Feb 28, 2020

Thank you for your helpful reviews, @wbamberg and @Elchi3. I've pushed a bunch of changes and I'm re-requesting a review. Some details on the changes I made, and some specific notes and responses after that.

Latest changes

Looking at the commit history may help, but here's what I've done by importance:

  • You'll notice that the structure of this part of the linter has changed. I moved ingredient handlers to a separate file. Since the plugin machinery is now rather distinct from the way handlers work, I thought it was easier to follow this way.
  • Missing ingredients are now reported as errors (not warnings) and the recipe mismatch error is no more. I originally reported a mismatch error so that it was easier to see how many pages were failing, but I guess this was just confusing. Consequently, warnMissingIngredient() is now logMissingIngredient() and error strings and IDs have changed for readability.
  • Ingredients without a handler warn instead of fail.
  • There were a lot of copy-and-pasted strings around, which was getting tiresome to look at. I changed the way that handlers are called now, so they get a third argument: context. This way I can pass around ingredient and recipe names a bit more ergonomically and avoid the copy and paste.
  • More specific errors are generated for missing SpecName and Compat macros. More discussion on this below.
  • More ingredients are handled now, though a few unhandled ingredients persist (see below).
  • For readability, the recipe linting now uses the macro name normalization module instead of lowercase strings.
  • I added macros that are linted by ingredients to the allowed list.
  • I merged master to get those clickable links (thanks Florian)!

Notes

SpecName and Compat errors

There are now specific errors for pages that have an expected section but lack a macro call (or expected string) for that section. This might be like my former recipe mismatch error: there's an error for a missing ingredient and there's an error for the specific way in which it doesn't have the ingredient. I'm not sure if this is a good idea and I'd like you to try it and give me some feedback for it. 😃

Unhandled ingredients

A few ingredients, including info_box, are not yet linted. I think we should deal with the implementation in separate PRs. data.info_box, data.class_constructor, and landing pages have content considerations that I think go beyond the scope of this PR. For example, data.info_box is quite different depending on whether you're talking about CSS or JS pages. It's not clear to me that these are, in fact, the same kind of ingredient.

In the mean time, you get a warning when an ingredient can't be linted. I experimented with failing upon encountering such items, based on Will's suggestion:

"I can't check page XYZ, because it reports itself to be javascript-class" and I don't know how to lint with that recipe yet"

I didn't like this very much. It meant that other lintable ingredients weren't linted. I could add some logic to skip over unlintable ingredients and then fail, but that seemed complicated when we we'll need to figure out how to deal with those ingredients anyway.

"Replace" category macros

Will said:

Now we have an idea which macros we need to "replace", perhaps we should exclude them all from the linter. Out of scope for this PR though I suppose.

This is probably right, but I'd like to have some sort of code to handle such macros before we allow them. The strictest way to do this would be to not allow a macro call until we can scrape them and turn them into Markdown. Compat is a sort of a half-way case: we now have code that acknowledges the requirement for it in Specifications sections, but we don't handle arbitrary Compat macros in the body of, for example, guide pages. Or as Will put it:

It seems cruel to warn for including {{specname}} but also warn for not including it :)

Which is why I've turned off warnings for the SpecName and Compat macros. 😄

In the mean time, you can comment out all macro warnings from preset.js when you're working on something like this. That's what I do. 🤷‍♂

Follow ups needed

Florian mentioned the null rule ID. This specific case is fixed, though I think many of the rule IDs are a mess right now. Work on this PR has clued me into how VFile represents message types. In a separate PR, I need to clean up ruleIds to use namespaces properly. This should enable some better filtering of results, too.

Along similar lines, we really need a JSON output. Linting large trees is tough to do in a console. It'd be really great to generate a JSON report and use that to slice the output. We could even generate a web dashboard to share our progress.

// Drop any part of the subtree that's after the first H2
const sectionStart = 0;
const sectionEnd = subTree.children.indexOf(select("h2", subTree));
subTree.children = subTree.children.slice(sectionStart, sectionEnd);
Copy link

@wbamberg wbamberg Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is buggy because if there are no H2 elements, sectionEnd is -1 so slice() removes the last element ( and maybe the same at https://github.com/mdn/stumptown-content/pull/320/files#diff-ea5f15cea1cdb44e2b23f4437776a67eR95 ?).

Finding short_description is tricky and will probably take some tuning. For JS pages, it's probably something like: "there is a <p> element before the interactive example or an H2". We could also exclude <p class="note" and <p class="warning".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried tuning the short_description handler some more and explaining my assumptions in a comment. There might be more ways in which it could be adjusted (is text content in a <div> OK? right now it's not), but it's pretty complicated as it is. In the interest of preventing this PR from growing larger, if you have any non-blocking feedback on the new short_description handler, please mention it, but I'll address them in a separate PR.

Copy link

@wbamberg wbamberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this quick update @ddbeck , I think this is really great.

I think there are a couple of occurrences of the same bug in the doc-slicing code, and that and some repetition make me think it might be worth factoring out some basic doc-slicing functions.

  • You'll notice that the structure of this part of the linter has changed. I moved ingredient handlers to a separate file. Since the plugin machinery is now rather distinct from the way handlers work, I thought it was easier to follow this way.

Definitely. I was going to mention that. I expect the handlers are going to multiply and potentially get more complex, so might end up wanting to live in their own files too.

It's much easier to read like this.

Notes

SpecName and Compat errors

There are now specific errors for pages that have an expected section but lack a macro call (or expected string) for that section. This might be like my former recipe mismatch error: there's an error for a missing ingredient and there's an error for the specific way in which it doesn't have the ingredient. I'm not sure if this is a good idea and I'd like you to try it and give me some feedback for it. 😃

I like this. I think the main things I didn't like before were warnings that were errors in disguise, and no linkage indicated between linked errors, and these are both addressed here:

https://wiki.developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy/handler/enumerate
           1:1  error    Missing from javascript-method: data.specifications                                    javascript-method/data.specifications                            html-require-ingredient
           1:6  warning  Macro: JSRef                                                                           html-warn-on-macros:jsref
          1:16  warning  Macro: obsolete_header                                                                 html-warn-on-macros:obsolete_header
          3:87  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
         28:77  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
          36:6  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
         41:69  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
         49:29  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
    77:1-77:44  error    Expected SpecName macro for javascript-method: data.specifications                     javascript-method/data.specifications/expected-macro             html-require-ingredient
          90:6  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
          91:6  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
          92:6  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref
          93:6  warning  Macro: jsxref                                                                          html-warn-on-macros:jsxref

Unhandled ingredients

A few ingredients, including info_box, are not yet linted. I think we should deal with the implementation in separate PRs. data.info_box, data.class_constructor, and landing pages have content considerations that I think go beyond the scope of this PR.

Yeah, I agree. As usual, it's only once the PR is done that I see the scope we should have spelled out in the user story :). There is obviously more work to do here like:

  • missing mandatory ingredients
  • checking ingredients at a deeper level
  • checking optional ingredients
  • checking the order of ingredients

But it definitely makes sense to land this and do those as follow-ups.

For example, data.info_box is quite different depending on whether you're talking about CSS or JS pages. It's not clear to me that these are, in fact, the same kind of ingredient.

In the mean time, you get a warning when an ingredient can't be linted. I experimented with failing upon encountering such items, based on Will's suggestion:

"I can't check page XYZ, because it reports itself to be javascript-class" and I don't know how to lint with that recipe yet"

I didn't like this very much. It meant that other lintable ingredients weren't linted. I could add some logic to skip over unlintable ingredients and then fail, but that seemed complicated when we we'll need to figure out how to deal with those ingredients anyway.

I'm fine with what's here. The main thing is that it's not reported as the content's fault.

"Replace" category macros

Will said:

Now we have an idea which macros we need to "replace", perhaps we should exclude them all from the linter. Out of scope for this PR though I suppose.

This is probably right, but I'd like to have some sort of code to handle such macros before we allow them. The strictest way to do this would be to not allow a macro call until we can scrape them and turn them into Markdown.

OK. I think this makes the output a bit misleading because it's not a content problem to have Compat or EmbedInteractiveExample or JSRef even today. But it doesn't matter either way.

Follow ups needed

Along similar lines, we really need a JSON output. Linting large trees is tough to do in a console. It'd be really great to generate a JSON report and use that to slice the output. We could even generate a web dashboard to share our progress.

Yes, JSON output would be really nice. Also a follow-up of course :).

file.message(`Linting ${ingredient} ingredient is unimplemented`, origin);
},
"data.browser_compatibility": (tree, file, context) => {
const id = "Browser_compatibility";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

35-59 here is basically identical to 76-96 - the only difference is because of a bug in the second version, as mentioned at https://github.com/mdn/stumptown-content/pull/320/files#r385966999.

This seems like a common thing to want to do: "get the piece of the doc between this H2 and the next one". Factoring it out would help ensure that it's really solid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call on this. I've pushed a change that factors out slicing sections (and another change that further generalizes the idea, to find the subtree between a node and the first node that meets some condition).

@ddbeck
Copy link
Contributor Author

ddbeck commented Mar 2, 2020

I've pushed a couple more changes, mostly around slicing up sections.

@wbamberg

I think there are a couple of occurrences of the same bug in the doc-slicing code, and that and some repetition make me think it might be worth factoring out some basic doc-slicing functions.

Yes, I've done this. It encouraged me to do something a little more general helps for more than slicing up sections (take a peek at the short_description ingredient for this).

I expect the handlers are going to multiply and potentially get more complex, so might end up wanting to live in their own files too.

I'm feeling this already. I didn't actually do it in this PR because I thought the reorganization and renames would be hard to follow. I'll do this in a (hopefully very easy to review) follow up PR.

I've got a lot of follow ups to do. 😅

@ddbeck ddbeck requested a review from wbamberg March 2, 2020 16:19
Copy link
Member

@Elchi3 Elchi3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this again and I'm happy with the changes made. I think a lot of things can be follow-ups. If you file them as issues, we can schedule them and also assign me or Will and have you review it, so that we learn the code some more in-depth as well. Really excellent work, thank you! 🚀

Copy link

@wbamberg wbamberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great. Thanks @ddbeck !

@ddbeck
Copy link
Contributor Author

ddbeck commented Mar 3, 2020

Thanks again, @wbamberg and @Elchi3 with your help on this PR. I opened issues for each of the things I mentioned as possible follow ups. Some of them are pretty straight forward (e.g., reorganizing handlers into modules), but a few are kinda gnarly (writing new handlers or linting ingredient order). Feel free to edit the descriptions to fill them out a bit more.

@wbamberg
Copy link

wbamberg commented Mar 3, 2020

I opened issues for each of the things I mentioned as possible follow ups

Thank you for this Daniel! My vote as the highest priority thing here is JSON output.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants