Improved PR link checking, finally #1574

nfagerlund · 2020-12-19T05:01:50Z

Well, I think I've figured out how to make our pull request checks usable:

Use Git to figure out which files were changed.
Spin up a preview server.
Translate the changed file paths to page URLs.
Grab each changed page and use Nokogiri to find all the links in the docs content area (ignoring other areas of the page like the top nav and sidebar nav).
Check every link and report HTTP errors.
For links that include an #anchor, use Nokogiri to make sure the destination page actually includes that anchor.

For reviewers

I need at least two reviewers:

A Ruby reviewer
A CircleCI (with funky Git + Bash stuff) reviewer

If you only want to review one part, go ahead and treat the other as a black box.

I also need the CircleCI reviewer to take a look at the similar-but-more-complicated companion PR in the Terraform core repo:

CI: Add broken link checking job for website content terraform#27404

Screenshot

The output in Circle ends up looking like this, if you have broken links:

The big old commit message

I already wrote all the explanations up for posterity, so I'll just paste it in here.

CI: Add more trustworthy link checking for pull requests (details below)

This commit adds a Ruby script and a CircleCI workflow for checking links in
only the changed files on a branch. The goal is to give a clear signal if you
introduce new broken links in a PR, with a minimum of false positives or (real)
problems that are unrelated to your changes.

DETAILS AND CAVEATS:

We rely on Git to decide which files we're checking, and we calculate it
differently in by repo:
- In terraform-website, we compare against the most recent common ancestor
  of the PR branch and master.
- In terraform core, not everything merges to master (due to long-lived
  branches like v0.14), so we need to figure out which branch you're
  probably PRing to first.
This only checks links in the content area of docs pages. It doesn't check
nav sidebars or the top/bottom navs.
No special handling for the new Vercel routes, so links to those from content
will be false positives. Probably revisiting this later, but should be OK for
an MVP since most docs content doesn't link heavily to the marketing pages.
We don't implement redirects... but you should update those links to their
new destinations anyway.
There are some easy optimization targets in the script that I didn't bother
reaching for; caching, basically. Easy to add later, but right now the script
time is WAY overshadowed by container pull and checkout time, so who cares.

CONTEXT AND REASONING:

We do a global, spidering link check for terraform.io whenever we deploy the
site. But a global link check sucks for pull requests:

It catches links that have nothing to do with your PR, and which you, the PR
author, might not have the power to do anything about.
Since it has to use a one-off build of the site to check changed pages that
aren't live yet, we either get false positives due to not implementing prod
behaviors (redirects, new routes for marketing content) or have to
re-implement those behaviors in a different stack.
Does lots of pointless extra work.

So our current PR link checking is useless, mostly due to the "someone else did
that" issue and the false positives for top nav items and redirects -- it's
always red and never actionable, so we just ignore it.

After investigating a bunch of alternatives, it turned out that the easiest
fix was to just write my own link checker script from semi-scratch in Ruby. I
realize that sounds outrageous, but:

If you have a decent URL library and HTML parsing/scraping library, there's
not much left to write. I got the prototype working in an afternoon.
Re-implementing when we switch to Next shouldn't be much worse; if anything,
the JS platform's tools should be better.
Adding Nokogiri to your dependencies is a legendary recipe for pain, but we
already HAD it because of Middleman, so no extra overhead.
On the other hand, getting any OTHER language ecosystem into our existing
Middleman container would require rearchitecting that container from the
ground up.
It might have been possible to run the site build in one container and an
off-the-shelf link checker in a second container, but networking two
containers together in a CI job seems quite complicated.
It might have been possible to use Vercel or Netlify to do PR previews, ping
their API to wait for the preview to come up, then run a link checker in a
single container. This still might be a reasonable approach for the future,
but required more platform investment (which was hard to justify for Middleman
stuff).
None of the off-the-shelf link checkers I investigated (about five or six of
them) met my requirements for PR checks (accept a list of pages to check,
scrape their links but don't spider any further than one level, check for
broken anchors as well as 404s, etc.), so they would have required some
additional tooling anyway, probably at a similar level of effort/complexity as
this new script.

mdeggies · 2021-01-07T01:32:31Z

.circleci/config.yml

+            #     copied, or had their type changed (file, symlink, etc.). In
+            #     particular, we don't want to check deleted files.
+            # $(git merge-base HEAD origin/master)..HEAD: Only consider files that have
+            #     changed since this branch diverged from master.


Thank you for the comments here -- saved me from a lot of googling time!

mdeggies

The parts that I grok look good to me. :D
@alvin-huang, can you take a look as well since you worked on the previous link checking logic?

acidprime

I went through this code again and everything looks good to me with the additions/linting we went over in Slack. Did not test the code but syntax looks good. I think moving away from open() was a good move 👍🏻

alvin-huang

LGTM great work! 🎉

This commit adds a Ruby script and a CircleCI workflow for checking links in *only the changed files* on a branch. The goal is to give a clear signal if you introduce new broken links in a PR, with a minimum of false positives or (real) problems that are unrelated to your changes. DETAILS AND CAVEATS: - We rely on Git to decide which files we're checking, and we calculate it differently in by repo: - In terraform-website, we compare against the most recent common ancestor of the PR branch and master. - In terraform core, not everything merges to master (due to long-lived branches like v0.14), so we need to figure out which branch you're probably PRing to first. - This only checks links in the *content area* of docs pages. It doesn't check nav sidebars or the top/bottom navs. - No special handling for the new Vercel routes, so links to those from content will be false positives. Probably revisiting this later, but should be OK for an MVP since most docs content doesn't link heavily to the marketing pages. - We don't implement redirects... but you should update those links to their new destinations anyway. - There are some easy optimization targets in the script that I didn't bother reaching for; caching, basically. Easy to add later, but right now the script time is WAY overshadowed by container pull and checkout time, so who cares. CONTEXT AND REASONING: We do a global, spidering link check for terraform.io whenever we deploy the site. But a global link check sucks for pull requests: - It catches links that have *nothing to do* with your PR, and which you, the PR author, might not have the power to do anything about. - Since it has to use a one-off build of the site to check changed pages that aren't live yet, we either get false positives due to not implementing prod behaviors (redirects, new routes for marketing content) or have to re-implement those behaviors in a different stack. - Does lots of pointless extra work. So our current PR link checking is useless, mostly due to the "someone else did that" issue and the false positives for top nav items and redirects -- it's always red and never actionable, so we just ignore it. After investigating a bunch of alternatives, it turned out that the easiest fix was to just write my own link checker script from semi-scratch in Ruby. I realize that sounds outrageous, but: - If you have a decent URL library and HTML parsing/scraping library, there's not much left to write. I got the prototype working in an afternoon. Re-implementing when we switch to Next shouldn't be much worse; if anything, the JS platform's tools should be better. - Adding Nokogiri to your dependencies is a legendary recipe for pain, but we already HAD it because of Middleman, so no extra overhead. - On the other hand, getting any OTHER language ecosystem into our existing Middleman container would require rearchitecting that container from the ground up. - It might have been possible to run the site build in one container and an off-the-shelf link checker in a second container, but networking two containers together in a CI job seems quite complicated. - It might have been possible to use Vercel or Netlify to do PR previews, ping their API to wait for the preview to come up, then run a link checker in a single container. This still might be a reasonable approach for the future, but required more platform investment (which was hard to justify for Middleman stuff). - None of the off-the-shelf link checkers I investigated (about five or six of them) met my requirements for PR checks (accept a list of pages to check, scrape their links but don't spider any further than one level, check for broken anchors as well as 404s, etc.), so they would have required some additional tooling anyway, probably at a similar level of effort/complexity as this new script.

GitHub PR checks show the *job* name, so that should be the more descriptive bit. Workflow name is only visible in Circle itself.

- Replace open-uri with Net::HTTP, because monkey-patching `Kernel#open` is extremely cursed and we're stuck on Ruby 2.3. - URI-escape filenames before constructing URLs with them, just in case. - Accept either null-separated or newline-separated file list, because Git commands are more predictable if you ask for nulls. - Add more comments. - Report the URL we attempted if we couldn't find a top-level input page.

…ommand - `-z` ensures consistent behavior for non-ASCII characters in filenames. Without that, behavior varies depending on the `core.quotePath` Git setting. - That git command is arguably the squirelliest bit of all this, so might as well explain which parts I considered important for the person who comes next.

This adds a CI job for running the new PR link checker for documentation. [terraform-website PR 1574](hashicorp/terraform-website#1574) added a new link checking CI job specifically for warning about broken links in pull requests. This link checker is optimized for: - Running (relatively) quickly. - Only reporting on files that were changed in the current PR, to avoid spamming you with problems you had nothing to do with. - Being transparent and simple to maintain. (Note that this is in conflict with minimizing false positives/negatives! We try to give very few of both, but completely eliminating them would result in an unaffordable maintenance burden. We expect that some PRs will be merged with this job red.) The tool is somewhat specific to our Middleman site builder, and we expect it will be replaced or obviated in the transition to the Next.js platform... but in the meantime, it should help make documentation slightly easier to maintain.

nfagerlund added the enhancement New feature (not docs) label Dec 19, 2020

nfagerlund force-pushed the dec20_pr_linkchecks branch 2 times, most recently from 9c23d5f to d62178c Compare December 19, 2020 05:29

nfagerlund force-pushed the dec20_pr_linkchecks branch 2 times, most recently from 6dd9d5a to 8c30f6a Compare January 5, 2021 02:42

nfagerlund changed the title ~~Draft: PR link checking~~ Improved PR link checking, finally Jan 5, 2021

nfagerlund marked this pull request as ready for review January 5, 2021 02:59

nfagerlund force-pushed the dec20_pr_linkchecks branch from 8c30f6a to f2e63d2 Compare January 5, 2021 17:40

mdeggies reviewed Jan 7, 2021

View reviewed changes

mdeggies approved these changes Jan 7, 2021

View reviewed changes

acidprime approved these changes Jan 7, 2021

View reviewed changes

alvin-huang approved these changes Jan 13, 2021

View reviewed changes

nfagerlund added 6 commits January 13, 2021 13:42

Skip anchor check for links we already know are broken

eace78a

Change job/workflow names for better display

7a8d514

GitHub PR checks show the *job* name, so that should be the more descriptive bit. Workflow name is only visible in Circle itself.

Add note about Vercel routes to link check output

ebf927a

nfagerlund force-pushed the dec20_pr_linkchecks branch from e94ea7f to ebf927a Compare January 13, 2021 21:48

nfagerlund merged commit 580e7fb into master Jan 13, 2021

nfagerlund deleted the dec20_pr_linkchecks branch January 13, 2021 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved PR link checking, finally #1574

Improved PR link checking, finally #1574

nfagerlund commented Dec 19, 2020 •

edited

mdeggies Jan 7, 2021

mdeggies left a comment

acidprime left a comment

alvin-huang left a comment

Improved PR link checking, finally #1574

Improved PR link checking, finally #1574

Conversation

nfagerlund commented Dec 19, 2020 • edited

For reviewers

Screenshot

The big old commit message

mdeggies Jan 7, 2021

Choose a reason for hiding this comment

mdeggies left a comment

Choose a reason for hiding this comment

acidprime left a comment

Choose a reason for hiding this comment

alvin-huang left a comment

Choose a reason for hiding this comment

nfagerlund commented Dec 19, 2020 •

edited