Skip to content
This repository has been archived by the owner on Jul 5, 2023. It is now read-only.

Improved PR link checking, finally #1574

Merged
merged 6 commits into from Jan 13, 2021
Merged

Improved PR link checking, finally #1574

merged 6 commits into from Jan 13, 2021

Conversation

nfagerlund
Copy link
Collaborator

@nfagerlund nfagerlund commented Dec 19, 2020

Well, I think I've figured out how to make our pull request checks usable:

  • Use Git to figure out which files were changed.
  • Spin up a preview server.
  • Translate the changed file paths to page URLs.
  • Grab each changed page and use Nokogiri to find all the links in the docs content area (ignoring other areas of the page like the top nav and sidebar nav).
  • Check every link and report HTTP errors.
  • For links that include an #anchor, use Nokogiri to make sure the destination page actually includes that anchor.

For reviewers

I need at least two reviewers:

  • A Ruby reviewer
  • A CircleCI (with funky Git + Bash stuff) reviewer

If you only want to review one part, go ahead and treat the other as a black box.

I also need the CircleCI reviewer to take a look at the similar-but-more-complicated companion PR in the Terraform core repo:

Screenshot

The output in Circle ends up looking like this, if you have broken links:

Screenshot_2021-01-04 Step - website-test (1942) - hashicorp terraform-website

The big old commit message

I already wrote all the explanations up for posterity, so I'll just paste it in here.

CI: Add more trustworthy link checking for pull requests (details below)

This commit adds a Ruby script and a CircleCI workflow for checking links in
only the changed files on a branch. The goal is to give a clear signal if you
introduce new broken links in a PR, with a minimum of false positives or (real)
problems that are unrelated to your changes.

DETAILS AND CAVEATS:

  • We rely on Git to decide which files we're checking, and we calculate it
    differently in by repo:

    • In terraform-website, we compare against the most recent common ancestor
      of the PR branch and master.

    • In terraform core, not everything merges to master (due to long-lived
      branches like v0.14), so we need to figure out which branch you're
      probably PRing to first.

  • This only checks links in the content area of docs pages. It doesn't check
    nav sidebars or the top/bottom navs.

  • No special handling for the new Vercel routes, so links to those from content
    will be false positives. Probably revisiting this later, but should be OK for
    an MVP since most docs content doesn't link heavily to the marketing pages.

  • We don't implement redirects... but you should update those links to their
    new destinations anyway.

  • There are some easy optimization targets in the script that I didn't bother
    reaching for; caching, basically. Easy to add later, but right now the script
    time is WAY overshadowed by container pull and checkout time, so who cares.

CONTEXT AND REASONING:

We do a global, spidering link check for terraform.io whenever we deploy the
site. But a global link check sucks for pull requests:

  • It catches links that have nothing to do with your PR, and which you, the PR
    author, might not have the power to do anything about.

  • Since it has to use a one-off build of the site to check changed pages that
    aren't live yet, we either get false positives due to not implementing prod
    behaviors (redirects, new routes for marketing content) or have to
    re-implement those behaviors in a different stack.

  • Does lots of pointless extra work.

So our current PR link checking is useless, mostly due to the "someone else did
that" issue and the false positives for top nav items and redirects -- it's
always red and never actionable, so we just ignore it.

After investigating a bunch of alternatives, it turned out that the easiest
fix was to just write my own link checker script from semi-scratch in Ruby. I
realize that sounds outrageous, but:

  • If you have a decent URL library and HTML parsing/scraping library, there's
    not much left to write. I got the prototype working in an afternoon.
    Re-implementing when we switch to Next shouldn't be much worse; if anything,
    the JS platform's tools should be better.

  • Adding Nokogiri to your dependencies is a legendary recipe for pain, but we
    already HAD it because of Middleman, so no extra overhead.

  • On the other hand, getting any OTHER language ecosystem into our existing
    Middleman container would require rearchitecting that container from the
    ground up.

  • It might have been possible to run the site build in one container and an
    off-the-shelf link checker in a second container, but networking two
    containers together in a CI job seems quite complicated.

  • It might have been possible to use Vercel or Netlify to do PR previews, ping
    their API to wait for the preview to come up, then run a link checker in a
    single container. This still might be a reasonable approach for the future,
    but required more platform investment (which was hard to justify for Middleman
    stuff).

  • None of the off-the-shelf link checkers I investigated (about five or six of
    them) met my requirements for PR checks (accept a list of pages to check,
    scrape their links but don't spider any further than one level, check for
    broken anchors as well as 404s, etc.), so they would have required some
    additional tooling anyway, probably at a similar level of effort/complexity as
    this new script.

@nfagerlund nfagerlund added the enhancement New feature (not docs) label Dec 19, 2020
@nfagerlund nfagerlund force-pushed the dec20_pr_linkchecks branch 2 times, most recently from 9c23d5f to d62178c Compare December 19, 2020 05:29
@nfagerlund nfagerlund force-pushed the dec20_pr_linkchecks branch 2 times, most recently from 6dd9d5a to 8c30f6a Compare January 5, 2021 02:42
@nfagerlund nfagerlund changed the title Draft: PR link checking Improved PR link checking, finally Jan 5, 2021
@nfagerlund nfagerlund marked this pull request as ready for review January 5, 2021 02:59
# copied, or had their type changed (file, symlink, etc.). In
# particular, we don't want to check deleted files.
# $(git merge-base HEAD origin/master)..HEAD: Only consider files that have
# changed since this branch diverged from master.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments here -- saved me from a lot of googling time!

Copy link
Member

@mdeggies mdeggies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parts that I grok look good to me. :D
@alvin-huang, can you take a look as well since you worked on the previous link checking logic?

Copy link
Contributor

@acidprime acidprime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this code again and everything looks good to me with the additions/linting we went over in Slack. Did not test the code but syntax looks good. I think moving away from open() was a good move 👍🏻

Copy link
Contributor

@alvin-huang alvin-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM great work! 🎉

This commit adds a Ruby script and a CircleCI workflow for checking links in
*only the changed files* on a branch. The goal is to give a clear signal if you
introduce new broken links in a PR, with a minimum of false positives or (real)
problems that are unrelated to your changes.

DETAILS AND CAVEATS:

- We rely on Git to decide which files we're checking, and we calculate it
  differently in by repo:

    - In terraform-website, we compare against the most recent common ancestor
      of the PR branch and master.

    - In terraform core, not everything merges to master (due to long-lived
      branches like v0.14), so we need to figure out which branch you're
      probably PRing to first.

- This only checks links in the *content area* of docs pages. It doesn't check
  nav sidebars or the top/bottom navs.

- No special handling for the new Vercel routes, so links to those from content
  will be false positives. Probably revisiting this later, but should be OK for
  an MVP since most docs content doesn't link heavily to the marketing pages.

- We don't implement redirects... but you should update those links to their
  new destinations anyway.

- There are some easy optimization targets in the script that I didn't bother
  reaching for; caching, basically. Easy to add later, but right now the script
  time is WAY overshadowed by container pull and checkout time, so who cares.

CONTEXT AND REASONING:

We do a global, spidering link check for terraform.io whenever we deploy the
site. But a global link check sucks for pull requests:

- It catches links that have *nothing to do* with your PR, and which you, the PR
  author, might not have the power to do anything about.

- Since it has to use a one-off build of the site to check changed pages that
  aren't live yet, we either get false positives due to not implementing prod
  behaviors (redirects, new routes for marketing content) or have to
  re-implement those behaviors in a different stack.

- Does lots of pointless extra work.

So our current PR link checking is useless, mostly due to the "someone else did
that" issue and the false positives for top nav items and redirects -- it's
always red and never actionable, so we just ignore it.

After investigating a bunch of alternatives, it turned out that the easiest
fix was to just write my own link checker script from semi-scratch in Ruby. I
realize that sounds outrageous, but:

- If you have a decent URL library and HTML parsing/scraping library, there's
  not much left to write. I got the prototype working in an afternoon.
  Re-implementing when we switch to Next shouldn't be much worse; if anything,
  the JS platform's tools should be better.

- Adding Nokogiri to your dependencies is a legendary recipe for pain, but we
  already HAD it because of Middleman, so no extra overhead.

- On the other hand, getting any OTHER language ecosystem into our existing
  Middleman container would require rearchitecting that container from the
  ground up.

- It might have been possible to run the site build in one container and an
  off-the-shelf link checker in a second container, but networking two
  containers together in a CI job seems quite complicated.

- It might have been possible to use Vercel or Netlify to do PR previews, ping
  their API to wait for the preview to come up, then run a link checker in a
  single container. This still might be a reasonable approach for the future,
  but required more platform investment (which was hard to justify for Middleman
  stuff).

- None of the off-the-shelf link checkers I investigated (about five or six of
  them) met my requirements for PR checks (accept a list of pages to check,
  scrape their links but don't spider any further than one level, check for
  broken anchors as well as 404s, etc.), so they would have required some
  additional tooling anyway, probably at a similar level of effort/complexity as
  this new script.
GitHub PR checks show the *job* name, so that should be the more descriptive bit.
Workflow name is only visible in Circle itself.
- Replace open-uri with Net::HTTP, because monkey-patching `Kernel#open` is
  extremely cursed and we're stuck on Ruby 2.3.
- URI-escape filenames before constructing URLs with them, just in case.
- Accept either null-separated or newline-separated file list, because Git
  commands are more predictable if you ask for nulls.
- Add more comments.
- Report the URL we attempted if we couldn't find a top-level input page.
…ommand

- `-z` ensures consistent behavior for non-ASCII characters in filenames.
  Without that, behavior varies depending on the `core.quotePath` Git setting.
- That git command is arguably the squirelliest bit of all this, so might as
  well explain which parts I considered important for the person who comes next.
@nfagerlund nfagerlund merged commit 580e7fb into master Jan 13, 2021
@nfagerlund nfagerlund deleted the dec20_pr_linkchecks branch January 13, 2021 22:00
nfagerlund added a commit to hashicorp/terraform that referenced this pull request Jan 13, 2021
This adds a CI job for running the new PR link checker for documentation.

[terraform-website PR 1574](hashicorp/terraform-website#1574)
added a new link checking CI job specifically for warning about broken links in
pull requests. This link checker is optimized for:

- Running (relatively) quickly.
- Only reporting on files that were changed in the current PR, to avoid spamming
  you with problems you had nothing to do with.
- Being transparent and simple to maintain. (Note that this is in conflict with
  minimizing false positives/negatives! We try to give very few of both, but
  completely eliminating them would result in an unaffordable maintenance
  burden. We expect that some PRs will be merged with this job red.)

The tool is somewhat specific to our Middleman site builder, and we expect it
will be replaced or obviated in the transition to the Next.js platform... but in
the meantime, it should help make documentation slightly easier to maintain.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature (not docs)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants