Conversation
9c23d5f
to
d62178c
Compare
6dd9d5a
to
8c30f6a
Compare
8c30f6a
to
f2e63d2
Compare
# copied, or had their type changed (file, symlink, etc.). In | ||
# particular, we don't want to check deleted files. | ||
# $(git merge-base HEAD origin/master)..HEAD: Only consider files that have | ||
# changed since this branch diverged from master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments here -- saved me from a lot of googling time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parts that I grok look good to me. :D
@alvin-huang, can you take a look as well since you worked on the previous link checking logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this code again and everything looks good to me with the additions/linting we went over in Slack. Did not test the code but syntax looks good. I think moving away from open()
was a good move 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM great work! 🎉
This commit adds a Ruby script and a CircleCI workflow for checking links in *only the changed files* on a branch. The goal is to give a clear signal if you introduce new broken links in a PR, with a minimum of false positives or (real) problems that are unrelated to your changes. DETAILS AND CAVEATS: - We rely on Git to decide which files we're checking, and we calculate it differently in by repo: - In terraform-website, we compare against the most recent common ancestor of the PR branch and master. - In terraform core, not everything merges to master (due to long-lived branches like v0.14), so we need to figure out which branch you're probably PRing to first. - This only checks links in the *content area* of docs pages. It doesn't check nav sidebars or the top/bottom navs. - No special handling for the new Vercel routes, so links to those from content will be false positives. Probably revisiting this later, but should be OK for an MVP since most docs content doesn't link heavily to the marketing pages. - We don't implement redirects... but you should update those links to their new destinations anyway. - There are some easy optimization targets in the script that I didn't bother reaching for; caching, basically. Easy to add later, but right now the script time is WAY overshadowed by container pull and checkout time, so who cares. CONTEXT AND REASONING: We do a global, spidering link check for terraform.io whenever we deploy the site. But a global link check sucks for pull requests: - It catches links that have *nothing to do* with your PR, and which you, the PR author, might not have the power to do anything about. - Since it has to use a one-off build of the site to check changed pages that aren't live yet, we either get false positives due to not implementing prod behaviors (redirects, new routes for marketing content) or have to re-implement those behaviors in a different stack. - Does lots of pointless extra work. So our current PR link checking is useless, mostly due to the "someone else did that" issue and the false positives for top nav items and redirects -- it's always red and never actionable, so we just ignore it. After investigating a bunch of alternatives, it turned out that the easiest fix was to just write my own link checker script from semi-scratch in Ruby. I realize that sounds outrageous, but: - If you have a decent URL library and HTML parsing/scraping library, there's not much left to write. I got the prototype working in an afternoon. Re-implementing when we switch to Next shouldn't be much worse; if anything, the JS platform's tools should be better. - Adding Nokogiri to your dependencies is a legendary recipe for pain, but we already HAD it because of Middleman, so no extra overhead. - On the other hand, getting any OTHER language ecosystem into our existing Middleman container would require rearchitecting that container from the ground up. - It might have been possible to run the site build in one container and an off-the-shelf link checker in a second container, but networking two containers together in a CI job seems quite complicated. - It might have been possible to use Vercel or Netlify to do PR previews, ping their API to wait for the preview to come up, then run a link checker in a single container. This still might be a reasonable approach for the future, but required more platform investment (which was hard to justify for Middleman stuff). - None of the off-the-shelf link checkers I investigated (about five or six of them) met my requirements for PR checks (accept a list of pages to check, scrape their links but don't spider any further than one level, check for broken anchors as well as 404s, etc.), so they would have required some additional tooling anyway, probably at a similar level of effort/complexity as this new script.
GitHub PR checks show the *job* name, so that should be the more descriptive bit. Workflow name is only visible in Circle itself.
- Replace open-uri with Net::HTTP, because monkey-patching `Kernel#open` is extremely cursed and we're stuck on Ruby 2.3. - URI-escape filenames before constructing URLs with them, just in case. - Accept either null-separated or newline-separated file list, because Git commands are more predictable if you ask for nulls. - Add more comments. - Report the URL we attempted if we couldn't find a top-level input page.
…ommand - `-z` ensures consistent behavior for non-ASCII characters in filenames. Without that, behavior varies depending on the `core.quotePath` Git setting. - That git command is arguably the squirelliest bit of all this, so might as well explain which parts I considered important for the person who comes next.
e94ea7f
to
ebf927a
Compare
This adds a CI job for running the new PR link checker for documentation. [terraform-website PR 1574](hashicorp/terraform-website#1574) added a new link checking CI job specifically for warning about broken links in pull requests. This link checker is optimized for: - Running (relatively) quickly. - Only reporting on files that were changed in the current PR, to avoid spamming you with problems you had nothing to do with. - Being transparent and simple to maintain. (Note that this is in conflict with minimizing false positives/negatives! We try to give very few of both, but completely eliminating them would result in an unaffordable maintenance burden. We expect that some PRs will be merged with this job red.) The tool is somewhat specific to our Middleman site builder, and we expect it will be replaced or obviated in the transition to the Next.js platform... but in the meantime, it should help make documentation slightly easier to maintain.
Well, I think I've figured out how to make our pull request checks usable:
#anchor
, use Nokogiri to make sure the destination page actually includes that anchor.For reviewers
I need at least two reviewers:
If you only want to review one part, go ahead and treat the other as a black box.
I also need the CircleCI reviewer to take a look at the similar-but-more-complicated companion PR in the Terraform core repo:
Screenshot
The output in Circle ends up looking like this, if you have broken links:
The big old commit message
I already wrote all the explanations up for posterity, so I'll just paste it in here.
CI: Add more trustworthy link checking for pull requests (details below)
This commit adds a Ruby script and a CircleCI workflow for checking links in
only the changed files on a branch. The goal is to give a clear signal if you
introduce new broken links in a PR, with a minimum of false positives or (real)
problems that are unrelated to your changes.
DETAILS AND CAVEATS:
We rely on Git to decide which files we're checking, and we calculate it
differently in by repo:
In terraform-website, we compare against the most recent common ancestor
of the PR branch and master.
In terraform core, not everything merges to master (due to long-lived
branches like v0.14), so we need to figure out which branch you're
probably PRing to first.
This only checks links in the content area of docs pages. It doesn't check
nav sidebars or the top/bottom navs.
No special handling for the new Vercel routes, so links to those from content
will be false positives. Probably revisiting this later, but should be OK for
an MVP since most docs content doesn't link heavily to the marketing pages.
We don't implement redirects... but you should update those links to their
new destinations anyway.
There are some easy optimization targets in the script that I didn't bother
reaching for; caching, basically. Easy to add later, but right now the script
time is WAY overshadowed by container pull and checkout time, so who cares.
CONTEXT AND REASONING:
We do a global, spidering link check for terraform.io whenever we deploy the
site. But a global link check sucks for pull requests:
It catches links that have nothing to do with your PR, and which you, the PR
author, might not have the power to do anything about.
Since it has to use a one-off build of the site to check changed pages that
aren't live yet, we either get false positives due to not implementing prod
behaviors (redirects, new routes for marketing content) or have to
re-implement those behaviors in a different stack.
Does lots of pointless extra work.
So our current PR link checking is useless, mostly due to the "someone else did
that" issue and the false positives for top nav items and redirects -- it's
always red and never actionable, so we just ignore it.
After investigating a bunch of alternatives, it turned out that the easiest
fix was to just write my own link checker script from semi-scratch in Ruby. I
realize that sounds outrageous, but:
If you have a decent URL library and HTML parsing/scraping library, there's
not much left to write. I got the prototype working in an afternoon.
Re-implementing when we switch to Next shouldn't be much worse; if anything,
the JS platform's tools should be better.
Adding Nokogiri to your dependencies is a legendary recipe for pain, but we
already HAD it because of Middleman, so no extra overhead.
On the other hand, getting any OTHER language ecosystem into our existing
Middleman container would require rearchitecting that container from the
ground up.
It might have been possible to run the site build in one container and an
off-the-shelf link checker in a second container, but networking two
containers together in a CI job seems quite complicated.
It might have been possible to use Vercel or Netlify to do PR previews, ping
their API to wait for the preview to come up, then run a link checker in a
single container. This still might be a reasonable approach for the future,
but required more platform investment (which was hard to justify for Middleman
stuff).
None of the off-the-shelf link checkers I investigated (about five or six of
them) met my requirements for PR checks (accept a list of pages to check,
scrape their links but don't spider any further than one level, check for
broken anchors as well as 404s, etc.), so they would have required some
additional tooling anyway, probably at a similar level of effort/complexity as
this new script.