Skip to content

CI Token Best Practices Sweep#672

Merged
swahtz merged 15 commits into
mainfrom
ci_improvements
Jun 30, 2026
Merged

CI Token Best Practices Sweep#672
swahtz merged 15 commits into
mainfrom
ci_improvements

Conversation

@swahtz

@swahtz swahtz commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

This PR hardens our GitHub Actions CI, centered on the admin-scoped token used to register self-hosted EC2 runners. It migrates that token to a fine-grained, purpose-scoped credential; completes a zizmor/actionlint audit pass across all workflows (pin actions to commit SHAs, scope id-token permissions, justified trigger exemptions); and adds a new Workflow Security gate that enforces — automatically, on every PR — that the runner token can only ever be used as the github-token input to machulav/ec2-github-runner.

Motivation

secrets.EC2_RUNNER_TOKEN has Administration: Read/Write on the repository — it has to, so machulav/ec2-github-runner can register and de-register runners. Our CI runs on pull_request_target, which is required to provision runners for PRs (by making the EC2_RUNNER_TOKEN available to PRs). This PR reduces the blast radius of that token and makes its safe usage a checked, enforceable invariant rather than a convention.

The specific threat the new gate stops

A single PR cannot exfiltrate the token: under pull_request_target the workflow definition runs from the base branch (not the PR), and the token only ever lives in the runner start/stop jobs, which run on GitHub-hosted runners and never check out or execute PR code. The real risk is a two-stage, time-of-merge attack: a PR adds the token to a code-running job / env: / run: step (dormant — the base definition runs on the PR, so it looks harmless), a maintainer merges it, and the next run executes the now-poisoned base definition and leaks the token. The new scan catches that dangerous edit against the PR's proposed workflow files before it can land on main.

What's in this PR

1. Token hardening

  • Replace GH_PERSONAL_ACCESS_TOKEN with the fine-grained EC2_RUNNER_TOKEN for starting/stopping EC2 runners across tests.yml, cu128.yml, cu130.yml, publish.yml, and nightly-publish.yml.
  • Remove unnecessary token usage from sync-doc-version.yml and only sync doc versions on publish.

2. zizmor / actionlint audit hardening

  • Pin every third-party action to a commit SHA across all workflows.
  • General zizmor audit fixes (credential handling, permissions, template-injection-prone patterns).
  • Scope id-token: write (AWS OIDC) down to only the jobs/steps that actually assume an AWS role, instead of granting it workflow-wide.
  • Add justified zizmor: ignore[dangerous-triggers] annotations on the pull_request_target triggers that are genuinely required to provision runners.

3. New Workflow Security gate

  • .github/workflows/workflow-security.yml — runs on every PR (pull_request_target) and on push to main. Three scans: the repo-specific token policy (+ its unit tests), actionlint, and zizmor.
  • .github/scripts/check_runner_token_policy.py — the repo-specific policy (rules below). Passes cleanly on all current workflows.
  • .github/scripts/test_check_runner_token_policy.py — unit tests for the policy (compliant baseline, each violation class, the real repo workflows, and a fail-closed test for the leak check). Run by the Workflow Security job on every PR; needs only pyyaml + pytest, no fvdb build.
  • Added permissions: contents: read to codestyle.yml and docs-build-test.yml so the whole workflow set passes zizmor at full strictness (no severity-threshold weakening).

The token policy (enforced by check_runner_token_policy.py)

  1. The token name may appear only inside .github/workflows/*.{yml,yaml} (plus the enforcement script and its tests) — never in product source, etc.
  2. Every textual occurrence in a workflow must be exactly github-token: ${{ secrets.EC2_RUNNER_TOKEN }}. This single rule forbids putting the token in env:, GH_TOKEN/GITHUB_TOKEN, with.token, a run: script, or a reusable-workflow secrets: block.
  3. The step that consumes the token must uses: machulav/ec2-github-runner.
  4. A job that references the token must not pull untrusted code into its workspace alongside the privileged context: no local actions (uses: ./...) and no actions/checkout.

Why pull_request_target (not pull_request) for the gate

Under pull_request_target the workflow definition, the policy script, and the scanner config all come from the base branch, so a malicious PR cannot edit the check to make it pass. The job overlays only the PR head's proposed .github/workflows files as inert data to scan — no PR code is executed and no secrets are exposed (permissions: contents: read).

Required follow-up (repository admin — not code)

The scan only stops the two-stage attack if it blocks the merge. After this merges, make the check Scan workflows + enforce runner-token policy a required status check on main (and any release branches), via a branch ruleset or classic branch protection, and do not allow admins to bypass it. Because the workflow triggers on pull_request_target for every branch, it runs on every PR, so requiring it will not leave PRs stuck "waiting for status". Optionally add a CODEOWNERS entry on .github/ so a human also reviews workflow diffs.

@swahtz swahtz requested a review from a team as a code owner June 29, 2026 05:54
@swahtz swahtz requested review from areidmeyer, Copilot and harrism June 29, 2026 05:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the repository’s GitHub Actions CI around the privileged EC2 runner registration token by tightening token usage patterns, reducing workflow permissions, pinning third-party actions to immutable SHAs, and introducing a dedicated “Workflow Security” gate that scans workflow changes on every PR.

Changes:

  • Replaced the broad admin PAT usage with a purpose-scoped EC2_RUNNER_TOKEN for EC2 runner lifecycle steps, and reduced token usage elsewhere.
  • Performed a zizmor/actionlint hardening sweep: pin actions by SHA, narrow id-token: write to only AWS-OIDC jobs, and add justified dangerous-trigger exemptions.
  • Added a new workflow-security workflow plus a repo-specific policy script to enforce safe runner-token usage patterns.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
.github/workflows/workflow-security.yml New PR/merge gate to scan workflows (token-policy + actionlint + zizmor).
.github/scripts/check_runner_token_policy.py New repo-specific enforcement script for runner-token safety constraints.
.github/workflows/tests.yml Uses EC2_RUNNER_TOKEN, pins actions, and scopes OIDC permissions to runner lifecycle jobs.
.github/workflows/cu128.yml Same hardening as tests workflow for CUDA 12.8 CI.
.github/workflows/cu130.yml Same hardening as tests workflow for CUDA 13.0 CI.
.github/workflows/publish.yml Pins actions, scopes OIDC permissions, and switches runner token usage to EC2_RUNNER_TOKEN.
.github/workflows/nightly-publish.yml Pins actions, scopes OIDC permissions, and switches runner token usage to EC2_RUNNER_TOKEN.
.github/workflows/nightly.yml Pins actions and reduces template-injection-prone patterns in shell steps.
.github/workflows/sync-doc-version.yml Changes trigger to release-published; removes PAT usage and pins actions.
.github/workflows/load-versions.yml Pins checkout action and disables credential persistence.
.github/workflows/docs-build-test.yml Adds explicit read-only permissions and pins setup/checkout actions.
.github/workflows/codestyle.yml Adds explicit read-only permissions and pins third-party actions.
.github/workflows/docs.yml Pins Pages deployment-related actions to commit SHAs.
.github/workflows/issue-triage.yml Reduces expression interpolation in shell by passing values via env vars.
.github/workflows/check-changes.yml Pins dorny/paths-filter to a commit SHA.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/scripts/check_runner_token_policy.py
Comment thread .github/scripts/check_runner_token_policy.py Outdated
Comment thread .github/scripts/check_runner_token_policy.py Outdated
swahtz added 7 commits June 29, 2026 18:05
Remove unnecessary token usage

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…d EC2_RUNNER_TOKEN for starting EC2 instances

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…RUNNER_TOKEN misuse in a PR

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz force-pushed the ci_improvements branch from 90efd40 to 4867997 Compare June 29, 2026 06:07
@swahtz swahtz changed the title Ci improvements CI Token Best Practices Sweep Jun 29, 2026
- Add the standard OpenVDB copyright header to check_runner_token_policy.py
- check_no_leaks_outside_workflows() now fails closed on unexpected `git grep`
  exit codes (e.g. when run outside a git worktree) instead of silently
  passing, so Rule 1 cannot be weakened by an environment quirk
- Add test_check_runner_token_policy.py (11 cases: compliant baseline, every
  violation class, the real repo workflows, and the leak check failing closed)
  and run it from the Workflow Security job; it needs only pyyaml + pytest
- Broaden the token-name allow-list to cover .github/scripts/ tooling and the
  security doc (these reference the name, never the value)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 14 comments.

Comment thread .github/workflows/publish.yml Outdated
Comment thread .github/workflows/publish.yml Outdated
Comment thread .github/workflows/nightly-publish.yml Outdated
Comment thread .github/workflows/nightly-publish.yml Outdated
Comment thread .github/workflows/cu128.yml Outdated
Comment thread .github/workflows/codestyle.yml
Comment thread .github/workflows/codestyle.yml
Comment thread .github/workflows/codestyle.yml
Comment thread .github/workflows/codestyle.yml
Comment thread .github/workflows/codestyle.yml
Fix single-quoted shell env-var interpolations introduced by the
template-injection hardening (single quotes prevent shell expansion, so the
literal ${NEEDS_VERSIONS_OUTPUTS_*} was being passed through):
- --cuda-arch-list in cu128, cu130, cu130-nightly, publish, nightly-publish
- the gcc-toolset profile.d snippet in publish and nightly-publish
These were correct when they were ${{ ... }} (render-time) but broke once they
became ${VAR} (shell-time); switch the affected quotes to double quotes.

Workflow Security gate:
- Fetch the PR head via refs/pull/<n>/head instead of by SHA from origin, so
  the overlay works for forked PRs too.
- Make the Rule 1 leak check scan the PR head commit tree (new
  --leak-check-ref, passed as FETCH_HEAD) so it also catches token references
  the PR adds outside .github/workflows, while the policy script still runs
  from the trusted base checkout (git grep only reads blobs). Add tests for
  the ref path (clean repo passes; missing ref fails closed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz requested a review from Copilot June 29, 2026 06:34
@swahtz swahtz added the CI Issues related to the Github actions CI/CD. For build issues use CMake/Build label Jun 29, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/publish.yml
Comment thread .github/workflows/nightly-publish.yml

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Comment thread .github/workflows/workflow-security.yml
Comment thread .github/scripts/check_runner_token_policy.py
Comment thread .github/workflows/docs.yml Outdated
…e tag

- workflow-security.yml: pin ZIZMOR_VERSION to an exact 1.26.1 instead of the
  floating "1.*", so the security gate is deterministic and a new zizmor
  release can't start failing unrelated PRs.
- check_runner_token_policy.py: treat a missing `git` (FileNotFoundError) as a
  Rule 1 violation rather than a silent skip, so the leak check always fails
  closed when it cannot verify confinement. Add a monkeypatched regression test.
- docs.yml: correct the actions/deploy-pages SHA annotation from the misleading
  "# v3.0.2-node.24" to "# v5.0.0". The SHA is unchanged and is the same commit
  v5/v5.0.0 point to (it carries all three tags); this is not a downgrade, just
  an accurate annotation matching the prior @v5 usage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread .github/workflows/workflow-security.yml Outdated
Pin the policy self-test dependencies to exact versions (pyyaml==6.0.3,
pytest==9.0.3) via new PYYAML_VERSION/PYTEST_VERSION env vars, alongside the
zizmor/actionlint pins, instead of the floating `pyyaml==6.*` / `pytest>=8,<9`
ranges. Keeps the security gate deterministic so a new pyyaml/pytest release
can't start failing unrelated PRs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread .github/scripts/check_runner_token_policy.py
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

Rules 1-4 are textual and key on the literal name EC2_RUNNER_TOKEN. Dynamic
secret indexing such as `secrets[format('EC2_RUNNER_%s', 'TOKEN')]` or
`secrets[matrix.name]` could resolve the admin token without ever spelling its
name, evading every rule (and the leak grep). Add Rule 5: reject any dynamic
`secrets[...]` access in a workflow. It runs on every workflow file (even ones
that never name the token), uses a \b anchor so identifiers merely ending in
"secrets" aren't matched, and the repo uses no dynamic secret access today.
Adds four tests (format(), matrix index, token-name-absent, and a negative
case for the \b anchor).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment thread .github/workflows/sync-doc-version.yml
The workflow now triggers on `release: published`, where actions/checkout
defaults to the tag ref (detached HEAD). peter-evans/create-pull-request then
has no branch to base the PR on and would fail or base it on the tag commit
rather than current main. Pin the checkout to `ref: main` so the doc-version PR
is always created from the default branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Comment thread .github/scripts/check_runner_token_policy.py Outdated
Comment thread .github/scripts/check_runner_token_policy.py
Comment thread .github/scripts/check_runner_token_policy.py Outdated
Rule 1's docstring and the three leak-check messages said the token must be
confined to `.github/workflows/`, but the implementation also allows
`.github/scripts/` (where this script and its tests live). Introduce a single
ALLOWED_PATHS_DESC derived from ALLOWED_PATH_PREFIXES and use it in all messages,
and update the docstring, so failure output reflects the real allowed surface.
No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

@harrism harrism left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice. Great hardening. Nice tests.

@swahtz swahtz merged commit 53ca3f1 into main Jun 30, 2026
40 checks passed
@swahtz swahtz deleted the ci_improvements branch June 30, 2026 00:39
swahtz added a commit that referenced this pull request Jun 30, 2026
## Problem

The `Workflow Security` gate (merged in #672) is failing on real PRs
(e.g. #673). Its `actionlint` step runs **shellcheck** on `run:` scripts
whenever shellcheck is installed — it is on GitHub-hosted runners but
not in many local setups, so the check passed locally yet fails on every
PR with **157 info/style findings** across pre-existing workflow scripts
repo-wide:

| code | count | what |
|---|---|---|
| SC2086 | 146 | unquoted `$VAR` (word-splitting) |
| SC2174 | 8 | `mkdir -p -m` mode only on deepest dir |
| SC2012 | 2 | use `find` instead of `ls` |
| SC2129 | 1 | grouped redirect style |

There are **no** actual actionlint errors (syntax/type/injection) —
purely the bundled shellcheck nits, almost all pre-existing and
unrelated to the token work.

## Fix

Set `SHELLCHECK_OPTS: "-e SC2086,SC2174,SC2012,SC2129"` on the
actionlint step so the gate fails only on meaningful problems. **Every
other shellcheck check stays active** — including `SC2016` (single-quote
expansion, the exact bug class fixed during #672) — as do actionlint's
own expression/injection checks and the token-policy script. The
excludes can be dropped later once the scripts are properly quoted.

## Verification

Reproduced locally with shellcheck installed: default `actionlint` → 157
findings / exit 1; with the excludes → **0 findings / exit 0**. `zizmor`
and the token policy are unaffected.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Issues related to the Github actions CI/CD. For build issues use CMake/Build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants