CI Token Best Practices Sweep#672
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the repository’s GitHub Actions CI around the privileged EC2 runner registration token by tightening token usage patterns, reducing workflow permissions, pinning third-party actions to immutable SHAs, and introducing a dedicated “Workflow Security” gate that scans workflow changes on every PR.
Changes:
- Replaced the broad admin PAT usage with a purpose-scoped
EC2_RUNNER_TOKENfor EC2 runner lifecycle steps, and reduced token usage elsewhere. - Performed a zizmor/actionlint hardening sweep: pin actions by SHA, narrow
id-token: writeto only AWS-OIDC jobs, and add justified dangerous-trigger exemptions. - Added a new
workflow-securityworkflow plus a repo-specific policy script to enforce safe runner-token usage patterns.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/workflow-security.yml | New PR/merge gate to scan workflows (token-policy + actionlint + zizmor). |
| .github/scripts/check_runner_token_policy.py | New repo-specific enforcement script for runner-token safety constraints. |
| .github/workflows/tests.yml | Uses EC2_RUNNER_TOKEN, pins actions, and scopes OIDC permissions to runner lifecycle jobs. |
| .github/workflows/cu128.yml | Same hardening as tests workflow for CUDA 12.8 CI. |
| .github/workflows/cu130.yml | Same hardening as tests workflow for CUDA 13.0 CI. |
| .github/workflows/publish.yml | Pins actions, scopes OIDC permissions, and switches runner token usage to EC2_RUNNER_TOKEN. |
| .github/workflows/nightly-publish.yml | Pins actions, scopes OIDC permissions, and switches runner token usage to EC2_RUNNER_TOKEN. |
| .github/workflows/nightly.yml | Pins actions and reduces template-injection-prone patterns in shell steps. |
| .github/workflows/sync-doc-version.yml | Changes trigger to release-published; removes PAT usage and pins actions. |
| .github/workflows/load-versions.yml | Pins checkout action and disables credential persistence. |
| .github/workflows/docs-build-test.yml | Adds explicit read-only permissions and pins setup/checkout actions. |
| .github/workflows/codestyle.yml | Adds explicit read-only permissions and pins third-party actions. |
| .github/workflows/docs.yml | Pins Pages deployment-related actions to commit SHAs. |
| .github/workflows/issue-triage.yml | Reduces expression interpolation in shell by passing values via env vars. |
| .github/workflows/check-changes.yml | Pins dorny/paths-filter to a commit SHA. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Remove unnecessary token usage Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…d EC2_RUNNER_TOKEN for starting EC2 instances Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…RUNNER_TOKEN misuse in a PR Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
- Add the standard OpenVDB copyright header to check_runner_token_policy.py - check_no_leaks_outside_workflows() now fails closed on unexpected `git grep` exit codes (e.g. when run outside a git worktree) instead of silently passing, so Rule 1 cannot be weakened by an environment quirk - Add test_check_runner_token_policy.py (11 cases: compliant baseline, every violation class, the real repo workflows, and the leak check failing closed) and run it from the Workflow Security job; it needs only pyyaml + pytest - Broaden the token-name allow-list to cover .github/scripts/ tooling and the security doc (these reference the name, never the value) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Fix single-quoted shell env-var interpolations introduced by the
template-injection hardening (single quotes prevent shell expansion, so the
literal ${NEEDS_VERSIONS_OUTPUTS_*} was being passed through):
- --cuda-arch-list in cu128, cu130, cu130-nightly, publish, nightly-publish
- the gcc-toolset profile.d snippet in publish and nightly-publish
These were correct when they were ${{ ... }} (render-time) but broke once they
became ${VAR} (shell-time); switch the affected quotes to double quotes.
Workflow Security gate:
- Fetch the PR head via refs/pull/<n>/head instead of by SHA from origin, so
the overlay works for forked PRs too.
- Make the Rule 1 leak check scan the PR head commit tree (new
--leak-check-ref, passed as FETCH_HEAD) so it also catches token references
the PR adds outside .github/workflows, while the policy script still runs
from the trusted base checkout (git grep only reads blobs). Add tests for
the ref path (clean repo passes; missing ref fails closed).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…e tag - workflow-security.yml: pin ZIZMOR_VERSION to an exact 1.26.1 instead of the floating "1.*", so the security gate is deterministic and a new zizmor release can't start failing unrelated PRs. - check_runner_token_policy.py: treat a missing `git` (FileNotFoundError) as a Rule 1 violation rather than a silent skip, so the leak check always fails closed when it cannot verify confinement. Add a monkeypatched regression test. - docs.yml: correct the actions/deploy-pages SHA annotation from the misleading "# v3.0.2-node.24" to "# v5.0.0". The SHA is unchanged and is the same commit v5/v5.0.0 point to (it carries all three tags); this is not a downgrade, just an accurate annotation matching the prior @v5 usage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Pin the policy self-test dependencies to exact versions (pyyaml==6.0.3, pytest==9.0.3) via new PYYAML_VERSION/PYTEST_VERSION env vars, alongside the zizmor/actionlint pins, instead of the floating `pyyaml==6.*` / `pytest>=8,<9` ranges. Keeps the security gate deterministic so a new pyyaml/pytest release can't start failing unrelated PRs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Rules 1-4 are textual and key on the literal name EC2_RUNNER_TOKEN. Dynamic
secret indexing such as `secrets[format('EC2_RUNNER_%s', 'TOKEN')]` or
`secrets[matrix.name]` could resolve the admin token without ever spelling its
name, evading every rule (and the leak grep). Add Rule 5: reject any dynamic
`secrets[...]` access in a workflow. It runs on every workflow file (even ones
that never name the token), uses a \b anchor so identifiers merely ending in
"secrets" aren't matched, and the repo uses no dynamic secret access today.
Adds four tests (format(), matrix index, token-name-absent, and a negative
case for the \b anchor).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
The workflow now triggers on `release: published`, where actions/checkout defaults to the tag ref (detached HEAD). peter-evans/create-pull-request then has no branch to base the PR on and would fail or base it on the tag commit rather than current main. Pin the checkout to `ref: main` so the doc-version PR is always created from the default branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Rule 1's docstring and the three leak-check messages said the token must be confined to `.github/workflows/`, but the implementation also allows `.github/scripts/` (where this script and its tests live). Introduce a single ALLOWED_PATHS_DESC derived from ALLOWED_PATH_PREFIXES and use it in all messages, and update the docstring, so failure output reflects the real allowed surface. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
harrism
approved these changes
Jun 30, 2026
harrism
left a comment
Contributor
There was a problem hiding this comment.
Really nice. Great hardening. Nice tests.
swahtz
added a commit
that referenced
this pull request
Jun 30, 2026
## Problem The `Workflow Security` gate (merged in #672) is failing on real PRs (e.g. #673). Its `actionlint` step runs **shellcheck** on `run:` scripts whenever shellcheck is installed — it is on GitHub-hosted runners but not in many local setups, so the check passed locally yet fails on every PR with **157 info/style findings** across pre-existing workflow scripts repo-wide: | code | count | what | |---|---|---| | SC2086 | 146 | unquoted `$VAR` (word-splitting) | | SC2174 | 8 | `mkdir -p -m` mode only on deepest dir | | SC2012 | 2 | use `find` instead of `ls` | | SC2129 | 1 | grouped redirect style | There are **no** actual actionlint errors (syntax/type/injection) — purely the bundled shellcheck nits, almost all pre-existing and unrelated to the token work. ## Fix Set `SHELLCHECK_OPTS: "-e SC2086,SC2174,SC2012,SC2129"` on the actionlint step so the gate fails only on meaningful problems. **Every other shellcheck check stays active** — including `SC2016` (single-quote expansion, the exact bug class fixed during #672) — as do actionlint's own expression/injection checks and the token-policy script. The excludes can be dropped later once the scripts are properly quoted. ## Verification Reproduced locally with shellcheck installed: default `actionlint` → 157 findings / exit 1; with the excludes → **0 findings / exit 0**. `zizmor` and the token policy are unaffected. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Jonathan Swartz <jonathan@jswartz.info> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR hardens our GitHub Actions CI, centered on the admin-scoped token used to register self-hosted EC2 runners. It migrates that token to a fine-grained, purpose-scoped credential; completes a zizmor/actionlint audit pass across all workflows (pin actions to commit SHAs, scope
id-tokenpermissions, justified trigger exemptions); and adds a new Workflow Security gate that enforces — automatically, on every PR — that the runner token can only ever be used as thegithub-tokeninput tomachulav/ec2-github-runner.Motivation
secrets.EC2_RUNNER_TOKENhas Administration: Read/Write on the repository — it has to, somachulav/ec2-github-runnercan register and de-register runners. Our CI runs onpull_request_target, which is required to provision runners for PRs (by making theEC2_RUNNER_TOKENavailable to PRs). This PR reduces the blast radius of that token and makes its safe usage a checked, enforceable invariant rather than a convention.The specific threat the new gate stops
A single PR cannot exfiltrate the token: under
pull_request_targetthe workflow definition runs from thebasebranch (not the PR), and the token only ever lives in the runner start/stop jobs, which run on GitHub-hosted runners and never check out or execute PR code. The real risk is a two-stage, time-of-merge attack: a PR adds the token to a code-running job /env:/run:step (dormant — the base definition runs on the PR, so it looks harmless), a maintainer merges it, and the next run executes the now-poisoned base definition and leaks the token. The new scan catches that dangerous edit against the PR's proposed workflow files before it can land onmain.What's in this PR
1. Token hardening
GH_PERSONAL_ACCESS_TOKENwith the fine-grainedEC2_RUNNER_TOKENfor starting/stopping EC2 runners acrosstests.yml,cu128.yml,cu130.yml,publish.yml, andnightly-publish.yml.sync-doc-version.ymland only sync doc versions on publish.2. zizmor / actionlint audit hardening
id-token: write(AWS OIDC) down to only the jobs/steps that actually assume an AWS role, instead of granting it workflow-wide.zizmor: ignore[dangerous-triggers]annotations on thepull_request_targettriggers that are genuinely required to provision runners.3. New Workflow Security gate
.github/workflows/workflow-security.yml— runs on every PR (pull_request_target) and on push tomain. Three scans: the repo-specific token policy (+ its unit tests),actionlint, andzizmor..github/scripts/check_runner_token_policy.py— the repo-specific policy (rules below). Passes cleanly on all current workflows..github/scripts/test_check_runner_token_policy.py— unit tests for the policy (compliant baseline, each violation class, the real repo workflows, and a fail-closed test for the leak check). Run by the Workflow Security job on every PR; needs onlypyyaml+pytest, no fvdb build.permissions: contents: readtocodestyle.ymlanddocs-build-test.ymlso the whole workflow set passeszizmorat full strictness (no severity-threshold weakening).The token policy (enforced by
check_runner_token_policy.py).github/workflows/*.{yml,yaml}(plus the enforcement script and its tests) — never in product source, etc.github-token: ${{ secrets.EC2_RUNNER_TOKEN }}. This single rule forbids putting the token inenv:,GH_TOKEN/GITHUB_TOKEN,with.token, arun:script, or a reusable-workflowsecrets:block.uses: machulav/ec2-github-runner.uses: ./...) and noactions/checkout.Why
pull_request_target(notpull_request) for the gateUnder
pull_request_targetthe workflow definition, the policy script, and the scanner config all come from thebasebranch, so a malicious PR cannot edit the check to make it pass. The job overlays only the PR head's proposed.github/workflowsfiles as inert data to scan — no PR code is executed and no secrets are exposed (permissions: contents: read).Required follow-up (repository admin — not code)
The scan only stops the two-stage attack if it blocks the merge. After this merges, make the check
Scan workflows + enforce runner-token policya required status check onmain(and any release branches), via a branch ruleset or classic branch protection, and do not allow admins to bypass it. Because the workflow triggers onpull_request_targetfor every branch, it runs on every PR, so requiring it will not leave PRs stuck "waiting for status". Optionally add aCODEOWNERSentry on.github/so a human also reviews workflow diffs.