feat: non-root runner user, --ephemeral, configurable runner version (Phase 4)#18
Merged
kurok merged 1 commit intofeat/al2023-supportfrom Apr 21, 2026
Merged
Conversation
…sion (Phase 4) Closes #10. Biggest compatibility risk in the modernization plan, called out in the #15 tracker as needing a provider-repo dogfood before landing. ## Bootstrap rewrite The EC2 user-data now: - set -euo pipefail throughout — a silent useradd / tar / sha256sum failure kills the bootstrap instead of proceeding to a broken ./run.sh. - Creates a dedicated 'runner' user (idempotent — skipped if it already exists, so re-runs from a crash-loop don't explode). - Drops to that user via 'sudo -u runner -H bash <<RUNNER_BOOTSTRAP' for every subsequent step. The old 'export RUNNER_ALLOW_RUNASROOT=1' escape hatch is gone. - Fetches the runner tarball and SHA-256-verifies it against actions/runner's published '.sha256' sidecar before extraction. Same defense-in-depth pattern the provider repo uses for Go and Terraform downloads (namecheap/terraform-provider-namecheap#160). - Passes '--ephemeral --unattended --disableupdate' to config.sh. GitHub auto-deregisters the runner after one job — the existing removeRunner() API call in src/gh.js becomes belt-and-braces rather than the primary deregister path. --disableupdate keeps the runner binary stable for the short-lived ephemeral session. ## New 'runner-version' input Optional, defaults to '2.333.1' (the version this PR is tested against). Consumers can override without waiting for a new action release — useful when GitHub gates a JS action on a newer node runtime and we need to move fast. src/config.js reads it with a default fallback so old callers that don't set it continue to work. ## CI adjustment The existing verify-runner-url job greps the literal version string out of the source to HEAD-check the release asset. With the version now parameterized, the literal lives in action.yml's 'default:', so the extractor is rewritten to read it from there. ## Tests tests/config.test.js adds two cases: - defaults to 2.333.1 when runner-version input is unset - honors an explicit override Full suite: 23 tests pass across utils + config. ## Consumer impact (terraform-provider-namecheap acctest) - make testacc is 'go test' — no root required. - All setup steps (curl Go / Terraform, extract tarballs, write go-env.sh) write to $GITHUB_WORKSPACE which is writable by any runner user, not just root. - actions/checkout@v6 writes to the workspace, no root. - The workspace directory structure is unchanged beyond its absolute path (/home/runner/actions-runner/_work/... instead of /actions-runner/_work/...). GITHUB_WORKSPACE, HOME, and relative paths all resolve the same way. The dogfood SHA-pin rotation will be opened on the provider repo after this merges, mirroring the pattern from machulav#158 → machulav#159. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This was referenced Apr 21, 2026
Closed
kurok
added a commit
that referenced
this pull request
Apr 21, 2026
Phase 4 (#18) landed three independent hardenings in one PR: - New configurable runner-version input (no runtime impact) - Ephemeral + checksum + set -euo pipefail (additive safety) - Root to non-root runner user via sudo-heredoc (behavioral change) The dogfood rotation on terraform-provider-namecheap#182 failed — 'Start self-hosted EC2 runner' timed out at 6m15s waiting for runner registration. EC2 instance booted fine, but whatever the user-data did inside the instance, it didn't end at './run.sh' polling GitHub. We can't post-mortem directly because the instance is ephemeral and already terminated. Fix-forward strategy: revert ONLY the non-root transition (highest-probability culprit among the three axes), keep everything else from Phase 4. If the Phase 4 dogfood rotation passes after this revert, the root-to-runner sudo-heredoc is the breaker and can be investigated as an isolated follow-up (likely candidates: sudoers config on the hardened AMI, SELinux context, config.sh writing outside its own directory, or my heredoc quoting). Landing the safer pieces now unblocks Phases 5/6/7. Kept: - runner-version input (Phase 4's main feature) - set -euo pipefail - --ephemeral + --unattended + --disableupdate on config.sh - SHA-256 verification of the runner tarball - Clearer bash syntax ('case "$(uname -m)"', double-quoted vars) Reverted: - useradd + sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc - RUNNER_ALLOW_RUNASROOT=1 restored (runner executes as root again) The non-root goal isn't lost — a follow-up issue will propose it again, this time with better instrumentation so we can see what failed inside the instance. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This was referenced Apr 21, 2026
kurok
added a commit
that referenced
this pull request
Apr 21, 2026
* revert: full rollback of Phase 4 bootstrap changes Phase 4 attempts #18 (with non-root) and #19 (without non-root but keeping --ephemeral + checksum + set -euo pipefail + runner-version input) BOTH failed the provider dogfood with the same 6m15s runner registration timeout (terraform-provider-namecheap#182 and machulav#183). The fix-forward in #19 narrowed the suspect set from 'all Phase 4 changes' to 'one of: set -euo pipefail, --ephemeral flag, --disableupdate flag, checksum verify, parameterized bash vars'. Still not isolated. Full rollback here restores the known-good Phase 1 bootstrap exactly. Everything else from Phase 1 is preserved (aws-sdk v3, ncc 0.38, jest tests, .gitattributes). Phase 4 work is NOT abandoned — it moves to follow-up issues where each change lands on its own with its own dogfood, so the next failure isolates itself to a single axis instead of requiring bisection across five simultaneous changes. Files reverted to match a1bd2f9 (Phase 1 tip): - action.yml (drops runner-version input) - src/aws.js (original 12-line bash array, yum install libicu make, RUNNER_ALLOW_RUNASROOT=1, no --ephemeral, no checksum verify) - src/config.js (drops runnerVersion field) - tests/config.test.js (drops runner-version test block, 23 -> 21 tests) Dist rebuilt against the reverted src (verify-dist will confirm). Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> * ci: revert verify-runner-url extractor to grep src/aws.js Paired with the full Phase 4 revert — now that action.yml no longer has a runner-version default, the Phase 4 version of verify-runner-url that reads action.yml can't find the version. Restore the original extractor that greps the literal URL out of src/aws.js. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> --------- Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This was referenced Apr 21, 2026
kurok
added a commit
that referenced
this pull request
Apr 21, 2026
…cksum table (#26) Closes #20. Supersedes the reverted #18 / #19 / #21. Implements the full Phase 4 bootstrap hardening from issue #10, with the root-cause fix from #20 baked in. Key differences from the earlier failed attempts: ## The fix for the actual failure Previous attempts died at: curl -fsSL <tarball>.sha256 | awk '{print }' with a 404 (actions/runner doesn't publish per-tarball sidecar files, empirically confirmed via aws ec2 get-console-output on a probe instance — see #20). This PR replaces that with a hardcoded table of expected hashes in src/runner-checksums.js, keyed by 'arch-version'. Two x86_64 / arm64 entries for the currently-pinned v2.333.1, sourced from the release body at github.com/actions/runner/releases/tag/v2.333.1. CI enforces table-vs-upstream consistency on every PR (see pr.yml). ## Everything else from Phase 4 - Non-root 'runner' user (useradd -m, sudo -u runner -H bash heredoc). RUNNER_ALLOW_RUNASROOT=1 escape hatch removed. - New 'runner-version' input in action.yml (default '2.333.1'). To override, add matching x64+arm64 SHAs to runner-checksums.js in the same PR — verify-runner-url CI will reject the change if the hashes don't match upstream. - --ephemeral --unattended --disableupdate on config.sh. GitHub auto-deregisters the runner after its job; disableupdate keeps the binary stable during the short ephemeral session. - set -euo pipefail on both the outer and inner (runner-user) shells. The earlier fatal failure under set -e was the .sha256 404, which no longer exists. - Paramaterized RUNNER_VERSION / TARBALL / BASE bash vars. ## Tests tests/runner-checksums.test.js — 6 new cases covering the table shape, hex format, x64+arm64 parity per version, lookup returns for known/unknown keys. tests/config.test.js — 2 new cases for the runner-version input (default fallback + override). Total: 36 -> 44 tests. ## CI: verify-runner-url overhaul The job now parses the runner-version from action.yml, then: 1. HEADs the Linux x64 release asset (unchanged). 2. Fetches the release body via 'gh api'. 3. Greps the BEGIN SHA linux-x64 / linux-arm64 HTML comments. 4. Cross-checks against the values lookup() returns from src/runner-checksums.js. Drift between the hardcoded table and upstream fails CI at code- review time, not at runtime. ## Dogfood plan (MUCH more careful this time) Provider SHA-pin rotation after merge, same pattern as prior phases. This time I have full EC2 console-output diagnostic capability via the recipe saved in my notes — any new bootstrap failure should be trivially diagnosable rather than opaque. Closing #20 on merge. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #10. Part of plan #15.
Biggest compatibility risk in the plan — flagged in #15 as requiring a provider-repo dogfood before landing.
Changes
Bootstrap rewrite (
src/aws.jsuser-data)set -euo pipefailthroughout — silentuseradd/tar/sha256sumfailure kills bootstrap instead of proceeding to a broken./run.sh.runneruser (idempotent) created at instance boot.sudo -u runner -H bash <<RUNNER_BOOTSTRAPdrops to that user for every subsequent step. OldRUNNER_ALLOW_RUNASROOT=1escape hatch removed..sha256sidecar before extraction. Mirrors the pattern we use in the provider for Go/Terraform downloads (ci: verify Go and Terraform download checksums on self-hosted runner terraform-provider-namecheap#160).--ephemeral --unattended --disableupdateonconfig.sh:--ephemeral: GitHub auto-deregisters after one job. ExistingremoveRunner()ingh.jsbecomes belt-and-braces.--disableupdate: runner binary stays fixed for the short-lived session.New
runner-versioninput (action.yml)Optional, defaults to
'2.333.1'. Consumers can override without waiting for an action release — useful when GitHub gates a JS action on a newer node runtime.src/config.jsreads it with a default fallback so existing callers continue to work.CI adjustment
verify-runner-urlpreviously greped the literal version out ofsrc/aws.js. With the version now parameterized, it readsaction.yml'sdefault:via awk. Smoke-tested locally — extracts2.333.1and HEAD-checks the release asset.Tests
tests/config.test.jsaddsrunner-versioncoverage: default fallback + explicit override. Full suite: 23 tests.Consumer impact (terraform-provider-namecheap acctest)
This is the risky part. Walked through every step of the provider's
acceptance_testjob:actions/checkout@v6writes to$GITHUB_WORKSPACEcurlGo tarball to workspace-local pathtar -C .go-instance -xzf …to workspacego-env.shat workspacecurlTerraform zip to workspaceunzip -o terraform.zip -d .terraform-binmake testacc→go test ./namecheap -run TestAccNothing in the provider's acctest needs root. Workspace absolute path shifts from
/actions-runner/_work/…to/home/runner/actions-runner/_work/…but$GITHUB_WORKSPACE/$HOME/ relative paths all resolve consistently.Dogfood plan
After this merges, I'll rotate the SHA pin in
terraform-provider-namecheap/.github/workflows/ci.ymland run the full acceptance suite against the new bootstrap. Same pattern as machulav#158 → machulav#159 → machulav#181 (Phase 1).If anything goes sideways in dogfood, I'll investigate before layering Phase 5 retries on top.