Skip to content

feat: non-root runner user, --ephemeral, configurable runner version (Phase 4)#18

Merged
kurok merged 1 commit intofeat/al2023-supportfrom
feat/10-bootstrap-hardening
Apr 21, 2026
Merged

feat: non-root runner user, --ephemeral, configurable runner version (Phase 4)#18
kurok merged 1 commit intofeat/al2023-supportfrom
feat/10-bootstrap-hardening

Conversation

@kurok
Copy link
Copy Markdown

@kurok kurok commented Apr 21, 2026

Closes #10. Part of plan #15.

Biggest compatibility risk in the plan — flagged in #15 as requiring a provider-repo dogfood before landing.

Changes

Bootstrap rewrite (src/aws.js user-data)

  • set -euo pipefail throughout — silent useradd / tar / sha256sum failure kills bootstrap instead of proceeding to a broken ./run.sh.
  • Dedicated runner user (idempotent) created at instance boot.
  • sudo -u runner -H bash <<RUNNER_BOOTSTRAP drops to that user for every subsequent step. Old RUNNER_ALLOW_RUNASROOT=1 escape hatch removed.
  • SHA-256 verification of the runner tarball against actions/runner's published .sha256 sidecar before extraction. Mirrors the pattern we use in the provider for Go/Terraform downloads (ci: verify Go and Terraform download checksums on self-hosted runner terraform-provider-namecheap#160).
  • --ephemeral --unattended --disableupdate on config.sh:
    • --ephemeral: GitHub auto-deregisters after one job. Existing removeRunner() in gh.js becomes belt-and-braces.
    • --disableupdate: runner binary stays fixed for the short-lived session.

New runner-version input (action.yml)

Optional, defaults to '2.333.1'. Consumers can override without waiting for an action release — useful when GitHub gates a JS action on a newer node runtime. src/config.js reads it with a default fallback so existing callers continue to work.

CI adjustment

verify-runner-url previously greped the literal version out of src/aws.js. With the version now parameterized, it reads action.yml's default: via awk. Smoke-tested locally — extracts 2.333.1 and HEAD-checks the release asset.

Tests

tests/config.test.js adds runner-version coverage: default fallback + explicit override. Full suite: 23 tests.

Consumer impact (terraform-provider-namecheap acctest)

This is the risky part. Walked through every step of the provider's acceptance_test job:

Step Root needed?
actions/checkout@v6 writes to $GITHUB_WORKSPACE No
curl Go tarball to workspace-local path No
tar -C .go-instance -xzf … to workspace No
Write go-env.sh at workspace No
curl Terraform zip to workspace No
unzip -o terraform.zip -d .terraform-bin No
make testaccgo test ./namecheap -run TestAcc No

Nothing in the provider's acctest needs root. Workspace absolute path shifts from /actions-runner/_work/… to /home/runner/actions-runner/_work/… but $GITHUB_WORKSPACE / $HOME / relative paths all resolve consistently.

Dogfood plan

After this merges, I'll rotate the SHA pin in terraform-provider-namecheap/.github/workflows/ci.yml and run the full acceptance suite against the new bootstrap. Same pattern as machulav#158machulav#159machulav#181 (Phase 1).

If anything goes sideways in dogfood, I'll investigate before layering Phase 5 retries on top.

…sion (Phase 4)

Closes #10. Biggest compatibility risk in the modernization plan,
called out in the #15 tracker as needing a provider-repo dogfood
before landing.

## Bootstrap rewrite

The EC2 user-data now:

- set -euo pipefail throughout — a silent useradd / tar / sha256sum
  failure kills the bootstrap instead of proceeding to a broken
  ./run.sh.
- Creates a dedicated 'runner' user (idempotent — skipped if it
  already exists, so re-runs from a crash-loop don't explode).
- Drops to that user via 'sudo -u runner -H bash <<RUNNER_BOOTSTRAP'
  for every subsequent step. The old 'export RUNNER_ALLOW_RUNASROOT=1'
  escape hatch is gone.
- Fetches the runner tarball and SHA-256-verifies it against
  actions/runner's published '.sha256' sidecar before extraction.
  Same defense-in-depth pattern the provider repo uses for Go and
  Terraform downloads (namecheap/terraform-provider-namecheap#160).
- Passes '--ephemeral --unattended --disableupdate' to config.sh.
  GitHub auto-deregisters the runner after one job — the existing
  removeRunner() API call in src/gh.js becomes belt-and-braces rather
  than the primary deregister path. --disableupdate keeps the runner
  binary stable for the short-lived ephemeral session.

## New 'runner-version' input

Optional, defaults to '2.333.1' (the version this PR is tested
against). Consumers can override without waiting for a new action
release — useful when GitHub gates a JS action on a newer node
runtime and we need to move fast.

src/config.js reads it with a default fallback so old callers that
don't set it continue to work.

## CI adjustment

The existing verify-runner-url job greps the literal version string
out of the source to HEAD-check the release asset. With the version
now parameterized, the literal lives in action.yml's 'default:',
so the extractor is rewritten to read it from there.

## Tests

tests/config.test.js adds two cases:
- defaults to 2.333.1 when runner-version input is unset
- honors an explicit override

Full suite: 23 tests pass across utils + config.

## Consumer impact (terraform-provider-namecheap acctest)

- make testacc is 'go test' — no root required.
- All setup steps (curl Go / Terraform, extract tarballs, write
  go-env.sh) write to $GITHUB_WORKSPACE which is writable by any
  runner user, not just root.
- actions/checkout@v6 writes to the workspace, no root.
- The workspace directory structure is unchanged beyond its absolute
  path (/home/runner/actions-runner/_work/... instead of
  /actions-runner/_work/...). GITHUB_WORKSPACE, HOME, and relative
  paths all resolve the same way.

The dogfood SHA-pin rotation will be opened on the provider repo
after this merges, mirroring the pattern from machulav#158machulav#159.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
@kurok kurok merged commit 7b949a3 into feat/al2023-support Apr 21, 2026
4 checks passed
@kurok kurok deleted the feat/10-bootstrap-hardening branch April 21, 2026 08:26
kurok added a commit that referenced this pull request Apr 21, 2026
Phase 4 (#18) landed three independent hardenings in one PR:
- New configurable runner-version input (no runtime impact)
- Ephemeral + checksum + set -euo pipefail (additive safety)
- Root to non-root runner user via sudo-heredoc (behavioral change)

The dogfood rotation on terraform-provider-namecheap#182 failed —
'Start self-hosted EC2 runner' timed out at 6m15s waiting for runner
registration. EC2 instance booted fine, but whatever the user-data
did inside the instance, it didn't end at './run.sh' polling GitHub.

We can't post-mortem directly because the instance is ephemeral and
already terminated. Fix-forward strategy: revert ONLY the non-root
transition (highest-probability culprit among the three axes), keep
everything else from Phase 4.

If the Phase 4 dogfood rotation passes after this revert, the
root-to-runner sudo-heredoc is the breaker and can be investigated
as an isolated follow-up (likely candidates: sudoers config on the
hardened AMI, SELinux context, config.sh writing outside its own
directory, or my heredoc quoting). Landing the safer pieces now
unblocks Phases 5/6/7.

Kept:
- runner-version input (Phase 4's main feature)
- set -euo pipefail
- --ephemeral + --unattended + --disableupdate on config.sh
- SHA-256 verification of the runner tarball
- Clearer bash syntax ('case "$(uname -m)"', double-quoted vars)

Reverted:
- useradd + sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc
- RUNNER_ALLOW_RUNASROOT=1 restored (runner executes as root again)

The non-root goal isn't lost — a follow-up issue will propose it
again, this time with better instrumentation so we can see what
failed inside the instance.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
kurok added a commit that referenced this pull request Apr 21, 2026
* revert: full rollback of Phase 4 bootstrap changes

Phase 4 attempts #18 (with non-root) and #19 (without non-root but
keeping --ephemeral + checksum + set -euo pipefail + runner-version
input) BOTH failed the provider dogfood with the same 6m15s runner
registration timeout (terraform-provider-namecheap#182 and machulav#183).

The fix-forward in #19 narrowed the suspect set from 'all Phase 4
changes' to 'one of: set -euo pipefail, --ephemeral flag, --disableupdate
flag, checksum verify, parameterized bash vars'. Still not isolated.

Full rollback here restores the known-good Phase 1 bootstrap exactly.
Everything else from Phase 1 is preserved (aws-sdk v3, ncc 0.38,
jest tests, .gitattributes).

Phase 4 work is NOT abandoned — it moves to follow-up issues where
each change lands on its own with its own dogfood, so the next
failure isolates itself to a single axis instead of requiring
bisection across five simultaneous changes.

Files reverted to match a1bd2f9 (Phase 1 tip):
- action.yml (drops runner-version input)
- src/aws.js (original 12-line bash array, yum install libicu make,
  RUNNER_ALLOW_RUNASROOT=1, no --ephemeral, no checksum verify)
- src/config.js (drops runnerVersion field)
- tests/config.test.js (drops runner-version test block, 23 -> 21 tests)

Dist rebuilt against the reverted src (verify-dist will confirm).

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

* ci: revert verify-runner-url extractor to grep src/aws.js

Paired with the full Phase 4 revert — now that action.yml no longer
has a runner-version default, the Phase 4 version of verify-runner-url
that reads action.yml can't find the version. Restore the original
extractor that greps the literal URL out of src/aws.js.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>

---------

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
kurok added a commit that referenced this pull request Apr 21, 2026
…cksum table (#26)

Closes #20. Supersedes the reverted #18 / #19 / #21.

Implements the full Phase 4 bootstrap hardening from issue #10, with
the root-cause fix from #20 baked in. Key differences from the
earlier failed attempts:

## The fix for the actual failure

Previous attempts died at:

    curl -fsSL <tarball>.sha256 | awk '{print }'

with a 404 (actions/runner doesn't publish per-tarball sidecar files,
empirically confirmed via aws ec2 get-console-output on a probe
instance — see #20).

This PR replaces that with a hardcoded table of expected hashes in
src/runner-checksums.js, keyed by 'arch-version'. Two x86_64 / arm64
entries for the currently-pinned v2.333.1, sourced from the release
body at github.com/actions/runner/releases/tag/v2.333.1. CI enforces
table-vs-upstream consistency on every PR (see pr.yml).

## Everything else from Phase 4

- Non-root 'runner' user (useradd -m, sudo -u runner -H bash heredoc).
  RUNNER_ALLOW_RUNASROOT=1 escape hatch removed.

- New 'runner-version' input in action.yml (default '2.333.1'). To
  override, add matching x64+arm64 SHAs to runner-checksums.js in
  the same PR — verify-runner-url CI will reject the change if
  the hashes don't match upstream.

- --ephemeral --unattended --disableupdate on config.sh. GitHub
  auto-deregisters the runner after its job; disableupdate keeps
  the binary stable during the short ephemeral session.

- set -euo pipefail on both the outer and inner (runner-user) shells.
  The earlier fatal failure under set -e was the .sha256 404, which
  no longer exists.

- Paramaterized RUNNER_VERSION / TARBALL / BASE bash vars.

## Tests

tests/runner-checksums.test.js — 6 new cases covering the table
shape, hex format, x64+arm64 parity per version, lookup returns for
known/unknown keys.

tests/config.test.js — 2 new cases for the runner-version input
(default fallback + override).

Total: 36 -> 44 tests.

## CI: verify-runner-url overhaul

The job now parses the runner-version from action.yml, then:
1. HEADs the Linux x64 release asset (unchanged).
2. Fetches the release body via 'gh api'.
3. Greps the BEGIN SHA linux-x64 / linux-arm64 HTML comments.
4. Cross-checks against the values lookup() returns from
   src/runner-checksums.js.

Drift between the hardcoded table and upstream fails CI at code-
review time, not at runtime.

## Dogfood plan (MUCH more careful this time)

Provider SHA-pin rotation after merge, same pattern as prior phases.
This time I have full EC2 console-output diagnostic capability via
the recipe saved in my notes — any new bootstrap failure should be
trivially diagnosable rather than opaque.

Closing #20 on merge.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant