Part of plan #15. Follow-up to Phase 4 (#10) — originally intended to be part of that PR but reverted after the dogfood failure.
Context
Phase 4 (#10 → PR #18) tried to drop the runner from root to a dedicated runner user via sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' wrapping the download + configure + run sequence. The terraform-provider-namecheap#182 dogfood failed at Start self-hosted EC2 runner with the 5-minute registration timeout — the EC2 instance came up but the runner never reached ./run.sh polling GitHub.
PR #19 (the fix-forward) reverted the non-root transition and kept everything else from Phase 4 (runner-version input, --ephemeral, --unattended, --disableupdate, SHA-256 checksum, set -euo pipefail). Dogfood with that pin now passes.
The missing piece
Instrumentation. We can't post-mortem the failed instance — it was ephemeral and got terminated by the stop-runner step before we could SSH in. Any second attempt at non-root needs to leave a breadcrumb trail we can pull up after the fact.
Suggested approach
1. Add an opt-in debug mode (via the Phase 7 debug input)
When debug: true:
set -x on the bootstrap so every command is echoed.
- Stream stdout+stderr to
/var/log/ec2-github-runner-bootstrap.log (world-readable).
- Upload the log to S3 (bucket passed via new
debug-log-bucket input) on exit, pass or fail. Requires an additional s3:PutObject permission on the runner's IAM instance profile.
- Or simpler: write the log to the instance metadata service as a
User-Data/Instance-Identity field — no extra IAM, but size-limited.
2. Reproduce outside the ephemeral flow
Launch one test instance with the same AMI + the failing user-data, and watch the EC2 console output via aws ec2 get-console-output. No runner registration needed — just look at where cloud-init actually stopped.
3. Binary-search the bootstrap
Rather than adding back all the non-root machinery at once, add it in the smallest possible steps:
- Step a:
useradd runner only. Keep root execution. Verify dogfood passes.
- Step b:
sudo -u runner true before anything else. Verify dogfood passes.
- Step c: Move only the tarball download + extraction to the runner user. Keep root for
config.sh + run.sh.
- Step d: Move
config.sh as well.
- Step e: Move
run.sh.
Any step that fails dogfood is the breaker and can be investigated in isolation.
Likely suspects
requiretty or Defaults env_reset in sudoers on the DEVOPS/hardened-amazon-linux2023 AMI. Cloud-init runs user-data in a non-interactive, tty-less context; requiretty would kill sudo -u runner.
- SELinux enforcing a context on
/home/runner that prevents config.sh from writing its .runner and .credentials files. Visible in audit.log / journalctl.
- Heredoc quoting in my JS — unlikely, but the quoted delimiter
<<'RUNNER_BOOTSTRAP' vs unquoted might have interacted weirdly with the token containing special characters.
- Runner user's PATH —
config.sh and run.sh inside /home/runner/actions-runner/ execute via ./ which doesn't need . in PATH, but some internal runner code might.
Acceptance criteria
Severity
Medium — security hardening, not a blocker. Phase 4 shipped the safer pieces; non-root is the outstanding goal.
Part of plan #15. Follow-up to Phase 4 (#10) — originally intended to be part of that PR but reverted after the dogfood failure.
Context
Phase 4 (#10 → PR #18) tried to drop the runner from root to a dedicated
runneruser viasudo -u runner -H bash <<'RUNNER_BOOTSTRAP'wrapping the download + configure + run sequence. The terraform-provider-namecheap#182 dogfood failed atStart self-hosted EC2 runnerwith the 5-minute registration timeout — the EC2 instance came up but the runner never reached./run.shpolling GitHub.PR #19 (the fix-forward) reverted the non-root transition and kept everything else from Phase 4 (runner-version input,
--ephemeral,--unattended,--disableupdate, SHA-256 checksum,set -euo pipefail). Dogfood with that pin now passes.The missing piece
Instrumentation. We can't post-mortem the failed instance — it was ephemeral and got terminated by the stop-runner step before we could SSH in. Any second attempt at non-root needs to leave a breadcrumb trail we can pull up after the fact.
Suggested approach
1. Add an opt-in debug mode (via the Phase 7
debuginput)When
debug: true:set -xon the bootstrap so every command is echoed./var/log/ec2-github-runner-bootstrap.log(world-readable).debug-log-bucketinput) on exit, pass or fail. Requires an additionals3:PutObjectpermission on the runner's IAM instance profile.User-Data/Instance-Identityfield — no extra IAM, but size-limited.2. Reproduce outside the ephemeral flow
Launch one test instance with the same AMI + the failing user-data, and watch the EC2 console output via
aws ec2 get-console-output. No runner registration needed — just look at where cloud-init actually stopped.3. Binary-search the bootstrap
Rather than adding back all the non-root machinery at once, add it in the smallest possible steps:
useradd runneronly. Keep root execution. Verify dogfood passes.sudo -u runner truebefore anything else. Verify dogfood passes.config.sh+run.sh.config.shas well.run.sh.Any step that fails dogfood is the breaker and can be investigated in isolation.
Likely suspects
requirettyorDefaults env_resetin sudoers on theDEVOPS/hardened-amazon-linux2023AMI. Cloud-init runs user-data in a non-interactive, tty-less context;requirettywould killsudo -u runner./home/runnerthat preventsconfig.shfrom writing its.runnerand.credentialsfiles. Visible inaudit.log/journalctl.<<'RUNNER_BOOTSTRAP'vs unquoted might have interacted weirdly with the token containing special characters.config.shandrun.shinside/home/runner/actions-runner/execute via./which doesn't need.in PATH, but some internal runner code might.Acceptance criteria
RUNNER_ALLOW_RUNASROOT=1escape hatch is gone at the end of this work.Severity
Medium — security hardening, not a blocker. Phase 4 shipped the safer pieces; non-root is the outstanding goal.