Skip to content

Phase 4.b: drop runner to non-root user — second attempt with bootstrap debuggability #20

@kurok

Description

@kurok

Part of plan #15. Follow-up to Phase 4 (#10) — originally intended to be part of that PR but reverted after the dogfood failure.

Context

Phase 4 (#10 → PR #18) tried to drop the runner from root to a dedicated runner user via sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' wrapping the download + configure + run sequence. The terraform-provider-namecheap#182 dogfood failed at Start self-hosted EC2 runner with the 5-minute registration timeout — the EC2 instance came up but the runner never reached ./run.sh polling GitHub.

PR #19 (the fix-forward) reverted the non-root transition and kept everything else from Phase 4 (runner-version input, --ephemeral, --unattended, --disableupdate, SHA-256 checksum, set -euo pipefail). Dogfood with that pin now passes.

The missing piece

Instrumentation. We can't post-mortem the failed instance — it was ephemeral and got terminated by the stop-runner step before we could SSH in. Any second attempt at non-root needs to leave a breadcrumb trail we can pull up after the fact.

Suggested approach

1. Add an opt-in debug mode (via the Phase 7 debug input)

When debug: true:

  • set -x on the bootstrap so every command is echoed.
  • Stream stdout+stderr to /var/log/ec2-github-runner-bootstrap.log (world-readable).
  • Upload the log to S3 (bucket passed via new debug-log-bucket input) on exit, pass or fail. Requires an additional s3:PutObject permission on the runner's IAM instance profile.
  • Or simpler: write the log to the instance metadata service as a User-Data/Instance-Identity field — no extra IAM, but size-limited.

2. Reproduce outside the ephemeral flow

Launch one test instance with the same AMI + the failing user-data, and watch the EC2 console output via aws ec2 get-console-output. No runner registration needed — just look at where cloud-init actually stopped.

3. Binary-search the bootstrap

Rather than adding back all the non-root machinery at once, add it in the smallest possible steps:

  • Step a: useradd runner only. Keep root execution. Verify dogfood passes.
  • Step b: sudo -u runner true before anything else. Verify dogfood passes.
  • Step c: Move only the tarball download + extraction to the runner user. Keep root for config.sh + run.sh.
  • Step d: Move config.sh as well.
  • Step e: Move run.sh.

Any step that fails dogfood is the breaker and can be investigated in isolation.

Likely suspects

  • requiretty or Defaults env_reset in sudoers on the DEVOPS/hardened-amazon-linux2023 AMI. Cloud-init runs user-data in a non-interactive, tty-less context; requiretty would kill sudo -u runner.
  • SELinux enforcing a context on /home/runner that prevents config.sh from writing its .runner and .credentials files. Visible in audit.log / journalctl.
  • Heredoc quoting in my JS — unlikely, but the quoted delimiter <<'RUNNER_BOOTSTRAP' vs unquoted might have interacted weirdly with the token containing special characters.
  • Runner user's PATHconfig.sh and run.sh inside /home/runner/actions-runner/ execute via ./ which doesn't need . in PATH, but some internal runner code might.

Acceptance criteria

  • Debug mode exists that makes bootstrap failures diagnosable from outside the instance.
  • Non-root runner user applied in the smallest coherent chunk that passes dogfood.
  • A decision recorded about whether the hardened-AL2023 AMI's sudoers / SELinux config needs changing, or whether the action works around them.
  • The RUNNER_ALLOW_RUNASROOT=1 escape hatch is gone at the end of this work.

Severity

Medium — security hardening, not a blocker. Phase 4 shipped the safer pieces; non-root is the outstanding goal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions