Skip to content

revert: full rollback of Phase 4 bootstrap to Phase 1 known-good#21

Merged
kurok merged 2 commits intofeat/al2023-supportfrom
fix/4-full-revert
Apr 21, 2026
Merged

revert: full rollback of Phase 4 bootstrap to Phase 1 known-good#21
kurok merged 2 commits intofeat/al2023-supportfrom
fix/4-full-revert

Conversation

@kurok
Copy link
Copy Markdown

@kurok kurok commented Apr 21, 2026

Closes follow-ups: Phase 4's bootstrap changes need a redo. Full rollback here so the provider's acceptance tests work again.

Why full revert

Two dogfood attempts failed with the identical 6m15s registration timeout:

Attempt Changes on top of Phase 1 Dogfood
#18 (Phase 4 original) non-root + --ephemeral + --disableupdate + checksum + set -euo pipefail + runner-version input namecheap/terraform-provider-namecheap#182failed at Start self-hosted EC2 runner
#19 (fix-forward, reverted only non-root) same minus non-root namecheap/terraform-provider-namecheap#183also failed identically

Both failures' Start self-hosted EC2 runner step times out after 5 min of polling for a registered runner, meaning the user-data bootstrap never reached ./run.sh. Since #19 removed the non-root transition and still fails, the breaker is NOT the non-root move — it's somewhere in the rest of Phase 4's changes (--ephemeral, --disableupdate, checksum verify, set -euo pipefail, or parameterized bash vars).

Rather than another narrow-search iteration, revert the whole block to Phase 1 bytes and restart Phase 4 with one change per PR + one dogfood per PR.

What this PR does

Checkout a1bd2f9's (Phase 1 tip) src/aws.js / src/config.js / action.yml / tests/ into the current tree. Rebuild dist/. Everything else from Phase 1 stays intact:

  • aws-sdk v3 migration
  • ncc 0.38 build
  • jest test harness (21 tests, down from 23 — dropped the two runner-version tests)
  • .gitattributes line-ending normalization for dist/

What's lost

All five Phase 4 changes need to re-land via new issues / PRs:

  • runner-version input (additive, zero runtime risk — easy relaunch)
  • set -euo pipefail in user-data (the most likely actual breaker IMO — could be turning a previously-silent non-fatal step into a hard stop)
  • --ephemeral + --unattended + --disableupdate on config.sh (could be flag-combination incompatibility with the installed runner version)
  • SHA-256 verification of the runner tarball (could fail if .sha256 URL returns an unexpected body)
  • Parameterized RUNNER_VERSION / TARBALL / BASE bash vars (lowest risk)

Follow-up issue #20 (non-root move) is unchanged — still waiting on debug instrumentation.

Next steps post-merge

  1. Dogfood this tip on the provider. Expected: pass (it's Phase 1 behavior exactly).
  2. Open five (or fewer if some combine naturally) small PRs, one per reintroduction of a Phase 4 change, each with its own dogfood.
  3. First failure isolates the breaker and points the investigation.

Phase 4 attempts #18 (with non-root) and #19 (without non-root but
keeping --ephemeral + checksum + set -euo pipefail + runner-version
input) BOTH failed the provider dogfood with the same 6m15s runner
registration timeout (terraform-provider-namecheap#182 and machulav#183).

The fix-forward in #19 narrowed the suspect set from 'all Phase 4
changes' to 'one of: set -euo pipefail, --ephemeral flag, --disableupdate
flag, checksum verify, parameterized bash vars'. Still not isolated.

Full rollback here restores the known-good Phase 1 bootstrap exactly.
Everything else from Phase 1 is preserved (aws-sdk v3, ncc 0.38,
jest tests, .gitattributes).

Phase 4 work is NOT abandoned — it moves to follow-up issues where
each change lands on its own with its own dogfood, so the next
failure isolates itself to a single axis instead of requiring
bisection across five simultaneous changes.

Files reverted to match a1bd2f9 (Phase 1 tip):
- action.yml (drops runner-version input)
- src/aws.js (original 12-line bash array, yum install libicu make,
  RUNNER_ALLOW_RUNASROOT=1, no --ephemeral, no checksum verify)
- src/config.js (drops runnerVersion field)
- tests/config.test.js (drops runner-version test block, 23 -> 21 tests)

Dist rebuilt against the reverted src (verify-dist will confirm).

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Paired with the full Phase 4 revert — now that action.yml no longer
has a runner-version default, the Phase 4 version of verify-runner-url
that reads action.yml can't find the version. Restore the original
extractor that greps the literal URL out of src/aws.js.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
@kurok kurok merged commit 249efbd into feat/al2023-support Apr 21, 2026
4 checks passed
@kurok kurok deleted the fix/4-full-revert branch April 21, 2026 08:52
kurok added a commit to namecheap/terraform-provider-namecheap that referenced this pull request Apr 21, 2026
…trap restored) (#184)

namecheap/ec2-github-runner#21 merged. Phase 4 bootstrap changes are
fully reverted to the Phase 1 known-good state. Everything else from
Phase 1 is preserved (aws-sdk v3, ncc 0.38, jest, gitattributes).

Expected: this dogfood passes. If it does, the previous two failures
are confirmed to be caused by something in Phase 4's bootstrap deltas
— most likely 'set -euo pipefail' interacting with an in-flight
failure (`mount -o remount,size=1G /tmp` is the prime suspect per
the hypothesis in ec2-github-runner#20).

Rotation chain:
  54459d6 (DEP0169 filter)
  a1bd2f9 (Phase 1, aws-sdk v3) -- dogfood green
  7b949a3 (Phase 4 original, non-root) -- dogfood failed
  78f98d1 (Phase 4 safe, no non-root) -- dogfood failed
  249efbd (Phase 4 full revert to Phase 1 bootstrap) -- this PR

Phase 4 work pauses here until #20 gets debug instrumentation so
individual change re-introductions can be bisected.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
kurok added a commit that referenced this pull request Apr 21, 2026
…cksum table (#26)

Closes #20. Supersedes the reverted #18 / #19 / #21.

Implements the full Phase 4 bootstrap hardening from issue #10, with
the root-cause fix from #20 baked in. Key differences from the
earlier failed attempts:

## The fix for the actual failure

Previous attempts died at:

    curl -fsSL <tarball>.sha256 | awk '{print }'

with a 404 (actions/runner doesn't publish per-tarball sidecar files,
empirically confirmed via aws ec2 get-console-output on a probe
instance — see #20).

This PR replaces that with a hardcoded table of expected hashes in
src/runner-checksums.js, keyed by 'arch-version'. Two x86_64 / arm64
entries for the currently-pinned v2.333.1, sourced from the release
body at github.com/actions/runner/releases/tag/v2.333.1. CI enforces
table-vs-upstream consistency on every PR (see pr.yml).

## Everything else from Phase 4

- Non-root 'runner' user (useradd -m, sudo -u runner -H bash heredoc).
  RUNNER_ALLOW_RUNASROOT=1 escape hatch removed.

- New 'runner-version' input in action.yml (default '2.333.1'). To
  override, add matching x64+arm64 SHAs to runner-checksums.js in
  the same PR — verify-runner-url CI will reject the change if
  the hashes don't match upstream.

- --ephemeral --unattended --disableupdate on config.sh. GitHub
  auto-deregisters the runner after its job; disableupdate keeps
  the binary stable during the short ephemeral session.

- set -euo pipefail on both the outer and inner (runner-user) shells.
  The earlier fatal failure under set -e was the .sha256 404, which
  no longer exists.

- Paramaterized RUNNER_VERSION / TARBALL / BASE bash vars.

## Tests

tests/runner-checksums.test.js — 6 new cases covering the table
shape, hex format, x64+arm64 parity per version, lookup returns for
known/unknown keys.

tests/config.test.js — 2 new cases for the runner-version input
(default fallback + override).

Total: 36 -> 44 tests.

## CI: verify-runner-url overhaul

The job now parses the runner-version from action.yml, then:
1. HEADs the Linux x64 release asset (unchanged).
2. Fetches the release body via 'gh api'.
3. Greps the BEGIN SHA linux-x64 / linux-arm64 HTML comments.
4. Cross-checks against the values lookup() returns from
   src/runner-checksums.js.

Drift between the hardcoded table and upstream fails CI at code-
review time, not at runtime.

## Dogfood plan (MUCH more careful this time)

Provider SHA-pin rotation after merge, same pattern as prior phases.
This time I have full EC2 console-output diagnostic capability via
the recipe saved in my notes — any new bootstrap failure should be
trivially diagnosable rather than opaque.

Closing #20 on merge.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant