Feat/al2023 support#28
Conversation
…or AL2023 Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
Signed-off-by: Vitaliy Sidorenko <vitaliy.sidorenko@namecheap.com>
The user-data bootstrap hardcoded v2.321.0, whose externals/ directory ships only node16 and node20. Any consumer trying to use a JS action that declares 'runs: using: node24' (actions/checkout@v5+, many other updated actions) fails on runners started via this action. v2.333.1 is the current stable GA runner release (2026-03-27) and bundles externals/node20 + externals/node24. No AMI change is needed because the runner brings its own Node binaries. Rebuilt dist/index.js with NODE_OPTIONS=--openssl-legacy-provider npm run package (the pinned @vercel/ncc 0.25.1 doesn't work with modern OpenSSL without that flag). The only diff inside dist/ is the two URL/filename lines that mirror the src change. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
The existing pr.yml only ran 'npm run lint' and only on PRs targeting main, so PRs against feature branches had zero CI coverage. Two new jobs: - verify-dist: runs 'npm run package' and fails if the committed dist/index.js disagrees with a fresh rebuild from src/. Catches the 'forgot to rebuild dist' class of bug permanently so future runner-version bumps can't ship with a stale bundle. - verify-runner-url: extracts the pinned actions/runner version from src/aws.js and HEADs the corresponding release asset. Catches typos in the version string and release-asset naming drift before a bad AMI start wastes an EC2 lifecycle. Also drops the 'branches: [main]' filter so PRs against any branch (feat/al2023-support in particular) get linted and verified. The existing lint-code job is left untouched to keep the diff minimal. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
…code GitHub started auto-failing workflows using actions/cache@v2 in December 2025 (deprecation notice). actions/checkout@v2 is in the same bucket. With this PR removing the 'branches: [main]' filter on pr.yml, the lint-code job now runs on every PR — so the red-flag deprecation caught by the runner causes every new PR to fail until these two pins are rotated forward. Minimal change: v2 → v4 on both uses: lines; everything else in the job stays the same. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
feat: bump actions/runner to v2.333.1 for node24 support
The action.yml was still declaring 'using: node12', the oldest Node runtime GitHub ever supported. GitHub has been auto-shimming forward to node16 / node20 via ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION for years; the current runner (v2.333.1, per #3) surfaces the warning: Node.js 20 actions are deprecated. Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026. Declaring node24 explicitly makes the runtime match what the runner already ships (externals/node24) and removes the deprecation warning from every consumer's CI page. Compatibility: - dist/index.js reviewed for Node-22+ removals (util.isArray, createCipher, etc.) — none present. - Two 'new Buffer(...)' occurrences: one inside a lodash docstring comment (not executed), one in tunnel-agent's proxy-auth path (we never pass proxyAuth). 'new Buffer' emits DEP0005 on node24 but still works; no runtime break. - Verified dist is reproducible: NODE_OPTIONS=--openssl-legacy-provider npm run package produces byte-identical dist/index.js. Consumers pinning this action by SHA (e.g. namecheap/terraform- provider-namecheap) will rotate to the new commit in a follow-up. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
feat: declare action runtime as node24
The bundled @actions/core v1.2.6 still emits the deprecated '::set-output name=X::Y' workflow command, which GitHub has warned about since 2022-10 and has announced will be disabled outright: The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. Two viable fixes: - Bump @actions/core to >= 1.10.0 (the version that switched to GITHUB_OUTPUT internally). Tried; transitive deps pull modern JS (private class fields) that @vercel/ncc 0.25.1 can't parse, which would force a separate ncc major bump with much larger dist churn. - Bypass core.setOutput entirely and write directly to the GITHUB_OUTPUT file. Modern runners always set that env var. Going with the second option. 11-line change in src/index.js with an equivalent 11-line diff in the rebuilt dist/index.js. No dependency updates; ncc rebuild is reproducible. Fallback path through core.setOutput() retained for the unlikely case that GITHUB_OUTPUT isn't set (pre-2022 self-hosted runner or local dry-run). Output values (label, ec2InstanceId) never contain newlines, so the plain 'key=value\n' format is safe. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
fix: write outputs to GITHUB_OUTPUT file instead of ::set-output
The bundled aws-sdk v2 (^2.809.0 resolved to 2.814.0) uses url.parse()
internally for endpoint URL parsing. On node24 that triggers DEP0169
on every action invocation, surfacing as a warning on every consumer's
CI page:
(node:PID) [DEP0169] DeprecationWarning: `url.parse()` behavior
is not standardized and prone to errors that have security
implications. Use the WHATWG URL API instead.
Three call sites verified in the bundled dist/: AWS.CloudFront's
getRtmpUrl, AWS.CloudFront.Signer.getSignedUrl, and AWS.util.urlParse.
Only the last runs at runtime (every EC2 API call parses its
endpoint). Upgrading aws-sdk to latest v2.1693 does not fix this —
the v2 SDK is in maintenance mode.
Migrating to aws-sdk v3 would rewrite src/aws.js entirely, far
beyond the scope of silencing a benign deprecation for trusted
EC2 endpoint URLs we control.
Instead, wrap process.emitWarning once at the top of src/index.js
and drop only the DEP0169 code. Every other warning — including
future new deprecations, user-installed 'warning' listeners, and
Node's default stderr formatter — is preserved by delegating to
the original emitWarning. Handles both call shapes:
process.emitWarning(msg, type, code, ctor) // positional
process.emitWarning(msg, { code, type }) // options object
Smoke-tested locally with three call patterns:
- DEP0169 (filtered, silent)
- DEP0123 options form (passed through, Node's default format)
- Plain 2-arg Warning (passed through)
Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
fix: silence DEP0169 url.parse deprecation from bundled aws-sdk v2
Scope note: this PR covers Phase 8 (issue #14) only for src/utils.js and src/config.js. Tests for src/aws.js, src/gh.js, and src/index.js are deferred to Phase 8.b, to be filed as a follow-up issue after Phases 1, 4, and 5 rewrite those files — testing them now against the aws-sdk v2 shape would be thrown away mid-rewrite. What landed: - tests/utils.test.js — sortByCreationDate covering sort order, in-place mutation, single-image, empty-array, and equal-date cases (5 tests). - tests/config.test.js — Config constructor validation and outputs. The module-scope singleton pattern means every test case calls loadConfig(inputs) which jest.resetModules() + mocks @actions/core / @actions/github + re-requires. Validation failures don't throw — the try/catch around 'new Config()' routes them to core.setFailed(), so tests assert against that mock's .calls array via expectValidationFailure() helper. Covers start-mode happy path, tag specifications logic, stop-mode happy path, six validation-failure modes, and generateUniqueLabel output shape (16 tests). Tooling: - jest 29.7.0 added as devDependency (pinned exact). - 'npm test' script wired in package.json. - 'unit-tests' job added to .github/workflows/pr.yml alongside the existing lint-code / verify-dist / verify-runner-url jobs. Verified locally: 'npm test' reports 'Test Suites: 2 passed, Tests: 21 passed' in ~0.3 s. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
* feat: migrate aws-sdk v2 to @aws-sdk/client-ec2 v3 (Phase 1) Completes Phase 1 (issue #7) of the modernization plan. ## Dependency changes - Remove 'aws-sdk' ^2.809.0 (in maintenance mode since late 2024; end-of-support announced for Nov 2025; emits DEP0169 on url.parse() in modern Node). - Add '@aws-sdk/client-ec2' 3.1033.0 pinned exact. Per-service package is dramatically smaller than v2's monolithic bundle. - Bump '@vercel/ncc' 0.25.1 -> 0.38.4. The old ncc couldn't parse modern JS (private class fields in v3's transitive deps); 0.38 is webpack-5-based and handles current syntax. ## Code changes (src/aws.js) Rewrite on the EC2Client + Command pattern: - EC2Client({}) replaces 'new AWS.EC2()'. Reads region + creds from env (same behavior, same source — configure-aws-credentials or instance profile). - client.send(new DescribeImagesCommand(params)) replaces ec2.describeImages(params).promise(). - client.send(new RunInstancesCommand(params)) replaces ec2.runInstances(params).promise(). - client.send(new TerminateInstancesCommand(params)) replaces ec2.terminateInstances(params).promise(). - client.send(new AssociateAddressCommand(params)) replaces ec2.associateAddress(params).promise(). - waitUntilInstanceRunning({client, maxWaitTime: 300}, {InstanceIds}) replaces ec2.waitFor('instanceRunning', ...).promise(). External action contract (inputs + outputs) is unchanged. Consumer workflows (notably terraform-provider-namecheap) do not need any change beyond rotating their SHA pin. ## Bundle + CI - 'npm run package' no longer needs NODE_OPTIONS=--openssl-legacy-provider (ncc 0.38 + webpack 5 don't use the legacy module-hash MD4 path). - dist/ now contains code-split chunk files (136.index.js, 360.index.js, etc.) alongside dist/index.js. All must be committed; the verify-dist CI check in pr.yml is broadened to diff the whole dist/ tree. - Bundle size: 7.9 MB -> 3.4 MB main (+ ~3.3 MB in chunks). Net smaller than v2. ## Backward compatibility The DEP0169 process.emitWarning filter added in #6 stays in place. The v3 bundle doesn't emit DEP0169, so the filter is effectively inert on the current tip — cleanup is a follow-up, not part of this PR's scope. ## Verification - npm test: 21 tests pass across tests/utils.test.js + tests/config.test.js (no aws.js tests yet; those land as Phase 8.b after Phase 4/5 stop rewriting aws.js). - npm run lint: clean. - Bundle builds cleanly on Node 20 without OpenSSL legacy provider. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> * chore: add .gitattributes to normalize line endings in dist/ ncc 0.38's output contains 451 source-embedded CR bytes inside string literals (aws-sdk transitive deps). When dist/ is committed through git's default line-ending normalization, those CRs are stripped into the blob, but every subsequent 'npm run package' reproduces them — creating a permanent, symmetric 451/451 diff that the verify-dist CI gate correctly flagged as drift. Mark the whole dist/ tree as binary via .gitattributes so git never converts line endings in that path. What ncc writes is what git stores; CI's rebuild produces byte-identical output. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> --------- Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
…sion (Phase 4) (#18) Closes #10. Biggest compatibility risk in the modernization plan, called out in the #15 tracker as needing a provider-repo dogfood before landing. ## Bootstrap rewrite The EC2 user-data now: - set -euo pipefail throughout — a silent useradd / tar / sha256sum failure kills the bootstrap instead of proceeding to a broken ./run.sh. - Creates a dedicated 'runner' user (idempotent — skipped if it already exists, so re-runs from a crash-loop don't explode). - Drops to that user via 'sudo -u runner -H bash <<RUNNER_BOOTSTRAP' for every subsequent step. The old 'export RUNNER_ALLOW_RUNASROOT=1' escape hatch is gone. - Fetches the runner tarball and SHA-256-verifies it against actions/runner's published '.sha256' sidecar before extraction. Same defense-in-depth pattern the provider repo uses for Go and Terraform downloads (namecheap/terraform-provider-namecheap#160). - Passes '--ephemeral --unattended --disableupdate' to config.sh. GitHub auto-deregisters the runner after one job — the existing removeRunner() API call in src/gh.js becomes belt-and-braces rather than the primary deregister path. --disableupdate keeps the runner binary stable for the short-lived ephemeral session. ## New 'runner-version' input Optional, defaults to '2.333.1' (the version this PR is tested against). Consumers can override without waiting for a new action release — useful when GitHub gates a JS action on a newer node runtime and we need to move fast. src/config.js reads it with a default fallback so old callers that don't set it continue to work. ## CI adjustment The existing verify-runner-url job greps the literal version string out of the source to HEAD-check the release asset. With the version now parameterized, the literal lives in action.yml's 'default:', so the extractor is rewritten to read it from there. ## Tests tests/config.test.js adds two cases: - defaults to 2.333.1 when runner-version input is unset - honors an explicit override Full suite: 23 tests pass across utils + config. ## Consumer impact (terraform-provider-namecheap acctest) - make testacc is 'go test' — no root required. - All setup steps (curl Go / Terraform, extract tarballs, write go-env.sh) write to $GITHUB_WORKSPACE which is writable by any runner user, not just root. - actions/checkout@v6 writes to the workspace, no root. - The workspace directory structure is unchanged beyond its absolute path (/home/runner/actions-runner/_work/... instead of /actions-runner/_work/...). GITHUB_WORKSPACE, HOME, and relative paths all resolve the same way. The dogfood SHA-pin rotation will be opened on the provider repo after this merges, mirroring the pattern from machulav#158 → machulav#159. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Phase 4 (#18) landed three independent hardenings in one PR: - New configurable runner-version input (no runtime impact) - Ephemeral + checksum + set -euo pipefail (additive safety) - Root to non-root runner user via sudo-heredoc (behavioral change) The dogfood rotation on terraform-provider-namecheap#182 failed — 'Start self-hosted EC2 runner' timed out at 6m15s waiting for runner registration. EC2 instance booted fine, but whatever the user-data did inside the instance, it didn't end at './run.sh' polling GitHub. We can't post-mortem directly because the instance is ephemeral and already terminated. Fix-forward strategy: revert ONLY the non-root transition (highest-probability culprit among the three axes), keep everything else from Phase 4. If the Phase 4 dogfood rotation passes after this revert, the root-to-runner sudo-heredoc is the breaker and can be investigated as an isolated follow-up (likely candidates: sudoers config on the hardened AMI, SELinux context, config.sh writing outside its own directory, or my heredoc quoting). Landing the safer pieces now unblocks Phases 5/6/7. Kept: - runner-version input (Phase 4's main feature) - set -euo pipefail - --ephemeral + --unattended + --disableupdate on config.sh - SHA-256 verification of the runner tarball - Clearer bash syntax ('case "$(uname -m)"', double-quoted vars) Reverted: - useradd + sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' heredoc - RUNNER_ALLOW_RUNASROOT=1 restored (runner executes as root again) The non-root goal isn't lost — a follow-up issue will propose it again, this time with better instrumentation so we can see what failed inside the instance. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
* revert: full rollback of Phase 4 bootstrap changes Phase 4 attempts #18 (with non-root) and #19 (without non-root but keeping --ephemeral + checksum + set -euo pipefail + runner-version input) BOTH failed the provider dogfood with the same 6m15s runner registration timeout (terraform-provider-namecheap#182 and machulav#183). The fix-forward in #19 narrowed the suspect set from 'all Phase 4 changes' to 'one of: set -euo pipefail, --ephemeral flag, --disableupdate flag, checksum verify, parameterized bash vars'. Still not isolated. Full rollback here restores the known-good Phase 1 bootstrap exactly. Everything else from Phase 1 is preserved (aws-sdk v3, ncc 0.38, jest tests, .gitattributes). Phase 4 work is NOT abandoned — it moves to follow-up issues where each change lands on its own with its own dogfood, so the next failure isolates itself to a single axis instead of requiring bisection across five simultaneous changes. Files reverted to match a1bd2f9 (Phase 1 tip): - action.yml (drops runner-version input) - src/aws.js (original 12-line bash array, yum install libicu make, RUNNER_ALLOW_RUNASROOT=1, no --ephemeral, no checksum verify) - src/config.js (drops runnerVersion field) - tests/config.test.js (drops runner-version test block, 23 -> 21 tests) Dist rebuilt against the reverted src (verify-dist will confirm). Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> * ci: revert verify-runner-url extractor to grep src/aws.js Paired with the full Phase 4 revert — now that action.yml no longer has a runner-version default, the Phase 4 version of verify-runner-url that reads action.yml can't find the version. Restore the original extractor that greps the literal URL out of src/aws.js. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com> --------- Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Closes #13. Part of plan #15. ## Motivation Phase 4's dogfood failures took three iterations and a full revert to diagnose because the ephemeral EC2 instance terminates before anyone can look at cloud-init logs. Structured action-side logs are the cheap half of the diagnostic story (see #20 for the AWS-side upload- to-S3 trap that completes it). ## New module: src/log.js JSON-shaped events via core.info / core.warning / core.error: log.info('run_instance', { ami_id: 'ami-0x', instance_type: 't3.medium', subnet_id: 'subnet-aaa', sg_id: 'sg-bbb', label: 'abcde', }); // -> ::notice ::{"step":"run_instance","mode":"start",...} Features: - Lazy require of ./config so validation-phase log calls don't crash when config's try/catch hasn't finished running new Config(). - sanitize() redacts values under a small secret-key allowlist before emitting (githubToken, github-token, token, aws-access-key-id, aws-secret-access-key, GPG_PRIVATE_KEY, password). - log.debug(...) is a no-op unless config.input.debug === 'true'. ## New input: debug action.yml declares 'debug: false' as default. Consumers opt in by setting debug: 'true' in their workflow; the action then emits the extra diagnostic traffic (full DescribeImages response, wait_for_runner poll details, input echo) without changing default output. src/config.js reads the input with a 'false' fallback so old callers that don't set it continue to work. ## Call sites instrumented src/aws.js: - wait_for_instance (start + end + error) - describe_images (request + selection + full-list under debug) - run_instance (request + result + error) - associate_address (attempt + warn on failure) - terminate_instance (start + end + error) src/gh.js: - gh_registration_token (attempt + success + error) - remove_runner (attempt + success + skip + error) - wait_for_runner (final outcome + per-poll details under debug) src/index.js: - start/stop bracketed by core.startGroup/endGroup so the Actions log collapses cleanly. - Top-level 'fatal' log on uncaught failure. - start_inputs / stop_inputs logged under debug (sanitized). ## Tests tests/log.test.js — 7 new cases: - info / warn / error emit through the right core.* channels with step + mode + fields baked into the JSON payload. - debug emits nothing when config.input.debug !== 'true'. - debug emits when config.input.debug === 'true'. - sanitize redacts recursively. - Payload-level redaction works (info with a {githubToken: …} field). tests/config.test.js — 2 new cases for the debug input (default + explicit override). Total: 23 -> 30 tests. ## Unblocks Phase 4.b (non-root retry, #20) now has visibility into what the bootstrap is doing. Combined with the S3-upload trap on the AWS side (also scoped in #20), any future bootstrap regression is diagnosable from outside the instance. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Closes #11. Part of plan #15. ## Scope Add a small retry helper and wrap the two cleanup paths in stop() with it. Make stop() attempt both cleanups independently so a GitHub-side failure doesn't prevent EC2 termination and vice versa. ## New module: src/retry.js withRetry(step, fn, { attempts, baseMs, maxMs }) - Default: 3 attempts, 2s base, doubled per retry, capped at 10s. Worst-case total wait: 2s + 4s + 8s = 14s before giving up. - Logs each retry via log.warn() with the step name suffixed '_retry' so the Actions run log shows attempt sequence. - On final failure, emits log.error with exhausted: true and re-throws the last error. ## Call sites - gh.removeRunner() — the DELETE /runners/{id} request wrapped in withRetry('remove_runner', ...). Idempotent: deleting a runner that's already gone returns 404 which is a permanent failure (not transient) but the existing get-runner-first path filters that out before the DELETE. - aws.terminateEc2Instance() — the TerminateInstances command wrapped in withRetry('terminate_instance', ...). Idempotent: terminating an already-terminated instance is a no-op per EC2. - start() path deliberately not wrapped. RunInstances is not idempotent without a ClientToken (billing implication) and the runner-version-specific AMI lookup retrying makes no sense. If start fails, the whole action fails; the consumer's workflow fallback handles retry at the workflow level. ## stop() independence Before: 'await A; await B'. If A throws, B never runs. After: both wrapped in try/catch, failures accumulated into a list, and at the end — if any failed — one Error is thrown summarizing both. So the EC2 instance gets terminated even when GitHub's API is temporarily flaky, and vice versa. ## Tests tests/retry.test.js — 4 cases: - Resolves on first-try success. - Retries on rejection and returns the eventual success value. - Exhausts attempts and re-throws after emitting warn + error logs. - Backoff caps at maxMs (verified by inspecting the delay values emitted in warn logs). Total: 30 -> 34 tests. ## Consumer impact Transparent. Default behavior matches what the old code did for the happy path. The retry / independent-cleanup changes only affect transient-failure flows that previously surfaced as 'stop-runner failed' jobs. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Part of #12. Phase 6 split: this PR ships IMDSv2 enforcement only. EBS encryption (also scoped in #12) becomes a Phase 6.b follow-up that needs per-AMI root-device lookup to set BlockDeviceMappings safely (wrong DeviceName creates an unused extra volume or clobbers the root). ## Change RunInstances params now include: MetadataOptions: { HttpTokens: config.input.httpTokens, HttpPutResponseHopLimit: 1, HttpEndpoint: 'enabled', } - HttpTokens default 'required' — IMDSv2 session token mandatory on every metadata request. Mitigates SSRF to IAM-credential theft. - HttpPutResponseHopLimit: 1 — token can't cross into a nested container, cap the blast radius of a containerized workload that tries to walk metadata. - HttpEndpoint: 'enabled' — explicit; default but good to pin. ## New input 'http-tokens' in action.yml, default 'required', accepted values 'required' and 'optional'. Consumers who must keep IMDSv1 compatibility set 'optional' explicitly. ## Tests tests/config.test.js — 2 new cases: default fallback + override. Total 34 -> 36. ## Consumer impact Transparent for code running on the runner that uses IMDS via a modern SDK (aws-sdk v2/v3, SSM agent, cloud-init) — all support IMDSv2 without config. Risk: an old tool or hand-rolled script that hits 169.254.169.254/latest/meta-data without first PUTting for a token. None expected on the hardened AL2023 AMI. ## Dogfood Small single-knob change — should pass the provider acctest cleanly once rotated. Phase 6.b (EBS encryption) will be scoped separately. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
#25) Closes #8 and #9. Documentation-only; no code change. ## Phase 2 (#8) — OIDC for AWS credentials README's 'How to start' step 1 now leads with GitHub OIDC as the preferred path and relegates static IAM access keys to a 'legacy' Option B. Rationale: static keys don't rotate, live in GitHub secrets indefinitely (permanent attack surface), and can't be scoped to a specific repo/branch/environment. OIDC issues short-lived STS tokens per run, scoped by repo/branch/environment. Includes: - A Terraform example for the trust-relationship IAM role with a repo-scoped 'sub' StringLike condition. - The minimum permissions policy (unchanged — attaches to role or user). - A workflow snippet showing 'permissions: id-token: write' and 'role-to-assume' instead of access-key secrets. No changes to the action's code — it already reads AWS creds from env, which configure-aws-credentials@v6 populates identically under both paths. ## Phase 3 (#9) — GitHub token type preferences README's 'How to start' step 2 replaced. Three token options ordered by preference: - A (preferred): GitHub App installation token via actions/create-github-app-token. No human identity, short-lived, minimal permission (Repository Administration: read/write). - B: fine-grained PAT scoped to specific repos with just Repository Administration. Better than classic PAT but still tied to a human. - C (deprecated): classic PAT with 'repo' scope. Over-permissive and human-tied; kept in docs as a fallback for environments that don't allow Apps or fine-grained PATs. No code change — the action accepts any token that has permission to manage self-hosted runners on the target repo. Docs change only. ## Not included in this PR - Phase 2's optional 'role-to-assume' input on the action itself (so consumers don't need the separate configure-aws-credentials step). Deferred — the current dual-step pattern is standard and works fine. Convenience feature, not urgency. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
…cksum table (#26) Closes #20. Supersedes the reverted #18 / #19 / #21. Implements the full Phase 4 bootstrap hardening from issue #10, with the root-cause fix from #20 baked in. Key differences from the earlier failed attempts: ## The fix for the actual failure Previous attempts died at: curl -fsSL <tarball>.sha256 | awk '{print }' with a 404 (actions/runner doesn't publish per-tarball sidecar files, empirically confirmed via aws ec2 get-console-output on a probe instance — see #20). This PR replaces that with a hardcoded table of expected hashes in src/runner-checksums.js, keyed by 'arch-version'. Two x86_64 / arm64 entries for the currently-pinned v2.333.1, sourced from the release body at github.com/actions/runner/releases/tag/v2.333.1. CI enforces table-vs-upstream consistency on every PR (see pr.yml). ## Everything else from Phase 4 - Non-root 'runner' user (useradd -m, sudo -u runner -H bash heredoc). RUNNER_ALLOW_RUNASROOT=1 escape hatch removed. - New 'runner-version' input in action.yml (default '2.333.1'). To override, add matching x64+arm64 SHAs to runner-checksums.js in the same PR — verify-runner-url CI will reject the change if the hashes don't match upstream. - --ephemeral --unattended --disableupdate on config.sh. GitHub auto-deregisters the runner after its job; disableupdate keeps the binary stable during the short ephemeral session. - set -euo pipefail on both the outer and inner (runner-user) shells. The earlier fatal failure under set -e was the .sha256 404, which no longer exists. - Paramaterized RUNNER_VERSION / TARBALL / BASE bash vars. ## Tests tests/runner-checksums.test.js — 6 new cases covering the table shape, hex format, x64+arm64 parity per version, lookup returns for known/unknown keys. tests/config.test.js — 2 new cases for the runner-version input (default fallback + override). Total: 36 -> 44 tests. ## CI: verify-runner-url overhaul The job now parses the runner-version from action.yml, then: 1. HEADs the Linux x64 release asset (unchanged). 2. Fetches the release body via 'gh api'. 3. Greps the BEGIN SHA linux-x64 / linux-arm64 HTML comments. 4. Cross-checks against the values lookup() returns from src/runner-checksums.js. Drift between the hardcoded table and upstream fails CI at code- review time, not at runtime. ## Dogfood plan (MUCH more careful this time) Provider SHA-pin rotation after merge, same pattern as prior phases. This time I have full EC2 console-output diagnostic capability via the recipe saved in my notes — any new bootstrap failure should be trivially diagnosable rather than opaque. Closing #20 on merge. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Rounds out Phase 6 (IMDSv2 landed in #24, EBS encryption deferred until a per-AMI root-device lookup could be done safely). ## Change New 'encrypt-ebs' input on action.yml, default 'false' (opt-in). When 'true', the action: 1. Fetches the AMI's DescribeImages result (already needed to resolve image IDs when 'ec2-image-filters' is set). 2. Finds the BlockDeviceMapping matching the AMI's RootDeviceName. 3. Clones that mapping, drops SnapshotId (AWS uses the AMI's snapshot automatically), sets 'Encrypted: true'. 4. Passes the cloned mapping as RunInstances.BlockDeviceMappings. Result: root volume launches with SSE-EBS, key 'alias/aws/ebs' in the launch account. VolumeSize / VolumeType / IOPS / DeleteOnTermination preserved from the AMI — only the encryption bit is new. ## Why opt-in The launch account (not necessarily the AMI owner account) must have either default EBS encryption enabled, or at minimum permission to use the default AWS-managed KMS key. If the AMI's snapshot is encrypted with a customer-managed key that doesn't have a cross- account grant, RunInstances fails. Defaulting to 'true' would regress every consumer whose IAM / KMS policy isn't set up for this. Default 'false' lets each consumer opt in after verifying their account can handle it. ## Why not account-level default encryption AWS supports 'aws ec2 enable-ebs-encryption-by-default' at the account level — and that's the preferred belt-and-suspenders. But not every consumer runs in an AWS account they control (e.g., Namecheap's CI runs in a shared org account). Action-side opt-in is the only portable control. ## Refactor alongside resolveImageId -> resolveImage: now returns both the ID and the full Image metadata. Callers that only need the ID use .id; the EBS-encryption code path uses .image.BlockDeviceMappings to build the encrypted clone. ## Tests tests/ebs.test.js — 6 new cases for buildEncryptedRootMapping: happy path with full EBS config + non-EBS sibling mapping, volume type / size / iops preservation, and five null-return paths for exotic AMI shapes (no RootDeviceName, no mappings, non-EBS root, orphan RootDeviceName). tests/config.test.js — 2 new cases for the encrypt-ebs input (default fallback + override). Total: 44 -> 52 tests. ## Consumer dogfood Separate PR on terraform-provider-namecheap rotates the pin and enables 'encrypt-ebs: true' on the CI job. If the dogfood fails with a KMS / IAM permissions error, we know the account needs policy work before enabling; action-side code is fine either way. Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
| - name: Set up Node.js | ||
| uses: actions/setup-node@v4 | ||
| with: | ||
| node-version: 20 |
dm1tr1yvovk
left a comment
There was a problem hiding this comment.
Code Review: PR #28 feat/al2023-support
Strengths
-
Clean AWS SDK v3 migration — Import-only-what-you-need command pattern (
client.send(new XCommand(...))) is correct and reduces bundle size. (src/aws.js:1-8) -
Robust security hardening — IMDSv2 enforcement with
HttpPutResponseHopLimit: 1is the correct SSRF defense. Default-on withhttp-tokens: optionalescape hatch is a good backward-compat trade-off. (src/aws.js:201-210) -
Non-root runner + ephemeral mode — Running as a dedicated
runneruser with--ephemeral --disableupdateis a significant security win. The heredoc-quoted<<'RUNNER_BOOTSTRAP'correctly prevents shell expansion of untrusted values. (src/aws.js:138-186) -
Checksum verification is enforced, not advisory —
sha256sum -c -runs underset -euo pipefail, killing bootstrap on mismatch. The CI jobverify-runner-urlcross-checks the hardcoded table against upstream releases. Two-layer defense, well designed. (src/aws.js:177,.github/workflows/pr.yml:92-125) -
Independent cleanup in stop mode — Catching errors from
terminateEc2Instance()andremoveRunner()independently prevents a GitHub API outage from blocking EC2 termination (and thus preventing billing runaway). (src/index.js:54-84) -
Structured logging with sanitization — The
log.jsmodule with key-based secret redaction and deferredrequire('./config')insideemit()correctly handles circular dependency with module load order. (src/log.js:28-41) -
Comprehensive test coverage — All 52 tests pass. Covers config validation, EBS mapping edge cases (6 scenarios), retry backoff, checksum integrity, and log sanitization.
-
CI pipeline additions —
verify-dist,unit-tests, andverify-runner-urljobs catch real defect classes at review time.
Issues
Critical (Must Fix)
C1. _.merge() mutates config.githubContext — stale runner_id on subsequent calls
- File:
src/gh.js:55 - What's wrong:
_.merge(config.githubContext, { runner_id: runner.id })mutates its first argument in place. AfterremoveRunner()runs,config.githubContextpermanently gains a stalerunner_idproperty, which will contaminate any subsequent logging or API calls using that object. - Fix: Use non-mutating merge:
{ ...config.githubContext, runner_id: runner.id }or_.merge({}, config.githubContext, { runner_id: runner.id })
C2. Missing ec2:DescribeImages IAM permission in documented policy
- File:
README.md:131-143 - What's wrong:
resolveImage()now always callsDescribeImagesCommandeven whenec2-image-idis provided directly (to fetch BlockDeviceMappings for EBS encryption). Butec2:DescribeImagesis not in the documented minimum IAM policy. Users upgrading will getAccessDeniederrors. - Fix: Add
ec2:DescribeImagesto the README's IAM policy block.
Important (Should Fix)
I1. Stale DEP0169 suppression after SDK v3 migration
- File:
src/index.js:1-14 - What's wrong: The
url.parse()deprecation suppression was added for aws-sdk v2. The comment says "migrating to aws-sdk v3 would require rewriting the entire action" — but that migration is already done. Dead code with a misleading comment. - Fix: Remove lines 1-14 entirely.
I2. waitForRunnerRegistered race — both reject and resolve can fire
- File:
src/gh.js:82-93 - What's wrong: When timeout fires,
reject()on line 86 has noreturn, so execution falls through to line 89 whereresolve()may also fire if the runner registered at the exact timeout boundary. - Fix: Add
return;afterreject(...)on line 86.
I3. New inputs not documented in README inputs table
- File:
README.md:269-282 - What's wrong: Four new inputs (
runner-version,encrypt-ebs,http-tokens,debug) are inaction.ymlbut absent from the README's Inputs table. Users won't discover the new security features.
I4. README still references "Amazon Linux 2" instead of AL2023
- File:
README.md:275 - What's wrong: Says "compatible with Amazon Linux 2 images" — contradicts the entire purpose of this PR.
I5. No validation of http-tokens input value
- File:
src/config.js:20 - What's wrong: Passed directly to
MetadataOptions.HttpTokenswithout validation. A typo likerequireinstead ofrequiredproduces a confusing AWS API error at instance launch rather than a clear validation error at action start. - Fix:
if (!['required', 'optional'].includes(this.input.httpTokens)) throw new Error(...)
I6. ec2Client() creates a new client on every call despite "singleton" comment
- File:
src/aws.js:16-22 - What's wrong: Comment says "A single shared client is fine" but the function creates a new
EC2Clientper invocation. Wastes credential resolution cycles. Comment and behavior are contradictory. - Fix: Either make it a true module-scope singleton or update the comment.
Minor (Nice to Have)
M1. src/gh.js:86 — reject() with a plain string instead of Error object. The error handler in index.js:90 accesses .name and .message, which will be undefined.
M2. package.json:21-22 — @actions/core at ^1.2.6 (2020) and @actions/github at ^4.0.0 (2021) are very outdated. Upgrading to @actions/core v1.10+ would eliminate the custom setOutput() function.
M3. README.md:319,324-325 — Example workflow still uses aws-actions/configure-aws-credentials@v1 and static IAM keys, contradicting the new OIDC-preferred guidance.
M4. src/aws.js:212 — encryptEbs === 'true' string comparison works but core.getBooleanInput() would be more conventional.
M5. src/aws.js:46-51 — Redundant DescribeImages call when ec2-image-id is provided and encrypt-ebs is false. Adds latency and forces the new permission even when not needed. Fixing this also reduces the impact of C2.
Recommendations
-
Add a manual integration test workflow (
workflow_dispatch) that creates/tears down a test runner to catch bootstrap failures, IAM permission gaps, and SDK issues that unit tests can't. -
Pin
@aws-sdk/client-ec2with caret range (^3.1033.0instead of exact3.1033.0) to receive patch/security fixes, or document the rationale for the exact pin. -
Document KMS permissions for
encrypt-ebs— With the default AWS-managed key, the role needskms:CreateGrantandkms:DescribeKeyonalias/aws/ebs. -
Consider removing lodash — The only usages are
_.filterand_.merge, both trivially replaceable with native JS. This would eliminate a transitive dependency.
Assessment
Ready to merge? No — with fixes.
Reasoning: The implementation is architecturally sound and delivers significant security and reliability improvements. The two critical issues must be addressed: C1 (_.merge mutation) is a correctness bug, and C2 (missing IAM permission in docs) is a breaking change for existing users. The stale DEP0169 suppression (I1) and the timeout race condition (I2) should also be fixed before merge. Remaining items can go in follow-ups.
No description provided.