Skip to content

feat: structured logging + opt-in debug mode (Phase 7)#22

Merged
kurok merged 1 commit intofeat/al2023-supportfrom
feat/13-structured-logging
Apr 21, 2026
Merged

feat: structured logging + opt-in debug mode (Phase 7)#22
kurok merged 1 commit intofeat/al2023-supportfrom
feat/13-structured-logging

Conversation

@kurok
Copy link
Copy Markdown

@kurok kurok commented Apr 21, 2026

Closes #13. Part of plan #15.

Motivation

Phase 4's dogfood failures (#18, #19) took three iterations and a full revert to diagnose because the ephemeral EC2 instance is terminated by the stop-runner step before anyone can look at cloud-init logs. Structured action-side logs are the cheap half of the diagnostic story (see #20 for the AWS-side exec > … | tee | aws s3 cp … trap that completes it).

What lands

  • New src/log.js — JSON-shaped events via core.info / core.warning / core.error. Lazy-requires ./config so validation-phase log calls don't crash before Config is constructed. Auto-redacts values under a small secret-key allowlist (githubToken, token, aws-access-key-id, …).
  • New debug input (default false). When set to true, the action emits extra diagnostic traffic — full DescribeImages response, per-poll wait_for_runner details, sanitized input echo. No change to default output.
  • Call sites instrumented across src/aws.js / src/gh.js / src/index.js: run_instance, describe_images, wait_for_instance, associate_address, terminate_instance, gh_registration_token, remove_runner, wait_for_runner.
  • core.startGroup / endGroup around start() and stop() so the Actions run log collapses.
  • Tests: 7 new in tests/log.test.js, 2 new in tests/config.test.js. Total 23 → 30.

Example output

With debug: false (default) — one JSON line per notable event:

::notice::{"step":"describe_images","mode":"start","owner":"699717368611","filters":[...]}
::notice::{"step":"describe_images","mode":"start","match_count":6,"selected_ami":"ami-06150d38a88f1778a"}
::notice::{"step":"run_instance","mode":"start","ami_id":"ami-06150d38a88f1778a","instance_type":"t3.medium","subnet_id":"subnet-01c4ff5a","sg_id":"sg-106ec76d","iam_role":"AmazonSSMRoleForInstancesQuickSetup","label":"abcde"}
::notice::{"step":"run_instance","mode":"start","instance_id":"i-02317df7bd1d30105","elapsed_ms":3140}

With debug: true — adds:

::notice::{"step":"start_inputs","mode":"start","debug":true,"githubToken":"***","label":"","ec2InstanceType":"t3.medium",...}
::notice::{"step":"describe_images_all","mode":"start","debug":true,"images":[{"id":"ami-...","name":"nc-al2023-...","created":"..."}]}
::notice::{"step":"wait_for_runner_poll","mode":"start","debug":true,"label":"abcde","elapsed_s":30,"found":false,"status":null}

Secrets never appear in logs. The sanitize-then-stringify pass in emit() runs on every field payload, including recursively through nested objects.

Compatibility

  • Default behavior unchanged for consumers — debug defaults to false → no additional log traffic.
  • External contract (inputs + outputs) unchanged: all pre-existing inputs + the two outputs (label, ec2-instance-id) work exactly as before.
  • No change to the EC2 bootstrap / user-data — this entire PR operates on the action's JS side, running on the GitHub runner that calls the action.

Unblocks

Dogfood

Behavior change is log-output-only — nothing in this PR affects how the EC2 instance boots or how the runner registers. Rotating the provider's pin to this tip should be risk-free; if the pin rotation's CI passes it validates the full action pipeline under the new logging.

Closes #13. Part of plan #15.

## Motivation

Phase 4's dogfood failures took three iterations and a full revert to
diagnose because the ephemeral EC2 instance terminates before anyone
can look at cloud-init logs. Structured action-side logs are the
cheap half of the diagnostic story (see #20 for the AWS-side upload-
to-S3 trap that completes it).

## New module: src/log.js

JSON-shaped events via core.info / core.warning / core.error:

    log.info('run_instance', {
      ami_id: 'ami-0x',
      instance_type: 't3.medium',
      subnet_id: 'subnet-aaa',
      sg_id: 'sg-bbb',
      label: 'abcde',
    });
    // -> ::notice ::{"step":"run_instance","mode":"start",...}

Features:
- Lazy require of ./config so validation-phase log calls don't crash
  when config's try/catch hasn't finished running new Config().
- sanitize() redacts values under a small secret-key allowlist before
  emitting (githubToken, github-token, token, aws-access-key-id,
  aws-secret-access-key, GPG_PRIVATE_KEY, password).
- log.debug(...) is a no-op unless config.input.debug === 'true'.

## New input: debug

action.yml declares 'debug: false' as default. Consumers opt in by
setting debug: 'true' in their workflow; the action then emits the
extra diagnostic traffic (full DescribeImages response, wait_for_runner
poll details, input echo) without changing default output.

src/config.js reads the input with a 'false' fallback so old callers
that don't set it continue to work.

## Call sites instrumented

src/aws.js:
- wait_for_instance (start + end + error)
- describe_images (request + selection + full-list under debug)
- run_instance (request + result + error)
- associate_address (attempt + warn on failure)
- terminate_instance (start + end + error)

src/gh.js:
- gh_registration_token (attempt + success + error)
- remove_runner (attempt + success + skip + error)
- wait_for_runner (final outcome + per-poll details under debug)

src/index.js:
- start/stop bracketed by core.startGroup/endGroup so the Actions
  log collapses cleanly.
- Top-level 'fatal' log on uncaught failure.
- start_inputs / stop_inputs logged under debug (sanitized).

## Tests

tests/log.test.js — 7 new cases:
- info / warn / error emit through the right core.* channels with
  step + mode + fields baked into the JSON payload.
- debug emits nothing when config.input.debug !== 'true'.
- debug emits when config.input.debug === 'true'.
- sanitize redacts recursively.
- Payload-level redaction works (info with a {githubToken: …} field).

tests/config.test.js — 2 new cases for the debug input (default +
explicit override).

Total: 23 -> 30 tests.

## Unblocks

Phase 4.b (non-root retry, #20) now has visibility into what the
bootstrap is doing. Combined with the S3-upload trap on the AWS side
(also scoped in #20), any future bootstrap regression is diagnosable
from outside the instance.

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
@kurok kurok merged commit b1b8d6d into feat/al2023-support Apr 21, 2026
4 checks passed
@kurok kurok deleted the feat/13-structured-logging branch April 21, 2026 09:18
kurok added a commit to namecheap/terraform-provider-namecheap that referenced this pull request Apr 21, 2026
…185)

namecheap/ec2-github-runner#22 merged. Phase 7 adds JSON-shaped log
events on every lifecycle step, an opt-in 'debug' input, and
core.startGroup/endGroup wrapping so the Actions log collapses
cleanly. No behavior change to the EC2 bootstrap or runner
registration — this is pure action-side observability.

Default-mode output grows by one JSON line per notable event
(~10 lines total for a full start+stop cycle). Setting debug: 'true'
in workflow yaml would add verbose traffic, but that's opt-in.

Rotation: 249efbd (Phase 4 revert) -> b1b8d6d (Phase 7).

Signed-off-by: yuriyryabikov <22548029+kurok@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant