Skip to content

Fix shared/apm.md matrix secret-stripping; add empirical verify workflow#1373

Merged
danielmeppiel merged 5 commits into
mainfrom
danielmeppiel/rca-matrix-secret-stripping
May 18, 2026
Merged

Fix shared/apm.md matrix secret-stripping; add empirical verify workflow#1373
danielmeppiel merged 5 commits into
mainfrom
danielmeppiel/rca-matrix-secret-stripping

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

@danielmeppiel danielmeppiel commented May 18, 2026

TL;DR

shared/apm.md has been broken for every form 2 (single-app legacy) and form 3 (apps[]) consumer since PR #982 added the apm-prep -> apm matrix split. The apm-prep job emits the GitHub App private-key PEM inside its matrix job output, GitHub Actions silently strips the output as a secret leak, the downstream matrix evaluates to empty, no APM bundles are produced, and the agent fails at no apm bundles found.

This PR fixes the canonical shared/apm.md by moving credentials off the cross-job output path entirely, and adds .github/workflows/verify-shared-apm-matrix.yml which empirically asserts both the failure mode and the fix using a synthetic ::add-mask::-registered PEM (no real secrets required).

Empirical proof (green): Verify shared/apm.md matrix secret-stripping fix #26058337658 -- all 6 jobs pass, including the regression sentinel that asserts GH still strips the old shape and the positive matrix-strategy job that asserts the new shape's env-relay round-trips the PEM intact.

Problem (WHY)

Reported by a consumer in the field: after updating to the latest gh-aw + canonical shared/apm.md, every workflow consuming it failed with:

##[warning]Skip output 'matrix' since it may contain secret.

...followed by zero matrix replicas and no apm bundles found at the agent boundary.

Root cause

The GitHub Actions runner runs every job output value through its secret masker before publishing it for consumption by dependent jobs. In actions/runner (src/Runner.Worker/JobExtension.cs), the relevant guard is:

if (!string.Equals(output.Value, HostContext.SecretMasker.MaskSecrets(output.Value)))
{
    // Replace the value with empty string and emit the warning.
}

The pre-fix apm-prep step built its matrix output by jq-folding private-key directly into the matrix rows. Once private-key was a registered secret (via repo / org secrets context or ::add-mask::), the masker detected the secret substring inside the output value, the runner replaced the output with "", and the downstream apm.strategy.matrix: ${{ fromJSON(needs.apm-prep.outputs.matrix) }} saw an empty string.

This is the runner working as designed. There is no flag to disable it and there is no path-pattern allow-list. Any value that contains any registered-secret substring is dropped, full stop. The only fix is to stop putting secrets into job outputs.

(This regressed silently because the pre-PR-982 design was a single apm job -- no matrix, no job-output secret crossing. The split design that PR-982 introduced is what triggered the masker.)

Approach (WHAT)

Keep the cross-job matrix output strictly non-secret (routing metadata only). Resolve credentials per matrix row inside the apm job itself, using gh-aw compile-time-substituted env vars, and relay the resolved row credentials between steps via $GITHUB_ENV -- which is per-job and not subject to the cross-job output-redaction filter (it still gets masked in logs, but it is usable in subsequent steps of the same job).

Patched apm-prep matrix shape (secret-free)

{"group":[
  {"id":"legacy",  "kind":"legacy", "index":0, "owner":"...", "repositories":"...", "packages":[...], "has-app":"true"},
  {"id":"acme",    "kind":"apps",   "index":0, "owner":"...", "repositories":"...", "packages":[...], "has-app":"true"},
  {"id":"default", "kind":"default","index":0, "owner":"",    "repositories":"",    "packages":[...], "has-app":"false"}
]}

No app-id, no private-key. The masker has nothing to match on; the output is published verbatim.

Patched apm job

  • Job-level env: carries the gh-aw compile-time-substituted credential fields (AW_APM_LEGACY_APP_ID, AW_APM_LEGACY_PRIVATE_KEY, AW_APM_APPS). GA evaluates these per replica; secret-bearing values are masked in logs but not stripped (env is not an output).
  • A new first step Resolve credentials for this matrix row reads matrix.group.kind + matrix.group.index, picks the right app-id / private-key from the job env (or jq-indexes AW_APM_APPS for the apps[] case), normalises trailing newlines, and writes them to $GITHUB_ENV as ROW_APP_ID / ROW_PRIVATE_KEY. It also re-runs ::add-mask:: on the resolved PEM (defence in depth).
  • actions/create-github-app-token@v3.1.1 then receives ${{ env.ROW_APP_ID }} / ${{ env.ROW_PRIVATE_KEY }}. Same if: matrix.group.has-app == 'true' guard.

Downstream validators (pre-agent-steps) only ever read id from the matrix manifest -- no change needed.

Empirical proof (verify-shared-apm-matrix.yml)

The PR ships its own proof. This workflow runs on pull_request against itself and exercises both shapes:

Job set A -- a-bug-prep + a-bug-consume (regression sentinel)

Replicates the pre-fix shape literally: registers a synthetic PEM via ::add-mask::, then writes a matrix job output containing that PEM, and the consumer asserts that needs.a-bug-prep.outputs.matrix is the empty string. Passes iff GitHub still strips secret-bearing outputs. If GA ever loosens the rule, Job A will fail and tell us the fix's premise has changed.

Job set B -- b-fix-prep + b-fix-assert-shape + b-fix-consume (positive proof)

Replicates the post-fix shape: registers the same PEM, emits a secret-free 2-row matrix (legacy row + apps[] row at index: 0), plus a non-secret expected_sha (SHA-256 of the PEM, which is not a substring of the PEM and therefore not masked). Asserts:

  1. needs.b-fix-prep.outputs.matrix is non-empty in the asserter.
  2. Matrix has the expected 2 rows.
  3. The matrix-strategy b-fix-consume job fans out and runs each row.
  4. Each row resolves ROW_PRIVATE_KEY from $GITHUB_ENV and the value round-trips intact (sha256sum of the resolved PEM equals expected_sha, byte-for-byte, on both the legacy and apps[] resolution paths).

No real GitHub App credentials are needed; everything is self-contained.

Adversarial expert audit (3 parallel reviews)

Before finalising, three independent expert reviewers were spawned in parallel (secret-masker / GitHub Apps / matrix-composition), each instructed to use only live verified primary sources -- no invented details. All three returned MOSTLY CONFIDENT on the env-relay design; none blocking. The audit converged on three concrete hardenings, applied to the resolver step in commit e027211a.

Primary-source citations (verified at audit time)

  • actions/runner -- src/Runner.Worker/JobExtension.cs:639-647 -- the output filter itself (MaskSecrets round-trip equality check).
  • actions/runner -- src/Runner.Worker/FileCommandManager.cs:378 -- $GITHUB_ENV heredoc parser: exact-line ordinal compare for the terminator.
  • actions/runner -- src/Runner.Worker/ActionCommandManager.cs:426 -- ::add-mask:: registers the value as a single substring; does not auto-decompose multi-line secrets.
  • docs.github.com -- workflow-commands -- multi-line strings -- explicit warning against fixed heredoc delimiters for arbitrary multi-line values.
  • docs.github.com -- contexts -- env.* context availability for step-level with: / env:.
  • actions/create-github-app-token@v3.1.1 -- main.js -- uses core.getInput() which calls .trim() on the PEM by default; tolerates either presence or absence of trailing newline.

Findings applied (commit e027211a)

  1. Random heredoc delimiter. The fixed delimiter APMPK violates the official "arbitrary multi-line value" guidance. The runner does an exact-line ordinal compare, so a PEM whose base64 body happened to contain a line equal to APMPK would silently truncate the env var. Probability for real PEMs is astronomically low but non-zero; the class is eliminated entirely via delim="APMPK_$(openssl rand -hex 16)".
  2. Trailing-newline normalisation. The legacy path (direct env var) preserves the PEM's trailing newline; the apps[] path (jq via $(...) command substitution) strips it. create-github-app-token v3.1.1 tolerates both, but normalising once at the env-relay boundary (pk="${pk%$'\n'}") eliminates the asymmetry at the source instead of leaning on downstream tolerance. Both paths now produce byte-identical ROW_PRIVATE_KEY.
  3. Never-echo SECURITY comment. ::add-mask:: registers the PEM as a single multi-line substring; the runner's secret masker does not match individual PEM lines printed in isolation. A future debug echo "$pk" or set -x in this step would leak the key body in clear text. Pure preventive guard on the resolver step.

Findings deferred (out of scope for this PR)

  • Pin actions/create-github-app-token by SHA -- repo-wide concern, separate PR.
  • Migrate app-id -> client-id -- the underlying action emits a deprecation hint; separate PR.
  • Empty-group matrix guard -- pre-existing in shared/apm.md, unchanged by this fix.

Verdict

All three reviewers independently arrived at MOSTLY CONFIDENT (not TOTAL) because secret masking is fundamentally a runtime promise from GitHub Actions, not a static guarantee any PR can prove offline. The verify workflow's regression sentinel (Job A) closes this gap empirically: if GitHub ever loosens the cross-job output filter, Job A will start failing and we get an automatic signal that the fix's premise has changed.

Validation evidence

Check Status
Lint (uv run --extra dev ruff check src/ tests/) silent (no Python touched)
Format check (uv run --extra dev ruff format --check src/ tests/) 782 files already formatted
YAML parse of verify-shared-apm-matrix.yml OK
ASCII-only constraint confirmed (status symbols [OK] / [!])
Empirical run of verify-shared-apm-matrix.yml run 26058337658 -- all 6 jobs green; the expected ##[warning]Skip output 'matrix' since it may contain secret. annotation comes from Job A by design (regression sentinel). A fresh run will be triggered by the hardening push and must remain green.

How to test

  1. The verify-shared-apm-matrix.yml workflow on this PR is green (run 26058337658) -- both job sets pass.
  2. Optional manual: dispatch the workflow via workflow_dispatch on any branch that contains both file changes.
  3. Optional end-to-end: a consumer that hits shared/apm.md via imports: against this branch's commit SHA should now produce a non-empty matrix and fan out APM bundle replicas. The previously-failing no apm bundles found validator should pass.

Trade-offs

  • Two-step credential resolution per row. Adds ~1 small jq step at the start of each apm replica. Necessary because the matrix can no longer carry credentials directly.
  • gh-aw vendored copy is not touched. .github/workflows/shared/apm.md in microsoft/apm is the canonical source of truth and the one consumers import; the vendored copy in microsoft/gh-aw is downstream and tracked via redirect: frontmatter -- it will sync naturally on the next gh-aw release. Out of scope for this PR.
  • apps[] form (form 3) is still upstream-blocked in gh-aw. Go's default formatter emits [map[a:1] map[b:2]] for object arrays in ${{ github.aw.import-inputs.* }}, which jq cannot parse. The verify workflow proves the receiving side works -- the actual unblock for live apps[] usage requires gh-aw to expose a JSON-encoding helper. Tracked separately.

Files

  • .github/workflows/shared/apm.md -- the canonical fix + 3 audit hardenings.
  • .github/workflows/verify-shared-apm-matrix.yml -- new empirical proof (now mirrors the hardened resolver).
  • CHANGELOG.md -- Unreleased / Fixed entry.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

The pre-fix shared/apm.md embedded the GitHub App private-key inside
the apm-prep job's 'matrix' output. GitHub Actions ran every job
output through HostContext.SecretMasker.MaskSecrets (see
actions/runner JobExtension.cs); when the masked value differed from
the original, the runner dropped the output entirely with the
warning 'Skip output "matrix" since it may contain secret.' The
downstream apm job's matrix: fromJSON(needs.apm-prep.outputs.matrix)
then received an empty string, no replicas ran, no APM bundles were
packed, and pre-agent-steps failed at 'no apm bundles found'. Every
single-app legacy (form 2) and apps[] (form 3) consumer of
shared/apm.md was broken end-to-end.

The fix moves credentials off the job-output path. apm-prep now
emits routing metadata only (id, kind, index, owner, repositories,
packages, has-app); the apm job carries the gh-aw-substituted
secrets in its own per-replica env, and a new 'Resolve credentials'
step picks the row's credentials with jq and relays them through
$GITHUB_ENV as ROW_APP_ID / ROW_PRIVATE_KEY. The
create-github-app-token step now reads from env.ROW_*. $GITHUB_ENV
is per-job and is masked-in-logs but NOT subject to the cross-job
output redaction filter.

Empirical proof: .github/workflows/verify-shared-apm-matrix.yml
exercises both shapes with a synthetic ::add-mask::'d PEM (no real
secrets needed). Job set A replicates the pre-fix shape and asserts
the matrix output IS stripped (regression sentinel against GH
silently changing the rule). Job set B replicates the post-fix
shape and asserts the matrix output survives, the matrix-strategy
job fans out to two rows (legacy + apps), each row resolves
ROW_PRIVATE_KEY from $GITHUB_ENV, and the PEM round-trips intact
via SHA-256 comparison.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 18, 2026 20:18
The loop body '[ -n "$line" ] && echo ...' returns the exit status
of the failed test when $line is empty, which under 'set -e' aborts
the script. Use an explicit 'if ...; then ...; fi' instead so the
loop tolerates the trailing empty line from the heredoc-string read
deterministically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a regression introduced by PR #982 where the apm-prep job in shared/apm.md embedded the GitHub App private-key inside its cross-job matrix output. The GitHub Actions runner's SecretMasker silently replaces any cross-job output containing a registered secret substring with an empty string, causing the downstream apm matrix strategy to fan out to zero replicas and every consumer (form 2 single-app + form 3 apps[]) to fail at no apm bundles found. The fix keeps the matrix output strictly non-secret (routing metadata only) and resolves per-row credentials inside the apm job via $GITHUB_ENV, which is not subject to the cross-job redaction filter. A new self-contained workflow empirically asserts both the runner's stripping behaviour (regression sentinel) and the new shape's survival, using a synthetic ::add-mask::-registered PEM.

Changes:

  • Restructured apm-prep to emit id/kind/index/owner/repositories/packages/has-app only; dropped app-id/private-key from the matrix.
  • Added a "Resolve credentials" step to the apm job that case-switches on matrix.group.kind, indexes apps by matrix.group.index, re-masks, and writes ROW_APP_ID/ROW_PRIVATE_KEY to $GITHUB_ENV; create-github-app-token now reads env.ROW_* and is gated on has-app == 'true'.
  • Added verify-shared-apm-matrix.yml with two job sets (broken-shape sentinel + fixed-shape positive proof with SHA round-trip).
Show a summary per file
File Description
.github/workflows/shared/apm.md Splits credentials out of the cross-job matrix output and relays them per row via $GITHUB_ENV in the apm job.
.github/workflows/verify-shared-apm-matrix.yml New empirical proof workflow asserting both the broken pre-fix shape is stripped and the post-fix shape survives end-to-end.
CHANGELOG.md Adds an Unreleased / Fixed entry for the fix.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 1

Comment thread CHANGELOG.md Outdated

### Fixed

- `shared/apm.md`: rebuilt the apm-prep -> apm matrix hand-off so the per-row credentials are relayed via `$GITHUB_ENV` on the `apm` job instead of being embedded in the matrix job output. The previous design placed the GitHub App `private-key` value inside the `matrix` job output of `apm-prep`, which the runner stripped via its cross-job secret-redaction filter (`HostContext.SecretMasker.MaskSecrets` in `actions/runner` / `JobExtension.cs`, surfaced as `##[warning]Skip output 'matrix' since it may contain secret.`). Downstream consumers received an empty matrix, zero `apm` replicas ran, and pre-agent-steps failed at `no apm bundles found`. Every form 2 (single-app legacy) and form 3 (`apps[]`) consumer of `shared/apm.md` was broken end-to-end. The fix moves credentials out of the matrix output entirely: `apm-prep` now emits routing metadata only (`id`, `kind`, `index`, `owner`, `repositories`, `packages`, `has-app`); the `apm` job carries the gh-aw-substituted secrets in its own per-replica env, resolves the row credentials via `jq`, and relays them to `create-github-app-token` through `ROW_APP_ID` / `ROW_PRIVATE_KEY` in `$GITHUB_ENV` (which is per-job and never subject to the cross-job redaction filter). A new `.github/workflows/verify-shared-apm-matrix.yml` workflow provides empirical proof: one job set replicates the broken pre-fix shape and asserts the matrix output is stripped (regression sentinel for the runner behaviour), and a second job set replicates the post-fix shape and asserts the matrix output survives, the `apm`-style matrix job fans out, and the resolved `ROW_PRIVATE_KEY` round-trips intact (SHA-256 comparison).
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 65e5868 -- compressed to a single line matching the file's convention and added the (#1373) suffix. The full root-cause exposition lives in the PR description.

Daniel Meppiel and others added 3 commits May 18, 2026 22:23
The legacy resolution path preserves the PEM's trailing newline (direct
env-var expansion), while the apps[] path strips it ($() command
substitution around jq -r). Both are equally valid for a real PEM
that downstream create-github-app-token consumes, but the bit-exact
SHA-256 comparison would diverge by one newline.

Trim trailing newlines on both the expected (prep) and got (consume)
side before hashing so both rows converge on the same value, without
weakening the substring-leak detection (the trim affects only
whitespace, not PEM body bytes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three parallel adversarial expert audits (secret-masker, GitHub Apps,
matrix-composition) returned MOSTLY CONFIDENT on the env-relay fix and
converged on three concrete hardenings worth applying. Mirrors the
production resolver in the verify workflow so the empirical proof
exercises the hardened code path.

1) Random heredoc delimiter (matrix-composition expert)
   Replace the fixed 'APMPK' delimiter with 'APMPK_$(openssl rand -hex 16)'.
   The official multi-line workflow command docs explicitly warn against
   fixed delimiters for arbitrary multi-line values; the runner's
   FileCommandManager does an exact-line ordinal compare, so a PEM whose
   base64 body happened to contain a line equal to 'APMPK' would
   prematurely terminate the value. Probability for real PEMs is
   astronomically low but non-zero; eliminate the class entirely.

2) Trailing-newline normalisation (matrix-composition expert)
   The legacy path (direct env var) preserves the PEM's trailing newline
   while the apps[] path (jq via $(...) command substitution) strips it.
   create-github-app-token v3.1.1 tolerates both via core.getInput().trim(),
   but normalising once at the env-relay boundary removes the asymmetry
   at the source instead of leaning on downstream tolerance. Both paths
   now produce byte-identical ROW_PRIVATE_KEY values.

3) Never-echo SECURITY comment (secret-masker + github-apps experts)
   ::add-mask:: registers the PEM as a single multi-line substring; the
   runner's secret masker does not match individual PEM lines printed in
   isolation. A future debug 'echo $pk' or 'set -x' in this step would
   leak the key body in clear text. Pure preventive guard.

Primary sources cited by the audits:
- actions/runner JobExtension.cs:639-647 (the output filter itself)
- actions/runner FileCommandManager.cs:378 (heredoc parser - exact compare)
- actions/runner ActionCommandManager.cs:426 (::add-mask:: implementation)
- docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#multiline-strings
- raw.githubusercontent.com/actions/create-github-app-token/v3.1.1/main.js

Verify workflow simplification: with the resolver now normalising trailing
newlines uniformly, the b-fix-consume SHA assertion no longer needs its
own normalisation loop -- both rows arrive at byte-identical PEMs at the
$GITHUB_ENV boundary. Drops 4 lines of belt-and-suspenders.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…PR ref

Per review comment from copilot-pull-request-reviewer on PR #1373:
- Add missing (#1373) suffix to match Keep a Changelog format used
  throughout the file.
- Compress the multi-paragraph root-cause exposition to a single line;
  the deep technical detail (runner internals, JobExtension.cs, verify
  workflow design) lives in the PR description where it belongs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel merged commit 7c46bff into main May 18, 2026
21 checks passed
@danielmeppiel danielmeppiel deleted the danielmeppiel/rca-matrix-secret-stripping branch May 18, 2026 20:57
danielmeppiel pushed a commit that referenced this pull request May 18, 2026
- One concise line per PR answering 'so what?' for end users
- Add 5 missing entries: #1376 (perf resolver), #1373 (shared/apm.md
  matrix secret-stripping), #1246 (install.ps1 GHES env vars), #1255
  (warn missing apm.yml), #1248 (extends:org unmanaged_files)
- Drop internal/CI/test-infra entries (#1270, #1271, #1272, #1274,
  #1276, #1291, #1360 refactor)
- Consolidate three #605 lines and four #1317 lines into one entry
  per PR where appropriate
- Promote MCP Registry v0.1 to a dedicated Breaking section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel added a commit that referenced this pull request May 18, 2026
* chore: cut 0.14.0

Renames the [Unreleased] block in CHANGELOG.md to [0.14.0] - 2026-05-18
and bumps the package version from 0.13.0 to 0.14.0 in pyproject.toml
(and uv.lock by regeneration).

0.14.0 ships the producer-experience epic (#1348) on the CLI side --
notably:

- apm pack --check-versions / --check-clean (#1365), the release gates
  consumed by apm-action mode: release.
- apm plugin init (#1370), the noun-verb successor to apm init --plugin.
- apm pack multi-format outputs (--marketplace, --marketplace-path,
  --json, marketplace.outputs map form) (#1317).
- New producer docs corpus (repo-shapes / releasing-from-any-ci /
  versioning-strategies) (#1370).
- Breaking: MCP registry client adopts the official v0.1 spec; self-
  hosted registries must serve /v0.1/ paths (#1337).

Plus the deprecations and fixes already listed in the moved block.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(changelog): tighten v0.14.0 entries; add post-cut PRs

- One concise line per PR answering 'so what?' for end users
- Add 5 missing entries: #1376 (perf resolver), #1373 (shared/apm.md
  matrix secret-stripping), #1246 (install.ps1 GHES env vars), #1255
  (warn missing apm.yml), #1248 (extends:org unmanaged_files)
- Drop internal/CI/test-infra entries (#1270, #1271, #1272, #1274,
  #1276, #1291, #1360 refactor)
- Consolidate three #605 lines and four #1317 lines into one entry
  per PR where appropriate
- Promote MCP Registry v0.1 to a dedicated Breaking section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(changelog): add #1377 Bitbucket DC tilde fix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Daniel Meppiel <copilot-rework@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants