Context
In #827 we shipped install-time policy enforcement for APM. When the policy fetch fails — cache miss combined with network failure, malformed YAML response, or garbage HTTP response — the current behavior is fail-open: the install proceeds and a warning is logged.
The three PolicyFetchResult outcomes that currently fail-open are:
cache_miss_fetch_fail — no cached policy and the remote fetch failed (network error, timeout, DNS).
malformed — a response was received but could not be parsed as valid policy YAML.
garbage_response — the HTTP response was non-YAML (HTML error page, empty body, etc.).
This was a deliberate v1 design decision to avoid bricking developer environments when an org policy server is flaky or unreachable. For enterprises that require fail-closed behavior (no install proceeds without a verified policy verdict), we need a configurable knob.
Proposal
Add a fetch_failure key under the policy section of apm-policy.yml:
policy:
fetch_failure: warn # default — current behavior (fail-open)
# fetch_failure: block # fail-closed — install exits non-zero on fetch failure
| Value |
Behavior |
warn (default) |
Log a warning and proceed with install. Current behavior, unchanged. |
block |
Exit non-zero with a clear error message. No packages are installed. |
Important: cache_stale (a cached policy within MAX_STALE_TTL) is unaffected by this knob — a stale-but-valid cache is always used regardless of the fetch_failure setting. This knob only governs the three hard-failure outcomes listed above.
Acceptance criteria
Open questions
apm install --dry-run — Should block mode also prevent dry-run from completing? Or should dry-run always succeed (since it does not actually install anything)?
--no-policy bypass — Should --no-policy still bypass enforcement even in block mode? CEO recommendation: yes. The escape hatch is the escape hatch — if an engineer explicitly opts out with --no-policy, that intent should be respected even in strict mode. The audit trail (CLI logs) already records the bypass.
Out of scope
- Non-GitHub VCS policy hosting (e.g., GitLab, Bitbucket policy sources) — separate feature.
- Signing or integrity verification of the policy file itself — separate security issue.
- Granular per-rule failure modes (e.g., some rules fail-open, others fail-closed) — future iteration if demand materializes.
Follow-up to #827. Filed per CEO mandate during panel review.
Context
In #827 we shipped install-time policy enforcement for APM. When the policy fetch fails — cache miss combined with network failure, malformed YAML response, or garbage HTTP response — the current behavior is fail-open: the install proceeds and a warning is logged.
The three
PolicyFetchResultoutcomes that currently fail-open are:cache_miss_fetch_fail— no cached policy and the remote fetch failed (network error, timeout, DNS).malformed— a response was received but could not be parsed as valid policy YAML.garbage_response— the HTTP response was non-YAML (HTML error page, empty body, etc.).This was a deliberate v1 design decision to avoid bricking developer environments when an org policy server is flaky or unreachable. For enterprises that require fail-closed behavior (no install proceeds without a verified policy verdict), we need a configurable knob.
Proposal
Add a
fetch_failurekey under thepolicysection ofapm-policy.yml:warn(default)blockImportant:
cache_stale(a cached policy withinMAX_STALE_TTL) is unaffected by this knob — a stale-but-valid cache is always used regardless of thefetch_failuresetting. This knob only governs the three hard-failure outcomes listed above.Acceptance criteria
warnandblockas valid values forpolicy.fetch_failure; rejects anything else.warn— no behavioral change for existing users.block, any of the three failure outcomes (cache_miss_fetch_fail,malformed,garbage_response) causes install to exit non-zero with a clear, actionable error message (e.g., "Policy fetch failed and fetch_failure is set to block. Cannot proceed.").cache_stale(withinMAX_STALE_TTL) is unaffected — always uses the cached policy regardless of this setting.warnandblockvalues across all three failure outcomes.blockmode exits non-zero andwarnmode logs warning but exits zero.Open questions
apm install --dry-run— Shouldblockmode also prevent dry-run from completing? Or should dry-run always succeed (since it does not actually install anything)?--no-policybypass — Should--no-policystill bypass enforcement even inblockmode? CEO recommendation: yes. The escape hatch is the escape hatch — if an engineer explicitly opts out with--no-policy, that intent should be respected even in strict mode. The audit trail (CLI logs) already records the bypass.Out of scope
Follow-up to #827. Filed per CEO mandate during panel review.