Skip to content

[policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829

@danielmeppiel

Description

@danielmeppiel

Context

In #827 we shipped install-time policy enforcement for APM. When the policy fetch fails — cache miss combined with network failure, malformed YAML response, or garbage HTTP response — the current behavior is fail-open: the install proceeds and a warning is logged.

The three PolicyFetchResult outcomes that currently fail-open are:

  • cache_miss_fetch_fail — no cached policy and the remote fetch failed (network error, timeout, DNS).
  • malformed — a response was received but could not be parsed as valid policy YAML.
  • garbage_response — the HTTP response was non-YAML (HTML error page, empty body, etc.).

This was a deliberate v1 design decision to avoid bricking developer environments when an org policy server is flaky or unreachable. For enterprises that require fail-closed behavior (no install proceeds without a verified policy verdict), we need a configurable knob.

Proposal

Add a fetch_failure key under the policy section of apm-policy.yml:

policy:
  fetch_failure: warn  # default — current behavior (fail-open)
  # fetch_failure: block  # fail-closed — install exits non-zero on fetch failure
Value Behavior
warn (default) Log a warning and proceed with install. Current behavior, unchanged.
block Exit non-zero with a clear error message. No packages are installed.

Important: cache_stale (a cached policy within MAX_STALE_TTL) is unaffected by this knob — a stale-but-valid cache is always used regardless of the fetch_failure setting. This knob only governs the three hard-failure outcomes listed above.

Acceptance criteria

  • Schema validation accepts both warn and block as valid values for policy.fetch_failure; rejects anything else.
  • Default value is warn — no behavioral change for existing users.
  • When set to block, any of the three failure outcomes (cache_miss_fetch_fail, malformed, garbage_response) causes install to exit non-zero with a clear, actionable error message (e.g., "Policy fetch failed and fetch_failure is set to block. Cannot proceed.").
  • cache_stale (within MAX_STALE_TTL) is unaffected — always uses the cached policy regardless of this setting.
  • Documentation page updated to describe the knob, default, and enterprise use case.
  • Unit tests cover both warn and block values across all three failure outcomes.
  • Integration test confirms block mode exits non-zero and warn mode logs warning but exits zero.

Open questions

  1. apm install --dry-run — Should block mode also prevent dry-run from completing? Or should dry-run always succeed (since it does not actually install anything)?
  2. --no-policy bypass — Should --no-policy still bypass enforcement even in block mode? CEO recommendation: yes. The escape hatch is the escape hatch — if an engineer explicitly opts out with --no-policy, that intent should be respected even in strict mode. The audit trail (CLI logs) already records the bypass.

Out of scope

  • Non-GitHub VCS policy hosting (e.g., GitLab, Bitbucket policy sources) — separate feature.
  • Signing or integrity verification of the policy file itself — separate security issue.
  • Granular per-rule failure modes (e.g., some rules fail-open, others fail-closed) — future iteration if demand materializes.

Follow-up to #827. Filed per CEO mandate during panel review.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions