Skip to content

fix: allow coordination.k8s.io/leases in Project discovery + body-kind probe defensive#617

Open
mattdjenkinson wants to merge 2 commits into
mainfrom
fix/discovery-filter-body-kind-probe
Open

fix: allow coordination.k8s.io/leases in Project discovery + body-kind probe defensive#617
mattdjenkinson wants to merge 2 commits into
mainfrom
fix/discovery-filter-body-kind-probe

Conversation

@mattdjenkinson
Copy link
Copy Markdown
Contributor

@mattdjenkinson mattdjenkinson commented May 15, 2026

Summary

Two related changes:

1. Policy fix (the actual bug that broke staging)

config/discovery/core/discovery-context-policy.yaml listed coordination.k8s.io/leases as Platform-only — alongside genuinely admin-only groups like rbac.authorization.k8s.io, authentication.k8s.io, admissionregistration.k8s.io. That's incorrect. Leases are a namespace-scoped primitive used legitimately by per-project workloads: leader election in any operator targeting project apiservers, or custom heartbeats like Datum Connect's connector health Lease.

Effect of the misconfiguration: project-context aggregated discovery omitted coordination.k8s.io entirely. Any controller-runtime client failed to find Lease in its REST mapper. Downstream symptom: datum-cloud/network-services-operator#160 — connector and iroh-dns controllers' Lease watches failed to engage, Connector status froze, and the operator burned an engage-retry storm against every affected project control plane (221 of them on staging).

Fix: add "Project" to the Lease contexts.

2. Body-kind-probe defensive fix in filterAPIIndex

While debugging the symptom I traced it to a plausible-but-wrong cause: filterAPIIndex used the request Accept header to decide whether to decode the response as apidiscoveryv2.APIGroupDiscoveryList. If the inner apiserver ever emitted legacy APIGroupList JSON in response to an aggregated request, json.Unmarshal would succeed with items: nil and the filter would silently emit an empty discovery list — every group disappears.

That turned out not to be the cause of the user-visible symptom (the policy was), but the silent-empty failure mode is still a real footgun. Probe the response body's kind field first; treat the request Accept header as a tie-break for bodies with no explicit kind; pass unknown shapes through verbatim.

Keeping this defensive change in the same PR so the filter is robust to format mismatches going forward — but the user-visible fix is the policy change.

Why this regressed only on staging

Production runs v0.25.0 with DiscoveryContextFilter feature gate disabled, so the policy is never consulted and discovery passes through unfiltered. Staging enabled the gate (infra commit adfe2b19, 2026-04-17) which loaded the policy and exposed the misconfiguration.

Test plan

New TestFilterAPIIndex_FormatDetection (4 sub-cases) covers the body-kind probe branches: aggregated body + aggregated Accept → filtered; legacy body + aggregated Accept → pass-through (no silent emptying); legacy body + legacy Accept → pass-through; aggregated body + legacy Accept → still filtered (body kind wins).

All existing tests pass.

The policy change should be validated end-to-end against an affected staging project after this PR lands — confirmed via an in-cluster curl against /apis/resourcemanager.miloapis.com/v1alpha1/projects/X/control-plane/apis with the aggregated Accept header, asserting that coordination.k8s.io appears in the response items.

Related

filterAPIIndex previously decided whether to decode the response as
APIGroupDiscoveryList based solely on the request Accept header.
The intent of that change (6ef5237) was to work around Milo not
echoing the as=APIGroupDiscoveryList Content-Type, but the trade-off
silently broke clients when the inner apiserver emitted legacy
APIGroupList JSON in response to an aggregated request: json.Unmarshal
into APIGroupDiscoveryList succeeds with items: nil (Go JSON is
lenient about extra/missing fields), and the filter re-encoded that
empty list and sent it back to the client.

The user-visible effect was every API group disappearing from project-
context discovery — including built-in groups like coordination.k8s.io
that no client has any way to register. controller-runtime's REST
mapper then failed with `no matches for kind "Lease" in version
"coordination.k8s.io/v1"`, and downstream operators that watch Lease
(network-services-operator's connector + iroh-dns controllers, plus
anything similar) couldn't engage. ~221 staging project control
planes were affected.

Probe the response body's `kind` field instead. Body kind is the
only reliable signal for the format that's actually in `cw.body`;
the Accept header is kept as a tie-break when the body lacks an
explicit kind. Unknown or unrecognised bodies pass through verbatim
so the filter never silently emits a malformed response.

New regression test covers the four format/header combinations.

Fixes: #616
Related: datum-cloud/network-services-operator#160 (downstream
manifestation), datum-cloud/network-services-operator#161 (NSO-side
defensive fix that no longer crash-loops on this).
The core-kubernetes-resources DiscoveryContextPolicy listed Lease as
Platform-only alongside genuinely admin-only groups (rbac, authn,
authz, admissionregistration). That was incorrect: Leases are a
namespace-scoped primitive used legitimately by per-project workloads
— leader election in any operator that targets project apiservers,
or custom heartbeats like Datum Connect's connector health Lease.

The Platform-only restriction made Project-context aggregated
discovery omit coordination.k8s.io entirely, which broke any
controller-runtime client (REST mapper resolves the group/kind from
discovery). Downstream impact in
datum-cloud/network-services-operator#160: the connector and
iroh-dns controllers' Lease watches fail to engage, Connector status
freezes, and the operator burns an engage-retry storm against every
affected project control plane.

The companion body-kind-probe fix in this PR is still valuable as
defensive handling for discovery responses where the Accept header
and body format disagree, but the user-visible symptom — Lease
missing from project discovery — is caused entirely by this policy
configuration.
@mattdjenkinson mattdjenkinson changed the title fix: detect discovery response format from body kind, not request Accept fix: allow coordination.k8s.io/leases in Project discovery + body-kind probe defensive May 15, 2026
@mattdjenkinson
Copy link
Copy Markdown
Contributor Author

Confirmed this is working as expected in staging.

@mattdjenkinson mattdjenkinson requested a review from scotwells May 15, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant