fix: allow coordination.k8s.io/leases in Project discovery + body-kind probe defensive#617
Open
mattdjenkinson wants to merge 2 commits into
Open
fix: allow coordination.k8s.io/leases in Project discovery + body-kind probe defensive#617mattdjenkinson wants to merge 2 commits into
mattdjenkinson wants to merge 2 commits into
Conversation
filterAPIIndex previously decided whether to decode the response as APIGroupDiscoveryList based solely on the request Accept header. The intent of that change (6ef5237) was to work around Milo not echoing the as=APIGroupDiscoveryList Content-Type, but the trade-off silently broke clients when the inner apiserver emitted legacy APIGroupList JSON in response to an aggregated request: json.Unmarshal into APIGroupDiscoveryList succeeds with items: nil (Go JSON is lenient about extra/missing fields), and the filter re-encoded that empty list and sent it back to the client. The user-visible effect was every API group disappearing from project- context discovery — including built-in groups like coordination.k8s.io that no client has any way to register. controller-runtime's REST mapper then failed with `no matches for kind "Lease" in version "coordination.k8s.io/v1"`, and downstream operators that watch Lease (network-services-operator's connector + iroh-dns controllers, plus anything similar) couldn't engage. ~221 staging project control planes were affected. Probe the response body's `kind` field instead. Body kind is the only reliable signal for the format that's actually in `cw.body`; the Accept header is kept as a tie-break when the body lacks an explicit kind. Unknown or unrecognised bodies pass through verbatim so the filter never silently emits a malformed response. New regression test covers the four format/header combinations. Fixes: #616 Related: datum-cloud/network-services-operator#160 (downstream manifestation), datum-cloud/network-services-operator#161 (NSO-side defensive fix that no longer crash-loops on this).
The core-kubernetes-resources DiscoveryContextPolicy listed Lease as Platform-only alongside genuinely admin-only groups (rbac, authn, authz, admissionregistration). That was incorrect: Leases are a namespace-scoped primitive used legitimately by per-project workloads — leader election in any operator that targets project apiservers, or custom heartbeats like Datum Connect's connector health Lease. The Platform-only restriction made Project-context aggregated discovery omit coordination.k8s.io entirely, which broke any controller-runtime client (REST mapper resolves the group/kind from discovery). Downstream impact in datum-cloud/network-services-operator#160: the connector and iroh-dns controllers' Lease watches fail to engage, Connector status freezes, and the operator burns an engage-retry storm against every affected project control plane. The companion body-kind-probe fix in this PR is still valuable as defensive handling for discovery responses where the Accept header and body format disagree, but the user-visible symptom — Lease missing from project discovery — is caused entirely by this policy configuration.
Contributor
Author
|
Confirmed this is working as expected in staging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related changes:
1. Policy fix (the actual bug that broke staging)
config/discovery/core/discovery-context-policy.yamllistedcoordination.k8s.io/leasesas Platform-only — alongside genuinely admin-only groups likerbac.authorization.k8s.io,authentication.k8s.io,admissionregistration.k8s.io. That's incorrect. Leases are a namespace-scoped primitive used legitimately by per-project workloads: leader election in any operator targeting project apiservers, or custom heartbeats like Datum Connect's connector health Lease.Effect of the misconfiguration: project-context aggregated discovery omitted
coordination.k8s.ioentirely. Any controller-runtime client failed to findLeasein its REST mapper. Downstream symptom:datum-cloud/network-services-operator#160— connector and iroh-dns controllers' Lease watches failed to engage, Connector status froze, and the operator burned an engage-retry storm against every affected project control plane (221 of them on staging).Fix: add
"Project"to the Lease contexts.2. Body-kind-probe defensive fix in
filterAPIIndexWhile debugging the symptom I traced it to a plausible-but-wrong cause:
filterAPIIndexused the requestAcceptheader to decide whether to decode the response asapidiscoveryv2.APIGroupDiscoveryList. If the inner apiserver ever emitted legacyAPIGroupListJSON in response to an aggregated request,json.Unmarshalwould succeed withitems: niland the filter would silently emit an empty discovery list — every group disappears.That turned out not to be the cause of the user-visible symptom (the policy was), but the silent-empty failure mode is still a real footgun. Probe the response body's
kindfield first; treat the request Accept header as a tie-break for bodies with no explicit kind; pass unknown shapes through verbatim.Keeping this defensive change in the same PR so the filter is robust to format mismatches going forward — but the user-visible fix is the policy change.
Why this regressed only on staging
Production runs
v0.25.0withDiscoveryContextFilterfeature gate disabled, so the policy is never consulted and discovery passes through unfiltered. Staging enabled the gate (infra commitadfe2b19, 2026-04-17) which loaded the policy and exposed the misconfiguration.Test plan
New
TestFilterAPIIndex_FormatDetection(4 sub-cases) covers the body-kind probe branches: aggregated body + aggregated Accept → filtered; legacy body + aggregated Accept → pass-through (no silent emptying); legacy body + legacy Accept → pass-through; aggregated body + legacy Accept → still filtered (body kind wins).All existing tests pass.
The policy change should be validated end-to-end against an affected staging project after this PR lands — confirmed via an in-cluster
curlagainst/apis/resourcemanager.miloapis.com/v1alpha1/projects/X/control-plane/apiswith the aggregatedAcceptheader, asserting thatcoordination.k8s.ioappears in the response items.Related
#616datum-cloud/network-services-operator#160(frozen Connectors),datum-cloud/network-services-operator#161(closed without merge; defensive code that's no longer needed once this fix lands)