fix(bench): wait for Terminating namespace before recreating#95
Merged
Conversation
Multi-profile runs (--profile=full, the release-bench workflow) re-use the same namespace names across profiles. The deferred teardown of profile N issues async ns Deletes and returns immediately; profile N+1 then calls ensureNamespace, which Get-succeeded on the still-Terminating namespace and skipped Create. The next CR Create in profile N+1 then raced the finalizer and intermittently saw "namespaces 'bench-src-0' not found" once the ns dropped from etcd. Surfaced on the very first end-to-end --profile=full run on the self-hosted runner: np-typical succeeded, np-stress's bootstrap hit the race on bench-src-0. Fix: ensureNamespace now polls until it sees either NotFound (safe to create) or a non-Terminating phase (reuse), with a 60s deadline. Single point of change; teardown stays best-effort and non-blocking. The unit-test surface for this is awkward (requires either a fake client that returns Terminating-then-NotFound or an envtest cluster); deferring that. The bench-smoke check covers the np-typical happy path; this race only surfaces in multi-profile sequencing, which the release-bench full matrix exercises end-to-end.
Bench smoke —
|
| Path | Samples | p50 | p95 | p99 |
|---|---|---|---|---|
| NP single-target | 100 | 15.7ms | 19.7ms | 23.3ms |
| CP-selector earliest | 30 | 53.6ms | 90.6ms | 105.3ms |
| CP-selector slowest | 30 | 125.4ms | 175.0ms | 177.2ms |
| CP-list earliest | 30 | 36.2ms | 47.9ms | 58.4ms |
| CP-list slowest | 30 | 36.2ms | 63.9ms | 76.2ms |
Total wall: 88s • Commit: 38506b8 • Workflow run
5 tasks
be0x74a
added a commit
that referenced
this pull request
May 8, 2026
PR #95 fixed the namespace teardown→bootstrap race by waiting for any Terminating namespace inside ensureNamespace. The next end-to-end run hit the same race class on a different resource: profile np-stress's installCRDs got Create→IsAlreadyExists on a still-Terminating CRD from np-typical's teardown, skipped the Create, slept 3s, and during that sleep the CRD finalizer completed. The follow-up createSource then saw "the server could not find the requested resource". Rather than playing whack-a-mole with one Terminating-aware Ensure per resource type (namespaces today, CRDs tomorrow, ClusterProjections next week), centralize the cleanup-completion wait in teardown itself. After issuing every Delete, teardown now polls until every namespace, CRD, and ClusterProjection it deleted is observed NotFound. Bounded at 120s; on timeout the function returns silently (next bootstrap will surface genuinely stuck state). The PR #95 ensureNamespace wait stays in place as defense-in-depth — it covers external-actor deletes that happen during a run, not just the inter-profile teardown race this commit closes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Surfaced by the first end-to-end `--profile=full` run on the self-hosted runner today: np-typical succeeded; np-stress's bootstrap immediately failed with `namespaces "bench-src-0" not found`. Multi-profile runs re-use the same namespace name across profiles, and the bench had a teardown→bootstrap race:
The smoke-test check (PR #92) only runs np-typical so it doesn't exercise multi-profile sequencing — exactly why the release-bench full matrix is load-bearing.
What
`ensureNamespace` now polls until it sees one of:
With a 60s deadline. Single point of change; the existing `teardown` stays best-effort and non-blocking (we don't want teardown to slow down per-profile wall when it doesn't have to).
Why not unit-test this directly
The race involves a kubernetes namespace's lifecycle (Active → Terminating → gone-from-etcd), driven by the apiserver finalizer chain. Unit-testing it requires either:
Either is disproportionate. The smoke check covers the np-typical happy path. This race specifically surfaces in multi-profile sequencing, which the release-bench full matrix exercises whenever it runs. If this regression sneaks back in, the next release-bench run will surface it — same way today's run did.
Test plan