fix(bench): wait for Terminating namespace before recreating by be0x74a · Pull Request #95 · projection-operator/projection

be0x74a · 2026-05-08T14:31:51Z

Why

Surfaced by the first end-to-end `--profile=full` run on the self-hosted runner today: np-typical succeeded; np-stress's bootstrap immediately failed with `namespaces "bench-src-0" not found`. Multi-profile runs re-use the same namespace name across profiles, and the bench had a teardown→bootstrap race:

Profile N's deferred `teardown()` issues async `Delete` on shared namespaces and returns immediately
Profile N+1's bootstrap calls `ensureNamespace("bench-src-0")` → `Get` succeeds because the namespace is still in `Terminating` phase → skips `Create` (the bug)
Microseconds later the finalizer completes; the namespace drops from etcd
Profile N+1's first `createSource` call hits an apiserver that no longer knows about `bench-src-0` → "namespace not found"

The smoke-test check (PR #92) only runs np-typical so it doesn't exercise multi-profile sequencing — exactly why the release-bench full matrix is load-bearing.

What

`ensureNamespace` now polls until it sees one of:

`IsNotFound` → safe to create fresh
A namespace whose `status.phase != Terminating` → reuse as-is

With a 60s deadline. Single point of change; the existing `teardown` stays best-effort and non-blocking (we don't want teardown to slow down per-profile wall when it doesn't have to).

Why not unit-test this directly

The race involves a kubernetes namespace's lifecycle (Active → Terminating → gone-from-etcd), driven by the apiserver finalizer chain. Unit-testing it requires either:

A fake client that scripts the phase transitions through Get calls (substantial test scaffolding for one race)
An envtest cluster (which the bench harness deliberately doesn't use — it's a local diagnostic that runs against a real Kind cluster, not envtest)

Either is disproportionate. The smoke check covers the np-typical happy path. This race specifically surfaces in multi-profile sequencing, which the release-bench full matrix exercises whenever it runs. If this regression sneaks back in, the next release-bench run will surface it — same way today's run did.

Test plan

`make build` clean
`make lint` 0 issues
`go test ./test/bench/ -count=1 -race` passing
After merge: re-trigger release-bench (against `main`) and confirm it gets past np-stress

Multi-profile runs (--profile=full, the release-bench workflow) re-use the same namespace names across profiles. The deferred teardown of profile N issues async ns Deletes and returns immediately; profile N+1 then calls ensureNamespace, which Get-succeeded on the still-Terminating namespace and skipped Create. The next CR Create in profile N+1 then raced the finalizer and intermittently saw "namespaces 'bench-src-0' not found" once the ns dropped from etcd. Surfaced on the very first end-to-end --profile=full run on the self-hosted runner: np-typical succeeded, np-stress's bootstrap hit the race on bench-src-0. Fix: ensureNamespace now polls until it sees either NotFound (safe to create) or a non-Terminating phase (reuse), with a 60s deadline. Single point of change; teardown stays best-effort and non-blocking. The unit-test surface for this is awkward (requires either a fake client that returns Terminating-then-NotFound or an envtest cluster); deferring that. The bench-smoke check covers the np-typical happy path; this race only surfaces in multi-profile sequencing, which the release-bench full matrix exercises end-to-end.

github-actions · 2026-05-08T14:37:41Z

Bench smoke — `mixed-typical`

End-to-end source-update latency from a 2-vCPU GHA runner. Treat absolute numbers as a sanity check, not a perf claim — runner noise is high. The point of this check is to catch shape-break regressions on api/v1 / controller / bench changes before merge. (Self-heal and ns-flip distributions are recorded in bench.json but omitted here for signal-to-noise.)

Profile

100 namespaced Projections + 50 CP-selector destinations + 10 CP-list destinations, layered in one bootstrap.

Results — source-update latency

Path	Samples	p50	p95	p99
NP single-target	100	15.7ms	19.7ms	23.3ms
CP-selector earliest	30	53.6ms	90.6ms	105.3ms
CP-selector slowest	30	125.4ms	175.0ms	177.2ms
CP-list earliest	30	36.2ms	47.9ms	58.4ms
CP-list slowest	30	36.2ms	63.9ms	76.2ms

Total wall: 88s • Commit: 38506b8 • Workflow run

PR #95 fixed the namespace teardown→bootstrap race by waiting for any Terminating namespace inside ensureNamespace. The next end-to-end run hit the same race class on a different resource: profile np-stress's installCRDs got Create→IsAlreadyExists on a still-Terminating CRD from np-typical's teardown, skipped the Create, slept 3s, and during that sleep the CRD finalizer completed. The follow-up createSource then saw "the server could not find the requested resource". Rather than playing whack-a-mole with one Terminating-aware Ensure per resource type (namespaces today, CRDs tomorrow, ClusterProjections next week), centralize the cleanup-completion wait in teardown itself. After issuing every Delete, teardown now polls until every namespace, CRD, and ClusterProjection it deleted is observed NotFound. Bounded at 120s; on timeout the function returns silently (next bootstrap will surface genuinely stuck state). The PR #95 ensureNamespace wait stays in place as defense-in-depth — it covers external-actor deletes that happen during a run, not just the inter-profile teardown race this commit closes.

be0x74a merged commit 35a6514 into main May 8, 2026
16 checks passed

be0x74a deleted the fix/bench-ns-terminating-race branch May 8, 2026 14:41

be0x74a mentioned this pull request May 8, 2026

fix(bench): synchronously wait for teardown deletes to complete #96

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): wait for Terminating namespace before recreating#95

fix(bench): wait for Terminating namespace before recreating#95
be0x74a merged 1 commit into
mainfrom
fix/bench-ns-terminating-race

be0x74a commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

be0x74a commented May 8, 2026

Why

What

Why not unit-test this directly

Test plan

Uh oh!

github-actions Bot commented May 8, 2026

Bench smoke — mixed-typical

Profile

Results — source-update latency

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bench smoke — `mixed-typical`