Skip to content

fix(bench): wait for Terminating namespace before recreating#95

Merged
be0x74a merged 1 commit into
mainfrom
fix/bench-ns-terminating-race
May 8, 2026
Merged

fix(bench): wait for Terminating namespace before recreating#95
be0x74a merged 1 commit into
mainfrom
fix/bench-ns-terminating-race

Conversation

@be0x74a
Copy link
Copy Markdown
Member

@be0x74a be0x74a commented May 8, 2026

Why

Surfaced by the first end-to-end `--profile=full` run on the self-hosted runner today: np-typical succeeded; np-stress's bootstrap immediately failed with `namespaces "bench-src-0" not found`. Multi-profile runs re-use the same namespace name across profiles, and the bench had a teardown→bootstrap race:

  1. Profile N's deferred `teardown()` issues async `Delete` on shared namespaces and returns immediately
  2. Profile N+1's bootstrap calls `ensureNamespace("bench-src-0")` → `Get` succeeds because the namespace is still in `Terminating` phase → skips `Create` (the bug)
  3. Microseconds later the finalizer completes; the namespace drops from etcd
  4. Profile N+1's first `createSource` call hits an apiserver that no longer knows about `bench-src-0` → "namespace not found"

The smoke-test check (PR #92) only runs np-typical so it doesn't exercise multi-profile sequencing — exactly why the release-bench full matrix is load-bearing.

What

`ensureNamespace` now polls until it sees one of:

  • `IsNotFound` → safe to create fresh
  • A namespace whose `status.phase != Terminating` → reuse as-is

With a 60s deadline. Single point of change; the existing `teardown` stays best-effort and non-blocking (we don't want teardown to slow down per-profile wall when it doesn't have to).

Why not unit-test this directly

The race involves a kubernetes namespace's lifecycle (Active → Terminating → gone-from-etcd), driven by the apiserver finalizer chain. Unit-testing it requires either:

  • A fake client that scripts the phase transitions through Get calls (substantial test scaffolding for one race)
  • An envtest cluster (which the bench harness deliberately doesn't use — it's a local diagnostic that runs against a real Kind cluster, not envtest)

Either is disproportionate. The smoke check covers the np-typical happy path. This race specifically surfaces in multi-profile sequencing, which the release-bench full matrix exercises whenever it runs. If this regression sneaks back in, the next release-bench run will surface it — same way today's run did.

Test plan

  • `make build` clean
  • `make lint` 0 issues
  • `go test ./test/bench/ -count=1 -race` passing
  • After merge: re-trigger release-bench (against `main`) and confirm it gets past np-stress

Multi-profile runs (--profile=full, the release-bench workflow) re-use the
same namespace names across profiles. The deferred teardown of profile N
issues async ns Deletes and returns immediately; profile N+1 then calls
ensureNamespace, which Get-succeeded on the still-Terminating namespace
and skipped Create. The next CR Create in profile N+1 then raced the
finalizer and intermittently saw "namespaces 'bench-src-0' not found"
once the ns dropped from etcd.

Surfaced on the very first end-to-end --profile=full run on the
self-hosted runner: np-typical succeeded, np-stress's bootstrap hit the
race on bench-src-0.

Fix: ensureNamespace now polls until it sees either NotFound (safe to
create) or a non-Terminating phase (reuse), with a 60s deadline. Single
point of change; teardown stays best-effort and non-blocking.

The unit-test surface for this is awkward (requires either a fake client
that returns Terminating-then-NotFound or an envtest cluster); deferring
that. The bench-smoke check covers the np-typical happy path; this race
only surfaces in multi-profile sequencing, which the release-bench full
matrix exercises end-to-end.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Bench smoke — mixed-typical

End-to-end source-update latency from a 2-vCPU GHA runner. Treat absolute numbers as a sanity check, not a perf claim — runner noise is high. The point of this check is to catch shape-break regressions on api/v1 / controller / bench changes before merge. (Self-heal and ns-flip distributions are recorded in bench.json but omitted here for signal-to-noise.)

Profile

100 namespaced Projections + 50 CP-selector destinations + 10 CP-list destinations, layered in one bootstrap.

Results — source-update latency

Path Samples p50 p95 p99
NP single-target 100 15.7ms 19.7ms 23.3ms
CP-selector earliest 30 53.6ms 90.6ms 105.3ms
CP-selector slowest 30 125.4ms 175.0ms 177.2ms
CP-list earliest 30 36.2ms 47.9ms 58.4ms
CP-list slowest 30 36.2ms 63.9ms 76.2ms

Total wall: 88s • Commit: 38506b8Workflow run

@be0x74a be0x74a merged commit 35a6514 into main May 8, 2026
16 checks passed
@be0x74a be0x74a deleted the fix/bench-ns-terminating-race branch May 8, 2026 14:41
be0x74a added a commit that referenced this pull request May 8, 2026
PR #95 fixed the namespace teardown→bootstrap race by waiting for any
Terminating namespace inside ensureNamespace. The next end-to-end run
hit the same race class on a different resource: profile np-stress's
installCRDs got Create→IsAlreadyExists on a still-Terminating CRD from
np-typical's teardown, skipped the Create, slept 3s, and during that
sleep the CRD finalizer completed. The follow-up createSource then saw
"the server could not find the requested resource".

Rather than playing whack-a-mole with one Terminating-aware Ensure per
resource type (namespaces today, CRDs tomorrow, ClusterProjections next
week), centralize the cleanup-completion wait in teardown itself. After
issuing every Delete, teardown now polls until every namespace, CRD,
and ClusterProjection it deleted is observed NotFound. Bounded at 120s;
on timeout the function returns silently (next bootstrap will surface
genuinely stuck state).

The PR #95 ensureNamespace wait stays in place as defense-in-depth — it
covers external-actor deletes that happen during a run, not just the
inter-profile teardown race this commit closes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant