feat(ci): add release-time bench workflow on self-hosted runner by be0x74a · Pull Request #94 · projection-operator/projection

be0x74a · 2026-05-08T11:45:36Z

Why

PR 4 of the bench multi-PR sequence:

feat(bench): retrofit harness for v0.3 dual-CRD topology #91 retrofitted the bench for v0.3 dual-CRD shape
feat(ci): add bench smoke check on PRs touching api/controller/bench #92 added a per-PR shape-break smoke check (free-tier 2-vCPU GHA runner)
feat(bench): add self-heal and ns-flip events, label source-update fields #93 added self-heal + ns-flip event measurements
This PR adds a release-time bench that captures full-matrix numbers on a self-hosted runner and persists them to a bench-history orphan branch keyed by release tag

Earlier in the design conversation we considered GHA larger (8-core, paid) vs. self-hosted. Picked self-hosted because release-time bench is rare (~6/year), reproducibility year-over-year matters more than always-on availability, and the dedicated VM eliminates noisy-neighbor variance from year-over-year comparisons.

What

.github/workflows/bench.yml:

Trigger: workflow_dispatch only (NOT pull_request — self-hosted runners on public repos are exposed to fork-PR malicious code; manual-only avoids that class entirely)
Inputs: ref (release tag, branch, or SHA — required) and optional `label\ override for the bench-history filename
Runs on: [self-hosted, bench-runner] — runner provisioning is handled separately on your side (see checklist below)
Steps: checkout target ref → Setup Go from go.mod → create Kind cluster (bench-release, kindest/node v1.32.0) → build & load operator image → kustomize-deploy → wait for ready → go run ./test/bench --profile=full --output=json → jq -s '.' slurp into a JSON array → derive label → render markdown step summary → push to bench-history orphan branch (self-bootstraps on first run with a README) → upload artifact → cleanup Kind cluster
Permissions: top-level contents: read; job adds contents: write for the orphan-branch push (least privilege)
Timeout: 240 min (full matrix typically 60-90 min — leaving a 2-3x cushion for stress profiles on first run)

.github/scripts/bench-history-summary.sh (bash + jq):

Renders a markdown step summary with three tables (source-update p99, self-heal p99, ns-flip p99) keyed by profile
Em-dash for missing/zero values; LC_ALL=C for locale-stable ms formatting
Step summary is operator-facing UX so the run output is glanceable without diving into the bench-history JSON

Self-hosted runner provisioning checklist

You'll need to provision and register the runner separately. The workflow's \runs-on: [self-hosted, bench-runner]` will only pick it up when both labels match.

VM specs (Proxmox or equivalent):

8 vCPU / 16 GB RAM / 50 GB NVMe-backed disk
Ubuntu 22.04 LTS recommended (matches GHA hosted runner image for parity)

Required tools on the VM:

apt-get update
apt-get install -y docker.io git curl jq
# Go (matches the project's go.mod toolchain — installer URL changes; check go.dev)
# kubectl, kind, helm, gh CLI
# GitHub Actions runner agent

gh CLI specifically is required on the runner (the workflow uses gh auth setup-git to configure git credentials for the bench-history push).

Runner registration:

GitHub org settings → Actions → Runners → New self-hosted runner → choose Linux/x64
Follow the install commands shown
When prompted for labels, add bench-runner (in addition to the default self-hosted, Linux, X64)
Configure as a service so it survives VM reboots

Power profile: the VM doesn't need to be always-on. Self-hosted runners deregister gracefully on shutdown and re-register on startup. Boot before triggering bench, shut down after — keeps power cost negligible.

Why `pull_request` is omitted

GitHub explicitly recommends self-hosted runners only for private repos, because fork PRs can run arbitrary workflow code on the runner. Our repo is public, so the bench workflow is workflow_dispatch-only. The PR-time shape-break check (bench-smoke.yml, added in #92) stays on the free-tier hosted runner — that's where regression catching happens; this workflow is just for the release-time anchored numbers.

If we ever add a pull_request trigger to this workflow, the runner becomes a supply-chain target. Don't.

Out of scope (future work)

Per-event controller histograms (Approach 2b) — v0.4.0 target
Trend dashboard reading from bench-history (gh-pages or similar) — defer until enough data points accumulate

Test plan

yamllint clean against the project .yamllint.yml (added in feat(ci): add bench smoke check on PRs touching api/controller/bench #92)
shellcheck clean on .github/scripts/bench-history-summary.sh and on every extracted run: block
Helper script exercised against two stub bench.json arrays (full 8-profile + single-profile) — both render valid markdown
Label-derivation tested for: v0.4.0, v1.0.0-rc1, v0.3.0+local, \feat/foo, abc1234567`, plus the override case
First real fire — pending runner provisioning on your side

Manually-triggered workflow (workflow_dispatch) that runs the full 8-profile bench matrix against a release tag, branch, or SHA on a self-hosted runner labelled bench-runner. Persists results to a self-bootstrapping bench-history orphan branch under bench-history/<label>.json and renders a markdown summary table of source-update / self-heal / ns-flip p99 latencies into the workflow step summary. pull_request triggers are deliberately omitted because self-hosted runners on public repos are exposed to fork-PR malicious code. The existing per-PR shape-break smoke check on ubuntu-22.04 (bench-smoke.yml) covers regression catching on inbound changes.

* release: v0.3.1 prep (CHANGELOG, chart bump) Promote Unreleased entries (#90 SourceNotFound distinction, #94 release-time bench workflow, #97 source.version optional + Destination column rename) to the [0.3.1] - 2026-05-10 heading. Bump chart version and appVersion to 0.3.1. Refresh artifacthub.io/crdsExamples to use bare-Kind for the core ConfigMap sources, matching the new lead form shipped in #97. * docs: bump install/cosign examples to v0.3.1; fix v0.3.0 misattribution Pre-tag documentation sweep found two classes of issue: - Install examples in README, getting-started, security (cosign), and the chart README still referenced v0.3.0. Bumped to v0.3.1 to match the release being cut. - docs/troubleshooting.md attributed the source.version relaxation to v0.3.0 in two places — incorrect, since v0.3.0 still carried the CEL rule (rescinded only in v0.3.1, per CHANGELOG and api-stability.md). Fixed by retiming one mention to pre-v0.3.1/v0.3.1 and rephrasing the other to a present-tense, version-neutral statement.

be0x74a merged commit 7d709e4 into main May 8, 2026
15 checks passed

be0x74a deleted the ci/bench-release branch May 8, 2026 11:53

be0x74a mentioned this pull request May 10, 2026

release: v0.3.1 prep (CHANGELOG, chart bump) #98

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): add release-time bench workflow on self-hosted runner#94

feat(ci): add release-time bench workflow on self-hosted runner#94
be0x74a merged 1 commit into
mainfrom
ci/bench-release

be0x74a commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

be0x74a commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Self-hosted runner provisioning checklist

Why pull_request is omitted

Out of scope (future work)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

be0x74a commented May 8, 2026 •

edited

Loading

Why `pull_request` is omitted