Skip to content

feat(ci): add release-time bench workflow on self-hosted runner#94

Merged
be0x74a merged 1 commit into
mainfrom
ci/bench-release
May 8, 2026
Merged

feat(ci): add release-time bench workflow on self-hosted runner#94
be0x74a merged 1 commit into
mainfrom
ci/bench-release

Conversation

@be0x74a
Copy link
Copy Markdown
Member

@be0x74a be0x74a commented May 8, 2026

Why

PR 4 of the bench multi-PR sequence:

Earlier in the design conversation we considered GHA larger (8-core, paid) vs. self-hosted. Picked self-hosted because release-time bench is rare (~6/year), reproducibility year-over-year matters more than always-on availability, and the dedicated VM eliminates noisy-neighbor variance from year-over-year comparisons.

What

.github/workflows/bench.yml:

  • Trigger: workflow_dispatch only (NOT pull_request — self-hosted runners on public repos are exposed to fork-PR malicious code; manual-only avoids that class entirely)
  • Inputs: ref (release tag, branch, or SHA — required) and optional `label\ override for the bench-history filename
  • Runs on: [self-hosted, bench-runner] — runner provisioning is handled separately on your side (see checklist below)
  • Steps: checkout target ref → Setup Go from go.mod → create Kind cluster (bench-release, kindest/node v1.32.0) → build & load operator image → kustomize-deploy → wait for ready → go run ./test/bench --profile=full --output=jsonjq -s '.' slurp into a JSON array → derive label → render markdown step summary → push to bench-history orphan branch (self-bootstraps on first run with a README) → upload artifact → cleanup Kind cluster
  • Permissions: top-level contents: read; job adds contents: write for the orphan-branch push (least privilege)
  • Timeout: 240 min (full matrix typically 60-90 min — leaving a 2-3x cushion for stress profiles on first run)

.github/scripts/bench-history-summary.sh (bash + jq):

  • Renders a markdown step summary with three tables (source-update p99, self-heal p99, ns-flip p99) keyed by profile
  • Em-dash for missing/zero values; LC_ALL=C for locale-stable ms formatting
  • Step summary is operator-facing UX so the run output is glanceable without diving into the bench-history JSON

Self-hosted runner provisioning checklist

You'll need to provision and register the runner separately. The workflow's \runs-on: [self-hosted, bench-runner]` will only pick it up when both labels match.

VM specs (Proxmox or equivalent):

  • 8 vCPU / 16 GB RAM / 50 GB NVMe-backed disk
  • Ubuntu 22.04 LTS recommended (matches GHA hosted runner image for parity)

Required tools on the VM:

apt-get update
apt-get install -y docker.io git curl jq
# Go (matches the project's go.mod toolchain — installer URL changes; check go.dev)
# kubectl, kind, helm, gh CLI
# GitHub Actions runner agent

gh CLI specifically is required on the runner (the workflow uses gh auth setup-git to configure git credentials for the bench-history push).

Runner registration:

  1. GitHub org settings → Actions → Runners → New self-hosted runner → choose Linux/x64
  2. Follow the install commands shown
  3. When prompted for labels, add bench-runner (in addition to the default self-hosted, Linux, X64)
  4. Configure as a service so it survives VM reboots

Power profile: the VM doesn't need to be always-on. Self-hosted runners deregister gracefully on shutdown and re-register on startup. Boot before triggering bench, shut down after — keeps power cost negligible.

Why pull_request is omitted

GitHub explicitly recommends self-hosted runners only for private repos, because fork PRs can run arbitrary workflow code on the runner. Our repo is public, so the bench workflow is workflow_dispatch-only. The PR-time shape-break check (bench-smoke.yml, added in #92) stays on the free-tier hosted runner — that's where regression catching happens; this workflow is just for the release-time anchored numbers.

If we ever add a pull_request trigger to this workflow, the runner becomes a supply-chain target. Don't.

Out of scope (future work)

  • Per-event controller histograms (Approach 2b) — v0.4.0 target
  • Trend dashboard reading from bench-history (gh-pages or similar) — defer until enough data points accumulate

Test plan

  • yamllint clean against the project .yamllint.yml (added in feat(ci): add bench smoke check on PRs touching api/controller/bench #92)
  • shellcheck clean on .github/scripts/bench-history-summary.sh and on every extracted run: block
  • Helper script exercised against two stub bench.json arrays (full 8-profile + single-profile) — both render valid markdown
  • Label-derivation tested for: v0.4.0, v1.0.0-rc1, v0.3.0+local, \feat/foo, abc1234567`, plus the override case
  • First real fire — pending runner provisioning on your side

Manually-triggered workflow (workflow_dispatch) that runs the full
8-profile bench matrix against a release tag, branch, or SHA on a
self-hosted runner labelled bench-runner. Persists results to a
self-bootstrapping bench-history orphan branch under
bench-history/<label>.json and renders a markdown summary table of
source-update / self-heal / ns-flip p99 latencies into the workflow
step summary.

pull_request triggers are deliberately omitted because self-hosted
runners on public repos are exposed to fork-PR malicious code. The
existing per-PR shape-break smoke check on ubuntu-22.04
(bench-smoke.yml) covers regression catching on inbound changes.
@be0x74a be0x74a merged commit 7d709e4 into main May 8, 2026
15 checks passed
@be0x74a be0x74a deleted the ci/bench-release branch May 8, 2026 11:53
be0x74a added a commit that referenced this pull request May 10, 2026
* release: v0.3.1 prep (CHANGELOG, chart bump)

Promote Unreleased entries (#90 SourceNotFound distinction, #94 release-time
bench workflow, #97 source.version optional + Destination column rename) to
the [0.3.1] - 2026-05-10 heading. Bump chart version and appVersion to 0.3.1.
Refresh artifacthub.io/crdsExamples to use bare-Kind for the core ConfigMap
sources, matching the new lead form shipped in #97.

* docs: bump install/cosign examples to v0.3.1; fix v0.3.0 misattribution

Pre-tag documentation sweep found two classes of issue:

- Install examples in README, getting-started, security (cosign), and the
  chart README still referenced v0.3.0. Bumped to v0.3.1 to match the
  release being cut.
- docs/troubleshooting.md attributed the source.version relaxation to
  v0.3.0 in two places — incorrect, since v0.3.0 still carried the CEL
  rule (rescinded only in v0.3.1, per CHANGELOG and api-stability.md).
  Fixed by retiming one mention to pre-v0.3.1/v0.3.1 and rephrasing the
  other to a present-tense, version-neutral statement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant