Skip to content

CNTRLPLANE-3531: Remove EFS-backed Go build cache from CI runners#8637

Merged
celebdor merged 1 commit into
openshift:mainfrom
celebdor:CNTRLPLANE-3531/remove-efs-cache
May 29, 2026
Merged

CNTRLPLANE-3531: Remove EFS-backed Go build cache from CI runners#8637
celebdor merged 1 commit into
openshift:mainfrom
celebdor:CNTRLPLANE-3531/remove-efs-cache

Conversation

@celebdor
Copy link
Copy Markdown
Collaborator

@celebdor celebdor commented May 29, 2026

Summary

  • Remove EFS-backed Go build cache infrastructure — benchmarks showed it made CI slower than no cache (4m03s vs 2m05s cold build) due to NFS latency on thousands of small files
  • Delete cache-warming CronJob, warm-go-cache composite action, and EFS PVC mount from runner pods
  • Remove warm-go-cache step from all 5 reusable workflows (verify, lint, test, envtest-ocp, envtest-kube)
  • gocacheprog stays in the runner image for use with the planned node-local DaemonSet cache (CNTRLPLANE-3530)

Benchmark data: https://gist.github.com/celebdor/7c73e9e3aee02d77f8879f251b354606

Cluster cleanup (after merge)

  • Delete CronJob: oc delete cronjob -n arc-runners go-cache-warmer
  • Delete PVC: oc delete pvc -n arc-runners go-cache-pvc
  • Redeploy runner scale set with updated values (no more EFS volume mount)

Test plan

  • Verify CI jobs run successfully without the cache (cold builds, ~2 min per shard)
  • Confirm no workflow references .github/actions/warm-go-cache
  • Confirm runner pods no longer mount /cache/go-build

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Removed automated Go build cache-warming infrastructure, including the cache-warming GitHub Action, scheduled maintenance job, and related integration across build, test, lint, and verification pipelines.

Benchmarks show reading cache entries from the EFS-backed PVC over NFS
takes 4m03s — 2x slower than compiling from scratch (2m05s). The Go
build cache consists of thousands of small files, and each lookup
requires NFS stat + read round-trips that dominate wall-clock time.
The cp -a fallback was also timing out because the cache grew too large.

Remove all EFS cache infrastructure:
- Delete cache-warming CronJob and warm-go-cache composite action
- Remove EFS PVC volume mount from runner pod spec
- Remove warm-go-cache step from all reusable workflows

gocacheprog stays in the runner image for use with the planned
node-local DaemonSet cache (CNTRLPLANE-3530).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 29, 2026

@celebdor: This pull request references CNTRLPLANE-3531 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Remove EFS-backed Go build cache infrastructure — benchmarks showed it made CI slower than no cache (4m03s vs 2m05s cold build) due to NFS latency on thousands of small files
  • Delete cache-warming CronJob, warm-go-cache composite action, and EFS PVC mount from runner pods
  • Remove warm-go-cache step from all 5 reusable workflows (verify, lint, test, envtest-ocp, envtest-kube)
  • gocacheprog stays in the runner image for use with the planned node-local DaemonSet cache (CNTRLPLANE-3530)

Benchmark data: https://gist.github.com/celebdor/7c73e9e3aee02d77f8879f251b354606

Cluster cleanup (after merge)

  • Delete CronJob: oc delete cronjob -n arc-runners go-cache-warmer
  • Delete PVC: oc delete pvc -n arc-runners go-cache-pvc
  • Redeploy runner scale set with updated values (no more EFS volume mount)

Test plan

  • Verify CI jobs run successfully without the cache (cold builds, ~2 min per shard)
  • Confirm no workflow references .github/actions/warm-go-cache
  • Confirm runner pods no longer mount /cache/go-build

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 61e069ec-2cbe-42ae-aa51-537616c5dde4

📥 Commits

Reviewing files that changed from the base of the PR and between 9b67f7b and ab4db84.

📒 Files selected for processing (8)
  • .github/actions/warm-go-cache/action.yaml
  • .github/workflows/envtest-kube-reusable.yaml
  • .github/workflows/envtest-ocp-reusable.yaml
  • .github/workflows/lint-reusable.yaml
  • .github/workflows/test-reusable.yaml
  • .github/workflows/verify-reusable.yaml
  • hack/github-actions-runner/cache-warming-cronjob.yaml
  • hack/github-actions-runner/values.yaml
💤 Files with no reviewable changes (8)
  • hack/github-actions-runner/values.yaml
  • .github/workflows/envtest-ocp-reusable.yaml
  • hack/github-actions-runner/cache-warming-cronjob.yaml
  • .github/actions/warm-go-cache/action.yaml
  • .github/workflows/envtest-kube-reusable.yaml
  • .github/workflows/test-reusable.yaml
  • .github/workflows/verify-reusable.yaml
  • .github/workflows/lint-reusable.yaml

📝 Walkthrough

Walkthrough

This PR removes Go build cache warming infrastructure from the Hypershift CI/CD pipeline. The warm-go-cache composite GitHub Action that previously used EFS-backed read-only caching with local overlay is no longer invoked by five reusable workflow files (envtest-kube, envtest-ocp, lint, test, verify). The lint workflow replaces this step with direct provisioning of pre-built lint tools. The associated Kubernetes CronJob responsible for periodic cache population is deleted. The runner container configuration in the Helm values file is simplified by relocating the resources block.

Possibly related PRs

  • openshift/hypershift#8576: Updated the warm-go-cache action to use gocacheprog with EFS-backed caching, which is the component now removed in this PR.

Suggested reviewers

  • jparrill
  • sjenning
🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: removal of EFS-backed Go build cache from CI runners. It is concise, directly related to the changeset, and provides clear context for understanding the primary modification.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only GitHub Actions workflows and Kubernetes manifests, not Go test files. No Ginkgo test definitions exist in the modified files, so the stable test names check is not applicable.
Test Structure And Quality ✅ Passed PR contains no Ginkgo test code. Changes are limited to YAML configs and GitHub Actions workflows (removing EFS-backed Go cache infrastructure), making this custom check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR removes EFS cache infrastructure (CronJob, actions). No new scheduling constraints added. Existing topology spread uses ScheduleAnyway (topology-safe on SNO, TNF, TNA, HyperShift).
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add new Ginkgo e2e tests; it removes cache infrastructure and adds unit tests for CLI commands, making this check not applicable.
No-Weak-Crypto ✅ Passed PR removes CI cache infrastructure (workflows, Kubernetes CronJob, GitHub Actions composite action) and configuration files; no cryptographic code or weak crypto patterns introduced.
Container-Privileges ✅ Passed PR removes cache infrastructure with no privileged configs added. No privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, or allowPrivilegeEscalation: true found.
No-Sensitive-Data-In-Logs ✅ Passed All logging statements in modified workflows only output boolean flags, workflow results, or filenames—no passwords, tokens, API keys, or PII exposed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/ci-tooling Indicates the PR includes changes for CI or tooling and removed do-not-merge/needs-area labels May 29, 2026
@celebdor celebdor marked this pull request as ready for review May 29, 2026 14:47
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2026
@openshift-ci openshift-ci Bot requested review from clebs and jparrill May 29, 2026 14:47
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, celebdor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 29, 2026

I now have the complete picture. Here is the analysis:

Test Failure Analysis Complete

Job Information

  • PR: #8637CNTRLPLANE-3531: Remove EFS-backed Go build cache from CI runners
  • Branch: CNTRLPLANE-3531/remove-efs-cache
  • Commit: ab4db847b507f830c8e7e8c13c24d0ebeafeb654
  • Failed Jobs: envtest-ocp, envtest-kube, test (5 shards: hypershift-operator, cpo-other, cmd-support, other, cpo-hostedcontrolplane)
  • All 7 jobs failed with the identical error

Test Failure Analysis

Error

##[error]Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under
'/home/runner/_work/hypershift/hypershift/.github/actions/warm-go-cache'.
Did you forget to run actions/checkout before running your local action?

Summary

All 7 CI jobs fail instantly at workflow setup — before any Go code compiles or any test runs. The PR deletes .github/actions/warm-go-cache/action.yaml and removes its uses: step from the reusable workflows, but the caller workflows (e.g., envtest-ocp.yaml, test.yaml) are hardcoded to @main, so GitHub Actions resolves the reusable workflow from main (which still contains - uses: ./.github/actions/warm-go-cache), while the checkout step checks out the PR branch (which has the action file deleted). The reusable workflow then tries to run a local action that no longer exists in the checked-out tree.

Root Cause

This is a chicken-and-egg problem caused by how GitHub Actions resolves reusable workflows for PRs in the same repository.

The mechanism:

  1. The caller workflows (envtest-ocp.yaml, envtest-kube.yaml, test.yaml) all reference their reusable counterparts at a fixed ref:

    uses: openshift/hypershift/.github/workflows/envtest-ocp-reusable.yaml@main
  2. When a PR is opened, GitHub resolves the reusable workflow definition from @main (SHA 9b67f7bb2879ff251ea25e078536eefc804eac9c). The main branch reusable workflows still contain:

    - uses: ./.github/actions/warm-go-cache
  3. However, the actions/checkout step checks out the PR branch (CNTRLPLANE-3531/remove-efs-cache), where .github/actions/warm-go-cache/action.yaml has been deleted.

  4. When the runner tries to execute the warm-go-cache step (defined in the @main reusable workflow), it looks for the action in the checked-out PR tree — and the file doesn't exist.

In short: The reusable workflow from main says "run warm-go-cache", but the code on disk (from the PR branch) has that action deleted. These two changes must land atomically, but the @main pinning makes that impossible in a single PR.

This is not a flaky failure or infrastructure issue — it will fail deterministically on every run of this PR as long as main still references warm-go-cache in the reusable workflows.

Recommendations

Option A — Two-phase merge (recommended):

  1. First PR: Remove only the - uses: ./.github/actions/warm-go-cache lines from the reusable workflows (*-reusable.yaml). Keep the action file itself. Merge this.
  2. Second PR: Delete .github/actions/warm-go-cache/action.yaml, cache-warming-cronjob.yaml, and the runner values.yaml volume mounts. Since the reusable workflows on main no longer reference the action, this PR's CI will pass.

Option B — Self-referencing reusable workflow ref:
Change the caller workflows to use the PR's head SHA instead of @main (not standard practice and introduces other complexities — not recommended).

Option C — Make the action a no-op first:

  1. First PR: Replace .github/actions/warm-go-cache/action.yaml with a no-op action (e.g., runs: using: composite, steps: [{run: "echo 'no-op'}"]), and simultaneously remove the EFS mounts and cronjob. Merge this.
  2. Second PR: Remove the uses: lines and delete the now-no-op action file.

Option A is the cleanest approach. Split this PR into two: one that removes the references, and a follow-up that deletes the files.

Evidence
Evidence Detail
Error message (all 7 jobs) Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/runner/_work/hypershift/hypershift/.github/actions/warm-go-cache'
Reusable workflow ref openshift/hypershift/.github/workflows/envtest-ocp-reusable.yaml@refs/heads/main resolved to SHA 9b67f7bb2879ff251ea25e078536eefc804eac9c (confirmed = current main HEAD)
main reusable workflow envtest-ocp-reusable.yaml on main still contains - uses: ./.github/actions/warm-go-cache
PR branch checkout actions/checkout checks out CNTRLPLANE-3531/remove-efs-cache where warm-go-cache/action.yaml is deleted
Caller workflow pinning envtest-ocp.yaml: uses: openshift/hypershift/.github/workflows/envtest-ocp-reusable.yaml@main — same for envtest-kube.yaml and test.yaml
PR diff Deletes .github/actions/warm-go-cache/action.yaml AND removes - uses: ./.github/actions/warm-go-cache from reusable workflows — but these changes don't take effect together because the reusable workflow is resolved from @main
Failure timing All jobs fail at workflow setup, before any compilation or test execution

@celebdor celebdor merged commit 8b13140 into openshift:main May 29, 2026
11 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ci-tooling Indicates the PR includes changes for CI or tooling jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants