Skip to content

CNTRLPLANE-3329: Use fuse-overlayfs for build cache instead of full copy#8568

Merged
celebdor merged 2 commits into
openshift:mainfrom
celebdor:CNTRLPLANE-3329/use-cache-directly
May 21, 2026
Merged

CNTRLPLANE-3329: Use fuse-overlayfs for build cache instead of full copy#8568
celebdor merged 2 commits into
openshift:mainfrom
celebdor:CNTRLPLANE-3329/use-cache-directly

Conversation

@celebdor
Copy link
Copy Markdown
Collaborator

@celebdor celebdor commented May 21, 2026

Summary

  • Adds fuse-overlayfs to the GitHub Actions runner image
  • Replaces the per-job cp -a of the entire EFS cache with a fuse-overlayfs mount that gives Go a writable view over the read-only EFS PVC with zero copy overhead
  • Reads hit the EFS mount directly, writes go to a tmpfs upper layer
  • Falls back to cp -a if fuse-overlayfs or /dev/fuse is unavailable, and to an empty cache if both fail
  • On OpenShift 4.15+, /dev/fuse is available to unprivileged pods without cluster config changes

Context

Benchmarking with contrib/ci/gha-cache-timing.sh showed that all cached workflows (lint, verify, unit tests, envtest) got slower after the EFS cache was introduced (May 13-18) compared to before:

Job Before (May 8-12) After (May 19-21) Delta
lint / Lint 7m46s 9m26s +1m40s
verify / Verify 11m19s 12m44s +1m25s
test / Unit Tests (avg) 6m30s 8m44s +2m14s
envtest (avg) 8m33s 10m54s +2m21s

The timeout 120 cp -a /cache/go-build/. /tmp/go-build-cache/ in the warm-go-cache action was copying the full cache on every job start, negating the compilation time savings. Pointing GOCACHE directly at the read-only mount doesn't work either — Go fails hard if it can't write to its cache directory.

Test plan

  • CI passes on this PR (falls back to cp -a since the new runner image isn't built yet)
  • After new runner image is deployed: verify fuse-overlayfs mount succeeds in job logs
  • Compare job durations against pre-cache baseline (May 8-12) using contrib/ci/gha-cache-timing.sh

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Improved Go build cache handling in CI: runner environment now supports an overlay-based read-mostly cache with a writable fallback.
    • The cache location is always initialized and exposed to builds.
    • CI will attempt to warm from remote storage (with a timed copy fallback) and will safely proceed without cache if warming or mounting fails.
    • Runner image now includes overlay tooling to enable overlay mounts when available.

Point GOCACHE at the read-only EFS mount (/cache/go-build) instead of
copying the entire cache into /tmp at job start. Go's build cache
handles read-only directories gracefully by skipping writes.

This eliminates the per-job cp -a overhead that was adding ~2 minutes
to every CI job since the EFS cache was introduced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 21, 2026

@celebdor: This pull request references CNTRLPLANE-3329 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Points GOCACHE directly at the read-only EFS mount (/cache/go-build) instead of copying the entire cache into /tmp at job start
  • Go's build cache handles read-only directories gracefully — it reads cached artifacts and silently skips writing new entries
  • Eliminates the per-job cp -a overhead that was adding ~2 minutes to every CI job

Context

Benchmarking with contrib/ci/gha-cache-timing.sh showed that all cached workflows (lint, verify, unit tests, envtest) got slower after the EFS cache was introduced (May 13-18) compared to before:

Job Before (May 8-12) After (May 19-21) Delta
lint / Lint 7m46s 9m26s +1m40s
verify / Verify 11m19s 12m44s +1m25s
test / Unit Tests (avg) 6m30s 8m44s +2m14s
envtest (avg) 8m33s 10m54s +2m21s

The timeout 120 cp -a /cache/go-build/. /tmp/go-build-cache/ in the warm-go-cache action was copying the full cache on every job start, negating the compilation time savings.

Test plan

  • Verify lint job completes successfully and is faster than baseline
  • Verify verify job completes successfully
  • Verify unit test shards complete successfully
  • Verify envtest-kube jobs complete successfully
  • Verify envtest-ocp jobs complete successfully
  • Compare job durations against pre-cache baseline (May 8-12)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels May 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

The PR updates the warm-go-cache composite GitHub Action to always create /tmp/go-build-cache and export GOCACHE to it; when /cache/go-build exists and fuse-overlayfs with /dev/fuse is available, the action mounts an overlay (lower=/cache/go-build, upper/work in /tmp) into /tmp/go-build-cache, otherwise it attempts a 120s timed copy from /cache/go-build into /tmp/go-build-cache; failures in both paths result in proceeding without a warmed cache. The action description was updated, and the runner image now installs fuse-overlayfs.

Sequence Diagram(s)

sequenceDiagram
  participant ComponentA
  participant ComponentB
  ComponentA->>ComponentB: observable interaction
Loading

Possibly related PRs

  • openshift/hypershift#8495: Updates CI workflows to invoke warm-go-cache and consume /tmp/go-build-cache via GOCACHE, directly coupled to this action's cache behavior.

Suggested reviewers

  • cblecker
  • clebs
  • bryan-cox
🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: replacing full copy operations with fuse-overlayfs for build cache management, which matches the core modifications across both files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR only modifies .github/actions/warm-go-cache/action.yaml and Dockerfile.github-actions-runner, neither of which are Ginkgo test files. The check for Ginkgo test name stability does not apply.
Test Structure And Quality ✅ Passed Check not applicable: PR modifies infrastructure files (.github/actions/warm-go-cache/action.yaml and Dockerfile.github-actions-runner), not Ginkgo test code.
Microshift Test Compatibility ✅ Passed PR contains only CI/infrastructure changes (.github Actions YAML, Dockerfile) with no Ginkgo e2e test additions; check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds no new Ginkgo e2e tests; only GitHub Actions workflow config and Dockerfile updates for build caching infrastructure.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies GitHub Actions CI/CD tooling (action.yaml and Dockerfile), not Kubernetes manifests, operator code, or controllers. Custom check does not apply.
Ote Binary Stdout Contract ✅ Passed PR modifies only GitHub Actions infrastructure (action.yaml and Dockerfile), not OTE binaries or test suites; check is not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add any Ginkgo e2e tests. Changes are limited to GitHub Actions workflow configuration and Dockerfile updates for build cache improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: celebdor
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@celebdor celebdor changed the title CNTRLPLANE-3329: Use EFS build cache directly instead of copying CNTRLPLANE-3329: Use fuse-overlayfs for build cache instead of full copy May 21, 2026
@celebdor celebdor added the area/ci-tooling Indicates the PR includes changes for CI or tooling label May 21, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/actions/warm-go-cache/action.yaml:
- Around line 9-16: The overlay mount branch uses an elif so a failed
fuse-overlayfs mount never triggers the copy fallback; change the logic in the
block that calls fuse-overlayfs so that if fuse-overlayfs exists and
/cache/go-build exists you attempt the mount, and on mount failure explicitly
fall back to the copy step (e.g., check fuse-overlayfs exit status or use a
compound || block to run the timeout cp -a /cache/go-build/.
/tmp/go-build-cache/ fallback and emit a warning). Update the code around the
fuse-overlayfs invocation and the timeout cp -a command so copy to
/tmp/go-build-cache is attempted whenever the mount fails, not only when
fuse-overlayfs is unavailable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 946e332a-7584-4610-9010-bae27e831707

📥 Commits

Reviewing files that changed from the base of the PR and between 141640d and 553f092.

📒 Files selected for processing (2)
  • .github/actions/warm-go-cache/action.yaml
  • Dockerfile.github-actions-runner

Comment thread .github/actions/warm-go-cache/action.yaml
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.40%. Comparing base (a7d68da) to head (56fbf18).
⚠️ Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8568      +/-   ##
==========================================
+ Coverage   40.34%   40.40%   +0.06%     
==========================================
  Files         755      755              
  Lines       93167    93235      +68     
==========================================
+ Hits        37587    37675      +88     
+ Misses      52877    52858      -19     
+ Partials     2703     2702       -1     

see 3 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.44% <ø> (+0.13%) ⬆️
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.31% <ø> (+0.17%) ⬆️
hypershift-operator 50.72% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace the per-job `cp -a` of the entire EFS cache with a
fuse-overlayfs mount. This gives Go a writable view over the read-only
EFS PVC with zero copy overhead — reads hit the EFS mount directly and
writes go to a tmpfs upper layer.

Falls back to `cp -a` if fuse-overlayfs or /dev/fuse is unavailable,
and to an empty cache if both fail.

Adds fuse-overlayfs to the runner image. On OpenShift 4.15+ /dev/fuse
is available to unprivileged pods without cluster config changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@celebdor celebdor force-pushed the CNTRLPLANE-3329/use-cache-directly branch from 553f092 to 56fbf18 Compare May 21, 2026 15:36
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/actions/warm-go-cache/action.yaml:
- Around line 10-16: Ensure the fuse-overlayfs mount uses a truly empty workdir
and that the mountpoint is created before mounting: before invoking the
fuse-overlayfs command referenced in the script (the line with "fuse-overlayfs
-o
lowerdir=/cache/go-build,upperdir=/tmp/go-cache-upper,workdir=/tmp/go-cache-work
/tmp/go-build-cache"), recreate/empty /tmp/go-cache-work (e.g., remove any
contents or rm -rf and then mkdir) so it is guaranteed empty per fuse-overlayfs
requirements, and explicitly mkdir -p /tmp/go-build-cache before the mount
attempt; keep the existing mount invocation and the mounted=true fallback logic
intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 00b72777-978f-48ac-bec4-2c181b014575

📥 Commits

Reviewing files that changed from the base of the PR and between 553f092 and 56fbf18.

📒 Files selected for processing (2)
  • .github/actions/warm-go-cache/action.yaml
  • Dockerfile.github-actions-runner

Comment on lines +10 to +16
if [ -d /cache/go-build ] && command -v fuse-overlayfs >/dev/null 2>&1 && [ -e /dev/fuse ]; then
mkdir -p /tmp/go-cache-upper /tmp/go-cache-work
if fuse-overlayfs -o lowerdir=/cache/go-build,upperdir=/tmp/go-cache-upper,workdir=/tmp/go-cache-work /tmp/go-build-cache; then
mounted=true
else
echo "::warning::fuse-overlayfs mount failed, falling back to copy"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the correct syntax for fuse-overlayfs mount command with lowerdir, upperdir, and workdir options?

💡 Result:

The correct syntax for the fuse-overlayfs mount command is [1][2]: fuse-overlayfs -o lowerdir=DIR1[:DIR2...],upperdir=UPPER,workdir=WORK MOUNTPOINT Options explanation: -o lowerdir=DIR1[:DIR2...]: A colon-separated list of directories to be merged as the read-only base layers [1][3][4]. -o upperdir=UPPER: The directory where all write changes to the filesystem will be stored [1][3][4]. -o workdir=WORK: An internal working directory required by the filesystem, which must reside on the same filesystem as the upperdir [1][3][4]. MOUNTPOINT: The target directory where the merged filesystem will be accessible [1][3]. Example: fuse-overlayfs -o lowerdir=lower1:lower2,upperdir=upper,workdir=work merged_dir [2] Note that workdir must be an empty directory on the same filesystem as the upperdir [1][3]. If you omit upperdir and workdir, the filesystem will be mounted in a read-only state [5].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,120p' .github/actions/warm-go-cache/action.yaml

Repository: openshift/hypershift

Length of output: 1045


Confirm fuse-overlayfs mount syntax and tighten workdir requirements.

  • The command fuse-overlayfs -o lowerdir=/cache/go-build,upperdir=/tmp/go-cache-upper,workdir=/tmp/go-cache-work /tmp/go-build-cache matches fuse-overlayfs’ documented -o lowerdir=...,upperdir=...,workdir=... MOUNTPOINT syntax, and /tmp/go-build-cache is created before the mount call.
  • To maximize mount success, ensure /tmp/go-cache-work is an empty directory before mounting (the current mkdir -p doesn’t guarantee emptiness per fuse-overlayfs’ requirement).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/warm-go-cache/action.yaml around lines 10 - 16, Ensure the
fuse-overlayfs mount uses a truly empty workdir and that the mountpoint is
created before mounting: before invoking the fuse-overlayfs command referenced
in the script (the line with "fuse-overlayfs -o
lowerdir=/cache/go-build,upperdir=/tmp/go-cache-upper,workdir=/tmp/go-cache-work
/tmp/go-build-cache"), recreate/empty /tmp/go-cache-work (e.g., remove any
contents or rm -rf and then mkdir) so it is guaranteed empty per fuse-overlayfs
requirements, and explicitly mkdir -p /tmp/go-build-cache before the mount
attempt; keep the existing mount invocation and the mounted=true fallback logic
intact.

@celebdor
Copy link
Copy Markdown
Collaborator Author

/override ci/prow/e2e-v2-aws
/override ci/prow/e2e-kubevirt-aws-ovn-reduced
/override ci/prow/e2e-azure-self-managed
/override ci/prow/e2e-aws-upgrade-hypershift-operator
/override ci/prow/e2e-aws
/override ci/prow/e2e-aks
/override ci/prow/e2e-v2-gke

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@celebdor: Overrode contexts on behalf of celebdor: ci/prow/e2e-aks, ci/prow/e2e-aws, ci/prow/e2e-aws-upgrade-hypershift-operator, ci/prow/e2e-azure-self-managed, ci/prow/e2e-kubevirt-aws-ovn-reduced, ci/prow/e2e-v2-aws, ci/prow/e2e-v2-gke

Details

In response to this:

/override ci/prow/e2e-v2-aws
/override ci/prow/e2e-kubevirt-aws-ovn-reduced
/override ci/prow/e2e-azure-self-managed
/override ci/prow/e2e-aws-upgrade-hypershift-operator
/override ci/prow/e2e-aws
/override ci/prow/e2e-aks
/override ci/prow/e2e-v2-gke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@celebdor celebdor marked this pull request as ready for review May 21, 2026 15:46
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and muraee May 21, 2026 15:47
@celebdor
Copy link
Copy Markdown
Collaborator Author

/override ci/prow/images

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@celebdor: Overrode contexts on behalf of celebdor: ci/prow/images

Details

In response to this:

/override ci/prow/images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@celebdor: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@celebdor celebdor merged commit 1676a97 into openshift:main May 21, 2026
31 of 34 checks passed
@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 21, 2026

Now I have all the evidence to produce the final report.

Test Failure Analysis Complete

Job Information

  • Prow Job: Red Hat Konflux / hypershift-gh-actions-runner-on-pull-request
  • Build ID: hypershift-gh-actions-runner-on-pull-request-zwmn7
  • Pipeline: Konflux Tekton PipelineRun in namespace crt-redhat-acm-tenant
  • PR: #8568CNTRLPLANE-3329: Use fuse-overlayfs for build cache instead of full copy
  • Failed Task: ecosystem-cert-preflight-checks (arm64 platform)
  • Duration: 58 seconds (killed)

Test Failure Analysis

Error

time="2026-05-21T15:58:36Z" level=info msg="running checks for quay.io/redhat-user-workloads/crt-redhat-acm-tenant/hypershift-gh-actions-runner:on-pr-56fbf18ff1b6fafb38b8b32937f78b440e98b893 for platform arm64"
time="2026-05-21T15:58:36Z" level=info msg="target image" image="quay.io/redhat-user-workloads/crt-redhat-acm-tenant/hypershift-gh-actions-runner:on-pr-56fbf18ff1b6fafb38b8b32937f78b440e98b893"
/tekton/scripts/script-3-x8jgl: line 30:    17 Killed                  /usr/local/bin/preflight check container "$image_url" --platform "$arch"

Summary

The Konflux ecosystem-cert-preflight-checks task was OOM-killed (SIGKILL) while running the Red Hat preflight certification tool against the arm64 variant of the hypershift-gh-actions-runner image. The task's app-check step has a hard 4Gi memory limit, and the preflight tool must pull and fully extract the container image into memory to inspect it. The hypershift-gh-actions-runner image is exceptionally large — it bundles the Go 1.25.7 toolchain, pre-compiled golangci-lint binary, kube-api-linter plugin (.so), oc/kubectl clients, gcc, and Python on top of the GitHub Actions runner base image. The amd64 preflight check passed (2 minutes), but the arm64 check was killed after 58 seconds, indicating the arm64 image extraction exceeded the 4Gi memory ceiling. This is not caused by the PR's code changes — adding fuse-overlayfs (~680KB) is negligible relative to the image's total size. The PR was correctly merged by overriding this check.

Root Cause

The ecosystem-cert-preflight-checks Tekton task (version 0.2, from quay.io/konflux-ci/tekton-catalog/task-ecosystem-cert-preflight-checks:0.2) has a 4Gi memory limit on the app-check step that runs /usr/local/bin/preflight check container. This tool must pull, extract, and inspect the full container image to perform Red Hat certification checks.

The hypershift-gh-actions-runner image is extremely large because it bundles:

  • GitHub Actions runner base image (ghcr.io/actions/actions-runner)
  • Go 1.25.7 full toolchain
  • Pre-compiled golangci-lint binary and kube-api-linter.so plugin
  • oc and kubectl binaries
  • Build tools: make, gcc, libc6-dev
  • Additional packages: git, curl, ca-certificates, python3-pip, fuse-overlayfs

The preflight tool was killed by SIGKILL (signal 9) after 58 seconds on the arm64 platform, which is the characteristic signature of a container exceeding its memory limit (OOM kill by the kubelet/cgroup). The amd64 variant passed in 2 minutes — arm64 images can have different layer sizes and extraction memory profiles.

The previous Dockerfile change (commit 15b9f136 from the k8s 0.35 bump PR) had both amd64 and arm64 ecosystem-cert-preflight-checks pass successfully on the on-push pipeline (hypershift-gh-actions-runner-on-push-942h4). This indicates the 4Gi limit is marginal for this image, and the failure is transient — the arm64 preflight happened to exceed the limit on this particular run.

The PR's actual changes (adding fuse-overlayfs package to the Dockerfile and switching the Go build cache strategy from full copy to fuse-overlayfs overlay mount) are functionally correct and add negligible size to the image.

Recommendations
  1. No action needed on PR CNTRLPLANE-3329: Use fuse-overlayfs for build cache instead of full copy #8568 — The PR was correctly merged by overriding the Konflux check. The code changes are sound and unrelated to the preflight failure.

  2. Monitor the post-merge on-push pipeline — The hypershift-gh-actions-runner-on-push pipeline for merge commit 1676a975 is currently in-progress. If it also fails the arm64 preflight, the 4Gi limit is now consistently too low for this image.

  3. If recurring, request a memory limit increase — File an issue with the Konflux team to increase the app-check step memory limit from 4Gi to 6Gi or 8Gi for the ecosystem-cert-preflight-checks task. The hypershift-gh-actions-runner image is legitimately large and will likely continue to grow.

  4. Alternative: split the Dockerfile — If the Konflux memory limit cannot be changed, consider splitting the runner image into a smaller base image (just the runner + Go) and a separate layer for build tools. This would reduce the image the preflight check needs to extract.

  5. Re-run to confirm flakiness — If needed in the future, simply re-trigger the Konflux pipeline to confirm the failure is transient.

Evidence
Evidence Detail
Failed task ecosystem-cert-preflight-checks (arm64), 58s duration, process Killed
Kill signal SIGKILL (signal 9) — OOM kill by container runtime
Memory limit app-check step has limits.memory: 4Gi in Tekton task definition
amd64 result Passed in 2 minutes (same pipeline run)
arm64 result Killed after 58 seconds
Previous run (commit 15b9f136) Both amd64 and arm64 preflight checks passed on on-push pipeline
Image contents Go 1.25.7 toolchain, golangci-lint, kube-api-linter.so, oc/kubectl, gcc, git, python3-pip, fuse-overlayfs
PR change impact Added fuse-overlayfs package (~680KB) — negligible size increase
Task definition quay.io/konflux-ci/tekton-catalog/task-ecosystem-cert-preflight-checks:0.2
Preflight tool quay.io/opdev/preflight:stable — runs /usr/local/bin/preflight check container
PR merged by celebdor — overrode all failing/pending checks
Enterprise contract Also cancelled/failed as a downstream consequence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci-tooling Indicates the PR includes changes for CI or tooling jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants