OSAC: use pruned vmaas snapshot to fix e2e disk pressure#79304
Conversation
The osac-vmaas snapshot accumulated ~30 stale OLM catalog index images (~42GB), leaving the disk at 85-86% usage on boot. This triggered kubelet DiskPressure, preventing pod scheduling during the refresh step and causing ~60% of e2e-vmaas runs to fail. The pruned snapshot boots at 54% disk usage with no disk pressure.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughThe boot step’s flavor reference was changed from ChangesBoot Step Flavor switch
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-osac-project-osac-installer-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
The cluster-tool pull registers the flavor with the image tag name. Update the boot command to match the new tag.
|
/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse pull-ci-osac-project-osac-installer-main-e2e-vmaas |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: akshaynadkarni, omer-vishlitzky The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/ph-rehearse |
|
/pj-rehearse ack |
|
@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@omer-vishlitzky: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Root Cause
The
osac-vmaassnapshot boots at 85-86% disk usage (79G/93G). Kubelet's image GC threshold is 85%. When the node exceeds this, kubelet setsDiskPressure=Truetaint, preventing new pods from being scheduled. The refresh script'soc rollout restartcreates new pods that can't schedule, andoc rollout status --timeout=120stimes out.Exhaustive analysis of all 51 recent e2e-vmaas runs confirmed: every rollout-timeout failure (25/25) had DiskPressure=True, every success (19/19) had DiskPressure=False.
Fix
Pruned stale OLM catalog index images from the snapshot and re-snapshotted:
certified-operator-index: 9 stale versions × 1.25GBcommunity-operator-index: 12 stale versions × 1.25GBredhat-operator-index: 9 stale versions × 1.79GBAlso ran
refresh-after-snapshot.shto update OSAC component images to latest.Verification
Summary
This PR fixes recurring DiskPressure failures in the OSAC e2e-vmaas CI jobs in this repository's OpenShift CI configuration by switching cluster boot to a pruned OSAC snapshot that greatly reduces disk usage at node boot.
Problem
e2e-vmaas CI clusters were booting from a snapshot that contained ~30 stale OLM catalog index images (~42 GB), causing nodes to start at ~85–86% disk usage (≈14 GB free). Kubelet's 85% image-garbage-collection threshold immediately set the DiskPressure=True taint on boot, blocking pod scheduling during the refresh/rollout step and causing ~60% of e2e-vmaas runs to fail.
Solution
The snapshot used for provisioning was pruned to remove stale OLM catalog index images, re-snapshotted, and OSAC component images were refreshed. Boot now uses the pruned flavor so nodes start with ~54% disk usage (≈44 GB free) and DiskPressure=False, preventing the rollout timeouts.
Changes (practical impact)
Notes