Skip to content

OSAC: use pruned vmaas snapshot to fix e2e disk pressure#79304

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
omer-vishlitzky:osac-vmaas-pruned-snapshot
May 14, 2026
Merged

OSAC: use pruned vmaas snapshot to fix e2e disk pressure#79304
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
omer-vishlitzky:osac-vmaas-pruned-snapshot

Conversation

@omer-vishlitzky
Copy link
Copy Markdown
Contributor

@omer-vishlitzky omer-vishlitzky commented May 14, 2026

Summary

  • Switch e2e-vmaas CI jobs to use a pruned snapshot image that boots at 54% disk usage instead of 85-86%
  • The old snapshot accumulated ~30 stale OLM catalog index images (~42GB) that pushed the node above kubelet's disk pressure eviction threshold on boot
  • This caused ~60% of e2e-vmaas runs across all OSAC repos to fail with rollout timeouts in the refresh script's step [7/8] ("1 old replicas are pending termination")

Root Cause

The osac-vmaas snapshot boots at 85-86% disk usage (79G/93G). Kubelet's image GC threshold is 85%. When the node exceeds this, kubelet sets DiskPressure=True taint, preventing new pods from being scheduled. The refresh script's oc rollout restart creates new pods that can't schedule, and oc rollout status --timeout=120s times out.

Exhaustive analysis of all 51 recent e2e-vmaas runs confirmed: every rollout-timeout failure (25/25) had DiskPressure=True, every success (19/19) had DiskPressure=False.

Fix

Pruned stale OLM catalog index images from the snapshot and re-snapshotted:

  • certified-operator-index: 9 stale versions × 1.25GB
  • community-operator-index: 12 stale versions × 1.25GB
  • redhat-operator-index: 9 stale versions × 1.79GB

Also ran refresh-after-snapshot.sh to update OSAC component images to latest.

Verification

  • Booted from pruned snapshot: DiskPressure=False immediately, 54% disk usage (44G free)
  • Original snapshot: DiskPressure=True on boot, 85% disk usage (14G free)

Summary

This PR fixes recurring DiskPressure failures in the OSAC e2e-vmaas CI jobs in this repository's OpenShift CI configuration by switching cluster boot to a pruned OSAC snapshot that greatly reduces disk usage at node boot.

Problem

e2e-vmaas CI clusters were booting from a snapshot that contained ~30 stale OLM catalog index images (~42 GB), causing nodes to start at ~85–86% disk usage (≈14 GB free). Kubelet's 85% image-garbage-collection threshold immediately set the DiskPressure=True taint on boot, blocking pod scheduling during the refresh/rollout step and causing ~60% of e2e-vmaas runs to fail.

Solution

The snapshot used for provisioning was pruned to remove stale OLM catalog index images, re-snapshotted, and OSAC component images were refreshed. Boot now uses the pruned flavor so nodes start with ~54% disk usage (≈44 GB free) and DiskPressure=False, preventing the rollout timeouts.

Changes (practical impact)

  • Affects OpenShift CI step definitions for the OSAC project (e2e-vmaas jobs) in this repo: the cluster boot step now points at the pruned snapshot image.
  • Updated ci-operator step default and boot script to use the pruned flavor image tag:
    • ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-ref.yaml — CLUSTER_TOOL_FLAVOR_IMAGE default changed to quay.io/rh-ee-ovishlit/cluster-flavors:osac-vmaas-pruned
    • ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh — cluster-tool boot command uses the osac-vmaas-pruned flavor
  • Ran refresh-after-snapshot.sh to update OSAC component images against the pruned snapshot.
  • Verification: boot from pruned snapshot yields DiskPressure=False and ~54% disk usage; original snapshot reproduces DiskPressure=True and ~85% usage.

Notes

  • The author posted pj-rehearse commands to rehearse affected e2e-vmaas jobs across OSAC projects and then targeted a narrower rehearse for the installer job.
  • One commit updates the boot command to match the new pruned flavor name registered by cluster-tool.

The osac-vmaas snapshot accumulated ~30 stale OLM catalog index
images (~42GB), leaving the disk at 85-86% usage on boot. This
triggered kubelet DiskPressure, preventing pod scheduling during
the refresh step and causing ~60% of e2e-vmaas runs to fail.

The pruned snapshot boots at 54% disk usage with no disk pressure.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 634f1c94-9317-4050-b972-1b3d31739c55

📥 Commits

Reviewing files that changed from the base of the PR and between 2d26a31 and 626db8a.

📒 Files selected for processing (1)
  • ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

Walkthrough

The boot step’s flavor reference was changed from osac-vmaas to osac-vmaas-pruned in both the step configuration default and the boot command used during provisioning.

Changes

Boot Step Flavor switch

Layer / File(s) Summary
CLUSTER_TOOL_FLAVOR_IMAGE default
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-ref.yaml
Environment variable default updated from quay.io/rh-ee-ovishlit/cluster-flavors:osac-vmaas to quay.io/rh-ee-ovishlit/cluster-flavors:osac-vmaas-pruned.
Boot command invocation
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh
Updated python3 /usr/local/bin/cluster-tool boot --flavor ... invocation to use osac-vmaas-pruned instead of osac-vmaas, changing the flavor used during cluster boot.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: switching to a pruned vmaas snapshot to resolve disk pressure issues in e2e tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not modify any Ginkgo test files. Changes are only to CI/CD configuration YAML and boot shell scripts. Check not applicable.
Test Structure And Quality ✅ Passed Custom check not applicable. PR modifies YAML CI/CD config and bash scripts, not Ginkgo test code.
Microshift Test Compatibility ✅ Passed PR does not add any new Ginkgo e2e tests. The check only applies to new test additions. Changes are limited to CI boot configuration and shell scripts for cluster provisioning.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. It only modifies infrastructure configuration files (YAML and bash scripts) to update the cluster boot process. The custom check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR only updates VM snapshot flavor tags for CI provisioning. No deployment manifests, operator code, or scheduling constraints are modified. Check not applicable.
Ote Binary Stdout Contract ✅ Passed PR contains only infrastructure config (YAML) and shell scripts for cluster provisioning. No OTE binaries, Go code, or process-level stdout writes affected. Check is not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests. Changes are to CI/CD infrastructure files (YAML and shell scripts), not test code. Custom check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from danmanor and jhernand May 14, 2026 14:12
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2026
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-installer-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The cluster-tool pull registers the flavor with the image tag name.
Update the boot command to match the new tag.
@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@omer-vishlitzky: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-osac-project-osac-test-infra-main-e2e-vmaas osac-project/osac-test-infra presubmit Registry content changed
pull-ci-osac-project-fulfillment-service-main-e2e-vmaas osac-project/fulfillment-service presubmit Registry content changed
pull-ci-osac-project-osac-operator-main-e2e-vmaas osac-project/osac-operator presubmit Registry content changed
pull-ci-osac-project-osac-installer-main-e2e-vmaas osac-project/osac-installer presubmit Registry content changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-osac-project-osac-installer-main-e2e-vmaas

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Copy link
Copy Markdown

@akshaynadkarni akshaynadkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akshaynadkarni, omer-vishlitzky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/ph-rehearse

@omer-vishlitzky
Copy link
Copy Markdown
Contributor Author

/pj-rehearse ack

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 14, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

@omer-vishlitzky: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 825e778 into openshift:main May 14, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants