OSAC: use pruned vmaas snapshot to fix e2e disk pressure by omer-vishlitzky · Pull Request #79304 · openshift/release

omer-vishlitzky · 2026-05-14T14:10:06Z

Summary

Switch e2e-vmaas CI jobs to use a pruned snapshot image that boots at 54% disk usage instead of 85-86%
The old snapshot accumulated ~30 stale OLM catalog index images (~42GB) that pushed the node above kubelet's disk pressure eviction threshold on boot
This caused ~60% of e2e-vmaas runs across all OSAC repos to fail with rollout timeouts in the refresh script's step [7/8] ("1 old replicas are pending termination")

Root Cause

The osac-vmaas snapshot boots at 85-86% disk usage (79G/93G). Kubelet's image GC threshold is 85%. When the node exceeds this, kubelet sets DiskPressure=True taint, preventing new pods from being scheduled. The refresh script's oc rollout restart creates new pods that can't schedule, and oc rollout status --timeout=120s times out.

Exhaustive analysis of all 51 recent e2e-vmaas runs confirmed: every rollout-timeout failure (25/25) had DiskPressure=True, every success (19/19) had DiskPressure=False.

Fix

Pruned stale OLM catalog index images from the snapshot and re-snapshotted:

certified-operator-index: 9 stale versions × 1.25GB
community-operator-index: 12 stale versions × 1.25GB
redhat-operator-index: 9 stale versions × 1.79GB

Also ran refresh-after-snapshot.sh to update OSAC component images to latest.

Verification

Booted from pruned snapshot: DiskPressure=False immediately, 54% disk usage (44G free)
Original snapshot: DiskPressure=True on boot, 85% disk usage (14G free)

Summary

This PR fixes recurring DiskPressure failures in the OSAC e2e-vmaas CI jobs in this repository's OpenShift CI configuration by switching cluster boot to a pruned OSAC snapshot that greatly reduces disk usage at node boot.

Problem

e2e-vmaas CI clusters were booting from a snapshot that contained ~30 stale OLM catalog index images (~42 GB), causing nodes to start at ~85–86% disk usage (≈14 GB free). Kubelet's 85% image-garbage-collection threshold immediately set the DiskPressure=True taint on boot, blocking pod scheduling during the refresh/rollout step and causing ~60% of e2e-vmaas runs to fail.

Solution

The snapshot used for provisioning was pruned to remove stale OLM catalog index images, re-snapshotted, and OSAC component images were refreshed. Boot now uses the pruned flavor so nodes start with ~54% disk usage (≈44 GB free) and DiskPressure=False, preventing the rollout timeouts.

Changes (practical impact)

Affects OpenShift CI step definitions for the OSAC project (e2e-vmaas jobs) in this repo: the cluster boot step now points at the pruned snapshot image.
Updated ci-operator step default and boot script to use the pruned flavor image tag:
- ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-ref.yaml — CLUSTER_TOOL_FLAVOR_IMAGE default changed to quay.io/rh-ee-ovishlit/cluster-flavors:osac-vmaas-pruned
- ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh — cluster-tool boot command uses the osac-vmaas-pruned flavor
Ran refresh-after-snapshot.sh to update OSAC component images against the pruned snapshot.
Verification: boot from pruned snapshot yields DiskPressure=False and ~54% disk usage; original snapshot reproduces DiskPressure=True and ~85% usage.

Notes

The author posted pj-rehearse commands to rehearse affected e2e-vmaas jobs across OSAC projects and then targeted a narrower rehearse for the installer job.
One commit updates the boot command to match the new pruned flavor name registered by cluster-tool.

The osac-vmaas snapshot accumulated ~30 stale OLM catalog index images (~42GB), leaving the disk at 85-86% usage on boot. This triggered kubelet DiskPressure, preventing pod scheduling during the refresh step and causing ~60% of e2e-vmaas runs to fail. The pruned snapshot boots at 54% disk usage with no disk pressure.

coderabbitai · 2026-05-14T14:10:41Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 634f1c94-9317-4050-b972-1b3d31739c55

📥 Commits

Reviewing files that changed from the base of the PR and between 2d26a31 and 626db8a.

📒 Files selected for processing (1)

ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

Walkthrough

The boot step’s flavor reference was changed from osac-vmaas to osac-vmaas-pruned in both the step configuration default and the boot command used during provisioning.

Changes

Boot Step Flavor switch

Layer / File(s)	Summary
CLUSTER_TOOL_FLAVOR_IMAGE default `ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-ref.yaml`	Environment variable default updated from `quay.io/rh-ee-ovishlit/cluster-flavors:osac-vmaas` to `quay.io/rh-ee-ovishlit/cluster-flavors:osac-vmaas-pruned`.
Boot command invocation `ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`	Updated `python3 /usr/local/bin/cluster-tool boot --flavor ...` invocation to use `osac-vmaas-pruned` instead of `osac-vmaas`, changing the flavor used during cluster boot.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 12

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: switching to a pruned vmaas snapshot to resolve disk pressure issues in e2e tests.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR does not modify any Ginkgo test files. Changes are only to CI/CD configuration YAML and boot shell scripts. Check not applicable.
Test Structure And Quality	✅ Passed	Custom check not applicable. PR modifies YAML CI/CD config and bash scripts, not Ginkgo test code.
Microshift Test Compatibility	✅ Passed	PR does not add any new Ginkgo e2e tests. The check only applies to new test additions. Changes are limited to CI boot configuration and shell scripts for cluster provisioning.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests. It only modifies infrastructure configuration files (YAML and bash scripts) to update the cluster boot process. The custom check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR only updates VM snapshot flavor tags for CI provisioning. No deployment manifests, operator code, or scheduling constraints are modified. Check not applicable.
Ote Binary Stdout Contract	✅ Passed	PR contains only infrastructure config (YAML) and shell scripts for cluster provisioning. No OTE binaries, Go code, or process-level stdout writes affected. Check is not applicable.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR does not add new Ginkgo e2e tests. Changes are to CI/CD infrastructure files (YAML and shell scripts), not test code. Custom check is not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

omer-vishlitzky · 2026-05-14T14:14:34Z

/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas

openshift-merge-bot · 2026-05-14T14:14:37Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

omer-vishlitzky · 2026-05-14T14:38:30Z

/pj-rehearse pull-ci-osac-project-osac-installer-main-e2e-vmaas

openshift-merge-bot · 2026-05-14T14:38:33Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The cluster-tool pull registers the flavor with the image tag name. Update the boot command to match the new tag.

omer-vishlitzky · 2026-05-14T14:51:57Z

/pj-rehearse pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas

openshift-merge-bot · 2026-05-14T14:53:23Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-merge-bot · 2026-05-14T14:55:23Z

[REHEARSALNOTIFIER]
@omer-vishlitzky: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-osac-project-osac-test-infra-main-e2e-vmaas	osac-project/osac-test-infra	presubmit	Registry content changed
pull-ci-osac-project-fulfillment-service-main-e2e-vmaas	osac-project/fulfillment-service	presubmit	Registry content changed
pull-ci-osac-project-osac-operator-main-e2e-vmaas	osac-project/osac-operator	presubmit	Registry content changed
pull-ci-osac-project-osac-installer-main-e2e-vmaas	osac-project/osac-installer	presubmit	Registry content changed

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

omer-vishlitzky · 2026-05-14T15:23:04Z

/pj-rehearse pull-ci-osac-project-osac-installer-main-e2e-vmaas

openshift-merge-bot · 2026-05-14T15:23:07Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

akshaynadkarni

LGTM

openshift-ci · 2026-05-14T15:52:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akshaynadkarni, omer-vishlitzky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/osac-project/cluster-tool/boot/OWNERS~~ [akshaynadkarni,omer-vishlitzky]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

omer-vishlitzky · 2026-05-14T17:40:40Z

/ph-rehearse

omer-vishlitzky · 2026-05-14T17:40:51Z

/pj-rehearse ack

openshift-merge-bot · 2026-05-14T17:40:55Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2026-05-14T17:51:36Z

@omer-vishlitzky: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from danmanor and jhernand May 14, 2026 14:12

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2026

fix boot command to use pruned flavor name

626db8a

The cluster-tool pull registers the flavor with the image tag name. Update the boot command to match the new tag.

akshaynadkarni approved these changes May 14, 2026

View reviewed changes

openshift-ci Bot assigned akshaynadkarni May 14, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2026

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 14, 2026

openshift-merge-bot Bot merged commit 825e778 into openshift:main May 14, 2026
15 checks passed

coderabbitai Bot mentioned this pull request May 16, 2026

OSAC-853: add AAP presubmit e2e-vmaas job #79365

Open

2 tasks

Conversation

omer-vishlitzky commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Verification

Summary

Problem

Solution

Changes (practical impact)

Notes

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

omer-vishlitzky commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

omer-vishlitzky commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

omer-vishlitzky commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

omer-vishlitzky commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

akshaynadkarni left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 14, 2026

Uh oh!

omer-vishlitzky commented May 14, 2026

Uh oh!

omer-vishlitzky commented May 14, 2026

Uh oh!

openshift-merge-bot Bot commented May 14, 2026

Uh oh!

openshift-ci Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

omer-vishlitzky commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading