Skip to content

OCPBUGS-83585: Wait for CRD removal in GenerateCRDInstallTest to fix flaky envtest#8261

Merged
enxebre merged 1 commit intoopenshift:mainfrom
jparrill:OCPBUGS-83585
Apr 17, 2026
Merged

OCPBUGS-83585: Wait for CRD removal in GenerateCRDInstallTest to fix flaky envtest#8261
enxebre merged 1 commit intoopenshift:mainfrom
jparrill:OCPBUGS-83585

Conversation

@jparrill
Copy link
Copy Markdown
Contributor

@jparrill jparrill commented Apr 16, 2026

What this PR does / why we need it

GenerateCRDInstallTest uninstalls all CRDs after validation but does not wait for the API server to fully remove them. When individual per-suite tests start immediately after, they find a stale CRD in a transitional/deletion state, causing WaitForCRDs to timeout with context deadline exceeded.

This adds a wait-for-removal loop after UninstallCRDs in GenerateCRDInstallTest, matching the pattern already used in GenerateTestSuite's AfterEach.

Root Cause

The execution order in suite_test.go is:

  1. GenerateCRDInstallTest("Default") — installs Default CRDs
  2. GenerateCRDInstallTest("TechPreviewNoUpgrade") — installs all TechPreviewNoUpgrade CRDs (including hcpetcdbackups), uninstalls without waiting
  3. Individual test suites start — the hcpetcdbackups CustomNoUpgrade variant hits the stale CRD

The flakiness depends on how fast the API server finalizes CRD deletion, which varies by K8s version and CI load:

Which issue(s) this PR fixes

Fixes https://issues.redhat.com/browse/OCPBUGS-83585

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Improved CRD uninstall verification in the test suite: tests now actively wait (with retries) for each custom resource definition to be fully removed, confirming cleanup within a 30-second window. This reduces test flakiness by preventing leftover CRDs from affecting subsequent test runs.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jparrill: This pull request references Jira Issue OCPBUGS-83585, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it

GenerateCRDInstallTest uninstalls all CRDs after validation but does not wait for the API server to fully remove them. When individual per-suite tests start immediately after, they find a stale CRD in a transitional/deletion state, causing WaitForCRDs to timeout with context deadline exceeded.

This adds a wait-for-removal loop after UninstallCRDs in GenerateCRDInstallTest, matching the pattern already used in GenerateTestSuite's AfterEach.

Root Cause

The execution order in suite_test.go is:

  1. GenerateCRDInstallTest("Default") — installs Default CRDs
  2. GenerateCRDInstallTest("TechPreviewNoUpgrade") — installs all TechPreviewNoUpgrade CRDs (including hcpetcdbackups), uninstalls without waiting
  3. Individual test suites start — the hcpetcdbackups CustomNoUpgrade variant hits the stale CRD

The flakiness depends on how fast the API server finalizes CRD deletion, which varies by K8s version and CI load:

Which issue(s) this PR fixes

Fixes https://issues.redhat.com/browse/OCPBUGS-83585

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 75fcf813-08e5-4f6f-8157-4d4678499332

📥 Commits

Reviewing files that changed from the base of the PR and between be244f9 and 2ff5cdc.

📒 Files selected for processing (1)
  • test/envtest/generator.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/envtest/generator.go

📝 Walkthrough

Walkthrough

Both GenerateTestSuite and GenerateCRDInstallTest in test/envtest/generator.go were updated to verify CRD removal after uninstall. Instead of only asserting envtest.UninstallCRDs(...) returned no error, the code now polls the Kubernetes API with k8sClient.Get(...) for each previously installed CustomResourceDefinition and asserts apierrors.IsNotFound(err) within 30s (1s interval). GenerateCRDInstallTest also added an explicit per-CRD loop after uninstall to wait for each CRD to be fully deleted.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Code
    participant K8sClient as k8sClient
    participant API as Kubernetes API Server
    Test->>K8sClient: envtest.UninstallCRDs(...)
    K8sClient->>API: Uninstall CRDs request
    API-->>K8sClient: Uninstall accepted (CRDs deleting)
    loop per CRD (until deleted)
        Test->>K8sClient: Get CRD
        K8sClient->>API: GET /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}
        alt CRD exists
            API-->>K8sClient: 200 OK (CRD present)
            K8sClient-->>Test: nil error
            Note right of Test: Wait 1s then retry
        else CRD not found
            API-->>K8sClient: 404 Not Found
            K8sClient-->>Test: apierrors.IsNotFound -> true
            Note right of Test: Proceed to next CRD
        end
    end
    Test-->>Test: All CRDs confirmed removed
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Ote Binary Stdout Contract ❓ Inconclusive Cannot locate file test/envtest/generator.go in repository to assess stdout contract violations. Verify repository path and file existence, then provide modified file content for OTE Binary Stdout Contract assessment.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding a wait-for-removal loop in GenerateCRDInstallTest to fix flaky envtest behavior by ensuring CRDs are fully removed before proceeding.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test constructs with dynamic or static names found in generator.go file.
Test Structure And Quality ✅ Passed The test code adheres to all five quality requirements with proper BeforeEach/AfterEach patterns, appropriate timeouts, meaningful assertions, and Ginkgo v2 compliance.
Microshift Test Compatibility ✅ Passed PR modifies only internal implementation of test generator functions without adding new Ginkgo test definitions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies cleanup logic of existing test generation functions to ensure CRDs are fully removed before subsequent tests run. No new Ginkgo e2e tests are added.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test infrastructure code in test/envtest/generator.go, adding CRD removal verification logic without changes to production code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies internal test utility functions, not new Ginkgo e2e tests. No new It()/Describe()/Context()/When() blocks added.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 16, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2026
@openshift-ci openshift-ci bot requested review from Nirshal and sdminonne April 16, 2026 14:48
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/envtest/generator.go`:
- Around line 255-258: The test currently uses
Eventually(...).ShouldNot(Succeed()) which treats any error (timeouts, transient
API errors) as a success; change the Eventually assertion to explicitly check
for NotFound by using the same Get closure but returning whether
apierrors.IsNotFound(err) (or asserting
Expect(apierrors.IsNotFound(err)).To(BeTrue())) so the Eventually only succeeds
when the CRD is actually absent; update the closure that calls
k8sClient.Get(ctx, client.ObjectKeyFromObject(crd), crd) to use
apierrors.IsNotFound on the returned error (importing
k8s.io/apimachinery/pkg/api/errors as apierrors if needed).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: a79c7bcb-90a1-46b7-a7c3-f6f4d2aa2307

📥 Commits

Reviewing files that changed from the base of the PR and between 846f2e9 and be244f9.

📒 Files selected for processing (1)
  • test/envtest/generator.go

Comment thread test/envtest/generator.go Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 35.62%. Comparing base (387b8be) to head (2ff5cdc).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8261   +/-   ##
=======================================
  Coverage   35.61%   35.62%           
=======================================
  Files         767      767           
  Lines       93333    93330    -3     
=======================================
  Hits        33245    33245           
+ Misses      57399    57396    -3     
  Partials     2689     2689           

see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@sdminonne sdminonne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread test/envtest/generator.go Outdated
Comment thread test/envtest/generator.go Outdated
Comment thread test/envtest/generator.go Outdated
@jparrill
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

@jparrill: This pull request references Jira Issue OCPBUGS-83585, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jparrill
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jparrill: This pull request references Jira Issue OCPBUGS-83585, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

…flaky envtest failures

GenerateCRDInstallTest uninstalls all CRDs after validation but does not
wait for the API server to fully remove them. When individual per-suite
tests start immediately after, they find a stale CRD in a transitional
state, causing WaitForCRDs to timeout with "context deadline exceeded".

Add a wait-for-removal loop after UninstallCRDs, using explicit
apierrors.IsNotFound checks instead of ShouldNot(Succeed()) to avoid
false positives from transient API errors. Also fix the same pattern
in GenerateTestSuite's AfterEach for consistency.

Fixes: https://issues.redhat.com/browse/OCPBUGS-83585

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@openshift-ci-robot
Copy link
Copy Markdown

@jparrill: This pull request references Jira Issue OCPBUGS-83585, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

What this PR does / why we need it

GenerateCRDInstallTest uninstalls all CRDs after validation but does not wait for the API server to fully remove them. When individual per-suite tests start immediately after, they find a stale CRD in a transitional/deletion state, causing WaitForCRDs to timeout with context deadline exceeded.

This adds a wait-for-removal loop after UninstallCRDs in GenerateCRDInstallTest, matching the pattern already used in GenerateTestSuite's AfterEach.

Root Cause

The execution order in suite_test.go is:

  1. GenerateCRDInstallTest("Default") — installs Default CRDs
  2. GenerateCRDInstallTest("TechPreviewNoUpgrade") — installs all TechPreviewNoUpgrade CRDs (including hcpetcdbackups), uninstalls without waiting
  3. Individual test suites start — the hcpetcdbackups CustomNoUpgrade variant hits the stale CRD

The flakiness depends on how fast the API server finalizes CRD deletion, which varies by K8s version and CI load:

Which issue(s) this PR fixes

Fixes https://issues.redhat.com/browse/OCPBUGS-83585

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Improved CRD uninstall verification in the test suite: tests now actively wait (with retries) for each custom resource definition to be fully removed, confirming cleanup within a 30-second window. This reduces test flakiness by preventing leftover CRDs from affecting subsequent test runs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jparrill jparrill requested a review from sdminonne April 16, 2026 16:00
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 16, 2026

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sdminonne
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 16, 2026
@jparrill
Copy link
Copy Markdown
Contributor Author

/verified by UnitTest

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 17, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jparrill: This PR has been marked as verified by UnitTest.

Details

In response to this:

/verified by UnitTest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@enxebre enxebre merged commit fa1cecd into openshift:main Apr 17, 2026
35 of 42 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@jparrill: Jira Issue Verification Checks: Jira Issue OCPBUGS-83585
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-83585 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it

GenerateCRDInstallTest uninstalls all CRDs after validation but does not wait for the API server to fully remove them. When individual per-suite tests start immediately after, they find a stale CRD in a transitional/deletion state, causing WaitForCRDs to timeout with context deadline exceeded.

This adds a wait-for-removal loop after UninstallCRDs in GenerateCRDInstallTest, matching the pattern already used in GenerateTestSuite's AfterEach.

Root Cause

The execution order in suite_test.go is:

  1. GenerateCRDInstallTest("Default") — installs Default CRDs
  2. GenerateCRDInstallTest("TechPreviewNoUpgrade") — installs all TechPreviewNoUpgrade CRDs (including hcpetcdbackups), uninstalls without waiting
  3. Individual test suites start — the hcpetcdbackups CustomNoUpgrade variant hits the stale CRD

The flakiness depends on how fast the API server finalizes CRD deletion, which varies by K8s version and CI load:

Which issue(s) this PR fixes

Fixes https://issues.redhat.com/browse/OCPBUGS-83585

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
  • Improved CRD uninstall verification in the test suite: tests now actively wait (with retries) for each custom resource definition to be fully removed, confirming cleanup within a 30-second window. This reduces test flakiness by preventing leftover CRDs from affecting subsequent test runs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants