Skip to content

OCPBUGS-86329: cpo: turn off cluster-api crdmigrator controller#8556

Open
stevekuznetsov wants to merge 1 commit into
openshift:mainfrom
stevekuznetsov:skuznets/turn-off-capi-crdmigrator
Open

OCPBUGS-86329: cpo: turn off cluster-api crdmigrator controller#8556
stevekuznetsov wants to merge 1 commit into
openshift:mainfrom
stevekuznetsov:skuznets/turn-off-capi-crdmigrator

Conversation

@stevekuznetsov
Copy link
Copy Markdown
Contributor

@stevekuznetsov stevekuznetsov commented May 20, 2026

We do not provide the RBAC necessary to make this controller function, and we do not want or need its behavior. Turn off the controller by skipping all phases in order to remove lots of serious-looking 401 spam from logs that misleads humans and agents analyzing the logs.

Summary by CodeRabbit

  • Chores
    • Adjusted control plane operator deployment configuration to skip certain CRD migration phases during controller startup, altering startup migration behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This PR modifies the cluster-api manager Deployment manifest to add a command-line argument that skips specific CRD migration phases during controller initialization. The --skip-crd-migration-phases=StorageVersionMigration,CleanupManagedFields flag is appended to the manager container's arguments, controlling which phases of CRD migration are bypassed at startup. No other deployment configuration is altered.

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only a Kubernetes YAML manifest file (deployment.yaml) adding container arguments, not test files. No Ginkgo test names are introduced or modified.
Test Structure And Quality ✅ Passed PR contains no Ginkgo test code - only a YAML deployment manifest change. Custom check for test structure/quality is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The change only modifies a Kubernetes Deployment manifest with container arguments, making the MicroShift test compatibility check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The change only modifies a Kubernetes deployment manifest file by adding a container argument for the cluster-api controller.
Topology-Aware Scheduling Compatibility ✅ Passed The change only adds a container argument without modifying scheduling constraints, affinity rules, replica counts, or nodeSelectors. No topology-aware scheduling issues found.
Ote Binary Stdout Contract ✅ Passed PR only modifies a deployment manifest to add CLI arguments to skip CRD migration phases. No code changes that could affect OTE binary stdout contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies only a Kubernetes Deployment YAML manifest (cluster-api deployment.yaml), adding container arguments. No Ginkgo e2e tests are added, so this check does not apply.
Title check ✅ Passed The title clearly and specifically references the main change: turning off the cluster-api crdmigrator controller via the deployment argument modification shown in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label May 20, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and sjenning May 20, 2026 13:02
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.34%. Comparing base (a7d68da) to head (a020815).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8556   +/-   ##
=======================================
  Coverage   40.34%   40.34%           
=======================================
  Files         755      755           
  Lines       93167    93167           
=======================================
  Hits        37587    37587           
  Misses      52877    52877           
  Partials     2703     2703           
Flag Coverage Δ
cmd-support 34.30% <ø> (ø)
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.14% <ø> (ø)
hypershift-operator 50.72% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

We do not provide the RBAC necessary to make this controller function,
and we do not want or need its behavior. Turn off the controller by
skipping all phases in order to remove lots of serious-looking 401 spam
from logs that misleads humans and agents analyzing the logs.

Signed-off-by: Steve Kuznetsov <stekuznetsov@microsoft.com>
@stevekuznetsov stevekuznetsov force-pushed the skuznets/turn-off-capi-crdmigrator branch from 2386fa6 to a020815 Compare May 20, 2026 13:14
@openshift-ci openshift-ci Bot added the area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release label May 20, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 20, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: machi1990, stevekuznetsov
Once this PR has been reviewed and has the lgtm label, please assign csrwng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@miguelsorianod
Copy link
Copy Markdown

Hi @stevekuznetsov,

To try to understand better whether we would need this at some point:

could you expand on why we do not want/need that behavior? isn't it possible that the CAPI CRDs evolve over time and we want to leverage the new ones?

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 20, 2026

Test Results

e2e-aws

e2e-aks

@stevekuznetsov
Copy link
Copy Markdown
Contributor Author

/retest

ERRO[2026-05-20T14:00:29Z] 
  * could not run steps: step [release:n1minor] failed: failed to get CLI image: unable to find the 'cli' image in the provided release image: unable to delete completed pod: could not delete completed pod: Operation cannot be fulfilled on Pod "release-images-n1minor-cli": the UID in the precondition (2c415d7c-258c-4259-a9d0-a6cad4faea5d) does not match the UID in record (62ea9337-faab-4325-8051-5d6d42f195da). The object might have been deleted and then recreated 

unrelated

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@stevekuznetsov: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws a020815 link true /test e2e-aws
ci/prow/e2e-aks a020815 link true /test e2e-aks

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have the complete picture. Let me compile the final report.

Test Failure Analysis Complete

Job Information

Job Build ID Failed Step Release Error
e2e-aws 2057195357032419328 [release:n2minor] 4.21.0-0.ci-2026-05-20-024624 pods "release-images-n2minor-cli" not found
e2e-aks 2057195356956921856 [release:n3minor] 4.20.0-0.ci-2026-05-20-103924 pod produced no output
Konflux hypershift-operator-main-enterprise-contract verify 2 of 256 enterprise-contract checks failed
Konflux hypershift-operator-enterprise-contract verify 2 of 256 enterprise-contract checks failed

Test Failure Analysis

Error

e2e-aws: could not run steps: step [release:n2minor] failed: failed to get CLI image:
  unable to extract the 'cli' image from the release image: pods "release-images-n2minor-cli" not found

e2e-aks: could not run steps: step [release:n3minor] failed: failed to get CLI image:
  unable to extract the 'cli' image from the release image, pod produced no output

Konflux: Integration test for component hypershift-operator-main snapshot
  hypershift-operator-20260520-202302-000 — verify task: 254 success, 24 warnings, 2 failures

Summary

All four failures are CI infrastructure issues completely unrelated to PR #8556. The two Prow jobs (e2e-aws and e2e-aks) both failed during the ci-operator release import phase — before any test code ever executed — due to transient pod scheduling/lifecycle failures on the build01 CI cluster. The PR changes (adding a single --skip-crd-migration-phases flag to a deployment YAML and its test fixture) were never exercised. The two Konflux enterprise-contract failures are also a pre-existing, repo-wide issue: the identical failures reproduce on PRs #8554 and #8555 (and on the main branch), confirming they are not caused by this PR.

Root Cause

Prow Jobs (e2e-aws, e2e-aks):

Both Prow jobs failed during the [release:*minor] import step of the ci-operator execution graph — this is a CI infrastructure step that imports OCP release payloads into the CI namespace, occurring long before any test steps run. The jobs never reached the multi-stage test phases (pre/test/post).

The specific mechanism:

  1. ci-operator creates a pod (e.g., release-images-n2minor-cli) to extract the cli image from the release payload.
  2. In e2e-aws, the pod release-images-n2minor-cli was never found (likely never scheduled or was garbage-collected before ci-operator could read it).
  3. In e2e-aks, the equivalent pod release-images-n3minor-cli existed but produced no output (the container ran but failed silently).

Key evidence that this is infrastructure-only:

  • Different releases failed in each job: n2minor (4.21) failed in e2e-aws, while n3minor (4.20) failed in e2e-aks — if the PR caused this, the same release would fail consistently.
  • Fast-fail timing: The failed imports completed in ~12-13 seconds vs. 120-150 seconds for successful imports, indicating immediate pod lifecycle failures.
  • Both jobs shared the same CI namespace (ci-op-mgtk9n06 on build01), suggesting a localized resource pressure or scheduling issue on that node/namespace.
  • No test code was ever executed: The JUnit XML (junit_operator.xml) contains only ci-operator step-graph entries — no test-phase entries exist.
  • The PR diff is trivial: It adds a single CLI flag (--skip-crd-migration-phases=StorageVersionMigration,CleanupManagedFields) to a deployment YAML and its corresponding test fixture — this cannot affect CI release imports.

Konflux Enterprise Contract:

Both Konflux checks (hypershift-operator-main-enterprise-contract and hypershift-operator-enterprise-contract) show 2 failures out of 256 checks (with 24 warnings). These exact same failures reproduce on:

This confirms the Konflux enterprise-contract failures are a pre-existing repo-wide issue unrelated to any specific PR.

Recommendations
  1. Rerun the Prow jobs — Both e2e-aws and e2e-aks failed due to transient CI infrastructure issues. A /retest comment on the PR should resolve these.

  2. Ignore the Konflux enterprise-contract failures — These are pre-existing, repo-wide failures that affect all PRs and main. They require a separate fix by the Konflux/enterprise-contract team or the repo's Konflux configuration maintainers.

  3. No code changes needed in PR OCPBUGS-86329: cpo: turn off cluster-api crdmigrator controller #8556 — The PR's changes (adding --skip-crd-migration-phases flag) are completely uninvolved in any of these failures.

Evidence
Evidence Detail
e2e-aws error step [release:n2minor] failed: failed to get CLI image: pods "release-images-n2minor-cli" not found
e2e-aks error step [release:n3minor] failed: failed to get CLI image: pod produced no output
Failure phase importing_release — before any test execution
e2e-aws failed import time 13.3s (vs 120-150s for successful imports)
e2e-aks failed import time 12.1s (vs 120-150s for successful imports)
CI namespace Both jobs shared ci-op-mgtk9n06 on build01.ci.openshift.org
Different releases failed n2minor (4.21) in e2e-aws vs n3minor (4.20) in e2e-aks
PR diff scope 2 files changed, 2 insertions — adds --skip-crd-migration-phases flag only
Konflux on PR #8555 Same 2 enterprise-contract failures (not PR-specific)
Konflux on PR #8554 Same 2 enterprise-contract failures (not PR-specific)
Konflux on main hypershift-operator-*-enterprise-contract conclusion: neutral (known failures)
JUnit XML Only ci-operator step-graph entries; no test-phase entries exist
ci-operator reason executing_graph:step_failed:importing_release

@celebdor celebdor changed the title fix: cpo: turn off cluster-api crdmigrator controller OCPBUGS-86329: cpo: turn off cluster-api crdmigrator controller May 21, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-86329, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

We do not provide the RBAC necessary to make this controller function, and we do not want or need its behavior. Turn off the controller by skipping all phases in order to remove lots of serious-looking 401 spam from logs that misleads humans and agents analyzing the logs.

Summary by CodeRabbit

  • Chores
  • Adjusted control plane operator deployment configuration to skip certain CRD migration phases during controller startup, altering startup migration behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants