New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading #4240
Conversation
Skipping CI for Draft Pull Request. |
/test e2e-gcp-op |
/test unit |
/test periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade |
@djoshy: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3c703d4
to
305295c
Compare
Refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.
/test unit |
@djoshy: This pull request references Jira Issue OCPBUGS-9108, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test e2e-gcp-op-single-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
/lgtm
Putting hold for QE testing
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Verified using IPI on GCP Verify that now machine-config CO is available when it becomes degraded
Verify that after breaking the sync loop the machine-config CO is also available. (Without this fix the machine-config CO would be degraded=true and available=false)
All negative test cases were executed without problems. An upgrade was executed, there was a problem in the upgrade but it was not related to this fix, so we can consider this PR qe-approved. /label qe-approved |
@djoshy: This pull request references Jira Issue OCPBUGS-9108, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/unhold |
/retest-required |
1 similar comment
/retest-required |
@djoshy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required |
abc3942
into
openshift:master
@djoshy: Jira Issue OCPBUGS-9108: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-9108 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-machine-config-operator-container-v4.16.0-202403260112.p0.gabc3942.assembly.stream.el8 for distgit ose-machine-config-operator. |
Fix included in accepted release 4.16.0-0.nightly-2024-03-28-223620 |
I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.
This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.