Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading #4240

Merged
merged 1 commit into from Mar 26, 2024

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Mar 4, 2024

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2024
Copy link
Contributor

openshift-ci bot commented Mar 4, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2024
@djoshy
Copy link
Contributor Author

djoshy commented Mar 4, 2024

/test e2e-gcp-op

@djoshy
Copy link
Contributor Author

djoshy commented Mar 4, 2024

/test unit
/test verify

@djoshy
Copy link
Contributor Author

djoshy commented Mar 4, 2024

/test periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Mar 4, 2024

@djoshy: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test e2e-gcp-op-single-node
  • /test e2e-hypershift
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-fips-op
  • /test e2e-aws-ovn-upgrade-out-of-change
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-ovn-upgrade-out-of-change
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-techpreview
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-rt
  • /test e2e-gcp-rt-op
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-dualstack
  • /test e2e-openstack-externallb
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test e2e-vsphere-upi-zones
  • /test e2e-vsphere-zones
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere
  • /test security

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-bootstrap-unit
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade-out-of-change
  • pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-security
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@djoshy djoshy force-pushed the add-sync-retry branch 3 times, most recently from 3c703d4 to 305295c Compare March 11, 2024 14:51
Refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.
@djoshy
Copy link
Contributor Author

djoshy commented Mar 11, 2024

/test unit
/test verify
/test e2e-gcp-op

@djoshy djoshy changed the title DNM: testing adding retry to applyManifests OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, adding retry to applyManifests before degrading Mar 11, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Mar 11, 2024
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-9108, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 11, 2024
@openshift-ci openshift-ci bot requested a review from rioliu-rh March 11, 2024 14:58
@djoshy djoshy changed the title OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, adding retry to applyManifests before degrading OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading Mar 11, 2024
@djoshy djoshy marked this pull request as ready for review March 11, 2024 18:10
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2024
@djoshy
Copy link
Contributor Author

djoshy commented Mar 12, 2024

/test e2e-gcp-op-single-node

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.
/lgtm
Putting hold for QE testing
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2024
Copy link
Contributor

openshift-ci bot commented Mar 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sergiordlr
Copy link

Verified using IPI on GCP

Verify that now machine-config CO is available when it becomes degraded

  1. Delete all MCD pods in a loop for

watch -n 0.2 oc delete pods --force -l k8s-app=machine-config-daemon

  1. Apply a mc to the master MCP (any MC)

  2. After waiting 10 minutes, we can see that the machine-config OC is degraded but available

$ oc get co machine-config
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.15.0-0.ci.test-2024-03-14-104818-ci-ln-4v16t82-latest   True        False         True       127m    Failed to resync 4.15.0-0.ci.test-2024-03-14-104818-ci-ln-4v16t82-latest because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 0, unavailable: 5)]

  1. Once we stop removing the MCD pods, the machineconfig is applied and the machine-config CO stops being degraded.

Verify that after breaking the sync loop the machine-config CO is also available. (Without this fix the machine-config CO would be degraded=true and available=false)

  1. Make all MCD pods to fail with CreateContainerError

while true; do sleep 0.2; oc patch ds machine-config-daemon --type json -p '[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["fake"]}]'; done

  1. Wait 10 minutes until machine-config CO is degraded. We can see that it is degraded, but Available=true
$ oc get co machine-config
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.15.0-0.ci.test-2024-03-15-085727-ci-ln-yfv7ndt-latest   True        False         True       3h      Failed to resync 4.15.0-0.ci.test-2024-03-15-085727-ci-ln-yfv7ndt-latest because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 0, unavailable: 5)]

All negative test cases were executed without problems.

An upgrade was executed, there was a problem in the upgrade but it was not related to this fix, so we can consider this PR qe-approved.

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 15, 2024
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-9108, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Contributor Author

djoshy commented Mar 21, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d9ffefd and 2 for PR HEAD f648930 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4328697 and 1 for PR HEAD f648930 in total

@djoshy
Copy link
Contributor Author

djoshy commented Mar 23, 2024

/retest-required

1 similar comment
@djoshy
Copy link
Contributor Author

djoshy commented Mar 25, 2024

/retest-required

Copy link
Contributor

openshift-ci bot commented Mar 25, 2024

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-upgrade-out-of-change f648930 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-aws-ovn-upgrade-out-of-change f648930 link false /test e2e-aws-ovn-upgrade-out-of-change
ci/prow/okd-scos-e2e-aws-ovn f648930 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@djoshy
Copy link
Contributor Author

djoshy commented Mar 25, 2024

/retest-required

@openshift-merge-bot openshift-merge-bot bot merged commit abc3942 into openshift:master Mar 26, 2024
9 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@djoshy: Jira Issue OCPBUGS-9108: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9108 has been moved to the MODIFIED state.

In response to this:

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.16.0-202403260112.p0.gabc3942.assembly.stream.el8 for distgit ose-machine-config-operator.
All builds following this will include this PR.

@djoshy djoshy deleted the add-sync-retry branch March 26, 2024 19:40
@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-03-28-223620

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants