OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading #4240

djoshy · 2024-03-04T20:37:17Z

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

openshift-ci · 2024-03-04T20:37:29Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

djoshy · 2024-03-04T20:41:21Z

/test e2e-gcp-op

djoshy · 2024-03-04T20:44:13Z

/test unit
/test verify

djoshy · 2024-03-04T20:49:01Z

/test periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade

openshift-ci · 2024-03-04T20:49:22Z

@djoshy: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test okd-scos-images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-upgrade-out-of-change
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-ovn-upgrade-out-of-change
/test e2e-azure-upgrade
/test e2e-gcp-op-techpreview
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-dualstack
/test e2e-openstack-externallb
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test e2e-vsphere-upi-zones
/test e2e-vsphere-zones
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-gcp-op
/test okd-scos-e2e-gcp-ovn-upgrade
/test okd-scos-e2e-vsphere
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-bootstrap-unit
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-okd-scos-images
pull-ci-openshift-machine-config-operator-master-security
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

djoshy · 2024-03-11T14:53:32Z

/test unit
/test verify
/test e2e-gcp-op

openshift-ci-robot · 2024-03-11T14:58:22Z

@djoshy: This pull request references Jira Issue OCPBUGS-9108, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

djoshy · 2024-03-12T13:06:13Z

/test e2e-gcp-op-single-node

pkg/operator/sync.go

sinnykumari

Overall looks good.
/lgtm
Putting hold for QE testing
/hold

openshift-ci · 2024-03-13T19:27:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [djoshy,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sergiordlr · 2024-03-15T17:32:55Z

Verified using IPI on GCP

Verify that now machine-config CO is available when it becomes degraded

Delete all MCD pods in a loop for

watch -n 0.2 oc delete pods --force -l k8s-app=machine-config-daemon

Apply a mc to the master MCP (any MC)
After waiting 10 minutes, we can see that the machine-config OC is degraded but available

$ oc get co machine-config
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.15.0-0.ci.test-2024-03-14-104818-ci-ln-4v16t82-latest   True        False         True       127m    Failed to resync 4.15.0-0.ci.test-2024-03-14-104818-ci-ln-4v16t82-latest because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 0, unavailable: 5)]

Once we stop removing the MCD pods, the machineconfig is applied and the machine-config CO stops being degraded.

Verify that after breaking the sync loop the machine-config CO is also available. (Without this fix the machine-config CO would be degraded=true and available=false)

Make all MCD pods to fail with CreateContainerError

while true; do sleep 0.2; oc patch ds machine-config-daemon --type json -p '[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["fake"]}]'; done

Wait 10 minutes until machine-config CO is degraded. We can see that it is degraded, but Available=true

$ oc get co machine-config
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.15.0-0.ci.test-2024-03-15-085727-ci-ln-yfv7ndt-latest   True        False         True       3h      Failed to resync 4.15.0-0.ci.test-2024-03-15-085727-ci-ln-yfv7ndt-latest because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 0, unavailable: 5)]

All negative test cases were executed without problems.

An upgrade was executed, there was a problem in the upgrade but it was not related to this fix, so we can consider this PR qe-approved.

/label qe-approved

openshift-ci-robot · 2024-03-15T17:33:06Z

@djoshy: This pull request references Jira Issue OCPBUGS-9108, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

djoshy · 2024-03-21T14:39:09Z

/unhold

openshift-ci-robot · 2024-03-22T23:02:44Z

/retest-required

Remaining retests: 0 against base HEAD d9ffefd and 2 for PR HEAD f648930 in total

openshift-ci-robot · 2024-03-23T00:03:04Z

/retest-required

Remaining retests: 0 against base HEAD 4328697 and 1 for PR HEAD f648930 in total

djoshy · 2024-03-23T10:47:35Z

/retest-required

djoshy · 2024-03-25T13:01:05Z

/retest-required

openshift-ci · 2024-03-25T13:04:17Z

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`f648930`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`
ci/prow/e2e-aws-ovn-upgrade-out-of-change	`f648930`	link	false	`/test e2e-aws-ovn-upgrade-out-of-change`
ci/prow/okd-scos-e2e-aws-ovn	`f648930`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

djoshy · 2024-03-25T19:21:02Z

/retest-required

openshift-ci-robot · 2024-03-26T01:49:22Z

@djoshy: Jira Issue OCPBUGS-9108: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4240

Jira Issue OCPBUGS-9108 has been moved to the MODIFIED state.

In response to this:

I refactored ApplyManifests by hoisting the waitForDaemonsetRollout call to the syncMachineConfigxxx function as this is more consistent with the other sync functions. This also makes the retry action cleaner by not multiplying the final wait timeout. Retry only takes place on rpc errors and resource conflicts, for all other cases, an immediate degrade takes place. Operator will no longer report Available=False for any case.

This is slightly hard to test right now as these rpc errors/resource conflicts aren't as prevalent on 4.14+.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-03-26T05:51:41Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.16.0-202403260112.p0.gabc3942.assembly.stream.el8 for distgit ose-machine-config-operator.
All builds following this will include this PR.

openshift-merge-robot · 2024-03-29T05:28:19Z

Fix included in accepted release 4.16.0-0.nightly-2024-03-28-223620

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2024

djoshy force-pushed the add-sync-retry branch from d93d2ce to eefacc9 Compare March 4, 2024 20:40

djoshy force-pushed the add-sync-retry branch 3 times, most recently from 3c703d4 to 305295c Compare March 11, 2024 14:51

djoshy force-pushed the add-sync-retry branch from 305295c to f648930 Compare March 11, 2024 14:52

djoshy changed the title ~~DNM: testing adding retry to applyManifests~~ OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, adding retry to applyManifests before degrading Mar 11, 2024

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Mar 11, 2024

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 11, 2024

openshift-ci bot requested a review from rioliu-rh March 11, 2024 14:58

djoshy changed the title ~~OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, adding retry to applyManifests before degrading~~ OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading Mar 11, 2024

djoshy marked this pull request as ready for review March 11, 2024 18:10

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2024

openshift-ci bot requested review from jkyros and yuqi-zhang March 11, 2024 18:11

sinnykumari reviewed Mar 13, 2024

View reviewed changes

pkg/operator/sync.go Show resolved Hide resolved

sinnykumari reviewed Mar 13, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2024

openshift-ci bot assigned sinnykumari Mar 13, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2024

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 15, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2024

openshift-merge-bot bot merged commit abc3942 into openshift:master Mar 26, 2024
9 of 17 checks passed

djoshy deleted the add-sync-retry branch March 26, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading #4240

OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading #4240

djoshy commented Mar 4, 2024 •

edited

openshift-ci bot commented Mar 4, 2024

djoshy commented Mar 4, 2024

djoshy commented Mar 4, 2024

djoshy commented Mar 4, 2024

openshift-ci bot commented Mar 4, 2024

djoshy commented Mar 11, 2024

openshift-ci-robot commented Mar 11, 2024

djoshy commented Mar 12, 2024

sinnykumari left a comment

openshift-ci bot commented Mar 13, 2024

sergiordlr commented Mar 15, 2024

openshift-ci-robot commented Mar 15, 2024

djoshy commented Mar 21, 2024

openshift-ci-robot commented Mar 22, 2024

openshift-ci-robot commented Mar 23, 2024

djoshy commented Mar 23, 2024

djoshy commented Mar 25, 2024

openshift-ci bot commented Mar 25, 2024 •

edited

djoshy commented Mar 25, 2024

openshift-ci-robot commented Mar 26, 2024

openshift-bot commented Mar 26, 2024

openshift-merge-robot commented Mar 29, 2024

OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading #4240

OCPBUGS-9108: OCPBUGS-24228: Make MCO operator always Available, add retry to applyManifests before degrading #4240

Conversation

djoshy commented Mar 4, 2024 • edited

openshift-ci bot commented Mar 4, 2024

djoshy commented Mar 4, 2024

djoshy commented Mar 4, 2024

djoshy commented Mar 4, 2024

openshift-ci bot commented Mar 4, 2024

djoshy commented Mar 11, 2024

openshift-ci-robot commented Mar 11, 2024

djoshy commented Mar 12, 2024

sinnykumari left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 13, 2024

sergiordlr commented Mar 15, 2024

openshift-ci-robot commented Mar 15, 2024

djoshy commented Mar 21, 2024

openshift-ci-robot commented Mar 22, 2024

openshift-ci-robot commented Mar 23, 2024

djoshy commented Mar 23, 2024

djoshy commented Mar 25, 2024

openshift-ci bot commented Mar 25, 2024 • edited

djoshy commented Mar 25, 2024

openshift-ci-robot commented Mar 26, 2024

openshift-bot commented Mar 26, 2024

openshift-merge-robot commented Mar 29, 2024

djoshy commented Mar 4, 2024 •

edited

openshift-ci bot commented Mar 25, 2024 •

edited