-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-24061: Keep CSI operators progressing=true during DaemonSet rollout #1734
Conversation
@dobsonj: This pull request references Jira Issue OCPBUGS-24061, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (wduan@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@dobsonj: GitHub didn't allow me to request PR reviews from the following users: openshift/storage. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
case daemonSet.Status.NumberUnavailable > 0: | ||
case daemonSet.Status.NumberUnavailable > 0, | ||
daemonSet.Status.UpdatedNumberScheduled < desiredNumber, | ||
daemonSet.Status.NumberAvailable < desiredNumber: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand from the API docs, the comparison daemonSet.Status.NumberAvailable < desiredNumber
is the same as daemonSet.Status.NumberUnavailable > 0
.
Also, I think daemonSet.Status.UpdatedNumberScheduled < desiredNumber
will always evaluate to true before daemonSet.Status.NumberAvailable < desiredNumber
.
If this is true, we could drop daemonSet.Status.NumberAvailable < desiredNumber
and still get the same behavior.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: if we split this case
we could return more helpful messages, likeWaiting for X nodes to have pods running and available
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, thanks @bertinatto :)
I pushed a new version, it may be a little too verbose, but let me know what you think:
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from False to True ("VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to act on changes")
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to act on changes" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 0 out of 5 scheduled"
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 0 out of 5 scheduled" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 1 out of 5 scheduled"
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 1 out of 5 scheduled" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 2 out of 5 scheduled"
1s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 2 out of 5 scheduled" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 3 out of 5 scheduled"
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 3 out of 5 scheduled" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 4 out of 5 scheduled"
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update node pods, 4 out of 5 scheduled" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from True to False ("VSphereCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think it's better not to update the condition once for every node like my example above, because a 1000 node cluster would end up with 1000 status changes for a single DaemonSet rollout.
I pushed a simpler version that just says how many pods it will update:
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from False to True ("VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to act on changes")
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to act on changes" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update 5 node pods"
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing message changed from "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to update 5 node pods" to "VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"
0s Normal OperatorStatusChanged deployment/cluster-storage-operator Status for clusteroperator/storage changed: Progressing changed from True to False ("VSphereCSIDriverOperatorCRProgressing: All is well\nSHARESCSIDriverOperatorCRProgressing: All is well")
@dobsonj: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dobsonj, jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dobsonj: Jira Issue OCPBUGS-24061: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-24061 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Fix included in accepted release 4.16.0-0.nightly-2024-05-14-095225 |
https://issues.redhat.com/browse/OCPBUGS-24061
Patching the ClusterCSIDriver causes the
progressing
condition to flip-flop between true and false during DaemonSet rollout. For example:$ oc -n openshift-cluster-storage-operator get event --sort-by='.lastTimestamp' --watch
$ oc patch clustercsidriver/csi.vsphere.vmware.com --type=merge -p '{"spec":{"driverConfig":{"vSphere":{"globalMaxSnapshotsPerBlockVolume": 5}}}}'
We should also be evaluating UpdatedNumberScheduled and NumberAvailable and wait until they reach DesiredNumberScheduled to avoid reporting progressing=false until the rollout is complete. This is consistent with what we already do in the CSO deployment controller.
After making these changes, and testing a vmware-vsphere-csi-driver-operator build with this change vendored:
/cc @openshift/storage @RomanBednar