New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator: fix logic setting version for progressing status #855
operator: fix logic setting version for progressing status #855
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kikisdeliveryservice The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
On the face of it, this seems like an obvious change. But the logic here is... twisty. |
@cgwalters Yeah im not 100% on this but what i think is happening is that syncProgressing is correctly called but then syncVersion which is called right after (without this change) is updating the version to the not yet updated operator. Which fastforward, ends up as the new version is available while the cluster is updating when it should be the old version is available and the new version is only available once the update is complete. |
Cross-linking https://bugzilla.redhat.com/show_bug.cgi?id=1708454 |
We surely have a faulty check in syncRequitedMachineConfitPools as well. I can't explain how it passes the check to report Available otherwise. I believe the issue lays down in a race which I'm still trying to find. |
Test timed out but poking around in the logs all of the nodes were succesfully updated. going to retest again: /test e2e-aws-upgrade |
so poking around in the logs progressing is now correct since the upgrade did not finish and stopped at 90%t:
But I really am not sure about the Available message bc it's misleading? Shouldn't the available version be the current version not the version it is working towards? The progressing message tells us what we are progressing to, but the Available message should tell us where we are - no??
But we did not officially change the operator version in the operator logs yet (and it timed out before that happened) which is good!:
|
/test e2e-aws-upgrade |
so this PR seems to solve the initial problem of the version being set before the upgrade is finished (see operator logs, it's no longer immediately updating version), but I'm hitting another problem where the test keeps timing out? I rekicked another PR on the e2e-aws-upgrade here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/851/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/699 that test eventually passed but took 3 hours(???) So I really can't tell if this pr is failing on the upgrade e2e bc of my change (even tho looking at logs it seems ok) or bc the test is just taking forever and timing out due to other things. |
/retest |
MCC and MCO logs look sane |
Ok, now I'm understanding more, from the first job failure (see also @kikisdeliveryservice comments at #855 (comment)).
The above happens because in setting Available we only check if we're not degraded. We should also do NOT report the new version if we're still progressing I guess. |
diff --git a/pkg/operator/status.go b/pkg/operator/status.go
index ec89f98e..05a2b71b 100644
--- a/pkg/operator/status.go
+++ b/pkg/operator/status.go
@@ -61,15 +61,14 @@ func (optr *Operator) syncAvailableStatus() error {
return nil
}
- optrVersion, _ := optr.vStore.Get("operator")
degraded := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded)
- message := fmt.Sprintf("Cluster has deployed %s", optrVersion)
+ message := fmt.Sprintf("Cluster has deployed %s", co.Status.Versions)
available := configv1.ConditionTrue
if degraded {
available = configv1.ConditionFalse
- message = fmt.Sprintf("Cluster not available for %s", optrVersion)
+ message = fmt.Sprintf("Cluster not available for %s", co.Status.Versions)
}
coStatus := configv1.ClusterOperatorStatusCondition{ The above takes care of my last comment. We were wrongly picking up the new version and set that w/o waiting completely for Progressing=False which sets the new version in the clusteroperator object.. This happens when we're still Progressing w/o being Degraded (see syncAvailableStatus) |
@@ -29,7 +29,7 @@ func (optr *Operator) syncVersion() error { | |||
} | |||
|
|||
// keep the old version and progressing if we fail progressing | |||
if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) && cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded) { | |||
if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) || cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debugging with Alex. This ||
makes it impossing to flip from progressing=True to False because we now never report the new version to the cluster object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other words, there's no way right now to flip from progressing=true & old version to progressing=false & new version reported here
Ok, looking again at https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/23161/pull-ci-openshift-origin-release-4.1-e2e-aws-upgrade/51/artifacts/e2e-aws-upgrade/pods/ The progressing logic is flawed and we're thinking about relying that status on the master pool status as well other than versions skew between the clusteroperator version and the MCO one |
Ok, the race is:
The race is the node_controller isn’t updating desiredConfig in time |
We really need the full chain to be complete to report progressing=false:
Note that we don't need to wait for the node controller to be idle because that's covered by the last condition. |
What's happening is that the Generation of a pool isn't updated (always at 1) because we never update its spec. We only update its status.configuration in 4.1. Colin's patch shoudl fix that |
the problem isn't there though. That code runs fine (though it may be flawed anyway), the failure is in checking the correct generation of the MCP we're comparing against in syncRequiredMachineConfigPool and I'm gonna check that we don't have such issue in master cause of #773 |
/skip |
@kikisdeliveryservice: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
In an effort to clean up the MCO repo, closing old open PRs with no recent activity. Feel free to reopen. |
Background: in CI we are seeing a cluster that is in the middle of upgrading having a CVO status of Available in version target version as opposed to the expected status of progressing until the upgrade is completed.