operator: fix logic setting version for progressing status #855

kikisdeliveryservice · 2019-06-13T22:29:13Z

Background: in CI we are seeing a cluster that is in the middle of upgrading having a CVO status of Available in version target version as opposed to the expected status of progressing until the upgrade is completed.

openshift-ci-robot · 2019-06-13T22:29:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgwalters · 2019-06-13T23:18:25Z

On the face of it, this seems like an obvious change. But the logic here is... twisty. syncProgressingStatus() which is responsible for setting progressing also looks at the version store, so this feels circular.

kikisdeliveryservice · 2019-06-13T23:32:02Z

@cgwalters Yeah im not 100% on this but what i think is happening is that syncProgressing is correctly called but then syncVersion which is called right after (without this change) is updating the version to the not yet updated operator. Which fastforward, ends up as the new version is available while the cluster is updating when it should be the old version is available and the new version is only available once the update is complete.

wking · 2019-06-13T23:34:44Z

Cross-linking https://bugzilla.redhat.com/show_bug.cgi?id=1708454

runcom · 2019-06-13T23:51:04Z

We surely have a faulty check in syncRequitedMachineConfitPools as well. I can't explain how it passes the check to report Available otherwise. I believe the issue lays down in a race which I'm still trying to find.

kikisdeliveryservice · 2019-06-14T00:49:47Z

Test timed out but poking around in the logs all of the nodes were succesfully updated. going to retest again:

/test e2e-aws-upgrade

kikisdeliveryservice · 2019-06-14T01:29:43Z

so poking around in the logs progressing is now correct since the upgrade did not finish and stopped at 90%t:

{
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-06-13T22:48:43Z",
                "generation": 1,
                "name": "machine-config",
                "resourceVersion": "35095",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/machine-config",
                "uid": "657b3a25-8e2d-11e9-9c98-12d94da3a8e8"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2019-06-13T22:49:22Z",
                        "message": "Cluster has deployed 0.0.1-2019-06-13-223247",
                        "status": "True",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-06-13T23:19:51Z",
                        "message": "Working towards 0.0.1-2019-06-13-223247",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-06-13T22:48:43Z",
                        "status": "False",
                        "type": "Degraded"
                    }
                ],
                "extension": {
                    "master": "all 3 nodes are at latest configuration rendered-master-b24a8176c20faebab011f3ed6e554d34",
                    "worker": "all 3 nodes are at latest configuration rendered-worker-18b7b8832dcd836e7d385662b1a4d7cf"
                },

But I really am not sure about the Available message bc it's misleading? Shouldn't the available version be the current version not the version it is working towards? The progressing message tells us what we are progressing to, but the Available message should tell us where we are - no??

machine-config-operator/pkg/operator/status.go

Line 65 in 0d843aa

message := fmt.Sprintf("Cluster has deployed %s", optrVersion)

But we did not officially change the operator version in the operator logs yet (and it timed out before that happened) which is good!:

I0613 23:29:49.332563       1 kubelet_config_controller.go:174] Starting MachineConfigController-KubeletConfigController
I0613 23:29:49.332540       1 container_runtime_config_controller.go:170] Starting MachineConfigController-ContainerRuntimeConfigController
I0613 23:29:54.238992       1 status.go:82] Pool worker: All nodes are updated with rendered-worker-18b7b8832dcd836e7d385662b1a4d7cf
I0613 23:30:07.396318       1 node_controller.go:220] Pool master: node ip-10-0-135-143.ec2.internal has completed update to rendered-master-b24a8176c20faebab011f3ed6e554d34
I0613 23:30:07.447398       1 node_controller.go:213] Pool master: node ip-10-0-135-143.ec2.internal is now reporting ready
I0613 23:30:12.396669       1 status.go:82] Pool master: All nodes are updated with rendered-master-b24a8176c20faebab011f3ed6e554d34

kikisdeliveryservice · 2019-06-14T04:29:12Z

/test e2e-aws-upgrade

kikisdeliveryservice · 2019-06-14T06:22:57Z

so this PR seems to solve the initial problem of the version being set before the upgrade is finished (see operator logs, it's no longer immediately updating version), but I'm hitting another problem where the test keeps timing out? I rekicked another PR on the e2e-aws-upgrade here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/851/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/699 that test eventually passed but took 3 hours(???) So I really can't tell if this pr is failing on the upgrade e2e bc of my change (even tho looking at logs it seems ok) or bc the test is just taking forever and timing out due to other things.

cc: @runcom @wking

runcom · 2019-06-14T10:38:17Z

/retest

runcom · 2019-06-14T10:42:03Z

I0614 04:03:02.618360       1 status.go:82] Pool master: All nodes are updated with rendered-master-624af8792aa560921625cd72e57628cf


I0614 04:03:03.005071       1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"9437fbd5-8e53-11e9-82ab-12f47fb64940", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 0.0.1-2019-06-14-030242}] to [{operator 0.0.1-2019-06-14-030413}]

MCC and MCO logs look sane

runcom · 2019-06-14T10:53:19Z

Ok, now I'm understanding more, from the first job failure (see also @kikisdeliveryservice comments at #855 (comment)).

Progressing=True at the new version
Available=True at the new version
Degraded=False
ClusterOperator object status in etcd still at old version (check here https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/855/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/694/artifacts/e2e-aws-upgrade/clusteroperators.json)

The above happens because in setting Available we only check if we're not degraded. We should also do NOT report the new version if we're still progressing I guess.

runcom · 2019-06-14T10:59:29Z

diff --git a/pkg/operator/status.go b/pkg/operator/status.go
index ec89f98e..05a2b71b 100644
--- a/pkg/operator/status.go
+++ b/pkg/operator/status.go
@@ -61,15 +61,14 @@ func (optr *Operator) syncAvailableStatus() error {
                return nil
        }
 
-       optrVersion, _ := optr.vStore.Get("operator")
        degraded := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded)
-       message := fmt.Sprintf("Cluster has deployed %s", optrVersion)
+       message := fmt.Sprintf("Cluster has deployed %s", co.Status.Versions)
 
        available := configv1.ConditionTrue
 
        if degraded {
                available = configv1.ConditionFalse
-               message = fmt.Sprintf("Cluster not available for %s", optrVersion)
+               message = fmt.Sprintf("Cluster not available for %s", co.Status.Versions)
        }
 
        coStatus := configv1.ClusterOperatorStatusCondition{

The above takes care of my last comment. We were wrongly picking up the new version and set that w/o waiting completely for Progressing=False which sets the new version in the clusteroperator object.. This happens when we're still Progressing w/o being Degraded (see syncAvailableStatus)

runcom · 2019-06-14T11:48:15Z

pkg/operator/status.go

@@ -29,7 +29,7 @@ func (optr *Operator) syncVersion() error {
 	}

 	// keep the old version and progressing if we fail progressing
-	if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) && cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded) {
+	if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) || cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded) {


debugging with Alex. This || makes it impossing to flip from progressing=True to False because we now never report the new version to the cluster object.

in other words, there's no way right now to flip from progressing=true & old version to progressing=false & new version reported here

runcom · 2019-06-14T12:11:40Z

Ok, looking again at https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/23161/pull-ci-openshift-origin-release-4.1-e2e-aws-upgrade/51/artifacts/e2e-aws-upgrade/pods/

The progressing logic is flawed and we're thinking about relying that status on the master pool status as well other than versions skew between the clusteroperator version and the MCO one

runcom · 2019-06-14T13:48:06Z

Ok, the race is:

18:07:56: new Operator goes up
start syncing
syncRequiredMachineConfigPools passes cause pool master is idle but didn't start updating to new rendered-mc
18:09:28: new version reported
18:10:06: the master pool starts rolling the new templates

syncRequiredMachineConfigPools needs a stronger check

The race is the node_controller isn’t updating desiredConfig in time

cgwalters · 2019-06-14T14:20:14Z

We really need the full chain to be complete to report progressing=false:

operator is updated
controller has rolled out and is idle (most importantly the template, render sub-controllers)
the master pool's spec config must = status

Note that we don't need to wait for the node controller to be idle because that's covered by the last condition.

runcom · 2019-06-14T14:21:04Z

What's happening is that the Generation of a pool isn't updated (always at 1) because we never update its spec. We only update its status.configuration in 4.1. Colin's patch shoudl fix that

runcom · 2019-06-14T14:28:06Z

We really need the full chain to be complete to report progressing=false:

the problem isn't there though. That code runs fine (though it may be flawed anyway), the failure is in checking the correct generation of the MCP we're comparing against in syncRequiredMachineConfigPool and I'm gonna check that we don't have such issue in master cause of #773

kikisdeliveryservice · 2019-11-27T20:13:21Z

/skip

openshift-ci-robot · 2019-11-27T21:55:13Z

@kikisdeliveryservice: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-disruptive	`14bb514`	link	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-op	`14bb514`	link	`/test e2e-gcp-op`
ci/prow/e2e-aws-upgrade	`14bb514`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-gcp-upgrade	`14bb514`	link	`/test e2e-gcp-upgrade`
ci/prow/e2e-vsphere	`14bb514`	link	`/test e2e-vsphere`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kikisdeliveryservice · 2019-12-03T22:49:31Z

In an effort to clean up the MCO repo, closing old open PRs with no recent activity.

Feel free to reopen.

operator: fix logic setting version for progressing status

14bb514

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 13, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2019

openshift-ci-robot requested review from jlebon and LorbusChris June 13, 2019 22:29

kikisdeliveryservice requested review from runcom and removed request for jlebon and LorbusChris June 13, 2019 22:30

wking mentioned this pull request Jun 13, 2019

vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message openshift/origin#23161

Closed

runcom reviewed Jun 14, 2019

View reviewed changes

kikisdeliveryservice added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2019

kikisdeliveryservice closed this Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: fix logic setting version for progressing status #855

operator: fix logic setting version for progressing status #855

kikisdeliveryservice commented Jun 13, 2019 •

edited

openshift-ci-robot commented Jun 13, 2019

cgwalters commented Jun 13, 2019

kikisdeliveryservice commented Jun 13, 2019 •

edited

wking commented Jun 13, 2019

runcom commented Jun 13, 2019

kikisdeliveryservice commented Jun 14, 2019

kikisdeliveryservice commented Jun 14, 2019 •

edited

kikisdeliveryservice commented Jun 14, 2019

kikisdeliveryservice commented Jun 14, 2019

runcom commented Jun 14, 2019

runcom commented Jun 14, 2019

runcom commented Jun 14, 2019 •

edited

runcom commented Jun 14, 2019 •

edited

runcom Jun 14, 2019

runcom Jun 14, 2019

runcom commented Jun 14, 2019 •

edited

runcom commented Jun 14, 2019 •

edited

cgwalters commented Jun 14, 2019

runcom commented Jun 14, 2019

runcom commented Jun 14, 2019 •

edited

kikisdeliveryservice commented Nov 27, 2019

openshift-ci-robot commented Nov 27, 2019

kikisdeliveryservice commented Dec 3, 2019

operator: fix logic setting version for progressing status #855

operator: fix logic setting version for progressing status #855

Conversation

kikisdeliveryservice commented Jun 13, 2019 • edited

openshift-ci-robot commented Jun 13, 2019

cgwalters commented Jun 13, 2019

kikisdeliveryservice commented Jun 13, 2019 • edited

wking commented Jun 13, 2019

runcom commented Jun 13, 2019

kikisdeliveryservice commented Jun 14, 2019

kikisdeliveryservice commented Jun 14, 2019 • edited

kikisdeliveryservice commented Jun 14, 2019

kikisdeliveryservice commented Jun 14, 2019

runcom commented Jun 14, 2019

runcom commented Jun 14, 2019

runcom commented Jun 14, 2019 • edited

runcom commented Jun 14, 2019 • edited

runcom Jun 14, 2019

Choose a reason for hiding this comment

runcom Jun 14, 2019

Choose a reason for hiding this comment

runcom commented Jun 14, 2019 • edited

runcom commented Jun 14, 2019 • edited

cgwalters commented Jun 14, 2019

runcom commented Jun 14, 2019

runcom commented Jun 14, 2019 • edited

kikisdeliveryservice commented Nov 27, 2019

openshift-ci-robot commented Nov 27, 2019

kikisdeliveryservice commented Dec 3, 2019

kikisdeliveryservice commented Jun 13, 2019 •

edited

kikisdeliveryservice commented Jun 13, 2019 •

edited

kikisdeliveryservice commented Jun 14, 2019 •

edited

runcom commented Jun 14, 2019 •

edited

runcom commented Jun 14, 2019 •

edited

runcom commented Jun 14, 2019 •

edited

runcom commented Jun 14, 2019 •

edited

runcom commented Jun 14, 2019 •

edited