Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: fix logic setting version for progressing status #855

Conversation

kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice commented Jun 13, 2019

Background: in CI we are seeing a cluster that is in the middle of upgrading having a CVO status of Available in version target version as opposed to the expected status of progressing until the upgrade is completed.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 13, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2019
@kikisdeliveryservice kikisdeliveryservice requested review from runcom and removed request for jlebon and LorbusChris June 13, 2019 22:30
@cgwalters
Copy link
Member

On the face of it, this seems like an obvious change. But the logic here is... twisty. syncProgressingStatus() which is responsible for setting progressing also looks at the version store, so this feels circular.

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented Jun 13, 2019

@cgwalters Yeah im not 100% on this but what i think is happening is that syncProgressing is correctly called but then syncVersion which is called right after (without this change) is updating the version to the not yet updated operator. Which fastforward, ends up as the new version is available while the cluster is updating when it should be the old version is available and the new version is only available once the update is complete.

@wking
Copy link
Member

wking commented Jun 13, 2019

@runcom
Copy link
Member

runcom commented Jun 13, 2019

We surely have a faulty check in syncRequitedMachineConfitPools as well. I can't explain how it passes the check to report Available otherwise. I believe the issue lays down in a race which I'm still trying to find.

@kikisdeliveryservice
Copy link
Contributor Author

Test timed out but poking around in the logs all of the nodes were succesfully updated. going to retest again:

/test e2e-aws-upgrade

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented Jun 14, 2019

so poking around in the logs progressing is now correct since the upgrade did not finish and stopped at 90%t:

{
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-06-13T22:48:43Z",
                "generation": 1,
                "name": "machine-config",
                "resourceVersion": "35095",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/machine-config",
                "uid": "657b3a25-8e2d-11e9-9c98-12d94da3a8e8"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2019-06-13T22:49:22Z",
                        "message": "Cluster has deployed 0.0.1-2019-06-13-223247",
                        "status": "True",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-06-13T23:19:51Z",
                        "message": "Working towards 0.0.1-2019-06-13-223247",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-06-13T22:48:43Z",
                        "status": "False",
                        "type": "Degraded"
                    }
                ],
                "extension": {
                    "master": "all 3 nodes are at latest configuration rendered-master-b24a8176c20faebab011f3ed6e554d34",
                    "worker": "all 3 nodes are at latest configuration rendered-worker-18b7b8832dcd836e7d385662b1a4d7cf"
                },

But I really am not sure about the Available message bc it's misleading? Shouldn't the available version be the current version not the version it is working towards? The progressing message tells us what we are progressing to, but the Available message should tell us where we are - no??

message := fmt.Sprintf("Cluster has deployed %s", optrVersion)

But we did not officially change the operator version in the operator logs yet (and it timed out before that happened) which is good!:

I0613 23:29:49.332563       1 kubelet_config_controller.go:174] Starting MachineConfigController-KubeletConfigController
I0613 23:29:49.332540       1 container_runtime_config_controller.go:170] Starting MachineConfigController-ContainerRuntimeConfigController
I0613 23:29:54.238992       1 status.go:82] Pool worker: All nodes are updated with rendered-worker-18b7b8832dcd836e7d385662b1a4d7cf
I0613 23:30:07.396318       1 node_controller.go:220] Pool master: node ip-10-0-135-143.ec2.internal has completed update to rendered-master-b24a8176c20faebab011f3ed6e554d34
I0613 23:30:07.447398       1 node_controller.go:213] Pool master: node ip-10-0-135-143.ec2.internal is now reporting ready
I0613 23:30:12.396669       1 status.go:82] Pool master: All nodes are updated with rendered-master-b24a8176c20faebab011f3ed6e554d34

@kikisdeliveryservice
Copy link
Contributor Author

/test e2e-aws-upgrade

@kikisdeliveryservice
Copy link
Contributor Author

so this PR seems to solve the initial problem of the version being set before the upgrade is finished (see operator logs, it's no longer immediately updating version), but I'm hitting another problem where the test keeps timing out? I rekicked another PR on the e2e-aws-upgrade here: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/851/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/699 that test eventually passed but took 3 hours(???) So I really can't tell if this pr is failing on the upgrade e2e bc of my change (even tho looking at logs it seems ok) or bc the test is just taking forever and timing out due to other things.

cc: @runcom @wking

@runcom
Copy link
Member

runcom commented Jun 14, 2019

/retest

@runcom
Copy link
Member

runcom commented Jun 14, 2019

I0614 04:03:02.618360       1 status.go:82] Pool master: All nodes are updated with rendered-master-624af8792aa560921625cd72e57628cf


I0614 04:03:03.005071       1 event.go:209] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"9437fbd5-8e53-11e9-82ab-12f47fb64940", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 0.0.1-2019-06-14-030242}] to [{operator 0.0.1-2019-06-14-030413}]

MCC and MCO logs look sane

@runcom
Copy link
Member

runcom commented Jun 14, 2019

Ok, now I'm understanding more, from the first job failure (see also @kikisdeliveryservice comments at #855 (comment)).

The above happens because in setting Available we only check if we're not degraded. We should also do NOT report the new version if we're still progressing I guess.

@runcom
Copy link
Member

runcom commented Jun 14, 2019

diff --git a/pkg/operator/status.go b/pkg/operator/status.go
index ec89f98e..05a2b71b 100644
--- a/pkg/operator/status.go
+++ b/pkg/operator/status.go
@@ -61,15 +61,14 @@ func (optr *Operator) syncAvailableStatus() error {
                return nil
        }
 
-       optrVersion, _ := optr.vStore.Get("operator")
        degraded := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded)
-       message := fmt.Sprintf("Cluster has deployed %s", optrVersion)
+       message := fmt.Sprintf("Cluster has deployed %s", co.Status.Versions)
 
        available := configv1.ConditionTrue
 
        if degraded {
                available = configv1.ConditionFalse
-               message = fmt.Sprintf("Cluster not available for %s", optrVersion)
+               message = fmt.Sprintf("Cluster not available for %s", co.Status.Versions)
        }
 
        coStatus := configv1.ClusterOperatorStatusCondition{

The above takes care of my last comment. We were wrongly picking up the new version and set that w/o waiting completely for Progressing=False which sets the new version in the clusteroperator object.. This happens when we're still Progressing w/o being Degraded (see syncAvailableStatus)

@@ -29,7 +29,7 @@ func (optr *Operator) syncVersion() error {
}

// keep the old version and progressing if we fail progressing
if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) && cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded) {
if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) || cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging with Alex. This || makes it impossing to flip from progressing=True to False because we now never report the new version to the cluster object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other words, there's no way right now to flip from progressing=true & old version to progressing=false & new version reported here

@runcom
Copy link
Member

runcom commented Jun 14, 2019

Ok, looking again at https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/23161/pull-ci-openshift-origin-release-4.1-e2e-aws-upgrade/51/artifacts/e2e-aws-upgrade/pods/

The progressing logic is flawed and we're thinking about relying that status on the master pool status as well other than versions skew between the clusteroperator version and the MCO one

@runcom
Copy link
Member

runcom commented Jun 14, 2019

Ok, the race is:

  • 18:07:56: new Operator goes up
  • start syncing
  • syncRequiredMachineConfigPools passes cause pool master is idle but didn't start updating to new rendered-mc
  • 18:09:28: new version reported
  • 18:10:06: the master pool starts rolling the new templates

syncRequiredMachineConfigPools needs a stronger check

The race is the node_controller isn’t updating desiredConfig in time

@cgwalters
Copy link
Member

We really need the full chain to be complete to report progressing=false:

  • operator is updated
  • controller has rolled out and is idle (most importantly the template, render sub-controllers)
  • the master pool's spec config must = status

Note that we don't need to wait for the node controller to be idle because that's covered by the last condition.

@runcom
Copy link
Member

runcom commented Jun 14, 2019

What's happening is that the Generation of a pool isn't updated (always at 1) because we never update its spec. We only update its status.configuration in 4.1. Colin's patch shoudl fix that

@runcom
Copy link
Member

runcom commented Jun 14, 2019

We really need the full chain to be complete to report progressing=false:

the problem isn't there though. That code runs fine (though it may be flawed anyway), the failure is in checking the correct generation of the MCP we're comparing against in syncRequiredMachineConfigPool and I'm gonna check that we don't have such issue in master cause of #773

@kikisdeliveryservice
Copy link
Contributor Author

/skip

@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-disruptive 14bb514 link /test e2e-aws-disruptive
ci/prow/e2e-gcp-op 14bb514 link /test e2e-gcp-op
ci/prow/e2e-aws-upgrade 14bb514 link /test e2e-aws-upgrade
ci/prow/e2e-gcp-upgrade 14bb514 link /test e2e-gcp-upgrade
ci/prow/e2e-vsphere 14bb514 link /test e2e-vsphere

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kikisdeliveryservice kikisdeliveryservice added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2019
@kikisdeliveryservice
Copy link
Contributor Author

In an effort to clean up the MCO repo, closing old open PRs with no recent activity.

Feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants