Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: fix status updates on ClusterOperator #351

Merged
merged 3 commits into from
Feb 1, 2019

Conversation

abhinavdahiya
Copy link
Contributor

  • ClusterOperator now reports list of versions, each for an operand and the operator itself.
  • The versions reported for only available conditions, as per defined behavior in [1]
  • Failing and Progressing do not change last Available, as per defined behavor in [2]

[1] https://github.com/openshift/api/blob/677153041bd22c1aa01fef67b68a8502ab1333d5/config/v1/types_cluster_operator.go#L59-L61
[2] https://github.com/openshift/cluster-version-operator/blob/4c2d5e0e3662f9643ed4f0992496539ef6dfdaac/docs/dev/clusteroperator.md#conditions

/cc @kikisdeliveryservice @cgwalters

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 29, 2019
@kikisdeliveryservice
Copy link
Contributor

@abhinavdahiya Thanks for this! This is for #346 right?

I will pull and test tomorrow 😄

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this and am testing but wanted to clear up a few questions before it merges:

  1. Was the bug seen in MCO Cluster Operator statuses frequently changing #346 was because the Available status was tracking the Progressing status so that when progressing status was True Available went False?

  2. Are the MCC/MCD/MCO operands of the MCO clusteroperator?

  3. The status will be available so long as the "operand is functional" is this the MCO or the MCO+MCC+MCD? And does this mean that it is in the version it is supposed to be in (not considering any upgrades that may or may not be progressing)?

  4. I notice that Progressing alternates between True and False - and the status says: Running resync for 3.11.0-521-g377bde3b so the MCO resyncs every ~15 seconds? And we will expect Progressing to change all the time regardless of whether an upgrade happened or not?

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed this...still reading and digesting the CVO docs. Some superficial comments.

type versionStore struct {
*sync.Mutex

versions map[string]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, in Rust this seems like it'd just be type Versions = Arc<Mutex<HashMap<String, String>>>. (Although if I were making an operator I'm not sure it'd be multi-threaded to begin with but that's a larger topic)

SetClusterOperatorStatusCondition(&co.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorAvailable, Status: configv1.ConditionTrue, LastTransitionTime: now})
cov1helpers.SetStatusCondition(&co.Status.Conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorAvailable, Status: configv1.ConditionTrue,
Message: fmt.Sprintf("Cluster is available at %s", optrVersion),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Cluster/MCO/? (or s/Cluster/Operator/, or...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCO is reporting that its view (in terms of ownership) of the cluster is at that version.

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- lastTransitionTime: 2019-01-29T20:53:23Z
    message: Cluster is available at 3.11.0-521-g377bde3b
    status: "True"
    type: Available
  - lastTransitionTime: 2019-01-30T20:56:32Z
    message: Running resync for 3.11.0-521-g377bde3b
    status: "True"
    type: Progressing
  - lastTransitionTime: 2019-01-30T06:21:34Z
    status: "False"
    type: Failing

@cgwalters it ends up reading this way in the yaml ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds fine to me!

pkg/operator/status.go Outdated Show resolved Hide resolved
pkg/operator/status.go Outdated Show resolved Hide resolved
@cgwalters
Copy link
Member

Was the bug seen in #346 was because the Available status was tracking the Progressing status so that when progressing status was True Available went False?

Yep, I believe that's the core issue.

Are the MCC/MCD/MCO operands of the MCO clusteroperator?

Yes, as I understand it.

I notice that Progressing alternates between True and False - and the status says: Running resync for 3.11.0-521-g377bde3b so the MCO resyncs every ~15 seconds?

Why it changes every 15 seconds is an interesting question...I think we'll try to resynchronize on any status change from a lot of objects, including e.g. the MCD daemonset. It feels like we should be doing more "diff detection" here or so, there are presumably good patterns for this but I am not familiar enough with the ecosystem yet.

- ClusterOperator now reports list of versions, each for an operand and the operator itself.
- The versions reported for only available conditions, as per defined behavior in [1]
- Failing and Progressing do not change last Available, as per defined behavor in [2]

[1] https://github.com/openshift/api/blob/677153041bd22c1aa01fef67b68a8502ab1333d5/config/v1/types_cluster_operator.go#L59-L61
[2] https://github.com/openshift/cluster-version-operator/blob/4c2d5e0e3662f9643ed4f0992496539ef6dfdaac/docs/dev/clusteroperator.md#conditions
@abhinavdahiya
Copy link
Contributor Author

Why it changes every 15 seconds is an interesting question...I think we'll try to resynchronize on any status change from a lot of objects, including e.g. the MCD daemonset. It feels like we should be doing more "diff detection" here or so, there are presumably good patterns for this but I am not familiar enough with the ecosystem yet.

based on

mcoconfigInformer.Informer().AddEventHandler(optr.eventHandler())
controllerConfigInformer.Informer().AddEventHandler(optr.eventHandler())
serviceAccountInfomer.Informer().AddEventHandler(optr.eventHandler())
crdInformer.Informer().AddEventHandler(optr.eventHandler())
deployInformer.Informer().AddEventHandler(optr.eventHandler())
daemonsetInformer.Informer().AddEventHandler(optr.eventHandler())
clusterRoleInformer.Informer().AddEventHandler(optr.eventHandler())
clusterRoleBindingInformer.Informer().AddEventHandler(optr.eventHandler())
cmInformer.Informer().AddEventHandler(optr.eventHandler())

it is reacting to changes to these objects.

@kikisdeliveryservice
Copy link
Contributor

@abhinavdahiya just to clarify: the MCO/Progressing status is reacting to changes in any Informers in lines 124-132 that you linked above?

@kikisdeliveryservice
Copy link
Contributor

Since there were some changes, I'm going to repull this and test again.

@abhinavdahiya
Copy link
Contributor Author

/retest

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version column values seem to be missing?:

$ oc get clusteroperators machine-config-operator
NAME                      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config-operator             True        False          False     12m

but in oc get clusteroperator machine-config-operator -o yaml -w :

  - name: operator
    version: 3.11.0-521-gcf207959

@abhinavdahiya
Copy link
Contributor Author

The version column values seem to be missing?:

$ oc get clusteroperators machine-config-operator
NAME                      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
machine-config-operator             True        False          False     12m

but in oc get clusteroperator machine-config-operator -o yaml -w :

  - name: operator
    version: 3.11.0-521-gcf207959

We do not control the definition CRD of cluster operator. CVO does. MCO only creates the CR.

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this a few times and no longer see Available changing every few seconds as described in #346 . Progressing does change back and forth depending on the state of mco informers are doing, which is expected, but they no longer have an effect on Available, for ex:

machine-config-operator             True      False     False     15h
machine-config-operator             True      True      False     15h

I'll leave this to @cgwalters to LGTM.

@cgwalters
Copy link
Member

/lgtm

@abhinavdahiya
Copy link
Contributor Author

/lgtm

hmm, looks like effect of the Prow service degradation.

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, cgwalters, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhinavdahiya
Copy link
Contributor Author

/retest

2 similar comments
@abhinavdahiya
Copy link
Contributor Author

/retest

@abhinavdahiya
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit f31d48f into openshift:master Feb 1, 2019
osherdp pushed a commit to osherdp/machine-config-operator that referenced this pull request Apr 13, 2021
Bug 1914723: SamplesTBRInaccessibleOnBoot Alert has a misspelling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants