MCO: clear out failing status on success and add tests #442

runcom · 2019-02-16T11:04:04Z

- What I did

Fix a bug reported in #406 (comment)
where we weren't properly resetting the failing status.

Along with that, the second commit slightly refactor the operator to allow proper unit regression testing.
Added regression tests for the bug reported as well as other tests of other conditions.

- How to verify it

Run unit tests + custom payload on a cluster, make somehow something fails and watch the failing status clearning

- Description for the changelog

@abhinavdahiya could you please review this?

runcom · 2019-02-16T11:06:19Z

cmd/common/client_builder.go

+	return c.innerConfigClient.ConfigV1().ClusterOperators().Get(name, options)
+}
+
+type ClusterOperatorsClientInterface interface {


This is important for us to be able to control what we use in the MCO itself and allow us to create meaningful mocks for proper unit tests.

runcom · 2019-02-16T11:07:52Z

pkg/operator/status_test.go

+	return coc.co, nil
+}
+
+func TestOperatorFailingStatusClearsOut(t *testing.T) {


this test fail correctly w/o the patch to fix the bug

I'll also add the other tests for the other conditions to make sure we never regress anymore on this...

runcom · 2019-02-16T11:10:49Z

Unit test failures is #417

/retest

cgwalters · 2019-02-16T13:03:15Z

OK yeah, I was looking at this code recently to add some logging and noticed that we never seemed to set failing to false.

/lgtm

runcom · 2019-02-16T13:41:17Z

I'm trying to cover the other cases in another PR

runcom · 2019-02-16T14:34:05Z

lots of tests failed in the last run https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws/1915 due to apiserver not being up (?)

runcom · 2019-02-16T14:34:45Z

the MCO cluster operator looks fine though

openshift-bot · 2019-02-16T15:04:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

runcom · 2019-02-16T15:13:38Z

rebased to solve merge conflicts

runcom · 2019-02-16T15:54:49Z

Flake unit tracked in #417

/retest

runcom · 2019-02-16T16:28:21Z

I'm debugging the latest failure in e2e-aws-op where the machine-config-operator didn't even come up, along with many other pods missing

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/583

/retest

runcom · 2019-02-16T16:32:33Z

I'm debugging the latest failure in e2e-aws-op where the machine-config-operator didn't even come up, along with many other pods missing

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/583

oh, looks like the worker nodes weren't all available https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/583/artifacts/e2e-aws-op/nodes.json (3/6), not sure how to further debug to understand why 🤔

runcom · 2019-02-16T16:35:51Z

it looks like @cgwalters failure reported on slack (https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/426/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/561) basically only masters go up (but with a different failure mode)

runcom · 2019-02-17T10:28:09Z

This went green again, so I've pushed again to rekick tests

aws-e2e-op runs are here https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/

aws-e2e are here https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws/

runcom · 2019-02-17T11:48:52Z

Haproxy flake, rest is cool

/retest

runcom · 2019-02-17T12:59:25Z

green again, retesting

runcom · 2019-02-17T14:32:28Z

green again, repushed to kick tests again

here's the status so far:

aws-e2e-op runs are here https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/

aws-e2e are here https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws/

runcom · 2019-02-17T15:39:14Z

All green again

Since the last 8 e2e-aws and e2e-aws-op runs (so from when I pushed the final revision of this PR):

e2e-aws: 1/8 failure (Haproxy flake)
e2e-aws-op: 0/8 failure

Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>

Specifically, add a test to verify that we're clearing out the failing condition on subsequent sync operations (covering a bug reported here openshift#406 (comment)) Signed-off-by: Antonio Murdaca <runcom@linux.com>

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-02-17T17:52:31Z

one more time...

runcom · 2019-02-17T18:28:01Z

/kind bug

runcom · 2019-02-17T19:02:45Z

Haproxy flake again

/retest

cgwalters · 2019-02-17T18:18:43Z

pkg/operator/status.go

 	failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing)
 	message := fmt.Sprintf("Cluster has deployed %s", optrVersion)

 	available := configv1.ConditionTrue

-	if failing && !progressing {


Hm, so failing now always means !available? OK, that looks like what the CVO does as well I think.

my understanding for the MCO is that if we fail the sync during a progressing, we could have e.g. applied a new MCO or something else that can likely misbehave if at some point we failed and thus no reason to report available. @abhinavdahiya what do you think?

so, when looking at https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions, there's an example:

If an error blocks reaching 4.0.1, the conditions might be: Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server. Available is true with message Cluster has deployed 4.0.0 Progressing is true with message Unable to apply 4.0.1: a required object is missing

I'm not sure how to read that, if any of the syncFuncs here fails during a progressing, the status of the MCO isn't really available cause we may have e.g. started rolling a new mcc but the mcd broke and makes little sense to report Available.

actually, maybe that !progressing is still right because the bug was really that we weren't clearing failed on successful sync

This patch accounts for that case though and the test changed accounts for my bad assumption (all other tests are fine anyway with this):

diff --git a/pkg/operator/status.go b/pkg/operator/status.go index b71370e..3ce2262 100644 --- a/pkg/operator/status.go +++ b/pkg/operator/status.go @@ -29,11 +29,12 @@ func (optr *Operator) syncAvailableStatus() error { optrVersion, _ := optr.vStore.Get("operator") failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing) + progressing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) message := fmt.Sprintf("Cluster has deployed %s", optrVersion) available := configv1.ConditionTrue - if failing { + if (failing && !progressing) || (failing && optr.inClusterBringup) { available = configv1.ConditionFalse message = fmt.Sprintf("Cluster not available for %s", optrVersion) } diff --git a/pkg/operator/status_test.go b/pkg/operator/status_test.go index 1437769..9b24c97 100644 --- a/pkg/operator/status_test.go +++ b/pkg/operator/status_test.go @@ -350,8 +350,7 @@ func TestOperatorSyncStatus(t *testing.T) { }, }, }, - // 3. test that if progressing fails, we report available=false because state of the operator - // might have changed in the various sync calls + // 3. test that if progressing fails, we report available=true for the current version { syncs: []syncCase{ { @@ -390,7 +389,7 @@ func TestOperatorSyncStatus(t *testing.T) { }, { Type: configv1.OperatorAvailable, - Status: configv1.ConditionFalse, + Status: configv1.ConditionTrue, }, { Type: configv1.OperatorFailing, @@ -405,6 +404,29 @@ func TestOperatorSyncStatus(t *testing.T) { }, }, }, + { + // we mock the fact that we are at operator=test-version-2 after the previous sync + cond: []configv1.ClusterOperatorStatusCondition{ + { + Type: configv1.OperatorProgressing, + Status: configv1.ConditionFalse, + }, + { + Type: configv1.OperatorAvailable, + Status: configv1.ConditionTrue, + }, + { + Type: configv1.OperatorFailing, + Status: configv1.ConditionFalse, + }, + }, + syncFuncs: []syncFunc{ + { + name: "fn1", + fn: func(config renderConfig) error { return nil }, + }, + }, + }, }, }, // 4. test that if progressing fails during bringup, we still report failing and not available @@ -601,4 +623,4 @@ func TestInClusterBringUpStayOnErr(t *testing.T) { assert.Nil(t, err, "expected syncAll to pass") assert.False(t, optr.inClusterBringup) -} \ No newline at end of file +}

the patch above effectively enables the mco to report available=true, progressing=true, failing=true if during a progressing we get a fail, but the mco is still available (assumption from cvo doc)

opened #450 to further discuss this.

cgwalters · 2019-02-17T19:14:07Z

/lgtm

openshift-ci-robot · 2019-02-17T19:14:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-02-17T20:09:01Z

Flaky tests:

[Conformance][Area:Networking][Feature:Router] The default ClusterIngress should support default wildcard reencrypt routes through external DNS [Suite:openshift/conformance/parallel/minimal]

Failing tests:

[Conformance][Area:Networking][Feature:Router] The HAProxy router should enable openshift-monitoring to pull metrics [Suite:openshift/conformance/parallel/minimal]

Haproxy flake failing + ClusterIngress already marked flake

/retest

runcom · 2019-02-17T21:30:45Z

unit flake #417

/retest

runcom · 2019-02-17T22:26:25Z

e2e-aws Haproxy flake again

/retest

openshift-ci-robot requested review from ashcrow and wking February 16, 2019 11:04

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2019

runcom mentioned this pull request Feb 16, 2019

pkg/operator: correctly sync status for the CVO #406

Merged

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 16, 2019

runcom changed the title ~~MCO: clear out failing status on success~~ MCO: clear out failing status on success and add a regression test Feb 16, 2019

runcom commented Feb 16, 2019

View reviewed changes

runcom force-pushed the clear-failing branch from b88e650 to 2d6f764 Compare February 16, 2019 11:16

cgwalters mentioned this pull request Feb 16, 2019

Inject OSImageURL from CVO into templated MachineConfigs #363

Closed

openshift-ci-robot assigned cgwalters Feb 16, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2019

runcom mentioned this pull request Feb 16, 2019

Add sync status tests #443

Closed

runcom force-pushed the clear-failing branch from 2d6f764 to d034512 Compare February 16, 2019 15:13

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2019

runcom force-pushed the clear-failing branch from d034512 to fff01c3 Compare February 16, 2019 15:16

runcom force-pushed the clear-failing branch from fff01c3 to ce883de Compare February 16, 2019 17:23

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 16, 2019

runcom force-pushed the clear-failing branch from e21b9c7 to ed3747b Compare February 17, 2019 10:26

runcom force-pushed the clear-failing branch 2 times, most recently from d0e2a1a to 987d5de Compare February 17, 2019 14:32

runcom mentioned this pull request Feb 17, 2019

pkg/daemon: ensure /home/core/.ssh is there, not just /home/core #448

Merged

runcom added 4 commits February 17, 2019 18:52

pkg/operator: clear out failing status

5b1ee9f

Fix a bug reported in openshift#406 (comment) where we weren't properly resetting the failing status. Signed-off-by: Antonio Murdaca <runcom@linux.com>

MCO: refactor code for unit testing and add one

146ea72

Specifically, add a test to verify that we're clearing out the failing condition on subsequent sync operations (covering a bug reported here openshift#406 (comment)) Signed-off-by: Antonio Murdaca <runcom@linux.com>

pkg/operator: add sync status unit tests

e1c97fd

Signed-off-by: Antonio Murdaca <runcom@linux.com>

pkg/operator: add test for inClusterBringUp staying on error

2a582a9

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom force-pushed the clear-failing branch from 987d5de to 2a582a9 Compare February 17, 2019 17:52

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2019

cgwalters reviewed Feb 17, 2019

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 17, 2019

runcom mentioned this pull request Feb 17, 2019

pkg/operator: report available when failing to progress #450

Closed

openshift-merge-robot merged commit f977475 into openshift:master Feb 17, 2019

runcom deleted the clear-failing branch February 17, 2019 23:39

runcom mentioned this pull request Feb 18, 2019

pkg/operator: use fake clients in unit #452

Merged

abhinavdahiya mentioned this pull request Feb 19, 2019

hack/build: Pin to RHCOS 47.330 and quay.io/openshift-release-dev/ocp-release:4.0.0-0.5 openshift/installer#1271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCO: clear out failing status on success and add tests #442

MCO: clear out failing status on success and add tests #442

runcom commented Feb 16, 2019 •

edited

Loading

runcom Feb 16, 2019

runcom Feb 16, 2019

runcom Feb 16, 2019

runcom commented Feb 16, 2019

cgwalters commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019

openshift-bot commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019 •

edited

Loading

runcom commented Feb 16, 2019 •

edited

Loading

runcom commented Feb 16, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019 •

edited

Loading

runcom commented Feb 17, 2019 •

edited

Loading

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

cgwalters Feb 17, 2019

runcom Feb 17, 2019 •

edited

Loading

runcom Feb 17, 2019 •

edited

Loading

runcom Feb 17, 2019

runcom Feb 17, 2019

runcom Feb 17, 2019

runcom Feb 17, 2019

cgwalters commented Feb 17, 2019

openshift-ci-robot commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

MCO: clear out failing status on success and add tests #442

MCO: clear out failing status on success and add tests #442

Conversation

runcom commented Feb 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runcom commented Feb 16, 2019

cgwalters commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019

openshift-bot commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019

runcom commented Feb 16, 2019 • edited Loading

runcom commented Feb 16, 2019 • edited Loading

runcom commented Feb 16, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019 • edited Loading

runcom commented Feb 17, 2019 • edited Loading

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

Choose a reason for hiding this comment

runcom Feb 17, 2019 • edited Loading

Choose a reason for hiding this comment

runcom Feb 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters commented Feb 17, 2019

openshift-ci-robot commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 17, 2019

runcom commented Feb 16, 2019 •

edited

Loading

runcom commented Feb 16, 2019 •

edited

Loading

runcom commented Feb 16, 2019 •

edited

Loading

runcom commented Feb 17, 2019 •

edited

Loading

runcom commented Feb 17, 2019 •

edited

Loading

runcom Feb 17, 2019 •

edited

Loading

runcom Feb 17, 2019 •

edited

Loading