Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO: clear out failing status on success and add tests #442

Merged
merged 4 commits into from
Feb 17, 2019

Conversation

runcom
Copy link
Member

@runcom runcom commented Feb 16, 2019

- What I did

Fix a bug reported in #406 (comment)
where we weren't properly resetting the failing status.

Along with that, the second commit slightly refactor the operator to allow proper unit regression testing.
Added regression tests for the bug reported as well as other tests of other conditions.

- How to verify it

Run unit tests + custom payload on a cluster, make somehow something fails and watch the failing status clearning

- Description for the changelog

@abhinavdahiya could you please review this?

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2019
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 16, 2019
@runcom runcom changed the title MCO: clear out failing status on success MCO: clear out failing status on success and add a regression test Feb 16, 2019
return c.innerConfigClient.ConfigV1().ClusterOperators().Get(name, options)
}

type ClusterOperatorsClientInterface interface {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important for us to be able to control what we use in the MCO itself and allow us to create meaningful mocks for proper unit tests.

return coc.co, nil
}

func TestOperatorFailingStatusClearsOut(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test fail correctly w/o the patch to fix the bug

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also add the other tests for the other conditions to make sure we never regress anymore on this...

@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

Unit test failures is #417

/retest

@cgwalters
Copy link
Member

OK yeah, I was looking at this code recently to add some logging and noticed that we never seemed to set failing to false.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2019
@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

I'm trying to cover the other cases in another PR

@runcom runcom mentioned this pull request Feb 16, 2019
@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

the MCO cluster operator looks fine though

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2019
@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

rebased to solve merge conflicts

@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

Flake unit tracked in #417

/retest

@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

I'm debugging the latest failure in e2e-aws-op where the machine-config-operator didn't even come up, along with many other pods missing

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/583

/retest

@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

I'm debugging the latest failure in e2e-aws-op where the machine-config-operator didn't even come up, along with many other pods missing

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/583

oh, looks like the worker nodes weren't all available https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/442/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/583/artifacts/e2e-aws-op/nodes.json (3/6), not sure how to further debug to understand why 🤔

@runcom
Copy link
Member Author

runcom commented Feb 16, 2019

it looks like @cgwalters failure reported on slack (https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/426/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/561) basically only masters go up (but with a different failure mode)

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 16, 2019
@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

Haproxy flake, rest is cool

/retest

@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

green again, retesting

@runcom runcom force-pushed the clear-failing branch 2 times, most recently from d0e2a1a to 987d5de Compare February 17, 2019 14:32
@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

All green again

Since the last 8 e2e-aws and e2e-aws-op runs (so from when I pushed the final revision of this PR):

  • e2e-aws: 1/8 failure (Haproxy flake)
  • e2e-aws-op: 0/8 failure

Fix a bug reported in openshift#406 (comment)
where we weren't properly resetting the failing status.

Signed-off-by: Antonio Murdaca <runcom@linux.com>
Specifically, add a test to verify that we're clearing out the failing
condition on subsequent sync operations (covering a bug reported here
openshift#406 (comment))

Signed-off-by: Antonio Murdaca <runcom@linux.com>
Signed-off-by: Antonio Murdaca <runcom@linux.com>
Signed-off-by: Antonio Murdaca <runcom@linux.com>
@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

one more time...

@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

/kind bug

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2019
@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

Haproxy flake again

/retest

failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing)
message := fmt.Sprintf("Cluster has deployed %s", optrVersion)

available := configv1.ConditionTrue

if failing && !progressing {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so failing now always means !available? OK, that looks like what the CVO does as well I think.

Copy link
Member Author

@runcom runcom Feb 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding for the MCO is that if we fail the sync during a progressing, we could have e.g. applied a new MCO or something else that can likely misbehave if at some point we failed and thus no reason to report available. @abhinavdahiya what do you think?

Copy link
Member Author

@runcom runcom Feb 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, when looking at https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions, there's an example:

If an error blocks reaching 4.0.1, the conditions might be:

Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server.
Available is true with message Cluster has deployed 4.0.0
Progressing is true with message Unable to apply 4.0.1: a required object is missing

I'm not sure how to read that, if any of the syncFuncs here fails during a progressing, the status of the MCO isn't really available cause we may have e.g. started rolling a new mcc but the mcd broke and makes little sense to report Available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, maybe that !progressing is still right because the bug was really that we weren't clearing failed on successful sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch accounts for that case though and the test changed accounts for my bad assumption (all other tests are fine anyway with this):

diff --git a/pkg/operator/status.go b/pkg/operator/status.go
index b71370e..3ce2262 100644
--- a/pkg/operator/status.go
+++ b/pkg/operator/status.go
@@ -29,11 +29,12 @@ func (optr *Operator) syncAvailableStatus() error {
 
 	optrVersion, _ := optr.vStore.Get("operator")
 	failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing)
+	progressing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing)
 	message := fmt.Sprintf("Cluster has deployed %s", optrVersion)
 
 	available := configv1.ConditionTrue
 
-	if failing {
+	if (failing && !progressing) || (failing && optr.inClusterBringup) {
 		available = configv1.ConditionFalse
 		message = fmt.Sprintf("Cluster not available for %s", optrVersion)
 	}
diff --git a/pkg/operator/status_test.go b/pkg/operator/status_test.go
index 1437769..9b24c97 100644
--- a/pkg/operator/status_test.go
+++ b/pkg/operator/status_test.go
@@ -350,8 +350,7 @@ func TestOperatorSyncStatus(t *testing.T) {
 				},
 			},
 		},
-		// 3. test that if progressing fails, we report available=false because state of the operator
-		//    might have changed in the various sync calls
+		// 3. test that if progressing fails, we report available=true for the current version
 		{
 			syncs: []syncCase{
 				{
@@ -390,7 +389,7 @@ func TestOperatorSyncStatus(t *testing.T) {
 						},
 						{
 							Type:   configv1.OperatorAvailable,
-							Status: configv1.ConditionFalse,
+							Status: configv1.ConditionTrue,
 						},
 						{
 							Type:   configv1.OperatorFailing,
@@ -405,6 +404,29 @@ func TestOperatorSyncStatus(t *testing.T) {
 						},
 					},
 				},
+				{
+					// we mock the fact that we are at operator=test-version-2 after the previous sync
+					cond: []configv1.ClusterOperatorStatusCondition{
+						{
+							Type:   configv1.OperatorProgressing,
+							Status: configv1.ConditionFalse,
+						},
+						{
+							Type:   configv1.OperatorAvailable,
+							Status: configv1.ConditionTrue,
+						},
+						{
+							Type:   configv1.OperatorFailing,
+							Status: configv1.ConditionFalse,
+						},
+					},
+					syncFuncs: []syncFunc{
+						{
+							name: "fn1",
+							fn:   func(config renderConfig) error { return nil },
+						},
+					},
+				},
 			},
 		},
 		// 4. test that if progressing fails during bringup, we still report failing and not available
@@ -601,4 +623,4 @@ func TestInClusterBringUpStayOnErr(t *testing.T) {
 	assert.Nil(t, err, "expected syncAll to pass")
 
 	assert.False(t, optr.inClusterBringup)
-}
\ No newline at end of file
+}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the patch above effectively enables the mco to report available=true, progressing=true, failing=true if during a progressing we get a fail, but the mco is still available (assumption from cvo doc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #450 to further discuss this.

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 17, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

Flaky tests:

[Conformance][Area:Networking][Feature:Router] The default ClusterIngress should support default wildcard reencrypt routes through external DNS [Suite:openshift/conformance/parallel/minimal]

Failing tests:

[Conformance][Area:Networking][Feature:Router] The HAProxy router should enable openshift-monitoring to pull metrics [Suite:openshift/conformance/parallel/minimal]

Haproxy flake failing + ClusterIngress already marked flake

/retest

@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

unit flake #417

/retest

@runcom
Copy link
Member Author

runcom commented Feb 17, 2019

e2e-aws Haproxy flake again

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants