Fix update retry logic in TestConfigurableRoute* #689

Miciah · 2021-12-14T23:37:48Z

Fix intermittent failures in the TestConfigurableRouteRBAC, TestConfigurableRouteNoSecretNoRBAC, and TestConfigurableRouteNoConsumingUserNoRBAC tests that were caused by incorrect update retry logic.

These tests use the eventuallyUpdateIngressSpec and eventuallyUpdateIngressStatus helper functions to update the cluster ingress config. Before this change, these helper functions attempted to update the ingress config using a polling loop that gets the current config, assigns a value provided as an argument to the helper function to a field in the config value from the API, and then updates the config in the API using the modified config value. The argument value is a complex data type, and so the assignment causes the argument value and the modified config value to reference the same memory. If the update failed, the next iteration of the polling loop would do another get from the API and overwrite the argument value. As a result, the assignment and update in the second and subsequent iterations of the polling loop would use the value from the API, not the value originally provided in the argument.

This change replaces the assignment in the polling loop with a deep copy so that the original argument value is not overwritten by the get. As a result, the test should be more resilient when updates fail.

Follow-up to #552. @awgreene, FYI.

test/e2e/configurable_route_test.go (eventuallyUpdateIngressSpec, eventuallyUpdateIngressStatus): Use DeepCopyInto instead of an assignment.

To reproduce the error, I changed eventuallyUpdateIngressStatus to simulate an update failure. I changed the following:

func eventuallyUpdateIngressStatus(t *testing.T, ingressStatus configv1.IngressStatus) error {
	t.Helper()
	ingress := &configv1.Ingress{}
	return wait.PollImmediate(1*time.Second, timeout, func() (bool, error) {
		if err := kclient.Get(context.TODO(), types.NamespacedName{Namespace: "", Name: "cluster"}, ingress); err != nil {
			t.Logf("error getting ingress: %v", err)
			return false, nil
		}

		ingress.Status = ingressStatus

		if err := kclient.Status().Update(context.TODO(), ingress); err != nil {
			t.Logf("error updating ingress.status: %v", err)
			return false, nil
		}

		return true, nil
	})
}

...to the following:

var haveTriedUpdate = false
func eventuallyUpdateIngressStatus(t *testing.T, ingressStatus configv1.IngressStatus) error {
	t.Helper()
	ingress := &configv1.Ingress{}
	return wait.PollImmediate(1*time.Second, timeout, func() (bool, error) {
		if err := kclient.Get(context.TODO(), types.NamespacedName{Namespace: "", Name: "cluster"}, ingress); err != nil {
			t.Logf("error getting ingress: %v", err)
			return false, nil
		}

		ingress.Status = ingressStatus

		// Simulate an error on first update.
		if !haveTriedUpdate {
			haveTriedUpdate = true
			t.Logf("error updating ingress.status: %v", "bogus error")
			return false, nil
		}

		if err := kclient.Status().Update(context.TODO(), ingress); err != nil {
			t.Logf("error updating ingress.status: %v", err)
			return false, nil
		}

		return true, nil
	})
}

That is, after making the assignment to ingress.Status, pretend to get an error so the loop starts over with another Get. When I did this, I saw the following:

% make test-e2e TEST='TestConfigurableRouteNoConsumingUserNoRBAC'
GO111MODULE=on GOFLAGS=-mod=vendor go test -timeout 1h -count 1 -v -tags e2e -run "TestConfigurableRouteNoConsumingUserNoRBAC" ./test/e2e
I1214 18:04:56.864670  259306 request.go:665] Waited for 1.049124231s due to client-side throttling, not priority and fairness, request: GET:https://api.ci-ln-7d1q3xk-72292.origin-ci-int-gce.dev.rhcloud.com:6443/apis/apps/v1?timeout=32s
=== RUN   TestConfigurableRouteNoConsumingUserNoRBAC
    configurable_route_test.go:423: error updating ingress.status: bogus error
    configurable_route_test.go:451: unexpected number of items exist: expected (1), actual (0)
[the above line repeated 61 more times]
    configurable_route_test.go:351: Number of roles in list != 1
    configurable_route_test.go:451: unexpected number of items exist: expected (1), actual (0)
 [the above line repeated 61 more times]
   configurable_route_test.go:356: Number of roleBindings in list != 1
--- FAIL: TestConfigurableRouteNoConsumingUserNoRBAC (122.50s)

Changing the assignment to a deep copy prevents the failure even when I have the simulated error:

% make test-e2e TEST='TestConfigurableRouteNoConsumingUserNoRBAC'
GO111MODULE=on GOFLAGS=-mod=vendor go test -timeout 1h -count 1 -v -tags e2e -run "TestConfigurableRouteNoConsumingUserNoRBAC" ./test/e2e
I1214 18:07:54.970503  259905 request.go:665] Waited for 1.049893585s due to client-side throttling, not priority and fairness, request: GET:https://api.ci-ln-7d1q3xk-72292.origin-ci-int-gce.dev.rhcloud.com:6443/apis/config.openshift.io/v1?timeout=32s
=== RUN   TestConfigurableRouteNoConsumingUserNoRBAC
    configurable_route_test.go:423: error updating ingress.status: bogus error
    configurable_route_test.go:451: unexpected number of items exist: expected (1), actual (0)
--- PASS: TestConfigurableRouteNoConsumingUserNoRBAC (3.08s)
PASS
ok      github.com/openshift/cluster-ingress-operator/test/e2e  5.745s
make test-e2e TEST='TestConfigurableRouteNoConsumingUserNoRBAC'  4.39s user 19.37s system 89% cpu 26.532 total

Fix intermittent failures in the TestConfigurableRouteRBAC, TestConfigurableRouteNoSecretNoRBAC, and TestConfigurableRouteNoConsumingUserNoRBAC tests that were caused by incorrect update retry logic. These tests use the eventuallyUpdateIngressSpec and eventuallyUpdateIngressStatus helper functions to update the cluster ingress config. Before this commit, these helper functions attempted to update the ingress config using a polling loop that gets the current config, assigns a value provided as an argument to the helper function to a field in the config value from the API, and then updates the config in the API using the modified config value. The argument value is a complex data type, and so the assignment causes the argument value and the modified config value to reference the same memory. If the update failed, the next iteration of the polling loop would do another get from the API and overwrite the argument value. As a result, the assignment and update in the second and subsequent iterations of the polling loop would use the value from the API, not the value originally provided in the argument. This commit replaces the assignment in the polling loop with a deep copy so that the original argument value is not overwritten by the get. As a result, the test should be more resilient when updates fail. Follow-up to commit baf2d3e. * test/e2e/configurable_route_test.go (eventuallyUpdateIngressSpec) (eventuallyUpdateIngressStatus): Use DeepCopyInto instead of an assignment.

candita · 2021-12-14T23:49:19Z

/lgtm

openshift-ci · 2021-12-14T23:49:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah,candita]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Miciah · 2021-12-15T00:05:34Z

level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: Provider produced inconsistent result after apply
level=error
level=error msg=When applying changes to module.vpc.aws_vpc_dhcp_options_association.main[0],
level=error msg=provider "registry.terraform.io/-/aws" produced an unexpected new value for
level=error msg=was present, but now absent.
level=error
level=error msg=This is a bug in the provider, which should be reported in the provider's own
level=error msg=issue tracker.
level=error
level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change

/test e2e-aws-operator

Miciah · 2021-12-15T00:44:53Z

Same Terraform failure. The Terraform issue is being tracked as BZ#2032521.
/test e2e-aws-operator

Miciah · 2021-12-15T01:11:16Z

Must-gather failed.
/test e2e-aws-single-node

openshift-bot · 2021-12-15T01:26:50Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Miciah · 2021-12-15T01:39:35Z

Failing tests:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

/test e2e-upgrade

Miciah · 2021-12-15T02:26:20Z

Must-gather failed.
/test e2e-aws-operator

Miciah · 2021-12-15T03:39:52Z

Failing tests:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

/test e2e-upgrade

openshift-bot · 2021-12-15T06:01:10Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-12-15T07:01:10Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-12-15T07:13:11Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-12-15T12:46:55Z

@Miciah: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from ironcladlou and rfredette December 14, 2021 23:38

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2021

openshift-ci bot assigned candita Dec 14, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2021

openshift-merge-robot merged commit 0e10b5b into openshift:master Dec 15, 2021

Miciah mentioned this pull request Dec 16, 2021

Bug 2039919: NE-542 Router compression E2E test #679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix update retry logic in TestConfigurableRoute* #689

Fix update retry logic in TestConfigurableRoute* #689

Miciah commented Dec 14, 2021

candita commented Dec 14, 2021

openshift-ci bot commented Dec 14, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

openshift-ci bot commented Dec 15, 2021

Fix update retry logic in TestConfigurableRoute* #689

Fix update retry logic in TestConfigurableRoute* #689

Conversation

Miciah commented Dec 14, 2021

candita commented Dec 14, 2021

openshift-ci bot commented Dec 14, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

Miciah commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

openshift-bot commented Dec 15, 2021

openshift-ci bot commented Dec 15, 2021