Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix update retry logic in TestConfigurableRoute* #689

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented Dec 14, 2021

Fix intermittent failures in the TestConfigurableRouteRBAC, TestConfigurableRouteNoSecretNoRBAC, and TestConfigurableRouteNoConsumingUserNoRBAC tests that were caused by incorrect update retry logic.

These tests use the eventuallyUpdateIngressSpec and eventuallyUpdateIngressStatus helper functions to update the cluster ingress config. Before this change, these helper functions attempted to update the ingress config using a polling loop that gets the current config, assigns a value provided as an argument to the helper function to a field in the config value from the API, and then updates the config in the API using the modified config value. The argument value is a complex data type, and so the assignment causes the argument value and the modified config value to reference the same memory. If the update failed, the next iteration of the polling loop would do another get from the API and overwrite the argument value. As a result, the assignment and update in the second and subsequent iterations of the polling loop would use the value from the API, not the value originally provided in the argument.

This change replaces the assignment in the polling loop with a deep copy so that the original argument value is not overwritten by the get. As a result, the test should be more resilient when updates fail.

Follow-up to #552. @awgreene, FYI.

  • test/e2e/configurable_route_test.go (eventuallyUpdateIngressSpec, eventuallyUpdateIngressStatus): Use DeepCopyInto instead of an assignment.

To reproduce the error, I changed eventuallyUpdateIngressStatus to simulate an update failure. I changed the following:

func eventuallyUpdateIngressStatus(t *testing.T, ingressStatus configv1.IngressStatus) error {
	t.Helper()
	ingress := &configv1.Ingress{}
	return wait.PollImmediate(1*time.Second, timeout, func() (bool, error) {
		if err := kclient.Get(context.TODO(), types.NamespacedName{Namespace: "", Name: "cluster"}, ingress); err != nil {
			t.Logf("error getting ingress: %v", err)
			return false, nil
		}

		ingress.Status = ingressStatus

		if err := kclient.Status().Update(context.TODO(), ingress); err != nil {
			t.Logf("error updating ingress.status: %v", err)
			return false, nil
		}

		return true, nil
	})
}

...to the following:

var haveTriedUpdate = false
func eventuallyUpdateIngressStatus(t *testing.T, ingressStatus configv1.IngressStatus) error {
	t.Helper()
	ingress := &configv1.Ingress{}
	return wait.PollImmediate(1*time.Second, timeout, func() (bool, error) {
		if err := kclient.Get(context.TODO(), types.NamespacedName{Namespace: "", Name: "cluster"}, ingress); err != nil {
			t.Logf("error getting ingress: %v", err)
			return false, nil
		}

		ingress.Status = ingressStatus

		// Simulate an error on first update.
		if !haveTriedUpdate {
			haveTriedUpdate = true
			t.Logf("error updating ingress.status: %v", "bogus error")
			return false, nil
		}

		if err := kclient.Status().Update(context.TODO(), ingress); err != nil {
			t.Logf("error updating ingress.status: %v", err)
			return false, nil
		}

		return true, nil
	})
}

That is, after making the assignment to ingress.Status, pretend to get an error so the loop starts over with another Get. When I did this, I saw the following:

% make test-e2e TEST='TestConfigurableRouteNoConsumingUserNoRBAC'
GO111MODULE=on GOFLAGS=-mod=vendor go test -timeout 1h -count 1 -v -tags e2e -run "TestConfigurableRouteNoConsumingUserNoRBAC" ./test/e2e
I1214 18:04:56.864670  259306 request.go:665] Waited for 1.049124231s due to client-side throttling, not priority and fairness, request: GET:https://api.ci-ln-7d1q3xk-72292.origin-ci-int-gce.dev.rhcloud.com:6443/apis/apps/v1?timeout=32s
=== RUN   TestConfigurableRouteNoConsumingUserNoRBAC
    configurable_route_test.go:423: error updating ingress.status: bogus error
    configurable_route_test.go:451: unexpected number of items exist: expected (1), actual (0)
[the above line repeated 61 more times]
    configurable_route_test.go:351: Number of roles in list != 1
    configurable_route_test.go:451: unexpected number of items exist: expected (1), actual (0)
 [the above line repeated 61 more times]
   configurable_route_test.go:356: Number of roleBindings in list != 1
--- FAIL: TestConfigurableRouteNoConsumingUserNoRBAC (122.50s)

Changing the assignment to a deep copy prevents the failure even when I have the simulated error:

% make test-e2e TEST='TestConfigurableRouteNoConsumingUserNoRBAC'
GO111MODULE=on GOFLAGS=-mod=vendor go test -timeout 1h -count 1 -v -tags e2e -run "TestConfigurableRouteNoConsumingUserNoRBAC" ./test/e2e
I1214 18:07:54.970503  259905 request.go:665] Waited for 1.049893585s due to client-side throttling, not priority and fairness, request: GET:https://api.ci-ln-7d1q3xk-72292.origin-ci-int-gce.dev.rhcloud.com:6443/apis/config.openshift.io/v1?timeout=32s
=== RUN   TestConfigurableRouteNoConsumingUserNoRBAC
    configurable_route_test.go:423: error updating ingress.status: bogus error
    configurable_route_test.go:451: unexpected number of items exist: expected (1), actual (0)
--- PASS: TestConfigurableRouteNoConsumingUserNoRBAC (3.08s)
PASS
ok      github.com/openshift/cluster-ingress-operator/test/e2e  5.745s
make test-e2e TEST='TestConfigurableRouteNoConsumingUserNoRBAC'  4.39s user 19.37s system 89% cpu 26.532 total

Fix intermittent failures in the TestConfigurableRouteRBAC,
TestConfigurableRouteNoSecretNoRBAC, and
TestConfigurableRouteNoConsumingUserNoRBAC tests that were caused by
incorrect update retry logic.

These tests use the eventuallyUpdateIngressSpec and
eventuallyUpdateIngressStatus helper functions to update the cluster
ingress config.  Before this commit, these helper functions attempted to
update the ingress config using a polling loop that gets the current
config, assigns a value provided as an argument to the helper function to a
field in the config value from the API, and then updates the config in the
API using the modified config value.  The argument value is a complex data
type, and so the assignment causes the argument value and the modified
config value to reference the same memory.  If the update failed, the next
iteration of the polling loop would do another get from the API and
overwrite the argument value.  As a result, the assignment and update in
the second and subsequent iterations of the polling loop would use the
value from the API, not the value originally provided in the argument.

This commit replaces the assignment in the polling loop with a deep copy so
that the original argument value is not overwritten by the get.  As a
result, the test should be more resilient when updates fail.

Follow-up to commit baf2d3e.

* test/e2e/configurable_route_test.go (eventuallyUpdateIngressSpec)
(eventuallyUpdateIngressStatus): Use DeepCopyInto instead of an assignment.
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2021
@candita
Copy link
Contributor

candita commented Dec 14, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 14, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Miciah
Copy link
Contributor Author

Miciah commented Dec 15, 2021

level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: Provider produced inconsistent result after apply
level=error
level=error msg=When applying changes to module.vpc.aws_vpc_dhcp_options_association.main[0],
level=error msg=provider "registry.terraform.io/-/aws" produced an unexpected new value for
level=error msg=was present, but now absent.
level=error
level=error msg=This is a bug in the provider, which should be reported in the provider's own
level=error msg=issue tracker.
level=error
level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change 

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Dec 15, 2021

Same Terraform failure. The Terraform issue is being tracked as BZ#2032521.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Dec 15, 2021

Must-gather failed.
/test e2e-aws-single-node

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@Miciah
Copy link
Contributor Author

Miciah commented Dec 15, 2021

Failing tests:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

/test e2e-upgrade

@Miciah
Copy link
Contributor Author

Miciah commented Dec 15, 2021

Must-gather failed.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Dec 15, 2021

Failing tests:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

/test e2e-upgrade

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 15, 2021

@Miciah: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants