UPSTREAM: <carry>: retry etcd errors #322

deads2k · 2020-08-26T15:30:58Z

This retries all non-mutating requests for etcd. This can be wasteful if things like RVs are outside the window, but it's easy to reason about at the moment and it has the effect of simply throttling the speed for the watcher or lister.

This retries only certain mutating request errors. The ones that I know mean "this action didn't do anything".

deads2k · 2020-08-26T18:47:59Z

/retest

hexfusion · 2020-08-27T05:52:59Z

/retest

deads2k · 2020-08-27T14:49:09Z

/refresh

deads2k · 2020-08-27T14:49:46Z

/retest

hexfusion · 2020-08-27T19:51:31Z

/test gcp-upgrade

openshift-ci-robot · 2020-08-27T19:51:48Z

@hexfusion: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test artifacts
/test e2e-aws
/test e2e-aws-csi
/test e2e-aws-disruptive
/test e2e-aws-fips
/test e2e-aws-jenkins
/test e2e-aws-multitenant
/test e2e-aws-ovn
/test e2e-aws-serial
/test e2e-azure
/test e2e-cmd
/test e2e-gcp
/test e2e-gcp-upgrade
/test e2e-vsphere
/test images
/test integration
/test k8s-e2e-conformance-aws
/test k8s-e2e-gcp
/test unit
/test verify
/test verify-commits

Use /test all to run the following jobs:

pull-ci-openshift-kubernetes-master-e2e-aws-csi
pull-ci-openshift-kubernetes-master-e2e-aws-fips
pull-ci-openshift-kubernetes-master-e2e-aws-serial
pull-ci-openshift-kubernetes-master-e2e-cmd
pull-ci-openshift-kubernetes-master-e2e-gcp
pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade
pull-ci-openshift-kubernetes-master-images
pull-ci-openshift-kubernetes-master-integration
pull-ci-openshift-kubernetes-master-k8s-e2e-gcp
pull-ci-openshift-kubernetes-master-unit
pull-ci-openshift-kubernetes-master-verify
pull-ci-openshift-kubernetes-master-verify-commits

In response to this:

/test gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hexfusion · 2020-08-27T19:52:14Z

/test e2e-gcp-upgrade

hexfusion · 2020-08-29T18:51:42Z

/test e2e-gcp-upgrade

deads2k · 2020-09-01T15:38:08Z

/retest

hexfusion · 2020-09-01T15:44:18Z

need more data

/test e2e-gcp-upgrade

hexfusion · 2020-09-01T16:25:21Z

cluster-bot minor upgrade

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1300830597689643008

hexfusion · 2020-09-01T16:33:41Z

minor upgrade #2
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1300833926452875264

tkashem · 2020-09-01T17:07:27Z

/retest

hexfusion · 2020-09-01T18:41:38Z

FTR both upgrade jobs passed

#322 (comment)

tkashem · 2020-09-02T00:34:20Z

taking over the work, moving it to #327

tkashem · 2020-09-02T04:42:54Z

I think I have an understanding of what's causing some of the tests to fail - rbac bootstrapping takes longer than the test timeout
I0901 23:26:27.121787 21260 healthz.go:239] healthz check failed: poststarthook/rbac/bootstrap-roles

testserver.go:264: failed to launch server: failed to wait for /healthz to return ok: timed out waiting for the condition

This is the wait function in testserver.go:

	// wait until healthz endpoint returns ok
	err = wait.Poll(100*time.Millisecond, 30*time.Second, func() (bool, error) {
		select {
		case err := <-errCh:
			return false, err
		default:
		}

		result := client.CoreV1().RESTClient().Get().AbsPath("/healthz").Do(context.TODO())
		status := 0
		result.StatusCode(&status)
		if status == 200 {
			return true, nil
		}
		return false, nil
	})
	if err != nil {
		return result, fmt.Errorf("failed to wait for /healthz to return ok: %v", err)
	}

I am experimenting with aggressively reducing the retry backoff delay here #327. The integration job passes with much smaller retry delay https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/327/pull-ci-openshift-kubernetes-master-integration/1301014246380802048

So the next steps are:

Investigate why rbac bootstrapping takes longer.
Increase the test health endpoint timeout or use much shorter delay (need to figure out which is the best course of action)
wire the context into the backed off retry delay so that the wait never surpasses the request context deadline
ensure the delay is jittered to avoid cluster of retries.

deads2k · 2020-09-02T12:05:05Z

Investigate why rbac bootstrapping takes longer.

Increase the test health endpoint timeout or use much shorter delay (need to figure out which is the best course of action)

wire the context into the backed off retry delay so that the wait never surpasses the request context deadline

ensure the delay is jittered to avoid cluster of retries.

Think ugly. We can do this by having global to turn it off and on.

If this significantly improves behavior on azure in 4.6, we can consider taking it upstream. It is authored to be lightweight to carry.

openshift-ci-robot · 2020-09-02T12:14:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-apiserver/OWNERS~~ [deads2k]
~~staging/src/k8s.io/apiserver/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2020-09-02T19:48:43Z

/retest

openshift-ci-robot · 2020-09-02T20:26:31Z

@deads2k: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/verify	`2e86cf6`	link	`/test verify`
ci/prow/e2e-aws-fips	`2e86cf6`	link	`/test e2e-aws-fips`
ci/prow/e2e-gcp	`2e86cf6`	link	`/test e2e-gcp`
ci/prow/e2e-gcp-upgrade	`2e86cf6`	link	`/test e2e-gcp-upgrade`
ci/prow/k8s-e2e-gcp	`2e86cf6`	link	`/test k8s-e2e-gcp`
ci/prow/e2e-cmd	`2e86cf6`	link	`/test e2e-cmd`
ci/prow/e2e-aws-csi	`2e86cf6`	link	`/test e2e-aws-csi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

tkashem · 2020-09-09T22:18:37Z

closing in favor of #327

/close

openshift-ci-robot · 2020-09-09T22:18:54Z

@tkashem: Closed this PR.

In response to this:

closing in favor of #327

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from ingvagabund and smarterclayton August 26, 2020 15:32

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2020

UPSTREAM: <carry>: retry etcd errors

2e86cf6

If this significantly improves behavior on azure in 4.6, we can consider taking it upstream. It is authored to be lightweight to carry.

deads2k force-pushed the retry-etcd branch from 699c25e to 2e86cf6 Compare September 2, 2020 12:14

openshift-ci-robot closed this Sep 9, 2020

tkashem mentioned this pull request Sep 9, 2020

Bug 1874584: UPSTREAM: <carry>: retry etcd errors #327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM: <carry>: retry etcd errors #322

UPSTREAM: <carry>: retry etcd errors #322

deads2k commented Aug 26, 2020

deads2k commented Aug 26, 2020

hexfusion commented Aug 27, 2020

deads2k commented Aug 27, 2020

deads2k commented Aug 27, 2020

hexfusion commented Aug 27, 2020

openshift-ci-robot commented Aug 27, 2020

hexfusion commented Aug 27, 2020

hexfusion commented Aug 29, 2020

deads2k commented Sep 1, 2020

hexfusion commented Sep 1, 2020

hexfusion commented Sep 1, 2020

hexfusion commented Sep 1, 2020

tkashem commented Sep 1, 2020

hexfusion commented Sep 1, 2020

tkashem commented Sep 2, 2020

tkashem commented Sep 2, 2020 •

edited

Loading

deads2k commented Sep 2, 2020

openshift-ci-robot commented Sep 2, 2020

deads2k commented Sep 2, 2020

openshift-ci-robot commented Sep 2, 2020

tkashem commented Sep 9, 2020

openshift-ci-robot commented Sep 9, 2020

UPSTREAM: <carry>: retry etcd errors #322

UPSTREAM: <carry>: retry etcd errors #322

Conversation

deads2k commented Aug 26, 2020

deads2k commented Aug 26, 2020

hexfusion commented Aug 27, 2020

deads2k commented Aug 27, 2020

deads2k commented Aug 27, 2020

hexfusion commented Aug 27, 2020

openshift-ci-robot commented Aug 27, 2020

hexfusion commented Aug 27, 2020

hexfusion commented Aug 29, 2020

deads2k commented Sep 1, 2020

hexfusion commented Sep 1, 2020

hexfusion commented Sep 1, 2020

hexfusion commented Sep 1, 2020

tkashem commented Sep 1, 2020

hexfusion commented Sep 1, 2020

tkashem commented Sep 2, 2020

tkashem commented Sep 2, 2020 • edited Loading

deads2k commented Sep 2, 2020

openshift-ci-robot commented Sep 2, 2020

deads2k commented Sep 2, 2020

openshift-ci-robot commented Sep 2, 2020

tkashem commented Sep 9, 2020

openshift-ci-robot commented Sep 9, 2020

tkashem commented Sep 2, 2020 •

edited

Loading