Set 10 minute timeout on webhook and deployment operations #183

Danil-Grigorev · 2020-08-02T07:45:09Z

Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full Deployment
rollout (5 minutes) and DaemonSet rollout (another 5 minutes) to occur before they are synced. Then the
process starts all over again. This applies to all webhook operations. For simplicity, the same applies
onto machine-api-controllers Deployment, which may take up to 5 minutes to successfully reconcile, or
even be re-provisioned after deletion.

This PR is a response to multiple CI failures:

openshift-ci-robot · 2020-08-02T07:45:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danil-grigorev
You can assign the PR to them by writing /assign @danil-grigorev in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Danil-Grigorev · 2020-08-03T07:55:36Z

/retest

Danil-Grigorev · 2020-08-05T10:38:29Z

/retest

enxebre · 2020-08-05T11:52:01Z

pkg/framework/webhooks.go

@@ -81,7 +81,7 @@ func DeleteMutatingWebhookConfiguration(c client.Client, webhookConfiguraiton *a

 // UpdateMutatingWebhookConfiguration updates the specified mutating webhook configuration
 func UpdateMutatingWebhookConfiguration(c client.Client, updated *admissionregistrationv1.MutatingWebhookConfiguration) error {
-	return wait.PollImmediate(RetryShort, WaitShort, func() (bool, error) {
+	return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {


can you elaborate what's motivating this change? can you share CI failures links that support that motivation and that this change would mitigate?

#183 (comment)

enxebre · 2020-08-05T11:52:07Z

pkg/framework/webhooks.go

@@ -97,7 +97,7 @@ func UpdateMutatingWebhookConfiguration(c client.Client, updated *admissionregis

 // UpdateValidatingWebhookConfiguration updates the specified mutating webhook configuration
 func UpdateValidatingWebhookConfiguration(c client.Client, updated *admissionregistrationv1.ValidatingWebhookConfiguration) error {
-	return wait.PollImmediate(RetryShort, WaitShort, func() (bool, error) {
+	return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {


can you elaborate what's motivating this change? can you share CI failures links that support that motivation and that this change would mitigate?

Here is one I noticed - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/183/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290937797384867840

But shortly the workflow I imagine is happening:

tests are running in parallel

webhook is deleted

MAO is busy on other things, and does not yet know (1-3 minutest), with the inclusion of BUG 1859221: Wait for resources to roll out on every sync machine-api-operator#651 it is now up to 10 minutes (https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L31 + https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L28)

Probably need to increase this up to WaitLong. But for now it is just mirroring of deployment logic, which works fine:

cluster-api-actuator-pkg/pkg/framework/deployment.go

Line 48 in 06ef5d1

return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/183/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290937797384867840

mm that link happened running against this PR... And the failure is for API operator deployment should [Serial] maintains deployment spec, so I'm not really sure how it's related to this change at all.

Those links are timeouts, if they are match the something went wrong with the operator, none test will pass. Some time to run the pod + the 3 min to check permanent availability should be the norm.

UpdateValidatingWebhookConfiguration are atomic functions, the only reason they have a PollImmediate atm is to mitigate transient errors and should probably top doing that, assume there won't be any and if there's we fail the test. If something needs to wait for a good reason please create a WaitBlah function.

That's why enabling Serial keyword is necessary here. It failed on this line, and that's the case I described:

cluster-api-actuator-pkg/pkg/operators/machine-api-operator.go

Line 60 in e943d39

Expect(framework.UpdateDeployment(client, maoManagedDeployment, framework.MachineAPINamespace, changedDeployment)).NotTo(HaveOccurred())

But shortly the workflow I imagine is happening:

* tests are running in parallel * webhook is deleted * MAO is busy on other things, and does not yet know (1-3 minutest), with the inclusion of [openshift/machine-api-operator#651](https://github.com/openshift/machine-api-operator/pull/651) it is now up to 10 minutes (https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L31 + https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L28)

In order to enable this, we need to merge #186 first.

JoelSpeed · 2020-08-05T11:52:54Z

pkg/operators/machine-api-operator.go

@@ -14,7 +14,7 @@ var (
 	maoManagedDeployment = "machine-api-controllers"
 )

-var _ = Describe("[Feature:Operators] Machine API operator deployment should", func() {
+var _ = Describe("[Feature:Operators] Machine API operator deployment should [Serial]", func() {


Does naming these as [serial] make them run as serial?

not, specifically. these are labels inspired from the kubernetes community[0]

[0] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests

enxebre · 2020-08-05T11:54:58Z

pkg/operators/machine-api-operator.go

@@ -14,7 +14,7 @@ var (
 	maoManagedDeployment = "machine-api-controllers"
 )

-var _ = Describe("[Feature:Operators] Machine API operator deployment should", func() {
+var _ = Describe("[Feature:Operators] Machine API operator deployment should [Serial]", func() {


is this something ginkgo just understand? can you reference the docs?

@enxebre @JoelSpeed You are right, not at the moment. Here is a separate PR for serial support: #186

these labels are inspired by the kubernetes community docs[0]. ginkgo doesn't know them, but i have started to clean them up in the autoscaler tests as i think it helps people to understand how to operate the tests.

additionally, as demonstrated in Danil's pr, they can be used to filter the tests.

[0] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests

Out of interest, do you have any estimate how much time this might add to each test run by making these serial? Do you think there are any tests within here that can still run in parallel? We could maybe divide them into parallel and serial sets

enxebre · 2020-08-06T09:25:52Z

pkg/operators/machine-api-operator.go

@@ -66,6 +66,10 @@ var _ = Describe("[Feature:Operators] Machine API operator deployment should", f
 		Expect(framework.IsDeploymentAvailable(client, maoManagedDeployment, framework.MachineAPINamespace)).To(BeTrue())

 	})
+})
+


can you help me understand what this is solving? All the IT validations within this describe still run in parallel right?

Can you share CI failures/flakes this PR would solve?

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290284947541594112

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-gcp-operator/1290284947608702976

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/180/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-azure-operator/1286272804777365504

Guess this one as well: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/180/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-gcp-operator/1286272804827697152

enxebre · 2020-08-06T09:27:11Z

In a typical environment MAO resource provisioning
occurs only during upgrades and serves as a one-time procedure.
Frequent manipulation with provisioned resources by the user is not expected.

Anything could manipulate mao owned resources and mao still should be able to react to that appropriately.
We shouldn't introduce artificial preconditions on how to run the test suite for it to pass. I think this tests should run in parallel and we use waitForXtoSync funcs.

JoelSpeed · 2020-08-06T12:34:02Z

The tests have passed once for the latest commit, since the primary purpose of this PR is to improve stability, we need to verify that it will repeatedly pass

/test e2e-aws-operator
/test e2e-azure-operator
/test e2e-gcp-operator

Danil-Grigorev · 2020-08-06T12:38:27Z

Those PRs are failing CI without this change:

enxebre · 2020-08-06T12:47:48Z

Those PRs are failing CI without this change:

It'd be better to link the failure jobs rather than PR, otherwise the links will became meaningless in time.

Can you please answer this #183 (comment)

For https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290284947541594112 have you dug into the details wouldn't we just to increase the timeout for waitToSync as per in #183 (comment)? That's current mao behaviour and parallel disruption is totally legit and so we want to test it, we don't want to relax tests.

Danil-Grigorev · 2020-08-06T12:55:01Z

Those PRs are failing CI without this change:

It'd be better to link the failure jobs rather than PR, otherwise the links will became meaningless in time.

Can you please answer this #183 (comment)

For https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290284947541594112 have you dug into the details wouldn't we just to increase the timeout for waitToSync as per in #183 (comment)? That's current mao behaviour and so we want to test it, we don't want to relax tests.

Here are some CI jobs: #183 (comment)
Here is the reason: #183 (comment) The places where the waitToSync may still fail, are the short lasting atomic operations like updateWebhooks as the timeout there should be ~ >=5 minutes to work all the time.

#183 (comment) - worth a BZ to work out. Reaction time is sometimes really slow.

Danil-Grigorev · 2020-08-06T15:11:22Z

/retest

Danil-Grigorev · 2020-08-06T17:18:45Z

/retest

Danil-Grigorev · 2020-08-06T17:38:22Z

An explanation, why those tests are still failing:

The actual timeout for webhook operations should be a sum of timeouts for daemonSet and Deployment tests - the worst possible case. This applies to all possible operations: Get Update, IsSynced.

The reason being - we expect something at some moment. This moment just passed, and now MAO rolls out the Deployment. Then it rolls out the daemonSet. And only after that it actually returns back to webhooks. So I'll rewrite these tests to follow that pattern, put a comment, and stop using the [Serial] keyword for now. @enxebre @JoelSpeed

JoelSpeed

Have you tried just extending the wait for the IsFooSynced methods? For the Delete and update certainly, if they haven't completed within WaitShort I'd be concerned, I don't think we need to extend those two. Get I'm on the fence about, as this maybe has the potential to be blocked by the rollouts right?

JoelSpeed · 2020-08-07T09:31:34Z

pkg/framework/webhooks.go

+// Webhooks may take up to 10 minutes to reconcile, as they could be just changed, but before we know about that,
+// a full Deployment (5 minutes) rollout and DaemonSet (another 5 minutes) rollout should occure. Then the


I think this will read better like this, wdyt?

Suggested change

// Webhooks may take up to 10 minutes to reconcile, as they could be just changed, but before we know about that,

// a full Deployment (5 minutes) rollout and DaemonSet (another 5 minutes) rollout should occure. Then the

// Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full Deployment

// rollout (5 minutes) and DaemonSet rollout (another 5 minutes) to occur before they are synced. Then the

Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full Deployment rollout (5 minutes) and DaemonSet rollout (another 5 minutes) to occur before they are synced. Then the process starts all over again. This applies to all webhook operations.

enxebre · 2020-08-07T12:47:54Z

pkg/framework/deployment.go

@@ -34,7 +40,12 @@ func GetDeployment(c client.Client, name, namespace string) (*kappsapi.Deploymen

 // DeleteDeployment deletes the specified deployment
 func DeleteDeployment(c client.Client, deployment *kappsapi.Deployment) error {


why do we need to wrap this? now DeleteDeployment is unused. Can't we just have a single waitForDerploymentDeletion

enxebre · 2020-08-07T12:48:13Z

pkg/framework/deployment.go

@@ -45,7 +56,13 @@ func DeleteDeployment(c client.Client, deployment *kappsapi.Deployment) error {

 // UpdateDeployment updates the specified deployment
 func UpdateDeployment(c client.Client, name, namespace string, updated *kappsapi.Deployment) error {
-	return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {


why do we need to wrap this? now UpdateDeployment is unused. Can't we just have a single waitForDeploymentUpdate?

enxebre · 2020-08-07T12:51:16Z

pkg/framework/deployment.go

@@ -61,7 +78,12 @@ func UpdateDeployment(c client.Client, name, namespace string, updated *kappsapi

 // IsDeploymentAvailable returns true if the deployment has one or more availabe replicas
 func IsDeploymentAvailable(c client.Client, name, namespace string) bool {
-	if err := wait.PollImmediate(RetryShort, WaitLong, func() (bool, error) {
+	return IsDeploymentAvailableIn(c, name, namespace, WaitLong)


why do we need to wrap this and have two different functions? is there any valid case where we want to use different wait timeout here?

enxebre · 2020-08-07T12:51:47Z

pkg/framework/deployment.go

@@ -84,7 +106,14 @@ func IsDeploymentAvailable(c client.Client, name, namespace string) bool {

 // IsDeploymentSynced returns true if provided deployment spec matched one found on cluster
 func IsDeploymentSynced(c client.Client, dep *kappsapi.Deployment, name, namespace string) bool {


why this wrap? is IsDeploymentSynced is unused now.

michaelgugino

/hold

I have no idea what this PR is trying to do. It either needs to link to a BZ with great detail, or the first comment and commit message describe exactly the conditions we're trying to resolve. Making some tests run in serial, I have no idea why we're doing that, I'm not going to try to determine the context from all the comments and code changes. We need clear commit messages describing the purpose of things.

Danil-Grigorev · 2020-08-10T09:59:27Z

I have no idea what this PR is trying to do. It either needs to link to a BZ with great detail, or the first comment and commit message describe exactly the conditions we're trying to resolve. Making some tests run in serial, I have no idea why we're doing that, I'm not going to try to determine the context from all the comments and code changes. We need clear commit messages describing the purpose of things.

Sorry, the PR description was outdated. Right now the PR is simply increasing timeouts on MAO operations, as multiple CI jobs failed recently due to that. I added the links, and an explanation, why it will take 10 minutes for each to complete.

JoelSpeed · 2020-08-10T10:48:14Z

@Danil-Grigorev Did you see #183 (review)? Don't think I got a response but I think it's still relevant

Thanks for updating the PR description and adding those links, will help us work out why we did this 6 months down the line when we are doing some archeology I'm sure! Nit, the 4th link seems to be an unrelated failure, PTAL

I wonder if we should try, as an alternative (or maybe compliment) to this, refactoring the MAO so that it applies the Deployment, Daemonset and webhooks, and then performs the waiting process. This would reduce the time waiting since the deployment and daemonset would be rolling out in parallel. If we try this idea first, we may not even need this PR right? The tests may consistently pass in the allowed time and we have done some optimisation to the product as well :D

openshift-ci-robot · 2020-08-10T11:31:28Z

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-operator	`fef754e`	link	`/test e2e-aws-operator`
ci/prow/e2e-azure-operator	`fef754e`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

michaelgugino · 2020-08-10T17:46:41Z

to this, refactoring the MAO so that it applies the Deployment, Daemonset and webhooks, and then performs the waiting process. This would reduce the time waiting since the deployment and daemonset would be rolling out in parallel. If we try this idea first, we may not even need this PR right? The tests may consistently pass in the allowed time and we have done some optimisation to the product as well :D

I endorse this suggestion.

openshift-bot · 2020-11-08T20:05:58Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-12-08T22:02:42Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Danil-Grigorev · 2020-12-15T12:05:39Z

/close

This is no longer relevant, as the issue with webhooks was resolved by openshift/machine-api-operator#707 and later substituted with upstream implementation - openshift/machine-api-operator#642

openshift-ci-robot · 2020-12-15T12:05:58Z

@Danil-Grigorev: Closed this PR.

In response to this:

/close

This is no longer relevant, as the issue with webhooks was resolved by openshift/machine-api-operator#707 and later substituted with upstream implementation - openshift/machine-api-operator#642

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from alexander-demicev and enxebre August 2, 2020 07:45

Danil-Grigorev mentioned this pull request Aug 2, 2020

rewrite scale up and down test for autoscaler #174

Merged

Danil-Grigorev force-pushed the webhook-wait-longer branch from 980be25 to fda5f57 Compare August 5, 2020 09:08

Danil-Grigorev changed the title ~~Bump webhook rollout timeout~~ Bump webhook rollout timeout, make webhook tests serial Aug 5, 2020

enxebre reviewed Aug 5, 2020

View reviewed changes

JoelSpeed reviewed Aug 5, 2020

View reviewed changes

enxebre reviewed Aug 5, 2020

View reviewed changes

Danil-Grigorev force-pushed the webhook-wait-longer branch 2 times, most recently from e943d39 to 47dd14d Compare August 6, 2020 09:11

enxebre mentioned this pull request Aug 6, 2020

Enable serial keyword to run tests serially #186

Closed

Danil-Grigorev changed the title ~~Bump webhook rollout timeout, make webhook tests serial~~ Make webhook tests serial Aug 6, 2020

enxebre reviewed Aug 6, 2020

View reviewed changes

Danil-Grigorev force-pushed the webhook-wait-longer branch from 125196a to 57ebec5 Compare August 6, 2020 17:57

Danil-Grigorev requested a review from enxebre August 6, 2020 17:58

JoelSpeed reviewed Aug 7, 2020

View reviewed changes

Danil-Grigorev force-pushed the webhook-wait-longer branch from 57ebec5 to c401014 Compare August 7, 2020 09:46

enxebre reviewed Aug 7, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2020

michaelgugino suggested changes Aug 7, 2020

View reviewed changes

Add framework support for custom deployment timeouts

fef754e

Danil-Grigorev force-pushed the webhook-wait-longer branch from c401014 to fef754e Compare August 10, 2020 09:30

Danil-Grigorev changed the title ~~Make webhook tests serial~~ Set 10 minute timeout on webhook and deployment operations Aug 10, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 8, 2020

openshift-ci-robot closed this Dec 15, 2020

		// Webhooks may take up to 10 minutes to reconcile, as they could be just changed, but before we know about that,
		// a full Deployment (5 minutes) rollout and DaemonSet (another 5 minutes) rollout should occure. Then the

		@@ -34,7 +40,12 @@ func GetDeployment(c client.Client, name, namespace string) (*kappsapi.Deploymen

		// DeleteDeployment deletes the specified deployment
		func DeleteDeployment(c client.Client, deployment *kappsapi.Deployment) error {

		@@ -84,7 +106,14 @@ func IsDeploymentAvailable(c client.Client, name, namespace string) bool {

		// IsDeploymentSynced returns true if provided deployment spec matched one found on cluster
		func IsDeploymentSynced(c client.Client, dep *kappsapi.Deployment, name, namespace string) bool {

Set 10 minute timeout on webhook and deployment operations #183

Set 10 minute timeout on webhook and deployment operations #183

Conversation

Danil-Grigorev commented Aug 2, 2020 • edited

openshift-ci-robot commented Aug 2, 2020

Danil-Grigorev commented Aug 3, 2020

Danil-Grigorev commented Aug 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Danil-Grigorev Aug 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elmiko Aug 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Aug 6, 2020 • edited

JoelSpeed commented Aug 6, 2020

Danil-Grigorev commented Aug 6, 2020

enxebre commented Aug 6, 2020 • edited

Danil-Grigorev commented Aug 6, 2020

Danil-Grigorev commented Aug 6, 2020

Danil-Grigorev commented Aug 6, 2020

Danil-Grigorev commented Aug 6, 2020 • edited

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelgugino left a comment

Choose a reason for hiding this comment

Danil-Grigorev commented Aug 10, 2020

JoelSpeed commented Aug 10, 2020

openshift-ci-robot commented Aug 10, 2020

michaelgugino commented Aug 10, 2020

openshift-bot commented Nov 8, 2020

openshift-bot commented Dec 8, 2020

Danil-Grigorev commented Dec 15, 2020

openshift-ci-robot commented Dec 15, 2020

Danil-Grigorev commented Aug 2, 2020 •

edited

Danil-Grigorev Aug 5, 2020 •

edited

elmiko Aug 5, 2020 •

edited

enxebre commented Aug 6, 2020 •

edited

enxebre commented Aug 6, 2020 •

edited

Danil-Grigorev commented Aug 6, 2020 •

edited