Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set 10 minute timeout on webhook and deployment operations #183

Closed

Conversation

Danil-Grigorev
Copy link
Contributor

@Danil-Grigorev Danil-Grigorev commented Aug 2, 2020

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danil-grigorev
You can assign the PR to them by writing /assign @danil-grigorev in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Danil-Grigorev
Copy link
Contributor Author

/retest

@Danil-Grigorev Danil-Grigorev changed the title Bump webhook rollout timeout Bump webhook rollout timeout, make webhook tests serial Aug 5, 2020
@Danil-Grigorev
Copy link
Contributor Author

/retest

@@ -81,7 +81,7 @@ func DeleteMutatingWebhookConfiguration(c client.Client, webhookConfiguraiton *a

// UpdateMutatingWebhookConfiguration updates the specified mutating webhook configuration
func UpdateMutatingWebhookConfiguration(c client.Client, updated *admissionregistrationv1.MutatingWebhookConfiguration) error {
return wait.PollImmediate(RetryShort, WaitShort, func() (bool, error) {
return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate what's motivating this change? can you share CI failures links that support that motivation and that this change would mitigate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -97,7 +97,7 @@ func UpdateMutatingWebhookConfiguration(c client.Client, updated *admissionregis

// UpdateValidatingWebhookConfiguration updates the specified mutating webhook configuration
func UpdateValidatingWebhookConfiguration(c client.Client, updated *admissionregistrationv1.ValidatingWebhookConfiguration) error {
return wait.PollImmediate(RetryShort, WaitShort, func() (bool, error) {
return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate what's motivating this change? can you share CI failures links that support that motivation and that this change would mitigate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one I noticed - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/183/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290937797384867840

But shortly the workflow I imagine is happening:

Probably need to increase this up to WaitLong. But for now it is just mirroring of deployment logic, which works fine:

return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/183/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290937797384867840

mm that link happened running against this PR... And the failure is for API operator deployment should [Serial] maintains deployment spec, so I'm not really sure how it's related to this change at all.

Those links are timeouts, if they are match the something went wrong with the operator, none test will pass. Some time to run the pod + the 3 min to check permanent availability should be the norm.

UpdateValidatingWebhookConfiguration are atomic functions, the only reason they have a PollImmediate atm is to mitigate transient errors and should probably top doing that, assume there won't be any and if there's we fail the test. If something needs to wait for a good reason please create a WaitBlah function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why enabling Serial keyword is necessary here. It failed on this line, and that's the case I described:

Expect(framework.UpdateDeployment(client, maoManagedDeployment, framework.MachineAPINamespace, changedDeployment)).NotTo(HaveOccurred())

But shortly the workflow I imagine is happening:

* tests are running in parallel

* webhook is deleted

* MAO is busy on other things, and does not yet know (1-3 minutest), with the inclusion of [openshift/machine-api-operator#651](https://github.com/openshift/machine-api-operator/pull/651) it is now up to 10 minutes (https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L31 + https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L28)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to enable this, we need to merge #186 first.

@@ -14,7 +14,7 @@ var (
maoManagedDeployment = "machine-api-controllers"
)

var _ = Describe("[Feature:Operators] Machine API operator deployment should", func() {
var _ = Describe("[Feature:Operators] Machine API operator deployment should [Serial]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does naming these as [serial] make them run as serial?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not, specifically. these are labels inspired from the kubernetes community[0]

[0] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests

@@ -14,7 +14,7 @@ var (
maoManagedDeployment = "machine-api-controllers"
)

var _ = Describe("[Feature:Operators] Machine API operator deployment should", func() {
var _ = Describe("[Feature:Operators] Machine API operator deployment should [Serial]", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something ginkgo just understand? can you reference the docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre @JoelSpeed You are right, not at the moment. Here is a separate PR for serial support: #186

Copy link
Contributor

@elmiko elmiko Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these labels are inspired by the kubernetes community docs[0]. ginkgo doesn't know them, but i have started to clean them up in the autoscaler tests as i think it helps people to understand how to operate the tests.

additionally, as demonstrated in Danil's pr, they can be used to filter the tests.

[0] https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, do you have any estimate how much time this might add to each test run by making these serial? Do you think there are any tests within here that can still run in parallel? We could maybe divide them into parallel and serial sets

@Danil-Grigorev Danil-Grigorev force-pushed the webhook-wait-longer branch 2 times, most recently from e943d39 to 47dd14d Compare August 6, 2020 09:11
@Danil-Grigorev Danil-Grigorev changed the title Bump webhook rollout timeout, make webhook tests serial Make webhook tests serial Aug 6, 2020
@@ -66,6 +66,10 @@ var _ = Describe("[Feature:Operators] Machine API operator deployment should", f
Expect(framework.IsDeploymentAvailable(client, maoManagedDeployment, framework.MachineAPINamespace)).To(BeTrue())

})
})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me understand what this is solving? All the IT validations within this describe still run in parallel right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share CI failures/flakes this PR would solve?

@enxebre
Copy link
Member

enxebre commented Aug 6, 2020

In a typical environment MAO resource provisioning
occurs only during upgrades and serves as a one-time procedure.
Frequent manipulation with provisioned resources by the user is not expected.

Anything could manipulate mao owned resources and mao still should be able to react to that appropriately.
We shouldn't introduce artificial preconditions on how to run the test suite for it to pass. I think this tests should run in parallel and we use waitForXtoSync funcs.

@JoelSpeed
Copy link
Contributor

The tests have passed once for the latest commit, since the primary purpose of this PR is to improve stability, we need to verify that it will repeatedly pass

/test e2e-aws-operator
/test e2e-azure-operator
/test e2e-gcp-operator

@Danil-Grigorev
Copy link
Contributor Author

@enxebre
Copy link
Member

enxebre commented Aug 6, 2020

Those PRs are failing CI without this change:

It'd be better to link the failure jobs rather than PR, otherwise the links will became meaningless in time.

Can you please answer this #183 (comment)

For https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290284947541594112 have you dug into the details wouldn't we just to increase the timeout for waitToSync as per in #183 (comment)? That's current mao behaviour and parallel disruption is totally legit and so we want to test it, we don't want to relax tests.

@Danil-Grigorev
Copy link
Contributor Author

Those PRs are failing CI without this change:

It'd be better to link the failure jobs rather than PR, otherwise the links will became meaningless in time.

Can you please answer this #183 (comment)

For https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290284947541594112 have you dug into the details wouldn't we just to increase the timeout for waitToSync as per in #183 (comment)? That's current mao behaviour and so we want to test it, we don't want to relax tests.

Here are some CI jobs: #183 (comment)
Here is the reason: #183 (comment) The places where the waitToSync may still fail, are the short lasting atomic operations like updateWebhooks as the timeout there should be ~ >=5 minutes to work all the time.

#183 (comment) - worth a BZ to work out. Reaction time is sometimes really slow.

@Danil-Grigorev
Copy link
Contributor Author

/retest

1 similar comment
@Danil-Grigorev
Copy link
Contributor Author

/retest

@Danil-Grigorev
Copy link
Contributor Author

Danil-Grigorev commented Aug 6, 2020

An explanation, why those tests are still failing:

The actual timeout for webhook operations should be a sum of timeouts for daemonSet and Deployment tests - the worst possible case. This applies to all possible operations: Get Update, IsSynced.

The reason being - we expect something at some moment. This moment just passed, and now MAO rolls out the Deployment. Then it rolls out the daemonSet. And only after that it actually returns back to webhooks. So I'll rewrite these tests to follow that pattern, put a comment, and stop using the [Serial] keyword for now. @enxebre @JoelSpeed

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried just extending the wait for the IsFooSynced methods? For the Delete and update certainly, if they haven't completed within WaitShort I'd be concerned, I don't think we need to extend those two. Get I'm on the fence about, as this maybe has the potential to be blocked by the rollouts right?

Comment on lines 23 to 24
// Webhooks may take up to 10 minutes to reconcile, as they could be just changed, but before we know about that,
// a full Deployment (5 minutes) rollout and DaemonSet (another 5 minutes) rollout should occure. Then the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will read better like this, wdyt?

Suggested change
// Webhooks may take up to 10 minutes to reconcile, as they could be just changed, but before we know about that,
// a full Deployment (5 minutes) rollout and DaemonSet (another 5 minutes) rollout should occure. Then the
// Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full Deployment
// rollout (5 minutes) and DaemonSet rollout (another 5 minutes) to occur before they are synced. Then the

Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full Deployment
rollout (5 minutes) and DaemonSet rollout (another 5 minutes) to occur before they are synced. Then the
process starts all over again. This applies to all webhook operations.
@@ -34,7 +40,12 @@ func GetDeployment(c client.Client, name, namespace string) (*kappsapi.Deploymen

// DeleteDeployment deletes the specified deployment
func DeleteDeployment(c client.Client, deployment *kappsapi.Deployment) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to wrap this? now DeleteDeployment is unused. Can't we just have a single waitForDerploymentDeletion

@@ -45,7 +56,13 @@ func DeleteDeployment(c client.Client, deployment *kappsapi.Deployment) error {

// UpdateDeployment updates the specified deployment
func UpdateDeployment(c client.Client, name, namespace string, updated *kappsapi.Deployment) error {
return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to wrap this? now UpdateDeployment is unused. Can't we just have a single waitForDeploymentUpdate?

@@ -61,7 +78,12 @@ func UpdateDeployment(c client.Client, name, namespace string, updated *kappsapi

// IsDeploymentAvailable returns true if the deployment has one or more availabe replicas
func IsDeploymentAvailable(c client.Client, name, namespace string) bool {
if err := wait.PollImmediate(RetryShort, WaitLong, func() (bool, error) {
return IsDeploymentAvailableIn(c, name, namespace, WaitLong)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to wrap this and have two different functions? is there any valid case where we want to use different wait timeout here?

@@ -84,7 +106,14 @@ func IsDeploymentAvailable(c client.Client, name, namespace string) bool {

// IsDeploymentSynced returns true if provided deployment spec matched one found on cluster
func IsDeploymentSynced(c client.Client, dep *kappsapi.Deployment, name, namespace string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this wrap? is IsDeploymentSynced is unused now.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2020
Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

I have no idea what this PR is trying to do. It either needs to link to a BZ with great detail, or the first comment and commit message describe exactly the conditions we're trying to resolve. Making some tests run in serial, I have no idea why we're doing that, I'm not going to try to determine the context from all the comments and code changes. We need clear commit messages describing the purpose of things.

@Danil-Grigorev Danil-Grigorev changed the title Make webhook tests serial Set 10 minute timeout on webhook and deployment operations Aug 10, 2020
@Danil-Grigorev
Copy link
Contributor Author

I have no idea what this PR is trying to do. It either needs to link to a BZ with great detail, or the first comment and commit message describe exactly the conditions we're trying to resolve. Making some tests run in serial, I have no idea why we're doing that, I'm not going to try to determine the context from all the comments and code changes. We need clear commit messages describing the purpose of things.

Sorry, the PR description was outdated. Right now the PR is simply increasing timeouts on MAO operations, as multiple CI jobs failed recently due to that. I added the links, and an explanation, why it will take 10 minutes for each to complete.

@JoelSpeed
Copy link
Contributor

@Danil-Grigorev Did you see #183 (review)? Don't think I got a response but I think it's still relevant

Thanks for updating the PR description and adding those links, will help us work out why we did this 6 months down the line when we are doing some archeology I'm sure! Nit, the 4th link seems to be an unrelated failure, PTAL

I wonder if we should try, as an alternative (or maybe compliment) to this, refactoring the MAO so that it applies the Deployment, Daemonset and webhooks, and then performs the waiting process. This would reduce the time waiting since the deployment and daemonset would be rolling out in parallel. If we try this idea first, we may not even need this PR right? The tests may consistently pass in the allowed time and we have done some optimisation to the product as well :D

@openshift-ci-robot
Copy link

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-operator fef754e link /test e2e-aws-operator
ci/prow/e2e-azure-operator fef754e link /test e2e-azure-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@michaelgugino
Copy link
Contributor

to this, refactoring the MAO so that it applies the Deployment, Daemonset and webhooks, and then performs the waiting process. This would reduce the time waiting since the deployment and daemonset would be rolling out in parallel. If we try this idea first, we may not even need this PR right? The tests may consistently pass in the allowed time and we have done some optimisation to the product as well :D

I endorse this suggestion.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 8, 2020
@Danil-Grigorev
Copy link
Contributor Author

/close

This is no longer relevant, as the issue with webhooks was resolved by openshift/machine-api-operator#707 and later substituted with upstream implementation - openshift/machine-api-operator#642

@openshift-ci-robot
Copy link

@Danil-Grigorev: Closed this PR.

In response to this:

/close

This is no longer relevant, as the issue with webhooks was resolved by openshift/machine-api-operator#707 and later substituted with upstream implementation - openshift/machine-api-operator#642

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants