-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set 10 minute timeout on webhook and deployment operations #183
Set 10 minute timeout on webhook and deployment operations #183
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
980be25
to
fda5f57
Compare
/retest |
pkg/framework/webhooks.go
Outdated
@@ -81,7 +81,7 @@ func DeleteMutatingWebhookConfiguration(c client.Client, webhookConfiguraiton *a | |||
|
|||
// UpdateMutatingWebhookConfiguration updates the specified mutating webhook configuration | |||
func UpdateMutatingWebhookConfiguration(c client.Client, updated *admissionregistrationv1.MutatingWebhookConfiguration) error { | |||
return wait.PollImmediate(RetryShort, WaitShort, func() (bool, error) { | |||
return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate what's motivating this change? can you share CI failures links that support that motivation and that this change would mitigate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/framework/webhooks.go
Outdated
@@ -97,7 +97,7 @@ func UpdateMutatingWebhookConfiguration(c client.Client, updated *admissionregis | |||
|
|||
// UpdateValidatingWebhookConfiguration updates the specified mutating webhook configuration | |||
func UpdateValidatingWebhookConfiguration(c client.Client, updated *admissionregistrationv1.ValidatingWebhookConfiguration) error { | |||
return wait.PollImmediate(RetryShort, WaitShort, func() (bool, error) { | |||
return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate what's motivating this change? can you share CI failures links that support that motivation and that this change would mitigate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is one I noticed - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/183/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290937797384867840
But shortly the workflow I imagine is happening:
- tests are running in parallel
- webhook is deleted
- MAO is busy on other things, and does not yet know (1-3 minutest), with the inclusion of BUG 1859221: Wait for resources to roll out on every sync machine-api-operator#651 it is now up to 10 minutes (https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L31 + https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L28)
Probably need to increase this up to WaitLong
. But for now it is just mirroring of deployment logic, which works fine:
return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm that link happened running against this PR... And the failure is for API operator deployment should [Serial] maintains deployment spec
, so I'm not really sure how it's related to this change at all.
Those links are timeouts, if they are match the something went wrong with the operator, none test will pass. Some time to run the pod + the 3 min to check permanent availability should be the norm.
UpdateValidatingWebhookConfiguration
are atomic functions, the only reason they have a PollImmediate atm is to mitigate transient errors and should probably top doing that, assume there won't be any and if there's we fail the test. If something needs to wait for a good reason please create a WaitBlah
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why enabling Serial
keyword is necessary here. It failed on this line, and that's the case I described:
Expect(framework.UpdateDeployment(client, maoManagedDeployment, framework.MachineAPINamespace, changedDeployment)).NotTo(HaveOccurred()) |
But shortly the workflow I imagine is happening:
* tests are running in parallel * webhook is deleted * MAO is busy on other things, and does not yet know (1-3 minutest), with the inclusion of [openshift/machine-api-operator#651](https://github.com/openshift/machine-api-operator/pull/651) it is now up to 10 minutes (https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L31 + https://github.com/openshift/machine-api-operator/blob/6e867bf92c3ec296c4c93d73abd93b47af9426a3/pkg/operator/sync.go#L28)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to enable this, we need to merge #186 first.
@@ -14,7 +14,7 @@ var ( | |||
maoManagedDeployment = "machine-api-controllers" | |||
) | |||
|
|||
var _ = Describe("[Feature:Operators] Machine API operator deployment should", func() { | |||
var _ = Describe("[Feature:Operators] Machine API operator deployment should [Serial]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does naming these as [serial]
make them run as serial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not, specifically. these are labels inspired from the kubernetes community[0]
@@ -14,7 +14,7 @@ var ( | |||
maoManagedDeployment = "machine-api-controllers" | |||
) | |||
|
|||
var _ = Describe("[Feature:Operators] Machine API operator deployment should", func() { | |||
var _ = Describe("[Feature:Operators] Machine API operator deployment should [Serial]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something ginkgo just understand? can you reference the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre @JoelSpeed You are right, not at the moment. Here is a separate PR for serial support: #186
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these labels are inspired by the kubernetes community docs[0]. ginkgo doesn't know them, but i have started to clean them up in the autoscaler tests as i think it helps people to understand how to operate the tests.
additionally, as demonstrated in Danil's pr, they can be used to filter the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, do you have any estimate how much time this might add to each test run by making these serial? Do you think there are any tests within here that can still run in parallel? We could maybe divide them into parallel and serial sets
e943d39
to
47dd14d
Compare
@@ -66,6 +66,10 @@ var _ = Describe("[Feature:Operators] Machine API operator deployment should", f | |||
Expect(framework.IsDeploymentAvailable(client, maoManagedDeployment, framework.MachineAPINamespace)).To(BeTrue()) | |||
|
|||
}) | |||
}) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me understand what this is solving? All the IT validations within this describe still run in parallel right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share CI failures/flakes this PR would solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290284947541594112
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-gcp-operator/1290284947608702976
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/180/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-azure-operator/1286272804777365504
- Guess this one as well: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/180/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-gcp-operator/1286272804827697152
Anything could manipulate mao owned resources and mao still should be able to react to that appropriately. |
The tests have passed once for the latest commit, since the primary purpose of this PR is to improve stability, we need to verify that it will repeatedly pass /test e2e-aws-operator |
Those PRs are failing CI without this change: |
It'd be better to link the failure jobs rather than PR, otherwise the links will became meaningless in time. Can you please answer this #183 (comment) For https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/184/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1290284947541594112 have you dug into the details wouldn't we just to increase the timeout for |
Here are some CI jobs: #183 (comment) #183 (comment) - worth a BZ to work out. Reaction time is sometimes really slow. |
/retest |
1 similar comment
/retest |
An explanation, why those tests are still failing: The actual timeout for webhook operations should be a sum of timeouts for The reason being - we expect something at some moment. This moment just passed, and now MAO rolls out the |
125196a
to
57ebec5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried just extending the wait for the IsFooSynced
methods? For the Delete and update certainly, if they haven't completed within WaitShort
I'd be concerned, I don't think we need to extend those two. Get I'm on the fence about, as this maybe has the potential to be blocked by the rollouts right?
pkg/framework/webhooks.go
Outdated
// Webhooks may take up to 10 minutes to reconcile, as they could be just changed, but before we know about that, | ||
// a full Deployment (5 minutes) rollout and DaemonSet (another 5 minutes) rollout should occure. Then the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will read better like this, wdyt?
// Webhooks may take up to 10 minutes to reconcile, as they could be just changed, but before we know about that, | |
// a full Deployment (5 minutes) rollout and DaemonSet (another 5 minutes) rollout should occure. Then the | |
// Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full Deployment | |
// rollout (5 minutes) and DaemonSet rollout (another 5 minutes) to occur before they are synced. Then the |
Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full Deployment rollout (5 minutes) and DaemonSet rollout (another 5 minutes) to occur before they are synced. Then the process starts all over again. This applies to all webhook operations.
57ebec5
to
c401014
Compare
pkg/framework/deployment.go
Outdated
@@ -34,7 +40,12 @@ func GetDeployment(c client.Client, name, namespace string) (*kappsapi.Deploymen | |||
|
|||
// DeleteDeployment deletes the specified deployment | |||
func DeleteDeployment(c client.Client, deployment *kappsapi.Deployment) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to wrap this? now DeleteDeployment
is unused. Can't we just have a single waitForDerploymentDeletion
@@ -45,7 +56,13 @@ func DeleteDeployment(c client.Client, deployment *kappsapi.Deployment) error { | |||
|
|||
// UpdateDeployment updates the specified deployment | |||
func UpdateDeployment(c client.Client, name, namespace string, updated *kappsapi.Deployment) error { | |||
return wait.PollImmediate(RetryShort, WaitMedium, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to wrap this? now UpdateDeployment
is unused. Can't we just have a single waitForDeploymentUpdate
?
pkg/framework/deployment.go
Outdated
@@ -61,7 +78,12 @@ func UpdateDeployment(c client.Client, name, namespace string, updated *kappsapi | |||
|
|||
// IsDeploymentAvailable returns true if the deployment has one or more availabe replicas | |||
func IsDeploymentAvailable(c client.Client, name, namespace string) bool { | |||
if err := wait.PollImmediate(RetryShort, WaitLong, func() (bool, error) { | |||
return IsDeploymentAvailableIn(c, name, namespace, WaitLong) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to wrap this and have two different functions? is there any valid case where we want to use different wait timeout here?
pkg/framework/deployment.go
Outdated
@@ -84,7 +106,14 @@ func IsDeploymentAvailable(c client.Client, name, namespace string) bool { | |||
|
|||
// IsDeploymentSynced returns true if provided deployment spec matched one found on cluster | |||
func IsDeploymentSynced(c client.Client, dep *kappsapi.Deployment, name, namespace string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this wrap? is IsDeploymentSynced is unused now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
I have no idea what this PR is trying to do. It either needs to link to a BZ with great detail, or the first comment and commit message describe exactly the conditions we're trying to resolve. Making some tests run in serial, I have no idea why we're doing that, I'm not going to try to determine the context from all the comments and code changes. We need clear commit messages describing the purpose of things.
c401014
to
fef754e
Compare
Sorry, the PR description was outdated. Right now the PR is simply increasing timeouts on MAO operations, as multiple CI jobs failed recently due to that. I added the links, and an explanation, why it will take 10 minutes for each to complete. |
@Danil-Grigorev Did you see #183 (review)? Don't think I got a response but I think it's still relevant Thanks for updating the PR description and adding those links, will help us work out why we did this 6 months down the line when we are doing some archeology I'm sure! Nit, the 4th link seems to be an unrelated failure, PTAL I wonder if we should try, as an alternative (or maybe compliment) to this, refactoring the MAO so that it applies the Deployment, Daemonset and webhooks, and then performs the waiting process. This would reduce the time waiting since the deployment and daemonset would be rolling out in parallel. If we try this idea first, we may not even need this PR right? The tests may consistently pass in the allowed time and we have done some optimisation to the product as well :D |
@Danil-Grigorev: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I endorse this suggestion. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/close This is no longer relevant, as the issue with webhooks was resolved by openshift/machine-api-operator#707 and later substituted with upstream implementation - openshift/machine-api-operator#642 |
@Danil-Grigorev: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Webhooks may take up to 10 minutes to sync. When changed we first have to wait for a full
Deployment
rollout (5 minutes) and
DaemonSet
rollout (another 5 minutes) to occur before they are synced. Then theprocess starts all over again. This applies to all webhook operations. For simplicity, the same applies
onto
machine-api-controllers
Deployment
, which may take up to 5 minutes to successfully reconcile, oreven be re-provisioned after deletion.
This PR is a response to multiple CI failures: