-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend timeout before breaking node in autoscaling test #71362
Extend timeout before breaking node in autoscaling test #71362
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aleksandra-malinowska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/sig autoscaling |
@@ -881,6 +881,10 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() { | |||
clusterSize = manuallyIncreaseClusterSize(f, originalSizes) | |||
} | |||
|
|||
// If new nodes are disconnected too soon, they'll be considered not started | |||
// instead of unready. | |||
time.Sleep(2 * time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeps with hardcoded values are rarely the best choice in tests. Can we do in a more robust way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If node's ready condition transitions less than 2 minutes after it was created, Cluster Autoscaler will consider this node as upcoming (= not yet started), not as unready. I don't see an easy way to work around this without fixing how node readiness works across the system. Assuming this is how we want to handle readiness for now, autoscaler's behavior is correct - we don't want to back-off from scaling cluster just because we've added some nodes. Before, this test probably worked because nodes started slowly (so by the time they were ready, they were older than 2 minutes).
What do you propose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest expanding the comment explaining why we are doing this. And please explain why 2 minute sleep is more than enough to avoid flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
af8ccf4
to
2f278a3
Compare
/lgtm |
@aleksandra-malinowska please remove hold once you confirm with the release team that they are ok with the merge. |
/lgtm |
@@ -881,6 +881,19 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() { | |||
clusterSize = manuallyIncreaseClusterSize(f, originalSizes) | |||
} | |||
|
|||
// If new nodes are disconnected too soon, they'll be considered not started | |||
// instead of unready, and cluster won't be considered unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healthy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the cluster to become unhealthy in this scenario, to test back-off from adding more nodes
@aleksandra-malinowska a couple of ques |
These tests are blocking Cluster Autoscaler release. Since default GCE config lives in this repo under /cluster/gce and is bundled with Kubernetes release, we need to release this component just before Kubernetes release and update the image at the last moment. Unfortunately, our tests also happen to live in this repo. If we're no longer allowed to fix them, we won't get test signal for Cluster Autoscaler, and won't be sure it works on 1.13. This is far from ideal, but moving tests out of this repository will take time, and I don't know if there are even any plans to do it with GCE config. |
Sounds good please go ahead with the merge then. /priority critical-urgent |
Thanks! /hold cancel |
This will hopefully fix
[sig-autoscaling] Cluster size autoscaling [Slow] Shouldn't perform scale up operation and should list unhealthy status if most of the cluster is broken[Feature:ClusterSizeAutoscalingScaleUp]
by extending timeout between creating nodes and disconnecting them. If the nodes are broken too soon, Cluster Autoscaler considers them upcoming and doesn't back-off from scaling cluster.Testgrid: https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling
/assign @mwielgus
/kind failing-test