-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend timeout before breaking node in autoscaling test #71362
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -881,6 +881,19 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() { | |
clusterSize = manuallyIncreaseClusterSize(f, originalSizes) | ||
} | ||
|
||
// If new nodes are disconnected too soon, they'll be considered not started | ||
// instead of unready, and cluster won't be considered unhealthy. | ||
// | ||
// More precisely, Cluster Autoscaler compares last transition time of | ||
// several readiness conditions to node create time. If it's within | ||
// 2 minutes, it'll assume node is just starting and not unhealthy. | ||
// | ||
// Nodes become ready in less than 1 minute after being created, | ||
// so waiting extra 2 minutes before breaking them (which triggers | ||
// readiness condition transition) should be sufficient, while | ||
// making no assumptions about minimal node startup time. | ||
time.Sleep(2 * time.Minute) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sleeps with hardcoded values are rarely the best choice in tests. Can we do in a more robust way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If node's ready condition transitions less than 2 minutes after it was created, Cluster Autoscaler will consider this node as upcoming (= not yet started), not as unready. I don't see an easy way to work around this without fixing how node readiness works across the system. Assuming this is how we want to handle readiness for now, autoscaler's behavior is correct - we don't want to back-off from scaling cluster just because we've added some nodes. Before, this test probably worked because nodes started slowly (so by the time they were ready, they were older than 2 minutes). What do you propose? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest expanding the comment explaining why we are doing this. And please explain why 2 minute sleep is more than enough to avoid flakes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
By("Block network connectivity to some nodes to simulate unhealthy cluster") | ||
nodesToBreakCount := int(math.Ceil(math.Max(float64(unhealthyClusterThreshold), 0.5*float64(clusterSize)))) | ||
nodes, err := f.ClientSet.CoreV1().Nodes().List(metav1.ListOptions{FieldSelector: fields.Set{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healthy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the cluster to become unhealthy in this scenario, to test back-off from adding more nodes