Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend timeout before breaking node in autoscaling test #71362

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
13 changes: 13 additions & 0 deletions test/e2e/autoscaling/cluster_size_autoscaling.go
Original file line number Diff line number Diff line change
Expand Up @@ -881,6 +881,19 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() {
clusterSize = manuallyIncreaseClusterSize(f, originalSizes)
}

// If new nodes are disconnected too soon, they'll be considered not started
// instead of unready, and cluster won't be considered unhealthy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the cluster to become unhealthy in this scenario, to test back-off from adding more nodes

//
// More precisely, Cluster Autoscaler compares last transition time of
// several readiness conditions to node create time. If it's within
// 2 minutes, it'll assume node is just starting and not unhealthy.
//
// Nodes become ready in less than 1 minute after being created,
// so waiting extra 2 minutes before breaking them (which triggers
// readiness condition transition) should be sufficient, while
// making no assumptions about minimal node startup time.
time.Sleep(2 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeps with hardcoded values are rarely the best choice in tests. Can we do in a more robust way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If node's ready condition transitions less than 2 minutes after it was created, Cluster Autoscaler will consider this node as upcoming (= not yet started), not as unready. I don't see an easy way to work around this without fixing how node readiness works across the system. Assuming this is how we want to handle readiness for now, autoscaler's behavior is correct - we don't want to back-off from scaling cluster just because we've added some nodes. Before, this test probably worked because nodes started slowly (so by the time they were ready, they were older than 2 minutes).

What do you propose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest expanding the comment explaining why we are doing this. And please explain why 2 minute sleep is more than enough to avoid flakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


By("Block network connectivity to some nodes to simulate unhealthy cluster")
nodesToBreakCount := int(math.Ceil(math.Max(float64(unhealthyClusterThreshold), 0.5*float64(clusterSize))))
nodes, err := f.ClientSet.CoreV1().Nodes().List(metav1.ListOptions{FieldSelector: fields.Set{
Expand Down