-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/openshift/e2e: Smoke test for scale down #32
test/openshift/e2e: Smoke test for scale down #32
Conversation
/hold |
As we're only concerned with delta values when validating the size of a machine set and the resultant number of nodes we only need to consider one additional node. This commit reduces the MaxReplicas from 12 => 2.
e80dd46
to
ef88007
Compare
/hold cancel |
This PR would also benefit from openshift/cluster-autoscaler-operator#37 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
if err != nil { | ||
return err | ||
} | ||
// As we have just deleted the workload the autoscaler will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if deleting the workload (job) will remove the pods as well, if not the autoscaler will still try to allocate new resources instead of scaling down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it does. That's how I've done 100% of my manual testing over the last few months.
/refresh |
/retest |
We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake. It is likely that this bug was introduced in PR #32. For example, you would occasionally see the following: ```console I0214 12:51:41.034700 1 scale_up.go:584] Scale-up: setting group size to 3 I0214 12:51:51.129943 1 scale_up.go:584] Scale-up: setting group size to 3 ``` Between these two calls we were adjusting the replica count in an attempt to clean up at the end of the test. But occasionally the autoscaler would do its scan-of-the-state-of-the-cluster and would up add new nodes because the replica count was less then desired. When this condition occurred the node count could never drop below the initial node count as we just added a further max-min nodes. The test would eventually timeout trying to assert that the node count matched the initial node count. The fix here is to not reset the replica count but instead rely on the autoscaler to scale down and adjust the replica count naturally; this change further helps to verify that scale down is working properly. There are additional smaller fixes here too: - we set cascading delete in the batch job (i.e., workload) - we assert that the replica count == the initial replica count - we explicitly set the clusterautoscaler's ScaleDown config
…a count We were adjusting the replica count when the cluster-autoscaler was still running which meant that the test would occasionally flake. It is likely that this bug was introduced in PR #32. For example, you would occasionally see the following: ```console I0214 12:51:41.034700 1 scale_up.go:584] Scale-up: setting group size to 3 I0214 12:51:51.129943 1 scale_up.go:584] Scale-up: setting group size to 3 ``` Between these two calls we were adjusting the replica count in an attempt to clean up at the end of the test. But occasionally the autoscaler would do its scan-of-the-state-of-the-cluster and would up add new nodes because the replica count was less then desired. When this condition occurred the node count could never drop below the initial node count as we just added a further max-min nodes. The test would eventually timeout trying to assert that the node count matched the initial node count. The fix here is to not reset the replica count but instead rely on the autoscaler to scale down and adjust the replica count naturally; this change further helps to verify that scale down is working properly. There are additional smaller fixes here too: - we set cascading delete in the batch job (i.e., workload) - we assert that the replica count == the initial replica count - we explicitly set the clusterautoscaler's ScaleDown config
Reworked the e2e test to delete the workload. This will:
The smoke tests asserts that:
I also reduced the number of replicas we spin up from 12 => 2 as we're only testing for a delta of 1.