Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/openshift/e2e: Smoke test for scale down #32

Merged

Conversation

frobware
Copy link

@frobware frobware commented Feb 8, 2019

Reworked the e2e test to delete the workload. This will:

  • delete all workload-based pods
  • the autoscaler will notice that there are nodes without any utilisation
  • the autoscaler will scale down

The smoke tests asserts that:

  • the MachineSet's replicas drops to its original number.
  • the number of nodes in the cluster drops to the original count before scale up

I also reduced the number of replicas we spin up from 12 => 2 as we're only testing for a delta of 1.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 8, 2019
@frobware
Copy link
Author

frobware commented Feb 8, 2019

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2019
As we're only concerned with delta values when validating the size of
a machine set and the resultant number of nodes we only need to
consider one additional node. This commit reduces the MaxReplicas from
12 => 2.
@frobware
Copy link
Author

frobware commented Feb 8, 2019

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2019
@frobware
Copy link
Author

frobware commented Feb 8, 2019

This PR would also benefit from openshift/cluster-autoscaler-operator#37

@frobware
Copy link
Author

frobware commented Feb 8, 2019

/cc @ingvagabund @bison @enxebre

Copy link

@bison bison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2019
@ingvagabund
Copy link
Member

/approve

if err != nil {
return err
}
// As we have just deleted the workload the autoscaler will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if deleting the workload (job) will remove the pods as well, if not the autoscaler will still try to allocate new resources instead of scaling down

Copy link
Author

@frobware frobware Feb 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it does. That's how I've done 100% of my manual testing over the last few months.

@frobware
Copy link
Author

frobware commented Feb 8, 2019

/refresh

@frobware
Copy link
Author

frobware commented Feb 8, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 0261cba into openshift:master Feb 8, 2019
openshift-merge-robot pushed a commit that referenced this pull request Feb 15, 2019
We were adjusting the replica count when the cluster-autoscaler was
still running which meant that the test would occasionally flake. It
is likely that this bug was introduced in PR #32.

For example, you would occasionally see the following:

```console
I0214 12:51:41.034700 1 scale_up.go:584] Scale-up: setting group size to 3
I0214 12:51:51.129943 1 scale_up.go:584] Scale-up: setting group size to 3
```

Between these two calls we were adjusting the replica count in an
attempt to clean up at the end of the test. But occasionally the
autoscaler would do its scan-of-the-state-of-the-cluster and would up
add new nodes because the replica count was less then desired. When
this condition occurred the node count could never drop below the
initial node count as we just added a further max-min nodes. The test
would eventually timeout trying to assert that the node count matched
the initial node count.

The fix here is to not reset the replica count but instead rely on the
autoscaler to scale down and adjust the replica count naturally; this
change further helps to verify that scale down is working properly.

There are additional smaller fixes here too:

- we set cascading delete in the batch job (i.e., workload)
- we assert that the replica count == the initial replica count
- we explicitly set the clusterautoscaler's ScaleDown config
@frobware frobware deleted the smoke-test-for-scale-down branch March 22, 2019 06:22
openshift-merge-robot pushed a commit that referenced this pull request Apr 4, 2019
…a count

We were adjusting the replica count when the cluster-autoscaler was
still running which meant that the test would occasionally flake. It
is likely that this bug was introduced in PR #32.

For example, you would occasionally see the following:

```console
I0214 12:51:41.034700 1 scale_up.go:584] Scale-up: setting group size to 3
I0214 12:51:51.129943 1 scale_up.go:584] Scale-up: setting group size to 3
```

Between these two calls we were adjusting the replica count in an
attempt to clean up at the end of the test. But occasionally the
autoscaler would do its scan-of-the-state-of-the-cluster and would up
add new nodes because the replica count was less then desired. When
this condition occurred the node count could never drop below the
initial node count as we just added a further max-min nodes. The test
would eventually timeout trying to assert that the node count matched
the initial node count.

The fix here is to not reset the replica count but instead rely on the
autoscaler to scale down and adjust the replica count naturally; this
change further helps to verify that scale down is working properly.

There are additional smaller fixes here too:

- we set cascading delete in the batch job (i.e., workload)
- we assert that the replica count == the initial replica count
- we explicitly set the clusterautoscaler's ScaleDown config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants