Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend timeout before breaking node in autoscaling test #71362

Merged

Conversation

aleksandra-malinowska
Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska commented Nov 22, 2018

This will hopefully fix [sig-autoscaling] Cluster size autoscaling [Slow] Shouldn't perform scale up operation and should list unhealthy status if most of the cluster is broken[Feature:ClusterSizeAutoscalingScaleUp] by extending timeout between creating nodes and disconnecting them. If the nodes are broken too soon, Cluster Autoscaler considers them upcoming and doesn't back-off from scaling cluster.

Testgrid: https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling

/assign @mwielgus
/kind failing-test

NONE

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Nov 22, 2018
@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleksandra-malinowska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2018
@aleksandra-malinowska
Copy link
Contributor Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Nov 22, 2018
@@ -881,6 +881,10 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() {
clusterSize = manuallyIncreaseClusterSize(f, originalSizes)
}

// If new nodes are disconnected too soon, they'll be considered not started
// instead of unready.
time.Sleep(2 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeps with hardcoded values are rarely the best choice in tests. Can we do in a more robust way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If node's ready condition transitions less than 2 minutes after it was created, Cluster Autoscaler will consider this node as upcoming (= not yet started), not as unready. I don't see an easy way to work around this without fixing how node readiness works across the system. Assuming this is how we want to handle readiness for now, autoscaler's behavior is correct - we don't want to back-off from scaling cluster just because we've added some nodes. Before, this test probably worked because nodes started slowly (so by the time they were ready, they were older than 2 minutes).

What do you propose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest expanding the comment explaining why we are doing this. And please explain why 2 minute sleep is more than enough to avoid flakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 23, 2018
@mwielgus
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 23, 2018
@mwielgus mwielgus added this to the v1.12 milestone Nov 23, 2018
@mwielgus mwielgus added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2018
@mwielgus mwielgus modified the milestones: v1.12, v1.13 Nov 23, 2018
@mwielgus
Copy link
Contributor

@aleksandra-malinowska please remove hold once you confirm with the release team that they are ok with the merge.

@nikopen
Copy link
Contributor

nikopen commented Nov 23, 2018

/lgtm

@AishSundar

@@ -881,6 +881,19 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() {
clusterSize = manuallyIncreaseClusterSize(f, originalSizes)
}

// If new nodes are disconnected too soon, they'll be considered not started
// instead of unready, and cluster won't be considered unhealthy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the cluster to become unhealthy in this scenario, to test back-off from adding more nodes

@AishSundar
Copy link
Contributor

@aleksandra-malinowska a couple of ques
(i) Is this test part of any release blocking dashboard? I dont think so, but correct me if wrong
(ii) If not, then any reason you need it fixed right away, during freeze? Are you missing signal on any other feature/ functionality due to this failure?
If its a pure test fix , then can it wait for a couple more days until after freeze?

@aleksandra-malinowska
Copy link
Contributor Author

These tests are blocking Cluster Autoscaler release. Since default GCE config lives in this repo under /cluster/gce and is bundled with Kubernetes release, we need to release this component just before Kubernetes release and update the image at the last moment.

Unfortunately, our tests also happen to live in this repo. If we're no longer allowed to fix them, we won't get test signal for Cluster Autoscaler, and won't be sure it works on 1.13.

This is far from ideal, but moving tests out of this repository will take time, and I don't know if there are even any plans to do it with GCE config.

@AishSundar
Copy link
Contributor

Sounds good please go ahead with the merge then.

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 24, 2018
@aleksandra-malinowska
Copy link
Contributor Author

Thanks!

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2018
@k8s-ci-robot k8s-ci-robot merged commit 610f48f into kubernetes:master Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants