Extend timeout before breaking node in autoscaling test #71362

aleksandra-malinowska · 2018-11-22T20:28:57Z

This will hopefully fix [sig-autoscaling] Cluster size autoscaling [Slow] Shouldn't perform scale up operation and should list unhealthy status if most of the cluster is broken[Feature:ClusterSizeAutoscalingScaleUp] by extending timeout between creating nodes and disconnecting them. If the nodes are broken too soon, Cluster Autoscaler considers them upcoming and doesn't back-off from scaling cluster.

Testgrid: https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling

/assign @mwielgus
/kind failing-test

NONE

k8s-ci-robot · 2018-11-22T20:31:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleksandra-malinowska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/autoscaling/OWNERS~~ [aleksandra-malinowska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aleksandra-malinowska · 2018-11-22T20:31:41Z

/sig autoscaling

mwielgus · 2018-11-23T12:13:20Z

test/e2e/autoscaling/cluster_size_autoscaling.go

@@ -881,6 +881,10 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() {
 			clusterSize = manuallyIncreaseClusterSize(f, originalSizes)
 		}

+		// If new nodes are disconnected too soon, they'll be considered not started
+		// instead of unready.
+		time.Sleep(2 * time.Minute)


Sleeps with hardcoded values are rarely the best choice in tests. Can we do in a more robust way?

If node's ready condition transitions less than 2 minutes after it was created, Cluster Autoscaler will consider this node as upcoming (= not yet started), not as unready. I don't see an easy way to work around this without fixing how node readiness works across the system. Assuming this is how we want to handle readiness for now, autoscaler's behavior is correct - we don't want to back-off from scaling cluster just because we've added some nodes. Before, this test probably worked because nodes started slowly (so by the time they were ready, they were older than 2 minutes).

What do you propose?

I suggest expanding the comment explaining why we are doing this. And please explain why 2 minute sleep is more than enough to avoid flakes.

mwielgus · 2018-11-23T12:49:19Z

/lgtm

mwielgus · 2018-11-23T12:52:05Z

@aleksandra-malinowska please remove hold once you confirm with the release team that they are ok with the merge.

nikopen · 2018-11-23T16:05:20Z

/lgtm

@AishSundar

AishSundar · 2018-11-23T16:19:22Z

test/e2e/autoscaling/cluster_size_autoscaling.go

@@ -881,6 +881,19 @@ var _ = SIGDescribe("Cluster size autoscaling [Slow]", func() {
 			clusterSize = manuallyIncreaseClusterSize(f, originalSizes)
 		}

+		// If new nodes are disconnected too soon, they'll be considered not started
+		// instead of unready, and cluster won't be considered unhealthy.


We want the cluster to become unhealthy in this scenario, to test back-off from adding more nodes

AishSundar · 2018-11-23T17:32:57Z

@aleksandra-malinowska a couple of ques
(i) Is this test part of any release blocking dashboard? I dont think so, but correct me if wrong
(ii) If not, then any reason you need it fixed right away, during freeze? Are you missing signal on any other feature/ functionality due to this failure?
If its a pure test fix , then can it wait for a couple more days until after freeze?

aleksandra-malinowska · 2018-11-23T18:30:55Z

These tests are blocking Cluster Autoscaler release. Since default GCE config lives in this repo under /cluster/gce and is bundled with Kubernetes release, we need to release this component just before Kubernetes release and update the image at the last moment.

Unfortunately, our tests also happen to live in this repo. If we're no longer allowed to fix them, we won't get test signal for Cluster Autoscaler, and won't be sure it works on 1.13.

This is far from ideal, but moving tests out of this repository will take time, and I don't know if there are even any plans to do it with GCE config.

AishSundar · 2018-11-24T17:45:58Z

Sounds good please go ahead with the merge then.

/priority critical-urgent

aleksandra-malinowska · 2018-11-24T17:52:29Z

Thanks!

/hold cancel

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Nov 22, 2018

k8s-ci-robot assigned mwielgus Nov 22, 2018

k8s-ci-robot requested review from DirectXMan12 and jszczepkowski November 22, 2018 20:31

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2018

k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Nov 22, 2018

mwielgus reviewed Nov 23, 2018

View reviewed changes

Extend timeout before breaking node in autoscaling test

2f278a3

aleksandra-malinowska force-pushed the autoscaling-test-fix-26 branch from af8ccf4 to 2f278a3 Compare November 23, 2018 12:47

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 23, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 23, 2018

mwielgus added this to the v1.12 milestone Nov 23, 2018

mwielgus added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2018

mwielgus modified the milestones: v1.12, v1.13 Nov 23, 2018

k8s-ci-robot assigned nikopen Nov 23, 2018

AishSundar reviewed Nov 23, 2018

View reviewed changes

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 24, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2018

k8s-ci-robot merged commit 610f48f into kubernetes:master Nov 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend timeout before breaking node in autoscaling test #71362

Extend timeout before breaking node in autoscaling test #71362

aleksandra-malinowska commented Nov 22, 2018 •

edited

k8s-ci-robot commented Nov 22, 2018

aleksandra-malinowska commented Nov 22, 2018

mwielgus Nov 23, 2018

aleksandra-malinowska Nov 23, 2018

mwielgus Nov 23, 2018

aleksandra-malinowska Nov 23, 2018

mwielgus commented Nov 23, 2018

mwielgus commented Nov 23, 2018

nikopen commented Nov 23, 2018

AishSundar Nov 23, 2018

aleksandra-malinowska Nov 23, 2018

AishSundar commented Nov 23, 2018

aleksandra-malinowska commented Nov 23, 2018

AishSundar commented Nov 24, 2018

aleksandra-malinowska commented Nov 24, 2018

Extend timeout before breaking node in autoscaling test #71362

Extend timeout before breaking node in autoscaling test #71362

Conversation

aleksandra-malinowska commented Nov 22, 2018 • edited

k8s-ci-robot commented Nov 22, 2018

aleksandra-malinowska commented Nov 22, 2018

mwielgus Nov 23, 2018

Choose a reason for hiding this comment

aleksandra-malinowska Nov 23, 2018

Choose a reason for hiding this comment

mwielgus Nov 23, 2018

Choose a reason for hiding this comment

aleksandra-malinowska Nov 23, 2018

Choose a reason for hiding this comment

mwielgus commented Nov 23, 2018

mwielgus commented Nov 23, 2018

nikopen commented Nov 23, 2018

AishSundar Nov 23, 2018

Choose a reason for hiding this comment

aleksandra-malinowska Nov 23, 2018

Choose a reason for hiding this comment

AishSundar commented Nov 23, 2018

aleksandra-malinowska commented Nov 23, 2018

AishSundar commented Nov 24, 2018

aleksandra-malinowska commented Nov 24, 2018

aleksandra-malinowska commented Nov 22, 2018 •

edited