Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing/Flaking Test: E2E: [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController Should scale from 5 pods to 3 pods and from 3 to 1 and verify decision stability #69444

Closed
jberkus opened this Issue Oct 4, 2018 · 39 comments

Comments

Projects
None yet
7 participants
@jberkus
Copy link

commented Oct 4, 2018

Test: https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke-serial&show-stale-tests=

Example: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-serial/7169

The HPA tests for gce-serial are flaking rather regularly; they fail about 2/3 of the time for the last week, and are often the cause of a failed test run. Can we de-flake these?

Over multiple test runs the problem seems to be the number of replicas jumping up to 4 unexpectedly:

Oct  3 11:39:13.153: INFO: ConsumeCPU URL: {https   35.232.126.216 /api/v1/namespaces/e2e-tests-horizontal-pod-autoscaling-qjvjw/services/rc-ctrl/proxy/ConsumeCPU  false durationSec=30&millicores=250&requestSizeMillicores=100 }
Oct  3 11:39:22.808: INFO: expecting there to be 3 replicas (are: 3)
Oct  3 11:39:32.778: INFO: expecting there to be 3 replicas (are: 4)
Oct  3 11:39:32.778: INFO: Unexpected error occurred: number of replicas changed unexpectedly

Is the limit not getting set correctly here? Is this an actual bug?

/sig autoscaling
/priority important-soon
/kind failing-test
/kind flake

@mwielgus

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2018

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2018

Recent failures fall into three categories:

  1. Pods not starting (failure 1, failure 2, failure 3, failure 4, failure 5, failure 6, failure 7). They fail with error message:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:64
Expected error:
    <*errors.errorString | 0xc0026423d0>: {
        s: "Only 0 pods started out of 5",
    }
    Only 0 pods started out of 5
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:460
  1. Size increasing unexpectedly (failure 1, failure 2, failure 3). They fail with error message:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:64
Expected error:
    <*errors.errorString | 0xc421627780>: {
        s: "number of replicas changed unexpectedly",
    }
    number of replicas changed unexpectedly
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:125
  1. Timeout waiting for scale down (failure). This one failed with error message:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:64
timeout waiting 15m0s for 3 replicas
Expected error:
    <*errors.errorString | 0xc4200996a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:123

I suspect 2. and 3. may be caused by test using correct amount of CPU when averaged over minute-long window but spiky (for example using 100% cpu for some time then 0% cpu for rest of the minute). I saw similar behavior when validating recent changes manually. This would result in HPA:

  • behaving as the test expects with before recent changes (because it ignored any spikes during scale*forbidden windows, longer sample windows masking the spike),
  • HPA (correctly) increasing size if spikes align after those changes (failures in group 2.),
  • HPA (correctly) not decreasing size if spikes align after those changes (failure in group 3.),

I'll take a look at code of those tests and try to

  1. is responsible for most of the flakes but I don't have any ideas for a cause yet.
@jberkus

This comment has been minimized.

Copy link
Author

commented Oct 9, 2018

/remove-kind failing-test

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 18, 2018

I was busy this week but I think I'll be able to prepare a fix for this tomorrow.

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Oct 18, 2018

/milestone v1.13
/kind failing-test

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Oct 18, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Oct 18, 2018

this tests is failing consistently in

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 22, 2018

On Friday I verified that CPU usage generated by resource consumer stays really close to the target value but oscillates slightly. To fix this I:

  • Increase CPU usage in the test (resource consumer is implemented in a way that makes me think that deviation fro target are of a fixed size so higher target would make deviation a smaller percent of the target).
  • Lower generated load (the test right now is targeting border between recommendation of 3 instances and 4 instances, I will change the load to something between 2 and 3 instances).

I'm checking if this helps the test.

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

It seems to have helped. I've sent a PR with those changes for review.

@jberkus

This comment has been minimized.

Copy link
Author

commented Oct 23, 2018

Re-opening until we have at least 3 green runs:

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-serial

/reopen

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

@jberkus: Reopening this issue.

In response to this:

Re-opening because we just had another flake last night:

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-serial

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Oct 23, 2018

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 24, 2018

But it doesn't work on test grid, the test now fails with:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:64
Expected error:
    <*errors.errorString | 0xc000566de0>: {
        s: "Only 3 pods started out of 5",
    }
    Only 3 pods started out of 5
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/autoscaling_utils.go:460

I'll look into this.

@jberkus

This comment has been minimized.

Copy link
Author

commented Oct 24, 2018

Yes, and it's now causing another test to fail as well:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new

Escalating to critical-urgent. This test has been failing for 3 weeks now, and is potentially blocking visibility into other failures.

/priority critical-urgent
/remove-priority important-soon

@mariantalla

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

This test had a couple of successful runs yesterday but is now flaking/failing in master-blocking:

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-serial

It seems to be passing in master-upgrade.

Any chance we could prioritize work on it? It's preventing us from seeing whether fixes for other test failures are working.

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 29, 2018

There was only 1 flake since PR changing CPU requests in HPA tests back to 500 mCPU merged on Thursday (and ~15 green runs). The flake was an unexpected scale up happening during stabilization.

I'll check if lowering CPU utilization a bit lower deflakes the test more but I think it's ok to close this issue now.

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 29, 2018

Lowering CPU utilization didn't help.

@jberkus

This comment has been minimized.

Copy link
Author

commented Oct 29, 2018

SIG-Autoscaling's boards are also showing flakiness: https://k8s-testgrid.appspot.com/sig-autoscaling-hpa, although that looks better than the sig-release test boards. Any idea why?

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2018

Thanks for the links I've been looking only at one of those dashboards. I've sent another PR that should reduce flakiness of our tests.

@jberkus

This comment has been minimized.

Copy link
Author

commented Oct 30, 2018

Thanks!

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2018

The PR merged. Now let's wait and see if it solves the problem.

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2018

The test named in the tile of this issue1 got better. Last failures at the dashboards you listed:

So it looks like the test didn't flake in the last ~2 days.

Another two tests2 are flaking here, I'll send a PR that should fix it. I think those flakes are result of behavior of HPA intentionally changing3 so the PR will just relax expectations of the test.


1 [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController Should scale from 5 pods to 3 pods and from 3 to 1 and verify decision stability
2 [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale down with External Metric with target value from Stackdriver [Feature:CustomMetricsAutoscaling]

and

[sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: Custom Metrics from Stackdriver) should scale down with External Metric with target average value from Stackdriver [Feature:CustomMetricsAutoscaling]
3 HPA no longer delays scale ups based on metrics other than CPU. Flakes are when HPA scales deployment to 2 or 3 so test doesn't observe it having size 1 and times out waiting for that size.

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2018

I misread titles of custom metric tests as tests for scaling up somehow. Those flaking tests are tests for scaling down and scale downs sometimes don't happen. I'm looking into why.

@jberkus

This comment has been minimized.

Copy link
Author

commented Nov 4, 2018

While this test has gotten better, it's still flaking some of the time:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-master

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

I looked at "Latest failures" in the two top categories according to this dashboard (all the other categories have almost no recent flakes).

All of the failures log as if they were not affected by this PR: waiting for %d replicas (current: %d) instead of expecting there to be in [%d, %d] replicas (are: %d). Tests results also say they are running version 1.11... or 1.12... (or more rarely 1.9... or 1.10... but I'm ignoring those for now - they are not related to any changes I made). So it looks like I need to cherry pick the PR into 1.11 and 1.12 branches.

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

... when I look at git branches it looks like my PR is there.

There was one failure with logs that looking like this PR affected it.

I'll wait to see if there are more failures in test runs including that change. In case there are I'll add some more logging (I have no ideas for cause of the one failure I saw so far).

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

/assign @jbartosik

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

@jbartosik: GitHub didn't allow me to assign the following users: jbartosik.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @jbartosik

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

@jbartosik any further update on this? I see it turning into more of a failure than a flake and I see another test failing as well in latest runs

https://testgrid.k8s.io/sig-release-master-upgrade#gce-new-master-upgrade-master

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

We are due to cut Beta for 1.13 today and would like to understand if this should be a blocker or not.

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

Also are there cherrypick PRs to backport your change to 1.11 and 1.12 branches?

@mariantalla

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

Potentially related: #70655

@jbartosik

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

6 (consecutive) most recent failures of test [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController Should scale from 1 pod to 3 pods and from 3 to 5 and verify decision stability do not have PR #70579 judging by their logs. That PR would change them from failures (unexpected scale from 3 to 4 instances during stabilisation) to success (expected 3 to 4 instances during stabilisation).

One of the 2 recent failures of [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController Should scale from 5 pods to 3 pods and from 3 to 1 and verify decision stability looks wrong. One failure was scaling up immediately to 5 instances (wrong). The other was scale up to 4 instances (ok, I'll send a PR relaxing that condition tomorrow).

I didn't send PRs cherrypicking test changes to 1.12 (because when I checked it looked like the change was there) not to 1.11 (my changes aren't there so old tests should keep working).

PR#70649 merged and it ads more logging, which will help me debug this.

Failures in the issue#70655 look like the PR relaxing stability requirement should have helped there too.

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

@jbartosik the gce-new-master-upgrade-master, which has the 6 (consecutive) most recent failures [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicationController Should scale from 1 pod to 3 pods and from 3 to 5 and verify decision stability , upgrades the master alone to 1.13(master) and and keep the nodes in old version and then run the old test suite.

The latest failing runs show that the tests are run from 1.11.5 branch
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-master/1867.

Doesn't this mean your changes need to be backported to 1.11 branch? That is the reason why you dont see you change in those tests logs

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

FWIW, the same test also flakes in gce-new-master-upgrade-cluster job which upgrade both master and nodes to 1.13 and runs 1.11 tests.

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

@jbartosik this failure is currently blocking Beta for 1..13. Can you please investigate these and let us know the path forward.

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

@AishSundar Fixes done by @jbartosik are all test fixes, not feature fixes. So the problem should be fixed once cherry-pick gets to 1.11.

That being said - why are we running 1.11 e2e tests against 1.13 cluster? That assumes full compatibility of both API and implementation.
We can never remove a feature without breaking this suite (the tests for the feature will still exist on older branches after all). And the tests are often tweaked to the specifics of implementation. Whenever we make any change to autoscaling algorithm tests need to be updated accordingly. Making tests on 1.11 work well with 1.13 risks they will stop working well with 1.11.

Shouldn't the upgrade job create 1.11 cluster, upgrade it to 1.13 and run 1.13 e2e tests on that cluster?

@AishSundar

This comment has been minimized.

Copy link
Contributor

commented Nov 14, 2018

I see these 2 pass in the latest run of https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-serial
Over to @jberkus to close when he sees fit

@jberkus

This comment has been minimized.

Copy link
Author

commented Nov 14, 2018

We have other flakes on that job now, but HPA is no longer flaking.

Thanks so much for the dilligence in getting these flakes cleared up, @jbartosik

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Nov 14, 2018

@jberkus: Closing this issue.

In response to this:

We have other flakes on that job now, but HPA is no longer flaking.

Thanks so much for the dilligence in getting these flakes cleared up, @jbartosik

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.