Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test flakes] master-scalability suites #60589

Closed
krzyzacy opened this issue Feb 28, 2018 · 164 comments

Comments

@krzyzacy
Copy link
Member

commented Feb 28, 2018

Failing release-blocking suites:

all three suites are flaking a lot recently, mind triage?

/sig scalability
/priority failing-test
/kind bug
/status approved-for-milestone

cc @jdumars @jberkus
/assign @shyamjvs @wojtek-t

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Feb 28, 2018

  • The correctness job is failing mostly due to timeout (need to adjust our schedule correspondingly) as there have been a bunch of e2es which got added to the suite recently (e.g #59391)
  • For the 100-node flakes, we have #60500 (and I believe that's related.. need to check).
  • For the performance job, I believe there's a regression (seems from the last few runs like it's around pod-startup latency). Maybe sth more too.

I'll try getting to them sometime during this week (having scarcity of free cycles atm).

@jdumars jdumars added this to the v1.10 milestone Feb 28, 2018

@krzyzacy

This comment has been minimized.

Copy link
Member Author

commented Mar 2, 2018

@shyamjvs is there any update for this issue?

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-correctness

I took a brief look into that. And either some test(s) are extreeeeemely slow or something is hanging somehwere. Par of logs from last run:

62571 I0301 23:01:31.360] Mar  1 23:01:31.348: INFO: Running AfterSuite actions on all node
62572 I0301 23:01:31.360]
62573 W0302 07:32:00.441] 2018/03/02 07:32:00 process.go:191: Abort after 9h30m0s timeout during ./hack/ginkgo-e2e.sh --ginkgo.flakeAttempts=2 --ginkgo.skip=\[Serial\]|\[Disruptive      \]|\[Flaky\]|\[Feature:.+\]|\[DisabledForLargeClusters\] --allowed-not-ready-nodes=50 --node-schedulable-timeout=90m --minStartupPods=8 --gather-resource-usage=master --gathe      r-metrics-at-teardown=master --logexporter-gcs-path=gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/80/artifacts --report-dir=/workspace/_artifacts --dis      able-log-dump=true --cluster-ip-range=10.64.0.0/11. Will terminate in another 15m
62574 W0302 07:32:00.445] SIGABRT: abort

No test finished within 8h30m

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-performance

Indeed seems like a regression. I think the regression happened somewhere between runs:
105 (which was still ok)
108 (which had visibly higher startup time)

We can try looking into kubemark-5000 to see if it is visible there too.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Kubemark-5000 is pretty stable. 99th percentile on this graph (maybe the regression happened even before, but I think it is somewhere between 105 and 108):

screenshot from 2018-03-02 14-36-54

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Regarding the corretness tests - gce-large-correctness is also failing.
Maybe some test that is extremely long was added at that time?

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Thanks a lot for looking @wojtek-t. Wrt performance job - I too feel strongly there's a regression (though I couldn't get to look properly into them).

Maybe some test that is extremely long was added at that time?

I was looking into this a while ago. And there were 2 suspicious changes I found:

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

@msau42

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

/assign

Some of the local storage tests will try to use every node in the cluster, thinking that cluster sizes are not that big. I'll add a fix to cap the max number of nodes.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Some of the local storage tests will try to use every node in the cluster, thinking that cluster sizes are not that big. I'll add a fix to cap the max number of nodes.

Thanks @msau42 - that would be great.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Going back to https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-performance suite

I took a closer look into runs up to 105 and runs 108 and after that.
The biggest difference wrt to pod startup time seem to appear at the step:

10% worst watch latencies: 

[the name of it is misleading - will explain below]

Up to 105 run, it generally looked like this:

I0129 21:17:43.450] Jan 29 21:17:43.356: INFO: perc50: 1.041233793s, perc90: 1.685463015s, perc99: 2.350747103s

Starting with 108 run, it looks more like:

Feb 12 10:08:57.123: INFO: perc50: 1.122693874s, perc90: 1.934670461s, perc99: 3.195883331s

That basically means ~0.85s increase and this roughly what we observe in the end result.

Now - what that "watch lag" is.
It is basically a time between "Kubelet observed that the pod is running" to "when test observed pod update setting its state to running".
There are a couple possibility where we could have regressed:

  • kubelet is slowed in reporting status
  • kubelet is qps-starved (and thus slower in reporting status)
  • apiserver is slower (e.g. cpu-starved) and thus processing requests slower (either write, watch, or both)
  • test is cpu-starved and thus processing incoming events slower

Since we don't really observe a difference between "schedule -> start" of a pod, that suggests that it most probably isn't apiserver (because processing requests and watch is on that path too), and it's most probably not slow kubelet too (because it starts the pod).

So i think the most likely hypothesis is:

  • kubelet is qps-starved (or sth that prevents it from quickly sending status update)
  • test is cpu-starved (or sth like that)

Test didn't change at all around that time. So I think it's most probably the first one.

That said, I went through PRs merged between 105 and 108 runs and didn't find anything useful so far.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

I think the next step is to:

  • look into the slowest pods (there seem to be O(1s) difference between slowest too) and see whether the difference is "before"of "after" the update status request was sent
@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

So I looked into example pods. And I'm already seeing this:

I0209 10:01:19.960823       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (1.615907ms) 200 [[kubelet/v1.10.0 (l    inux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
...
I0209 10:01:22.464046       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (279.153µs) 429 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
I0209 10:01:23.468353       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (218.216µs) 429 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
I0209 10:01:24.470944       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-16-5pl6z/pods/density-latency-pod-3115-fh7hl/status: (1.42987ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.200.5:37414]
I0209 09:57:01.559125       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (1.836423ms) 200 [[kubelet/v1.10.0 (l    inux/amd64) kubernetes/05944b1] 35.229.43.12:37782]
...
I0209 09:57:04.452830       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (231.2µs) 429 [[kubelet/v1.10.0 (linu    x/amd64) kubernetes/05944b1] 35.229.43.12:37782]
I0209 09:57:05.454274       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (213.872µs) 429 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.229.43.12:37782]
I0209 09:57:06.458831       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-35-gjzmd/pods/density-latency-pod-2034-t7c7h/status: (2.13295ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.229.43.12:37782]
I0209 10:01:53.063575       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (1.410064ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.212.60:3391
...
I0209 10:01:55.259949       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (10.4894ms) 429 [[kubelet/v1.10.0 (lin    ux/amd64) kubernetes/05944b1] 35.196.212.60:33916]
I0209 10:01:56.266377       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (233.931µs) 429 [[kubelet/v1.10.0 (lin    ux/amd64) kubernetes/05944b1] 35.196.212.60:33916]
I0209 10:01:57.269427       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (182.035µs) 429 [[kubelet/v1.10.0 (lin    ux/amd64) kubernetes/05944b1] 35.196.212.60:33916]
I0209 10:01:58.290456       1 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-density-30-5-6w4kv/pods/density-latency-pod-3254-586j7/status: (13.44863ms) 200 [[kubelet/v1.10.0 (li    nux/amd64) kubernetes/05944b1] 35.196.212.60:33916]

So it seems pretty clear that the problem is related to "429"s.

@jdumars

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Are those throttled API calls due to a quota on the owner account?

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Are those throttled API calls due to a quota on the owner account?

This isn't throttling as I thought initially. These are 429s on apiserver (the reason may be either slower apiserver because of some reason, or more requests coming to apiserver).

@jdumars

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

Oh, ok. That's not great news.

@Random-Liu

This comment has been minimized.

Copy link
Member

commented Mar 21, 2018

@jdumars np~ @yujuhong helped me a lot!

@yujuhong

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

But we do want to ask what PodScheduled mean to pods which are not scheduled by scheduler? Do we really need this condition for those pods? /cc @kubernetes/sig-autoscaling-bugs Because @yujuhong told me that PodScheduled is used by autoscaling now.

I still think letting kubelet set the PodScheduled condition is somewhat strange (as I noted in the original PR). Even if kubelet doesn't set this condition, it'd not affect the cluster autoscaler since the autoscaler ignores the pods without the specific condition. Anyway, the fix we eventually came up has very minor footprint, and would maintain the current behavior (i.e., always setting the PodScheduled condition), so we'll go with that.

@yujuhong

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

Also, revived the really old issue for adding tests for steady-state pod update rate #14391

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

Anyway, the fix we eventually came up has very minor footprint, and would maintain the current behavior (i.e., always setting the PodScheduled condition), so we'll go with that.

@yujuhong - are you talking about this one: #61504 (or do I misunderstand it)?

@wasylkowski @shyamjvs - can you please run 5000-node tests with that PR patched locally (before we merge it) to ensure that this really helps?

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

I ran the test against 1.10 HEAD + #61504, and the pod-startup latency seems to be fine:

INFO: perc50: 2.594635536s, perc90: 3.483550118s, perc99: 4.327417676s

Will re-run once more to confirm.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

@shyamjvs - thanks a lot!

mfojtik pushed a commit to mfojtik/kubernetes that referenced this issue Mar 22, 2018

Kubernetes Submit Queue
Merge pull request kubernetes#61478 from shyamjvs/capture-pod-startup…
…-phases-as-metrics

Automatic merge from submit-queue (batch tested with PRs 61378, 60915, 61499, 61507, 61478). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Capture pod startup phases as metrics

Learning from kubernetes#60589, we should also start collecting and graphing sub-parts of pod-startup latency.

/sig scalability
/kind feature
/priority important-soon
/cc @wojtek-t 

```release-note
NONE
```
@shyamjvs

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

Second run also seems good:

INFO: perc50: 2.583489146s, perc90: 3.466873901s, perc99: 4.380595534s

Fairly confident now the fix did the trick. Let's get it into 1.10 asap.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

Thanks @shyamjvs

As we talked offline - I think we had one more regression in the last month or so, but that one shouldn't block the release.

@yujuhong

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

@yujuhong - are you talking about this one: #61504 (or do I misunderstand it)?

Yep. The current fix in that PR is not in the options proposed initially in #60589 (comment)

k8s-github-robot pushed a commit that referenced this issue Mar 24, 2018

Kubernetes Submit Queue
Merge pull request #61504 from Random-Liu/fix-pod-scheduled-bug
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix `PodScheduled` bug for static pod.

Fixes #60589.

This is an implementation of option 2 in #60589 (comment).
I've validated this in my own cluster, and there won't be continuously status update for static pod any more.

Signed-off-by: Lantao Liu <lantaol@google.com>



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
none
```
@jberkus

This comment has been minimized.

Copy link

commented Mar 25, 2018

re-opening until we have a good performance test result.

@jberkus jberkus reopened this Mar 25, 2018

@jdumars

This comment has been minimized.

Copy link
Member

commented Mar 26, 2018

@yujuhong @krzyzacy @shyamjvs @wojtek-t @Random-Liu @wasylkowski-a any updates on this? This is still blocking 1.10 at the moment.

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Mar 26, 2018

So the only part of this bug that was blocking the release is the 5k-node performance job. Unfortunately, we lost our run from today due to a different reason (ref: #61190 (comment))

That said, we're fairly confident the fix works based on my manual runs (results pasted in #60589 (comment)). So IMHO we don't need to block release on it (the next run's going to be on wed).

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 26, 2018

+1
@jdumars - I think we can treat this as non-blocker.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Mar 26, 2018

Sorry, I edited my post above. I meant that we should treat it as "non-blocker".

@jdumars

This comment has been minimized.

Copy link
Member

commented Mar 26, 2018

Ok, thank you very much. This conclusion represents a tremendous amount of hours you have invested, and I cannot possibly thank you all enough for the work you have done. While we talk in the abstract about "community" and "contributors" you, and the others who have worked this issue represent it in concrete terms. You are the very heart and soul of this project, and I know I speak for everyone involved when I say it is an honor to work alongside such passion, commitment, and professionalism.

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2018

[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process

@krzyzacy @msau42 @shyamjvs @wojtek-t

Issue Labels
  • sig/api-machinery sig/autoscaling sig/node sig/scalability sig/scheduling sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

k8s-github-robot pushed a commit that referenced this issue Apr 12, 2018

Kubernetes Submit Queue
Merge pull request #61198 from jpbetz/etcd-3.2-upgrade-reattempt
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (#60589 (comment)).

See #60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc

k8s-publishing-bot added a commit to kubernetes/sample-apiserver that referenced this issue Apr 12, 2018

Merge pull request #61198 from jpbetz/etcd-3.2-upgrade-reattempt
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)).

See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc

Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7

k8s-publishing-bot added a commit to kubernetes/apiextensions-apiserver that referenced this issue Apr 12, 2018

Merge pull request #61198 from jpbetz/etcd-3.2-upgrade-reattempt
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)).

See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc

Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
@shyamjvs

This comment has been minimized.

Copy link
Member

commented May 25, 2018

This issue was resolved with the relevant fixes in for 1.10.
For 1.11, we're tracking the failures under - #63030.

/close

prameshj added a commit to prameshj/kubernetes that referenced this issue Jun 1, 2018

KIVagant added a commit to KIVagant/kubernetes that referenced this issue Jul 4, 2018

openshift-publish-robot pushed a commit to openshift/kubernetes-sample-apiserver that referenced this issue Jan 14, 2019

Merge pull request #60891 from shyamjvs/go-back-to-etcd-3.1.10
Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Rollback etcd server version to 3.1.11 due to #60589

Ref kubernetes/kubernetes#60589 (comment)

The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump.
So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :)

/cc @wojtek-t @jpbetz 

```release-note
Downgrade default etcd server version to 3.1.11 due to #60589
```

(I'm not sure if we should instead remove release-notes of the original PRs)

Kubernetes-commit: 56195fd1d329e8fb6c3c6cba59e1bc1eb4a2c998

openshift-publish-robot pushed a commit to openshift/kubernetes-sample-apiserver that referenced this issue Jan 14, 2019

Merge pull request #61198 from jpbetz/etcd-3.2-upgrade-reattempt
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Update default etcd server to 3.2 for kubernetes 1.11

Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)).

See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k.

```release-note
Upgrade the default etcd server version to 3.2.18
```
@gyuho @shyamjvs @jdumars @timothysc

Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.