Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Test : [sig-scalability] Density [Feature:Performance] should allow starting 30 pods per node using { ReplicationController} with 0 secrets, 0 configmaps and 0 daemons #63030

Closed
AishSundar opened this Issue Apr 23, 2018 · 57 comments

Comments

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2018

@AishSundar: GitHub didn't allow me to assign the following users: for, triage.

Note that only kubernetes members and repo collaborators can be assigned.

In response to this:

Failing Job

sig-release-master-blocking#gce-scale-performance

Failing Test
[sig-scalability] Density [Feature:Performance] should allow starting 30 pods per node using { ReplicationController} with 0 secrets, 0 configmaps and 0 daemons

Triage results
https://storage.googleapis.com/k8s-gubernator/triage/index.html?sig=scalability&job=ci-kubernetes-e2e-gce-scale-performance

/kind bug
/priority failing-test
/priority important-soon
/sig scalability
/milestone v1.11
@kubernetes/sig-scalability-bugs -bugs

cc @jberkus @tpepper

/assign @shyamjvs for triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AishSundar

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2018

/status approved-for-milestone

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Apr 23, 2018

This issue is being taken care of in #62064 and #62808.

@AishSundar

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2018

@AishSundar

This comment has been minimized.

Copy link
Contributor Author

commented May 7, 2018

@shyamjvs @wojtek-t is there an ETA on fixing this test failure? This is currently on of the few tests failing our sig-release master blocking suite. If this is not fixed we will have to delay beta for 1.11

@tpepper @jberkus as FYI

@shyamjvs

This comment has been minimized.

Copy link
Member

commented May 7, 2018

The load capacity test failures were mostly because of #60372 IIUC and that was fixed by it's reverting PR few days later.
The longer term density test failures seems to be another regression (I'm currently looking into it). At this point, I'd say this EOW as a reasonable ETA.

@shyamjvs

This comment has been minimized.

Copy link
Member

commented May 9, 2018

So I took a look into the density test failures and ~all the failures in the streak were due to high latencies of 'DELETE pods' request (and sometimes 'POST configmaps' also).
E.g from run-150:

WARNING Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:7.023ms Perc90:1.232998s Perc99:1.846284s Perc100:0s} Count:310025}
WARNING Top latency metric: {Resource:configmaps Subresource: Verb:POST Scope:namespace Latency:{Perc50:2.119ms Perc90:13.776ms Perc99:1.260952s Perc100:0s} Count:764}

From the graph of the latency across runs, there was a jump across runs 141 and 142:

lat

@shyamjvs

This comment has been minimized.

Copy link
Member

commented May 9, 2018

That's a big jump (more than twice) and luckily a similar effect is also visible on our 2k-node cluster job and it's even consistently seen. So I ran few rounds of bisection across the diff on 2k-node cluster, and root-caused to the following PR which made changes around admission plugins - #62659

cc @enj @deads2k @liggitt - Any idea why this happened? Also, can we revert the change?

@liggitt

This comment has been minimized.

Copy link
Member

commented May 9, 2018

cc @enj @deads2k @liggitt - Any idea why this happened? Also, can we revert the change?

Does the scale test environment enable the OwnerReferencesPermissionEnforcement admission plugin? That's the only plugin using that rest mapper that I can see, and I didn't think it was on by default

@enj

This comment has been minimized.

Copy link
Member

commented May 9, 2018

It is the only plugin we have that uses that REST mapper...

@shyamjvs

This comment has been minimized.

Copy link
Member

commented May 9, 2018

Let me run twice with and without the change to confirm.

@liggitt

This comment has been minimized.

Copy link
Member

commented May 9, 2018

It looks like the QPS on the client used for internal loopback requests is set to 50 here:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/config_selfclient.go#L40-L45

Using that client for discovery uses ~25 requests every 10 seconds to maintain the list of active resources.

That client is also used by the node restriction admission plugin in pod deletion cases:
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/noderestriction/admission.go#L185-L187

During a deletion storm, it seems likely that client is throttling itself already, and also using it for discovery could have affected it slightly, though I wouldn't have expected nearly a doubling effect (50 QPS == 500 available requests per 10 second period, of which the newly added discovery would have taken 5%)

As a first step, I'd likely try:

  • bumping QPS for that client by ~10
  • lengthen the resource rediscovery period to 20-30 seconds

As a next step, I'll look into reworking the node restriction plugin to work off the node->pod graph we already have in place rather than requiring live lookups on deletion

@shyamjvs

This comment has been minimized.

Copy link
Member

commented May 9, 2018

On a somewhat related note (though from a different time - around yesterday) we started seeing failures in even our 100-node performance tests - https://k8s-testgrid.appspot.com/sig-scalability-gce#gce

And the reason there is also exactly the same, i.e increased delete pod latency:

WARNING Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:65.177ms Perc90:2.010244s Perc99:2.051307s Perc100:0s} Count:6207}

That's a super high bump for such a small cluster. And the diff is pointing to this change around GC client - #63444

I'm not sure if this is a right hypothesis, but using dynamic client seems to hurt the performance?

@liggitt @deads2k

@shyamjvs

This comment has been minimized.

Copy link
Member

commented May 9, 2018

FYI - Filed another issue #63610 for the GC regression which is a different one.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented May 10, 2018

@liggitt - it seems that your #63606 helps (which looking into the code is expected, given that you're reducing number of generated QPSes).
I took a look and I'm happy to approve it once you remove the "do-not-submit" commit from it :)

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

Maybe experiment by setting LOGROTATE_MAX_SIZE quite high that rotation never takes place? (I don't know what the consequences of this will be though)

No - we've been there already. You're ending up with files of a size that you can't easily open anywhere :)

BTW - the first point (fixing retries) is significantly more important than fixing log rotation.

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

Sgtm - let's neglect the log-rotation for now (I think it's a hard problem in general from stories @crassirostris used to tell me:)

/assign @krzysied

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

Given that the last run also succeeded (well, test-infra part failed to update build log, but test passed), I'm removing it from release blockers list.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

Actually, it's not marked as release blocker...

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

I've removed all the milestone related labels (as it's not really relevant for the remainder of this issue imo).

@kubernetes/test-infra-maintainers - Could someone explain what went wrong with this run - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/162 ? (I see not all artifacts are present)

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

I filed #64760 for the logrotation thing.
Let's fix the retries and close this bug when that is fixed.

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2018

@wojtek-t I've created PR with retries fix.

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

@shyamjvs last line of the log:

I0605 01:32:07.654] Call:  gsutil -m -q -o GSUtil:use_magicfile=True cp -r -c -z log,txt,xml /workspace/artifacts gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/162

I think that gcloud cp call did not finish. Again I saw logexporter failed on >1k nodes, is this normal? We probably should retry?

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2018

[MILESTONENOTIFIER] Milestone Removed From Issue

@AishSundar @krzysied @shyamjvs @kubernetes/sig-node-misc @kubernetes/sig-scalability-misc

Important: Code freeze is in effect and only issues with priority/critical-urgent may remain in the v1.11 milestone.

Help

@k8s-github-robot k8s-github-robot removed this from the v1.11 milestone Jun 8, 2018

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Jun 11, 2018

Seems like we're seeing this problem with artifacts upload failing a bit too frequently - https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-scale-performance
I'll send out a change increasing the logexporter step timeout to see if that helps.

@shyamjvs

This comment has been minimized.

Copy link
Member

commented Jun 11, 2018

It seems like we have good enough timeout for logexporter already (~18 mins on 5k-node).

@krzyzacy - I'm unable to find the build-log for those runs. Need that to better understand what exactly is happening (is the job timing out while log-upload? or is the gsutil call failing?). Mitigation strategy would depend on that.

@krzyzacy

This comment has been minimized.

Copy link
Member

commented Jun 11, 2018

@shyamjvs that gcloud call I pasted above never returns

k8s-github-robot pushed a commit that referenced this issue Jun 12, 2018

Kubernetes Submit Queue
Merge pull request #64768 from krzysied/scale_retries
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Adding scale error retries

**What this PR does / why we need it**:
ScaleWithRetries will retry all retryable errors, not only conflict error.
ref #63030

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```

k8s-github-robot pushed a commit that referenced this issue Jun 12, 2018

Kubernetes Submit Queue
Merge pull request #65018 from shyamjvs/add-debug-logs-to-logexporter
Automatic merge from submit-queue (batch tested with PRs 64974, 65009, 65018). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Increase logexporter timeout and add debug logs

Ref - #63030 (comment)

So it seems that logexporter isn't running on too many nodes on our 5k node cluster (~40% of nodes). As a result we fallback to ssh-based copying for so many nodes which is slow and hence the job times out. My feeling is it's because of slow scheduling of logexporter pods (and hence quite some nodes didn't even get the chance to run those pods before we delete the daemonset).

/cc @wojtek-t @krzyzacy 

```release-note
NONE
```

/sig scalability
/kind bug
/priority important-soon
/milestone v1.11
/status approved-for-milestone
@shyamjvs

This comment has been minimized.

Copy link
Member

commented Jun 13, 2018

There's no active work left here. The scaling-related flakes should be fixed with #64768 merged.
And for logexporter issue, I've filed #65018 (we're looking into it, but it's not very critical).

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.