Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment Integration Test Goroutine Limit Exceeded #53617

Closed
crimsonfaith91 opened this issue Oct 9, 2017 · 19 comments
Closed

Deployment Integration Test Goroutine Limit Exceeded #53617

crimsonfaith91 opened this issue Oct 9, 2017 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects

Comments

@crimsonfaith91
Copy link
Contributor

crimsonfaith91 commented Oct 9, 2017

What happened:
When number of deployment integration tests increases more than a threshold, an error running race: limit on 8192 simultaneously alive goroutines is exceeded, dying happens when running the tests locally using bazel. The error does not happen when the number of tests is small.

What you expected to happen:
The integration tests should not create so many alive goroutines (more than 8192).

How to reproduce it (as minimally and precisely as possible):
(1) Duplicate each deployment tests under test/integration/deployment directory twice with a digit identifier
(2) bazel build //test/integration/deployment/...
(3) bazel test //test/integration/deployment/...

Anything else we need to know?:
The error also happens for replicaset. It may be related to how integration test environment is set up.

/kind bug
/sig apps

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Oct 9, 2017
@crimsonfaith91 crimsonfaith91 changed the title Controller Integration Test Goroutine Limit Exceeded Deployment Integration Test Goroutine Limit Exceeded Oct 9, 2017
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Oct 9, 2017
@enisoc
Copy link
Member

enisoc commented Oct 9, 2017

@kubernetes/sig-api-machinery-bugs Is it expected that an integration test would exceed 8192 goroutines (mostly started in apiserver code) if it starts a number of apiservers? That seems excessive to me, but if it's normal we should probably limit the concurrency of integration tests. If it's not normal, it seems like we are leaking goroutines.

Some examples of what those 8192 goroutines are doing:

k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc420f625a0, 0xa8a2700, 0xc4210f5800, 0xc423939ba0, 0xc4210f55c0, 0xc420f485a0, 0x0, 0x0)
        vendor/k8s.io/client-go/tools/cache/reflector.go:366 +0x16f2
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc420f625a0, 0xc420f485a0, 0x0, 0x0)
        vendor/k8s.io/client-go/tools/cache/reflector.go:332 +0x1560
k8s.io/apiserver/pkg/storage.(*Cacher).startCaching(0xc4209121c0, 0xc420f485a0)
        vendor/k8s.io/apiserver/pkg/storage/cacher.go:276 +0x1a4
k8s.io/apiserver/pkg/storage.NewCacherFromConfig.func1.1()
        vendor/k8s.io/apiserver/pkg/storage/cacher.go:245 +0x80
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42003e7a8)
        vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x70
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc423939fa8, 0x3b9aca00, 0x0, 0xc42003e701, 0xc420f485a0)
        vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xce
k8s.io/apimachinery/pkg/util/wait.Until(0xc42003e7a8, 0x3b9aca00, 0xc420f485a0)
        vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x5b
k8s.io/apiserver/pkg/storage.NewCacherFromConfig.func1(0xc4209121c0, 0xc420f485a0)
        vendor/k8s.io/apiserver/pkg/storage/cacher.go:248 +0xe3
created by k8s.io/apiserver/pkg/storage.NewCacherFromConfig
        vendor/k8s.io/apiserver/pkg/storage/cacher.go:249 +0xfa7
k8s.io/apiserver/pkg/storage.(*Cacher).dispatchEvents(0xc4209121c0)
        vendor/k8s.io/apiserver/pkg/storage/cacher.go:595 +0x24a
created by k8s.io/apiserver/pkg/storage.NewCacherFromConfig
        vendor/k8s.io/apiserver/pkg/storage/cacher.go:237 +0xf54
github.com/coreos/etcd/clientv3.(*lessor).deadlineLoop(0xc4206948c0)
        vendor/github.com/coreos/etcd/clientv3/lease.go:434 +0x2fd
created by github.com/coreos/etcd/clientv3.NewLease
        vendor/github.com/coreos/etcd/clientv3/lease.go:156 +0x4da

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 9, 2017
@ncdc
Copy link
Member

ncdc commented Oct 10, 2017

Do you have a full stack dump of all the goroutines? We could run them through panicparse to get some summarized details.

@crimsonfaith91
Copy link
Contributor Author

@ncdc Yes, but the file is very big (around 4MB). Most of the goroutines have same output. The partial stack dump above highlights most of the output.

@enisoc
Copy link
Member

enisoc commented Oct 11, 2017

The goroutines sampled above seem to be part of the client sitting between the REST Store and etcd. Perhaps it will help to incorporate calls to DestroyFunc somewhere in the integration framework?

// Called to cleanup clients used by the underlying Storage; optional.
DestroyFunc func()

@mml
Copy link
Contributor

mml commented Oct 12, 2017

cc @jpbetz

@enisoc
Copy link
Member

enisoc commented Oct 18, 2017

This may be related:

#50690
#49489

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 16, 2018
@MHBauer
Copy link
Contributor

MHBauer commented Jan 19, 2018

to me this looks like dup of root cause in #49489

@MHBauer
Copy link
Contributor

MHBauer commented Jan 19, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2018
@MHBauer
Copy link
Contributor

MHBauer commented Jan 19, 2018

@ncdc we have similar problem in service-catalog. Goroutines are not being reclaimed.

Not sure how to get an appropriate stack dump to help, but some information in this gist

@crimsonfaith91
Copy link
Contributor Author

crimsonfaith91 commented Feb 1, 2018

I also encountered the error when working on a DaemonSet's integration test: #59013

@kow3ns kow3ns added this to Backlog in Workloads Feb 27, 2018
@sttts sttts added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 1, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2018
@nikhita
Copy link
Member

nikhita commented Jun 13, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2018
@nikhita
Copy link
Member

nikhita commented Sep 13, 2018

/remove-lifecycle stale

Removing help-wanted because the direction is not clear.

/remove-help

@k8s-ci-robot k8s-ci-robot removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 12, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 11, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Workloads automation moved this from Backlog to Done Feb 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Workloads
  
Done
Development

No branches or pull requests

9 participants