Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trusted Cluster goroutine leak #10648

Closed
rosstimothy opened this issue Feb 25, 2022 · 2 comments · Fixed by #10871
Closed

Trusted Cluster goroutine leak #10648

rosstimothy opened this issue Feb 25, 2022 · 2 comments · Fixed by #10871
Assignees
Labels
bug test-plan-problem Issues which have been surfaced by running the manual release test plan

Comments

@rosstimothy
Copy link
Contributor

Description

It looks like watchCertAuthorities is not returning when trusted clusters are removed. Which subsequently prevents the underlying CertAuthorityWatcher to never be closed and causes a memory leak.

Both leaks can be seen in the screenshot below:
Screenshot 2022-02-25 at 10-23-13 Teleport Health Stats - Grafana

The goroutine profile here was taken after deleting the clusters:
profile001

What happened:

While running the 500 trusted clusters tests I noticed that after the trusted clusters were deleted there was a leak of both memory and goroutines.

What you expected to happen:

When the trusted clusters are removed, all resources should return to levels they were at prior to adding the clusters.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

  1. Add a bunch of trusted clusters
  2. Delete them all
  3. Monitor goroutine and heap metrics

Server Details

  • Teleport version (run teleport version): 9.0.0-beta.1

Client Details

  • Tsh version (tsh version): 9.0.0-beta.1
@zmb3 zmb3 added the test-plan-problem Issues which have been surfaced by running the manual release test plan label Feb 25, 2022
@rosstimothy rosstimothy self-assigned this Mar 1, 2022
@rosstimothy
Copy link
Contributor Author

Did a bit more digging on this and it ended up being caused by two things

Blocking indefinitely writing to c.CertAuthorityC

CertAuthorityWatcher tries to write to a channel that is no longer being read from when the remoteSite is closed.

c.CertAuthorityC <- casToSlice(c.host, c.user)

c.CertAuthorityC <- casToSlice(c.host, c.user)

The correct behavior would be to select on writing to c.CertAuthorityC and reading from ctx.Done. Note the only other watchers that currently due this are the ProxyWatcher and LockWatcher. Both the DatabaseWatcher and AppWatcher also have the same blocking bug.

select {
case c.CertAuthorityC <- casToSlice(c.host, c.user):
case <- ctx.Done():

HTTPClient hanging indefinitely waiting for a response

CertAuthorityWatcher hangs waiting for a response from GetCertAuthorities and RotateExternalCertAuthority. This is caused by two things. First we don't set a Timeout on the http.Client used by HTTPClient.

roundtrip.HTTPClient(&http.Client{Transport: transport}),

While we do set some timeouts on the http.Transport used by said http.Client, that doesn't appear to be enough to prevent hanging forever in some situations.

Second, we don't properly propagate the callers context all the way down the the outgoing requests. So when the caller's context is cancelled there is nothing to stop the request.

return httplib.ConvertResponse(c.Client.PostJSON(context.TODO(), endpoint, val))

return httplib.ConvertResponse(c.Client.Get(context.TODO(), u, params))

@rosstimothy
Copy link
Contributor Author

Here is the goroutine dump difference from prior to running the 500 Trusted Cluster scaling test and after. It captures bot the waiting on the channel and being stuck waiting for the HTTP response

profile001

rosstimothy added a commit that referenced this issue Mar 4, 2022
The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 10, 2022
The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 10, 2022
* Fix goroutine and memory leak in watchCertAuthorities

The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 10, 2022
* Fix goroutine and memory leak in watchCertAuthorities

The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 10, 2022
* Fix goroutine and memory leak in watchCertAuthorities (#10871)

The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 14, 2022
The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 14, 2022
The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 15, 2022
* Fix goroutine and memory leak in watchCertAuthorities (#10871)

The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
rosstimothy added a commit that referenced this issue Mar 15, 2022
* Fix goroutine and memory leak in watchCertAuthorities (#10871)

The CA Watcher was blocking both on writing to a channel when the watcher
was closed and on HTTP calls that had no request timeout or context passed
to cause cancellation.

All resourceWatcher implementations that had a bug which may cause them to block
on writing to a channel forever were fixed by selecting on the write and ctx.Done.

Adding context.Context to all Get/Put/Post/Delete methods on the auth HTTPClient to
force callers to propagate context. Prior all calls used context.TODO which
prevents requests from being properly cancelled.

Add context propagation to RotateCertAuthority, RotateExternalCertAuthority,
GetCertAuthority, GetCertAuthorities. This is needed to get the correct ctx
from the CertAtuhorityWatcher all the way down to the HTTPClient that makes
the call.

Closes #10648
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug test-plan-problem Issues which have been surfaced by running the manual release test plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants