Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrations/operator: re-use the teleport client instead of creating a new one #34050

Merged
merged 3 commits into from
Nov 7, 2023

Conversation

hugoShaka
Copy link
Contributor

@hugoShaka hugoShaka commented Oct 30, 2023

Fixes #24110

This PR addresses several major issues of the Teleport Operator:

  • potentially high CPU and memory usage when reconciling broken resources
  • the operator caused a high log volume on the Teleport container
  • teleport clients were not properly closed, causing a memory leak

This should the ongoing memory issues several users reported, should largely reduce the impact of broken reconciliation and reduce the memory spikes when reconciling many resources. Another PR from @tigrato will reduce the amount of unnecessary reconciliations. With both PRs we should be in a much better place in terms of CPU/memory load on the operator side.

How it works

With this PR the embedded tbot now caches the client and can skip the whole connection dance if the certs have not changed. In the future, tbot will return us a single client with rolling certificates, which will simplify the whole thing.

This PR also wraps the teleport client in a new structure that contains an RWLock and tracks who is using the client. This allows us to ensure no one is using the client before closing it.

Finally, this PR adds an RWLock on tbot's memory destination to ensure safe reads from the sidecar (we don't want to read while tbot is writing the renewed cert, this would end badly).

changelog: The operator reuses its connection to Teleport. Reduces CPU usage, logs, and fixes a memory leak.

@hugoShaka hugoShaka added bug kube-operator Issues related to Kube Operator labels Oct 30, 2023
Copy link
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - let's put that RWMutex in to the Memory Destination though.

Copy link
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See slack, just realised where the root of your memory leak is.

return b.cachedClient, nil
}

freshClient, err := b.clientBuilder(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we ever closing the client?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the leak I spotted - the problem also exists in the existing code. We can probably adjust this PR to fix this since the other content of the PR is still valuable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 13de098

This is not super clean, and we'll definitely want to get rid of this when tbot will send us clients with in-place cert-renewal. However, tbot changes won't be backported to v12/v13, so we need the current fix for those versions.

integrations/operator/sidecar/tbot.go Outdated Show resolved Hide resolved
integrations/operator/sidecar/tbot.go Outdated Show resolved Hide resolved
lib/tbot/config/destination_memory.go Outdated Show resolved Hide resolved
integrations/operator/sidecar/tbot.go Outdated Show resolved Hide resolved
integrations/operator/sidecar/tbot.go Outdated Show resolved Hide resolved
@hugoShaka hugoShaka force-pushed the hugo/operator-reuse-client branch 2 times, most recently from 360cf2b to b3b6690 Compare October 31, 2023 19:38
Copy link
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hugoShaka hugoShaka added this pull request to the merge queue Nov 7, 2023
Merged via the queue into master with commit df257be Nov 7, 2023
32 checks passed
@hugoShaka hugoShaka deleted the hugo/operator-reuse-client branch November 7, 2023 22:24
@public-teleport-github-review-bot

@hugoShaka See the table below for backport results.

Branch Result
branch/v12 Failed
branch/v13 Failed
branch/v14 Create PR

hugoShaka added a commit that referenced this pull request Nov 9, 2023
… a new one (#34050)

* integrations/operator: re-use the teleport client instead of creating a new one

* fix race condition

* address feedback + add godocs
hugoShaka added a commit that referenced this pull request Nov 10, 2023
logand22 pushed a commit that referenced this pull request Nov 10, 2023
… a new one (#34050)

* integrations/operator: re-use the teleport client instead of creating a new one

* fix race condition

* address feedback + add godocs
github-merge-queue bot pushed a commit that referenced this pull request Nov 13, 2023
…eating a new one (#34431)

* integrations/operator: re-use the teleport client instead of creating a new one (#34050)

* integrations/operator: re-use the teleport client instead of creating a new one

* fix race condition

* address feedback + add godocs

* fixup! integrations/operator: re-use the teleport client instead of creating a new one (#34050)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Huge memory and cpu usage for teleport operator
4 participants