Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize memory usage of secret/configmap informers #5993

Conversation

simonpasquier
Copy link
Contributor

@simonpasquier simonpasquier commented Oct 11, 2023

Description

The secrets and configmaps informers only need to store the name, namespace and resource version of the objects: the event handlers don't look at the other metadata fields. By stripping the rest of the metadata, we can save a fair share of memory on clusters with lots of secrets/configmaps.

Example with a kind cluster running 16,384 secrets (in 128 namespaces), each secret has 4 annotations of 1,024 chars each.

Before the change

image

After this change

image

In all cases, we see a big spike at the start of the operator which is expected because it fetches all data for the informers from the Kubernetes API. Depending on when the GC kicks in, the max allocated memory can remain high for longer:

image

The maximum memory usage with/without the change is similar but once the situation settles, we see a ~80% decrease in memory usage with this PR (from 300MB -> 50MB).

To take full advantage of this optimization, we can also tune the GOGC variable to reclaim the memory faster after startup.

With GOGC=50:

image

With GOGC=30:

image

I didn't observe significant changes to the CPU usage.

The optimization has been inspired by cert-manager/cert-manager#5966

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

Reduce memory usage for clusters with large number of secrets and/or configmaps.

The secrets and configmaps informers only need to store the name,
namespace and resource version of the objects: the event handlers don't
look at the other metadata fields. By stripping the rest of the
metadata, we can save a fair share of memory on clusters with lots of
secrets/configmaps.

Experiments show that the maximum memory usage with/without the change
is similar but once the situation settles, we have a ~80% decrease in
memory usage with this commit (from 300MB -> 50MB). More details (with
graphs) in the pull request.

To take full advantage of this optimization, we also tune the GOGC
variable to "30" to reclaim the memory faster after startup.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier simonpasquier changed the title WIP: optimize metadata informers feat: optimize memory usage of secret/configmap informers Nov 15, 2023
@simonpasquier simonpasquier marked this pull request as ready for review November 15, 2023 08:43
@simonpasquier simonpasquier requested a review from a team as a code owner November 15, 2023 08:43
@@ -10,6 +10,7 @@ local defaults = {
requests: { cpu: '', memory: '' },
},
enableReloaderProbes: false,
goGC: '30',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://pkg.go.dev/runtime the default is 100. Should we retain the default value here while allowing it to be configurable or is the idea that we try (and push users to) to take advantage of the optimisation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter. My observation is that the memory usage increases a lot initially when the operator populates its local cache but it's transient because most of the data is discarded by the transform func and reclaiming more aggressively than the default is beneficial in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense. Feels pretty low risk so all good with me :)

Copy link
Contributor

@philipgough philipgough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on the default setting but other than that lgtm. Awesome improvement.

@nicolastakashi
Copy link
Contributor

Impressive work!
LGTM 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants