-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: optimize memory usage of secret/configmap informers #5993
feat: optimize memory usage of secret/configmap informers #5993
Conversation
9393561
to
3a520bd
Compare
3a520bd
to
50cad8a
Compare
The secrets and configmaps informers only need to store the name, namespace and resource version of the objects: the event handlers don't look at the other metadata fields. By stripping the rest of the metadata, we can save a fair share of memory on clusters with lots of secrets/configmaps. Experiments show that the maximum memory usage with/without the change is similar but once the situation settles, we have a ~80% decrease in memory usage with this commit (from 300MB -> 50MB). More details (with graphs) in the pull request. To take full advantage of this optimization, we also tune the GOGC variable to "30" to reclaim the memory faster after startup. Signed-off-by: Simon Pasquier <spasquie@redhat.com>
50cad8a
to
2c67ca6
Compare
@@ -10,6 +10,7 @@ local defaults = { | |||
requests: { cpu: '', memory: '' }, | |||
}, | |||
enableReloaderProbes: false, | |||
goGC: '30', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://pkg.go.dev/runtime the default is 100. Should we retain the default value here while allowing it to be configurable or is the idea that we try (and push users to) to take advantage of the optimisation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter. My observation is that the memory usage increases a lot initially when the operator populates its local cache but it's transient because most of the data is discarded by the transform func and reclaiming more aggressively than the default is beneficial in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense. Feels pretty low risk so all good with me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question on the default setting but other than that lgtm. Awesome improvement.
Impressive work! |
Description
The secrets and configmaps informers only need to store the name, namespace and resource version of the objects: the event handlers don't look at the other metadata fields. By stripping the rest of the metadata, we can save a fair share of memory on clusters with lots of secrets/configmaps.
Example with a kind cluster running 16,384 secrets (in 128 namespaces), each secret has 4 annotations of 1,024 chars each.
Before the change
After this change
In all cases, we see a big spike at the start of the operator which is expected because it fetches all data for the informers from the Kubernetes API. Depending on when the GC kicks in, the max allocated memory can remain high for longer:
The maximum memory usage with/without the change is similar but once the situation settles, we see a ~80% decrease in memory usage with this PR (from 300MB -> 50MB).
To take full advantage of this optimization, we can also tune the GOGC variable to reclaim the memory faster after startup.
With
GOGC=50
:With
GOGC=30
:I didn't observe significant changes to the CPU usage.
The optimization has been inspired by cert-manager/cert-manager#5966
Type of change
What type of changes does your code introduce to the Prometheus operator? Put an
x
in the box that apply.CHANGE
(fix or feature that would cause existing functionality to not work as expected)FEATURE
(non-breaking change which adds functionality)BUGFIX
(non-breaking change which fixes an issue)ENHANCEMENT
(non-breaking change which improves existing functionality)NONE
(if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)Changelog entry
Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.