Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube-state-metrics 20x spikes in memory usage at restart #2302

Closed
zhoujoetan opened this issue Jan 12, 2024 · 5 comments
Closed

Kube-state-metrics 20x spikes in memory usage at restart #2302

zhoujoetan opened this issue Jan 12, 2024 · 5 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@zhoujoetan
Copy link

zhoujoetan commented Jan 12, 2024

What happened:
A few of our kube-state-metrics instance (single-instance, no sharding) recently had OOM issues after restart. The memory usage spiked up to 2.5GB (see attachment) for a few minutes before stabilized at 131 MB. We tried to increase the CPU limit from the default 0.1 to 1 or even 5, but it does not seem to help much.

Here is the pprof profile I captured:

File: kube-state-metrics
Type: inuse_space
Time: Jan 11, 2024 at 4:12pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1503.72MB, 99.10% of 1517.33MB total
Dropped 50 nodes (cum <= 7.59MB)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
  753.11MB 49.63% 49.63%   753.11MB 49.63%  io.ReadAll
  748.12MB 49.30% 98.94%   748.12MB 49.30%  k8s.io/apimachinery/pkg/runtime.(*Unknown).Unmarshal
    2.49MB  0.16% 99.10%     8.99MB  0.59%  k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add
         0     0% 99.10%   753.11MB 49.63%  io/ioutil.ReadAll (inline)
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).Decode
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
         0     0% 99.10%  1502.32MB 99.01%  k8s.io/client-go/kubernetes/typed/core/v1.(*configMaps).List
         0     0% 99.10%   753.61MB 49.67%  k8s.io/client-go/rest.(*Request).Do

Looks like heap memory usage does not represent 100% of the container_memory_usage_bytes metric.

What you expected to happen:
memory usage to not spike 20x at restart

How to reproduce it (as minimally and precisely as possible):
Kill/restart the KSM pod.

Anything else we need to know?:

Environment:

  • kube-state-metrics version: v2.3.0
  • Kubernetes version (use kubectl version): v1.24.17-eks-8cb36c9
  • Cloud provider or hardware configuration: AWS EKS
  • Other info:

Untitled

@zhoujoetan zhoujoetan added the kind/bug Categorizes issue or PR as related to a bug. label Jan 12, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 12, 2024
@dgrisonnet
Copy link
Member

/triage accepted
/assign @rexagod

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 25, 2024
@mindw
Copy link
Contributor

mindw commented Feb 6, 2024

@zhoujoetan try excluding configmaps and secrets from the list of exported resources (--resources= command line option).
At least for me it dropped initial memory usage from ~400Mib to 24Mib.

Both CLI and helm chart have them included by default.

In my case Helm charts (which store the manifests in secrets by defaults) history were the main culprit for the I could not confirm is pagination is used which may mitigate this issue.

Hope this helps.

@rexagod
Copy link
Member

rexagod commented Feb 25, 2024

kube-state-metrics version: v2.3.0

@zhoujoetan It seems you're on an outdated version that's no longer supported. Could you switch to one of the supported versions (preferably the latest release) and verify this issue still persists for you?

@zhoujoetan
Copy link
Author

I have figured out the issue. We had a ton of configmaps objects that KSM read during startup time. Trimming those objects brought the memory usage back down. I am closing the issue now.

@nalshamaajc
Copy link

@zhoujoetan When you sam Configmap objects, you mean this cluste-rwide or was it something specific? Also when you say trimming was it like deleting unwanted CMs, or removing data from these CMs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants