Kube-state-metrics 20x spikes in memory usage at restart #2302

zhoujoetan · 2024-01-12T19:06:59Z

What happened:
A few of our kube-state-metrics instance (single-instance, no sharding) recently had OOM issues after restart. The memory usage spiked up to 2.5GB (see attachment) for a few minutes before stabilized at 131 MB. We tried to increase the CPU limit from the default 0.1 to 1 or even 5, but it does not seem to help much.

Here is the pprof profile I captured:

File: kube-state-metrics
Type: inuse_space
Time: Jan 11, 2024 at 4:12pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1503.72MB, 99.10% of 1517.33MB total
Dropped 50 nodes (cum <= 7.59MB)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
  753.11MB 49.63% 49.63%   753.11MB 49.63%  io.ReadAll
  748.12MB 49.30% 98.94%   748.12MB 49.30%  k8s.io/apimachinery/pkg/runtime.(*Unknown).Unmarshal
    2.49MB  0.16% 99.10%     8.99MB  0.59%  k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add
         0     0% 99.10%   753.11MB 49.63%  io/ioutil.ReadAll (inline)
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0% 99.10%   749.71MB 49.41%  k8s.io/apimachinery/pkg/runtime/serializer/protobuf.(*Serializer).Decode
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil
         0     0% 99.10%     8.99MB  0.59%  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
         0     0% 99.10%  1502.32MB 99.01%  k8s.io/client-go/kubernetes/typed/core/v1.(*configMaps).List
         0     0% 99.10%   753.61MB 49.67%  k8s.io/client-go/rest.(*Request).Do

Looks like heap memory usage does not represent 100% of the container_memory_usage_bytes metric.

What you expected to happen:
memory usage to not spike 20x at restart

How to reproduce it (as minimally and precisely as possible):
Kill/restart the KSM pod.

Anything else we need to know?:

Environment:

kube-state-metrics version: v2.3.0
Kubernetes version (use kubectl version): v1.24.17-eks-8cb36c9
Cloud provider or hardware configuration: AWS EKS
Other info:

The text was updated successfully, but these errors were encountered:

dgrisonnet · 2024-01-25T17:59:13Z

/triage accepted
/assign @rexagod

mindw · 2024-02-06T13:47:18Z

@zhoujoetan try excluding configmaps and secrets from the list of exported resources (--resources= command line option).
At least for me it dropped initial memory usage from ~400Mib to 24Mib.

Both CLI and helm chart have them included by default.

In my case Helm charts (which store the manifests in secrets by defaults) history were the main culprit for the I could not confirm is pagination is used which may mitigate this issue.

Hope this helps.

rexagod · 2024-02-25T20:28:21Z

kube-state-metrics version: v2.3.0

@zhoujoetan It seems you're on an outdated version that's no longer supported. Could you switch to one of the supported versions (preferably the latest release) and verify this issue still persists for you?

zhoujoetan · 2024-02-27T19:52:42Z

I have figured out the issue. We had a ton of configmaps objects that KSM read during startup time. Trimming those objects brought the memory usage back down. I am closing the issue now.

nalshamaajc · 2024-03-13T22:57:46Z

@zhoujoetan When you sam Configmap objects, you mean this cluste-rwide or was it something specific? Also when you say trimming was it like deleting unwanted CMs, or removing data from these CMs?

zhoujoetan added the kind/bug Categorizes issue or PR as related to a bug. label Jan 12, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 12, 2024

k8s-ci-robot assigned rexagod Jan 25, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 25, 2024

zhoujoetan closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kube-state-metrics 20x spikes in memory usage at restart #2302

Kube-state-metrics 20x spikes in memory usage at restart #2302

zhoujoetan commented Jan 12, 2024 •

edited

dgrisonnet commented Jan 25, 2024

mindw commented Feb 6, 2024

rexagod commented Feb 25, 2024 •

edited

zhoujoetan commented Feb 27, 2024

nalshamaajc commented Mar 13, 2024

Kube-state-metrics 20x spikes in memory usage at restart #2302

Kube-state-metrics 20x spikes in memory usage at restart #2302

Comments

zhoujoetan commented Jan 12, 2024 • edited

dgrisonnet commented Jan 25, 2024

mindw commented Feb 6, 2024

rexagod commented Feb 25, 2024 • edited

zhoujoetan commented Feb 27, 2024

nalshamaajc commented Mar 13, 2024

zhoujoetan commented Jan 12, 2024 •

edited

rexagod commented Feb 25, 2024 •

edited