kubectl shows node memory >100%? #86499

roy-work · 2019-12-20T23:51:45Z

What happened:

We ran the following, and got the following:

$ kubectl top nodes
NAME                       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-nodepool1-35944238-0   320m         8%     15080Mi         54%
aks-nodepool1-35944238-1   355m         9%     28048Mi         101%
aks-nodepool1-35944238-2   544m         13%    24650Mi         88%
aks-nodepool1-35944238-3   325m         8%     2170Mi          7%
aks-nodepool1-35944238-4   511m         13%    14992Mi         54%
aks-nodepool1-35944238-5   516m         13%    25332Mi         91%

How/why do we have a node reporting >100% memory usage? (There seems to be plenty of memory on the host as reported by the kernel's MemAvailable statistic. (multiple gigabytes)

What you expected to happen:

Memory usage can't exceed 100%, no?

How to reproduce it (as minimally and precisely as possible): we unfortunately don't know

Anything else we need to know?: no swap on these VMs. We're curious what kernel memory statistic goes into computing the total from Kubernetes; it's my understanding that there is various ways to go over 100%, e.g., by summing RSS over several processes. (E.g., shared and resident pages would get double-counted.)

Environment:

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"524c3a1238422529d62f8e49506df658fa9c8b8c", GitTreeState:"clean", BuildDate:"2019-11-14T05:26:24Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: Azure
OS (e.g: cat /etc/os-release): AKS instance?
Kernel (e.g. uname -a): (unsure, since this is Azure AKS; we don't have good access to this piece of data…)
Install tools: Azure?
Network plugin and version (if this is a network-related bug): N/A / Azure
Others: None

The text was updated successfully, but these errors were encountered:

roy-work · 2019-12-20T23:54:04Z

(Taking my best shot here, but if this is wrong please freely adjust it.)

/sig instrumentation

tedyu · 2019-12-21T00:18:32Z

Other than 101% reporting issue, do you observe any other abnormality ?

roy-work · 2019-12-23T18:33:24Z

Well, yes, the node didn't seem to be actually at 100% memory use; as mentioned, it seemed to have significant headroom.

haosdent · 2019-12-26T11:56:43Z

how about kubectl describe node aks-nodepool1-35944238-1

serathius · 2020-02-24T23:15:46Z

Node Memory utilization is ratio of node Working Set Bytes and node allocatable memory.

Allocatable memory is available on node object:
kubectl describe node aks-nodepool1-35944238-1

Node Working Set Bytes are available on api:
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/aks-nodepool1-35944238-1

Please provide results from those commands so we can distinquish if problem is in kubectl top or in metrics pipeline.

fejta-bot · 2020-05-24T23:53:24Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-06-24T00:34:40Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-07-24T01:15:39Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-07-24T01:15:52Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pdabrowski-it-solutions · 2021-04-20T20:29:25Z

I have similar problem:

NAME                        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
default-pool-2meb4gv3qy     357m         8%     1926Mi          72%         
secondary-pool-dqiqzzikb5   155m         3%     1061Mi          150%        
secondary-pool-kypbaua5an   82m          2%     884Mi           125%

kubectl describe node secondary-pool-dqiqzzikb5

CreationTimestamp:  Mon, 19 Apr 2021 23:33:10 +0200
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  secondary-pool-dqiqzzikb5
  AcquireTime:     <unset>
  RenewTime:       Tue, 20 Apr 2021 22:24:21 +0200
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 19 Apr 2021 23:33:51 +0200   Mon, 19 Apr 2021 23:33:51 +0200   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 20 Apr 2021 22:22:37 +0200   Tue, 20 Apr 2021 01:01:50 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 20 Apr 2021 22:22:37 +0200   Tue, 20 Apr 2021 01:01:50 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 20 Apr 2021 22:22:37 +0200   Tue, 20 Apr 2021 01:01:51 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 20 Apr 2021 22:22:37 +0200   Tue, 20 Apr 2021 01:01:51 +0200   KubeletReady                 kubelet is posting ready status
Addresses:
  ExternalIP:  ***
  Hostname:    secondary-pool-dqiqzzikb5
Capacity:
  cpu:                4
  ephemeral-storage:  41218368Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1872412Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  37986847886
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             721436Ki
  pods:               110
System Info:
  Machine ID:                 ***
  System UUID:                ***
  Boot ID:                    ***
  Kernel Version:             3.10.0-1160.24.1.el7.x86_64
  OS Image:                   CentOS Linux 7 (Core)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.4
  Kubelet Version:            v1.18.15
  Kube-Proxy Version:         v1.18.15

kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/secondary-pool-dqiqzzikb5

{"kind":"NodeMetrics","apiVersion":"metrics.k8s.io/v1beta1","metadata":{"name":"secondary-pool-dqiqzzikb5","selfLink":"/apis/metrics.k8s.io/v1beta1/nodes/secondary-pool-dqiqzzikb5","creationTimestamp":"2021-04-20T20:28:21Z"},"timestamp":"2021-04-20T20:27:14Z","window":"30s","usage":{"cpu":"197390796n","memory":"1085572Ki"}}

I have no idea what's causing it or how to fix :/

k8s-ci-robot · 2021-04-20T20:30:38Z

@pdabrowski-it-solutions: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The memory usage result sometimes can be 100+% overflow result. Because the memory usage calculation is based on logical "Allocatable" node total memory which depends on "[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]", not actual host total memory. * Fix: kubernetes#86499 * Reference: kubernetes#100222

…l node memory total usage. If "Allocatable" is used to a node total memory size, under high memory pressure or pre-reserved memory value is bigger, the "MEMORY%" can be bigger than 100%. For suppressing the confusing, add a option to show node real memory usage based on "Capacity". * Reference: kubernetes#86499

ydcool · 2021-08-06T07:08:58Z

/reopen
We have the same issue.

k8s-ci-robot · 2021-08-06T07:09:04Z

@ydcool: Reopened this issue.

In response to this:

/reopen
We have the same issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dashpole · 2021-08-12T16:44:32Z

/remove-lifecycle rotten
/triage accepted
/assign @serathius

GrumpyRainbow · 2021-08-30T14:11:23Z

Seeing same issue here too

NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-web-30425474-vmss000007   79m          4%     2357Mi          109%

CreationTimestamp:  Fri, 23 Jul 2021 09:16:07 -0500
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  aks-web-30425474-vmss000007
  AcquireTime:     <unset>
  RenewTime:       Mon, 30 Aug 2021 09:06:12 -0500
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 23 Jul 2021 09:16:41 -0500   Fri, 23 Jul 2021 09:16:41 -0500   RouteCreated                 RouteController created a route
  MemoryPressure       False   Mon, 30 Aug 2021 09:04:12 -0500   Fri, 23 Jul 2021 09:16:07 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 30 Aug 2021 09:04:12 -0500   Fri, 23 Jul 2021 09:16:07 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 30 Aug 2021 09:04:12 -0500   Fri, 23 Jul 2021 09:16:07 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 30 Aug 2021 09:04:12 -0500   Fri, 23 Jul 2021 09:16:17 -0500   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  Hostname:    aks-web-30425474-vmss000007
  InternalIP:  ***
Capacity:
  attachable-volumes-azure-disk:  4
  cpu:                            2
  ephemeral-storage:              129900528Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         4030180Ki
  pods:                           110
Allocatable:
  attachable-volumes-azure-disk:  4
  cpu:                            1900m
  ephemeral-storage:              119716326407
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         2213604Ki
  pods:                           110
System Info:
  Machine ID:                 ***
  System UUID:                ***
  Boot ID:                    ***
  Kernel Version:             5.4.0-1049-azure
  OS Image:                   Ubuntu 18.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.4+azure
  Kubelet Version:            v1.20.5
  Kube-Proxy Version:         v1.20.5

{"kind":"NodeMetrics","apiVersion":"metrics.k8s.io/v1beta1","metadata":{"name":"aks-web-30425474-vmss000007","selfLink":"/apis/metrics.k8s.io/v1beta1/nodes/aks-web-30425474-vmss000007","creationTimestamp":"2021-08-30T14:06:22Z"},"timestamp":"2021-08-30T14:05:45Z","window":"30s","usage":{"cpu":"78528126n","memory":"2413824Ki"}}

AnthonyWC · 2021-09-29T21:37:58Z

Another data point (on AWS EKS):

Seems like happen only on smaller nodetypes (I see it with both micro/nano, others are medium)

NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-x-x-x-x.us-west-2.compute.internal           117m         6%     864Mi           25%
ip-x-x-x-x.us-west-2.compute.internal            45m         2%     700Mi           128%
ip-x-x-x-x.us-west-2.compute.internal           118m         6%     1265Mi          37%

Usage is displayed correctly with kubectl describe node

Labels:             alpha.eksctl.io/nodegroup-name=nodegroup-spot-t4g-micro
                    beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/instance-type=t4g.micro
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=SPOT
                    eks.amazonaws.com/nodegroup-image=ami-08b947bfaa7dc4093
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-0a1422f9743d31441
                    eks.amazonaws.com/sourceLaunchTemplateVersion=1
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    kubernetes.io/arch=arm64
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=t4g.micro
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 28 Sep 2021 14:28:32 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-192-168-60-44.us-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 29 Sep 2021 17:23:20 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 29 Sep 2021 17:18:45 -0400   Tue, 28 Sep 2021 14:28:30 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 29 Sep 2021 17:18:45 -0400   Tue, 28 Sep 2021 14:28:30 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 29 Sep 2021 17:18:45 -0400   Tue, 28 Sep 2021 14:28:30 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 29 Sep 2021 17:18:45 -0400   Tue, 28 Sep 2021 14:28:52 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  <redacted>
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         2
  ephemeral-storage:           83864556Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  hugepages-32Mi:              0
  hugepages-64Ki:              0
  memory:                      968424Ki
  pods:                        4
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         1930m
  ephemeral-storage:           76215832858
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  hugepages-32Mi:              0
  hugepages-64Ki:              0
  memory:                      559848Ki
  pods:                        4
System Info:
  Machine ID:                 **
  System UUID:                **
  Boot ID:                    **
  Kernel Version:             5.4.141-67.229.amzn2.aarch64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  docker://19.3.13
  Kubelet Version:            v1.21.2-eks-55daa9d
  Kube-Proxy Version:         v1.21.2-eks-55daa9d
ProviderID:                   aws:///us-west-2b/i-0a3b1e6e4c0b5a1a6
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                 CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                 ------------  ----------   ---------------  -------------  ---
  kube-system                 aws-node-wgnm6                   10m (0%)        0 (0%)           0 (0%)               0 (0%)         26h
  kube-system                 kube-proxy-xxl86                 100m (5%)        0 (0%)           0 (0%)                0 (0%)         26h
  ##                           ##-7dcb4c47ff-t5mbp            1740m (90%)   1880m (97%)   480Mi (87%)      540Mi (98%)    26h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         1850m (95%)  1880m (97%)
  memory                      480Mi (87%)  540Mi (98%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  hugepages-32Mi              0 (0%)       0 (0%)
  hugepages-64Ki              0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:                       <none>

NodeMetrics:

  "usage": {
    "cpu": "45283753n",
    "memory": "717664Ki"
  }

serathius · 2021-09-30T17:27:35Z

Why kubelet node memory utilization can exceed 100% (same can happen for CPU)? Simple, because it's not utilization of physical device, but utilization of resources allocated for pods and system daemons.

How utilization of resources allocated is calculated? Simple, it's sum of resources used by on node divided by all resources allocated.

How Kubelet allocates resources for pods? It takes all resources available on VM and substracts resources that it reserves for itself, kernel etc.

How Kubelet knows how much resources it needs to reserve? User provides it in flags.

There is no error, there is no Kubernetes magic that gets us over 100%, it's just question of how you define utilization. Kubelet reserves some resources for system, they are not included when calculating node utilization, which means it can go above 100% when pods start using reserved resources (this is what over-committing means in "kubectl describe node" output).

Some math based on your kubectl describe node output
Node capacity: 968424Ki
Node allocatable: 559848Ki
Node Usage 717664Ki

Node utilization: Node usage / node allocatable = 717664Ki / 559848Ki = 128%
VM utilization: Node usage / node capacity = 717664Ki / 968424Ki = 74%

dashpole · 2021-10-04T14:04:43Z

Based on my reading of the metrics server implementation, I don't think "Node Usage" include only usage from pods. The node_memory_working_set_bytes includes usage by system daemons, so I'm not sure it makes sense to compare to allocatable, which is meant to exclude system daemons.

serathius · 2021-10-04T15:02:01Z

Hmm, I was not aware of that. This is a good point, as utilization was implemented in kubectl by different SIG (CLI), it's possible that it was not properly reviewed by other stakeholders (SIG Instrumentation/SIG Node). I would make sense to revisit what should be displayed as node utilization.

…l node memory total usage. If "Allocatable" is used to a node total memory size, under high memory pressure or pre-reserved memory value is bigger, the "MEMORY%" can be bigger than 100%. For suppressing the confusing, add a option to show node real memory usage based on "Capacity". * Reference: kubernetes/kubernetes#86499 Kubernetes-commit: 862937bf1c7975d3f54ae47a2958e47f2c50150f

roy-work added the kind/bug Categorizes issue or PR as related to a bug. label Dec 20, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 20, 2019

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 20, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 24, 2020

k8s-ci-robot closed this as completed Jul 24, 2020

bysnupy mentioned this issue Jun 16, 2021

Use "Capacity" instead of "Allocatable" for an accurate node memory total size #102917

Merged

bysnupy mentioned this issue Jun 28, 2021

Suppress 100+% overflow at the memory total usage ratio #103253

Closed

k8s-ci-robot reopened this Aug 6, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 6, 2021

k8s-ci-robot assigned serathius Aug 12, 2021

k8s-ci-robot closed this as completed in #102917 Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubectl shows node memory >100%? #86499

kubectl shows node memory >100%? #86499

roy-work commented Dec 20, 2019 •

edited

Loading

roy-work commented Dec 20, 2019

tedyu commented Dec 21, 2019

roy-work commented Dec 23, 2019

haosdent commented Dec 26, 2019

serathius commented Feb 24, 2020 •

edited

Loading

fejta-bot commented May 24, 2020

fejta-bot commented Jun 24, 2020

fejta-bot commented Jul 24, 2020

k8s-ci-robot commented Jul 24, 2020

pdabrowski-it-solutions commented Apr 20, 2021

k8s-ci-robot commented Apr 20, 2021

ydcool commented Aug 6, 2021

k8s-ci-robot commented Aug 6, 2021

dashpole commented Aug 12, 2021

GrumpyRainbow commented Aug 30, 2021

AnthonyWC commented Sep 29, 2021

serathius commented Sep 30, 2021 •

edited

Loading

dashpole commented Oct 4, 2021

serathius commented Oct 4, 2021

kubectl shows node memory >100%? #86499

kubectl shows node memory >100%? #86499

Comments

roy-work commented Dec 20, 2019 • edited Loading

roy-work commented Dec 20, 2019

tedyu commented Dec 21, 2019

roy-work commented Dec 23, 2019

haosdent commented Dec 26, 2019

serathius commented Feb 24, 2020 • edited Loading

fejta-bot commented May 24, 2020

fejta-bot commented Jun 24, 2020

fejta-bot commented Jul 24, 2020

k8s-ci-robot commented Jul 24, 2020

pdabrowski-it-solutions commented Apr 20, 2021

k8s-ci-robot commented Apr 20, 2021

ydcool commented Aug 6, 2021

k8s-ci-robot commented Aug 6, 2021

dashpole commented Aug 12, 2021

GrumpyRainbow commented Aug 30, 2021

AnthonyWC commented Sep 29, 2021

serathius commented Sep 30, 2021 • edited Loading

dashpole commented Oct 4, 2021

serathius commented Oct 4, 2021

roy-work commented Dec 20, 2019 •

edited

Loading

serathius commented Feb 24, 2020 •

edited

Loading

serathius commented Sep 30, 2021 •

edited

Loading