Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

application crash due to k8s 1.9.x open the kernel memory accounting by default #61937

Closed
wzhx78 opened this issue Mar 30, 2018 · 120 comments
Closed
Labels
kind/support Categorizes issue or PR as a support question. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@wzhx78
Copy link

wzhx78 commented Mar 30, 2018

when we upgrade the k8s from 1.6.4 to 1.9.0, after a few days, the product environment report the machine is hang and jvm crash in container randomly , we found the cgroup memory css id is not release, when cgroup css id is large than 65535, the machine is hang, we must restart the machine.

we had found runc/libcontainers/memory.go in k8s 1.9.0 had delete the if condition, which cause the kernel memory open by default, but we are using kernel 3.10.0-514.16.1.el7.x86_64, on this version, kernel memory limit is not stable, which leak the cgroup memory leak and application crash randomly

when we run "docker run -d --name test001 --kernel-memory 100M " , docker report
WARNING: You specified a kernel memory limit on a kernel older than 4.0. Kernel memory limits are experimental on older kernels, it won't work as expected and can cause your system to be unstable.

k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go

-		if d.config.KernelMemory != 0 {
+			// Only enable kernel memory accouting when this cgroup
+			// is created by libcontainer, otherwise we might get
+			// error when people use `cgroupsPath` to join an existed
+			// cgroup whose kernel memory is not initialized.
 			if err := EnableKernelMemoryAccounting(path); err != nil {
 				return err
 			}

I want to know why kernel memory open by default? can k8s consider the different kernel version?

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

Uncomment only one, leave it on its own line:

/kind bug
/kind feature

What happened:
application crash and cgroup memory leak

What you expected to happen:
application stable and cgroup memory doesn't leak

How to reproduce it (as minimally and precisely as possible):
install k8s 1.9.x on kernel 3.10.0-514.16.1.el7.x86_64 machine, and create and delete pod repeatedly, when create more than 65535/3 times , the kubelet report "cgroup no space left on device" error, when the cluster run a few days , the container will crash.

Anything else we need to know?:

Environment: kernel 3.10.0-514.16.1.el7.x86_64

  • Kubernetes version (use kubectl version): k8s 1.9.x
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a): 3.10.0-514.16.1.el7.x86_64
  • Install tools: rpm
  • Others:
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 30, 2018
@qkboy
Copy link

qkboy commented Mar 30, 2018

Use below test case can reproduce this error:
first, make cgroup memory to be full:

# uname -r
3.10.0-514.10.2.el7.x86_64
# kubelet --version
Kubernetes 1.9.0
# mkdir /sys/fs/cgroup/memory/test
# for i in `seq 1 65535`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done
# cat /proc/cgroups |grep memory
memory  11      65535   1

then release 99 cgroup memory that can be used next to create:

# for i in `seq 1 100`;do rmdir /sys/fs/cgroup/memory/test/test-${i} 2>/dev/null 1>&2; done 
# mkdir /sys/fs/cgroup/memory/stress/
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done 
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device <-- notice number 100 can not create
# for i in `seq 1 100`;do rmdir /sys/fs/cgroup/memory/test/test-${i}; done <-- delete 100 cgroup memory
# cat /proc/cgroups |grep memory
memory  11      65436   1

second, create a new pod on this node.
Each pod will create 3 cgroup memory directory. for example:

# ll /sys/fs/cgroup/memory/kubepods/pod0f6c3c27-3186-11e8-afd3-fa163ecf2dce/
total 0
drwxr-xr-x 2 root root 0 Mar 27 14:14 6d1af9898c7f8d58066d0edb52e4d548d5a27e3c0d138775e9a3ddfa2b16ac2b
drwxr-xr-x 2 root root 0 Mar 27 14:14 8a65cb234767a02e130c162e8d5f4a0a92e345bfef6b4b664b39e7d035c63d1

So when we recreate 100 cgroup memory directory, there will be 4 item failed:

# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done    
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-97’: No space left on device <-- 3 directory used by pod
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-98’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-99’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device
# cat /proc/cgroups 
memory  11      65439   1

third, delete the test pod. Recreate 100 cgroup memory directory before confirm all test pod's container are already destroy.
The correct result that we expected is only number 100 cgroup memory directory can not be create:

# cat /proc/cgroups 
memory  11      65436   1
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done 
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device

But the incorrect result is all cgroup memory directory created by pod are leaked:

# cat /proc/cgroups 
memory  11      65436   1 <-- now cgroup memory total directory
# for i in `seq 1 100`;do mkdir /sys/fs/cgroup/memory/test/test-${i}; done    
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-97’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-98’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-99’: No space left on device
mkdir: cannot create directory ‘/sys/fs/cgroup/memory/test/test-100’: No space left on device

Notice that cgroup memory count already reduce 3 , but they occupy space not release.

@wzhx78
Copy link
Author

wzhx78 commented Mar 30, 2018

/sig container
/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2018
@wzhx78
Copy link
Author

wzhx78 commented Mar 30, 2018

@kubernetes/sig-cluster-container-bugs

@feellifexp
Copy link

This bug seems to be related: opencontainers/runc#1725

Which docker version are you using?

@qkboy
Copy link

qkboy commented Mar 30, 2018

@feellifexp with docker 1.13.1

@frol
Copy link

frol commented Mar 30, 2018

There is indeed a kernel memory leak up to 4.0 kernel release. You can follow this link for details: moby/moby#6479 (comment)

@wzhx78
Copy link
Author

wzhx78 commented Mar 31, 2018

@feellifexp the kernel log also have this message after upgrade to k8s 1.9.x

kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x8020)

@wzhx78
Copy link
Author

wzhx78 commented Mar 31, 2018

I want to know why k8s 1.9 delete this line if d.config.KernelMemory != 0 { in k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go

@feellifexp
Copy link

I am not an expert here, but this seems to be change from runc, and the change was introduced to k8s since v1.8.
After reading the code, it seems it impacts cgroupfs cgroup driver, while systemd driver is not changed. But I did not test the theory yet.
Maybe experts from kubelet and container can chime in further.

@kevin-wangzefeng
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 31, 2018
@kevin-wangzefeng
Copy link
Member

I want to know why k8s 1.9 delete this line if d.config.KernelMemory != 0 { in
k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go

I guess opencontainers/runc#1350 is the one you are looking for, which is actually an upstream change.

/cc @hqhq

@wzhx78
Copy link
Author

wzhx78 commented Mar 31, 2018

thanks @kevin-wangzefeng , the runc upstream had changed, I know why now , the change is `hqhq/runc@fe898e7 , but enable kernel memory accounting on root by default , the child cgroup will enable also, this will cause cgroup memory leak on kernel 3.10.0, @hqhq , is there any way to let us enable or disable kernel memory by ourself? or get the warning log when the kernel < 4.0

@hqhq
Copy link

hqhq commented Apr 1, 2018

@wzhx78 The root cause is there are kernel memory limit bugs in 3.10, if you don't want to use kernel memory limit because it's not stable on your kernel, the best solution would be to disable kernel memory limit on your kernel.

I can't think of a way to workaround this on runc side without causing issues like opencontainers/runc#1083 and opencontainers/runc#1347 , unless we add some ugly logic like do different things for different kernel versions, I'm afraid that won't be an option.

@wzhx78
Copy link
Author

wzhx78 commented Apr 1, 2018

@hqhq it's exactly kernel 3.10's bug, but we spent more time to found it and it brought us big trouble on production environment, since we only upgrade k8s version from 1.6.x to 1.9.x. In k8x 1.6.x version , it doesn't open the kernel memory by default since runc had if condition. but after 1.9.x, runc open it by default. we don't want others who upgrade the k8s 1.9.x version had this big trouble. And runc is popular container solution, we think it need to consider different kernel version, at least, if runc can report the error message in kubelet error log when the kernel is not suitable for open kernel memory by default

@wzhx78
Copy link
Author

wzhx78 commented Apr 2, 2018

@hqhq any comments ?

@hqhq
Copy link

hqhq commented Apr 2, 2018

Maybe you can add an option like --disable-kmem-limit for both k8s and runc to make runc disable kernel memory accounting.

@warmchang
Copy link
Contributor

v1.8 and all later versions will be affected by this.
e5a6a79#diff-17daa5db16c7d00be0fe1da12d1f9165L39

image

@wzhx78
Copy link
Author

wzhx78 commented Apr 3, 2018

@warmchang yes.

Is this reasonable to add --disable-kmem-limit flag in k8s ? anyone can discuss this with us ?

@like-inspur
Copy link
Contributor

I don't find there is a config named disable-kmem-limit for k8s. How to add this flag? @wzhx78

@wzhx78
Copy link
Author

wzhx78 commented Apr 14, 2018

k8s doesn't support now, we need to discuss with community whether is reasonable to add this flag in kubelet start option

@gyliu513
Copy link
Contributor

Not only 1.9, but also 1.10 and master have same issue. This is a very serious issue for production, I think providing a parameter to disable kmem limit is good.

/cc @dchen1107 @thockin any comments for this? Thanks.

@wzhx78
Copy link
Author

wzhx78 commented Apr 23, 2018

@thockin @dchen1107 any comments for this?

@gyliu513
Copy link
Contributor

gyliu513 commented May 21, 2018

@dashpole any reason to update memory.go as follows in e5a6a79#diff-17daa5db16c7d00be0fe1da12d1f9165L39 , this is seriously impacting Kubernetes 1.8, 1.9, 1.10, 1.11 etc.

-		if d.config.KernelMemory != 0 {
+			// Only enable kernel memory accouting when this cgroup
+			// is created by libcontainer, otherwise we might get
+			// error when people use `cgroupsPath` to join an existed
+			// cgroup whose kernel memory is not initialized.
 			if err := EnableKernelMemoryAccounting(path); err != nil {
 				return err
 			}

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

This issue is still reproduced in the following environment without kernel parameter cgroup.memory=nokmem

$ uname -r
3.10.0-1160.24.1.el7.x86_64
$ cat /etc/centos-release
CentOS Linux release 7.6.1810 (Core)

xref: docker/for-linux#841

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 23, 2021
@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

/reopen

@k8s-ci-robot
Copy link
Contributor

@gjkim42: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Apr 23, 2021
@k8s-ci-robot
Copy link
Contributor

@wzhx78: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 23, 2021
@TvdW
Copy link

TvdW commented Apr 23, 2021

CentOS 7.6 went EOL in 2019. Since the main fix for this problem was in CentOS 7.8, please check with a more recent version.

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

@TvdW

Also reproduced in the following environment.
After we make a pod and delete it, 3 cgroups memory directories are leaked.

$ uname -r
3.10.0-1160.11.1.el7.x86_64
$ cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

#61937 (comment)

It seems that the kernel bug which causes this error is finally fixed now, and will be released in kernel-3.10.0-1075.el7, which is due in RHEL 7.8, but goodness knows when that will be, as RHEL 7.7 only came out on August 6th, ~3 weeks ago.

https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101

There may be other bugs.

@JoshuaAndrew
Copy link
Contributor

this problem caused by OS and Docker
(1) this problem was fixed in RHEL/CentOS 7.8
(2) Disable kmem accounting in runc on RHEL/CentOS (docker/escalation#614, docker/escalation#692) docker/engine#121 in Docker CE 18.09.1

so you should upgrade centos and docker

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

@JoshuaAndrew
Do you mean that either of them addresses the issue? or do we need both of them?

@JoshuaAndrew
Copy link
Contributor

@gjkim42
need both of them

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

Actually, i am using containerd directly(not by docker).

$ containerd --version
containerd containerd.io 1.3.9 ea765aba0d05254012b0b9e595e995c09186427f
$ runc --version
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

Also reproduced in the following environment.

# cat /etc/centos-release
CentOS Linux release 7.8.2003 (Core)
# uname -r
3.10.0-1160.24.1.el7.x86_64
# containerd --version
containerd containerd.io 1.3.9 ea765aba0d05254012b0b9e595e995c09186427f
# runc --version
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

#61937 (comment)

It seems that the kernel bug which causes this error is finally fixed now, and will be released in kernel-3.10.0-1075.el7, which is due in RHEL 7.8, but goodness knows when that will be, as RHEL 7.7 only came out on August 6th, ~3 weeks ago.

https://bugzilla.redhat.com/show_bug.cgi?id=1507149#c101

I am not sure what they fixed at CentOS 7.8, but it didn't solve the problem.

According to docker-archive/engine@8486ea1,
I think CentOS 7 gave up solving this problem at the kernel level.

@chilicat
Copy link

We do not see the issue anymore since Centos 7.7. We set the kernel options: cgroup.memory=nokmem

Our current environment is as following:

# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)


# uname -r
3.10.0-1160.21.1.el7.x86_64

# docker version
Client:
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        18.09.2
 Built:             Wed Mar  6 12:37:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Wed Mar  6 12:32:48 2019
  OS/Arch:          linux/amd64
  Experimental:     false



# runc --version
runc version 1.0.0-rc6+dev

# containerd --version
containerd github.com/containerd/containerd 1.2.2 9754871865f7fe2f4e74d43e2fc7ccd237edcbce

@gjkim42
Copy link
Member

gjkim42 commented Apr 23, 2021

Thanks @chilicat
I also confirmed that setting the kernel parameter can resolve the issue.

HOWEVER, I am wondering if it is safe to set kernel parameter cgroup.memory=nokmem or if there is any other way than to set the kernel parameter.

@wmealing
Copy link

wmealing commented Apr 24, 2021

I misread your post @gjkim42 , sorry please ignore.

@gjkim42
Copy link
Member

gjkim42 commented May 20, 2021

cc @ehashman @bobbypage @dims

Does sig-node aware of this issue?
I think every cluster hosted by CentOS 7 has had this issue.

@ehashman
Copy link
Member

CentOS 7 is a much older kernel than what we test CI on in SIG Node/upstream Kubernetes (currently the 5.4.x series). People are welcome to experiment with kernel parameters and share workarounds for their own distributions/deployments but any support will be best effort.

@kolyshkin
Copy link
Contributor

I strongly suggest employing a workaround described at #61937 (comment)

Also, since runc v1.0.0-rc94 runc never sets kernel memory (so upgrading to runc >= v1.0.0-rc94 should solve the problem).

@ffromani
Copy link
Contributor

Kubernetes does not use issues on this repo for support requests. If you have a question on how to use Kubernetes or to debug a specific issue, please visit our forums.

/remove-kind bug
/kind support
/close

Extra rationale: this issue affects Centos 7, which is indeed much older than what we test in CI, and because workaround exists (see runc v1.0.0-rc94):

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 24, 2021
@k8s-ci-robot
Copy link
Contributor

@fromanirh: Closing this issue.

In response to this:

Kubernetes does not use issues on this repo for support requests. If you have a question on how to use Kubernetes or to debug a specific issue, please visit our forums.

/remove-kind bug
/kind support
/close

Extra rationale: this issue affects Centos 7, which is indeed much older than what we test in CI, and because workaround exists (see runc v1.0.0-rc94):

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andrewzrant
Copy link

andrewzrant commented Mar 27, 2024

Thanks @chilicat I also confirmed that setting the kernel parameter can resolve the issue.

HOWEVER, I am wondering if it is safe to set kernel parameter cgroup.memory=nokmem or if there is any other way than to set the kernel parameter.

yes, i also want to know cgroup.memory=nokmem will cause bad results? and cgroup.kmem desgins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests