Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of "loaded inactive dead" systemd transient mount units continues to grow #57345

Closed
saad-ali opened this issue Dec 18, 2017 · 40 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@saad-ali
Copy link
Member

saad-ali commented Dec 18, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:

The number of systemd transient mount units continues to grow unchecked on nodes.

I see a massive number of loaded inactive dead secret transient mount units, for example:

home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-pods-947b5769\x2de3d0\x2d11e7\x2d874f\x2d42010a8002d1-volumes-kubernetes.io\x7e...-default\x2dtoken\x2dvfqcv.mount loaded    inactive dead      /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/94f62320-e319-11e7-874f-42010a8002d1/volumes/kubernetes.io~secret/default-token-vfqcv

We suspect (but have not yet verified) that once there are too many transient mount units, subsequent mounts will fail with:

Warning FailedMount 28m (x40 over 1h) kubelet, ... MountVolume.SetUp failed for volume "myvol" : mount failed: exit status 1 
Mounting command: systemd-run 
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/8d3db903-e053-11e7-b0ce-42010a800024/volumes/kubernetes.io~secret/myvol --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/8d3db903-e053-11e7-b0ce-42010a800024/volumes/kubernetes.io~secret/myvol 
Output: Failed to start transient scope unit: Argument list too long

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

I left my one node cluster running over the weekend with a single cron job (based on the example cronjob in https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/).

On my test machine I saw the following on Friday evening:

$ mount | grep -i kube | wc -l
37
$ ls -l /run/systemd/transient/ | wc -l
1
$ systemctl list-units --all | wc -l
825

And the following on Monday morning:

$ mount | grep -i kube | wc -l
37
$ ls -l /run/systemd/transient/ | wc -l
1
$ systemctl list-units --all | wc -l
16154

Anything else we need to know?:

Suspect PR #49640 which forces mounts to run in their own systemd scope. It went in to k8s 1.8, so all versions of 1.8+ running systemd may be affected.

CC @jsafrane @derekwaynecarr

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 18, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 18, 2017
@saad-ali saad-ali added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 18, 2017
@saad-ali
Copy link
Member Author

/sig storage

@saad-ali
Copy link
Member Author

CC @kubernetes/sig-storage-bugs

@saad-ali
Copy link
Member Author

Mitigation: I can verify that rebooting the machine ("sudo shutdown -r") cleans up the transient mounts.

@rootfs
Copy link
Contributor

rootfs commented Dec 18, 2017

@gnufied
Copy link
Member

gnufied commented Dec 18, 2017

I ran some tests locally and systemd considers these commands as mount units and hence they show up in systemctl list-units but once secret has been unmounted the unit disappears from the list. @saad-ali can you verify if unit remains in list even after secret has been unmounted?

@gnufied
Copy link
Member

gnufied commented Dec 18, 2017

@saad-ali was this cluster started via ./cluster/local-up-cluster.sh ? if yes - this environment has default of KeepTerminatedVolumes set to true which will cause mount point to persist event after pod using it has been terminated...

@gnufied
Copy link
Member

gnufied commented Dec 18, 2017

I have opened a PR to change the default for local clusters - #57355

@rootfs
Copy link
Contributor

rootfs commented Dec 18, 2017

is kubelet started by systemd too? what's the resource limit in kubelet unit file?

@saad-ali
Copy link
Member Author

@gnufied No, I was able to repro against a standard GKE/GCE cluster. Once the secret is unmounted the unit appears to stick around as loaded inactive dead in systemctl list-units --all. The key for repro'ing was setting up a cluster with 1 node and setting up a simple Kuberentes cron job (which causes a new container to be created every minute).

@gnufied
Copy link
Member

gnufied commented Dec 18, 2017

@saad-ali do you still have that cluster around? can you check if those secrets are still mounted or they just show up in systemctl list-units but mount points are gone?

@saad-ali
Copy link
Member Author

@saad-ali do you still have that cluster around? can you check if those secrets are still mounted or they just show up in systemctl list-units but mount points are gone?

They don't appear to be mounted, they just show up in the systemctl list-units.

Example entry in:

var-lib-kubelet-pods-4cb56507\x2de42f\x2d11e7\x2da1b6\x2d42010a80022d-volumes-kubernetes.io\x7esecret-default\x2dtoken\x2d5zqkr.mount                                                  loaded    inactive dead      /var/lib/kubelet/pods/4cb56507-e42f-11e7-a1b6-42010a80022d/volumes/kubernetes.io~secret/default-token-5zqkr

And no associated entry in mount table:

$ mount | grep -i "4cb56507-e42f-11e7-a1b6-42010a80022d"

The number of entries in the mount table remains static:

$ mount | grep -i kube | wc -l
35

@saad-ali
Copy link
Member Author

Also, I'm seeing the systemd transient units growing uncontrolled in Kubernetes 1.6 as well. Over the course of an hour:

$ systemctl list-units --all | wc -l
369
$ systemctl list-units --all | wc -l
549

So my hypothesis is that this has been happening for a while, but the reason it's becoming an issue now is that in k8s 1.8+ (with PR #49640), all k8s mount operations are executed as scoped transient units and once the max units is hit, all subsequent kubernetes triggered mounts fail.

@rootfs
Copy link
Contributor

rootfs commented Dec 19, 2017

It might be a platform issue - @mtanino tested it with 16K transit mount on Fedora 25 without hitting this issue.

mkdir -p /mnt/test; for i in $(seq 1 16384); do echo $i; mkdir /mnt/test/$i; systemd-run --scope -- mount -t tmpfs tmpfs /mnt/test/$i; done

@msau42
Copy link
Member

msau42 commented Dec 19, 2017

It would be better to test on different platforms with the full Kubernetes stack, ie the cron job example. It could be something in Kubernetes or docker that is causing the leak.

@gnufied
Copy link
Member

gnufied commented Dec 19, 2017

I think this is indeed GKE/GCE problem.. /home/kubernetes/containerized_mounter/ directory appears to recursively bind mount rootfs including /var/lib/kubelet. All mounts inside /var/lib/kubelet also propagate into /home/kubernetes/containarized_mounter/xxx directory..(because containarized_mounter uses shared mount I guess).

I can confirm that, recursively bind mounting a directory with shared option in another place and then mounting tmpfs inside the directory causes tmpfs to propagate to bind mount directory as well (and you will have 2 systemd-unit for per mount). But on umount, both systemd units are removed from systemd-unit listing. So the bug isn't entirely because of rootfs being mounted in multiple places. But it does exaceberate the problems somewhat because for each mount you have 2 systemd units being created.

GCE image also appears to be using overlay2 and has a weird bind mount of /var/lib/docker on itself.. Things to investigate next:

In isolation check if somehow this is related to overlay2. The mount error that @saad-ali posted above is because of the way layers are mounted in overlay2. overlay2 uses symlinks to reduce number of arguments supplied to mount command.

@gnufied
Copy link
Member

gnufied commented Dec 19, 2017

@msau42 I tested full stack locally on Ubuntu and then full stack on EBS and I can't reproduce the problem.

@saad-ali
Copy link
Member Author

The limit based on what @wonderfly dug up is 131072 transient units (https://github.com/systemd/systemd/blob/v232/src/core/unit.c#L229). So you won't hit the issue with 16k units.

That said, it does look like the containerized_mounter is causing the leaks:

$ sudo mkdir -p /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha; sudo systemd-run --scope -- /home/kubernetes/containerized_mounter/mounter mount -t tmpfs tmpfs /var/lib/kubelet/testmnt/alpha
Running scope as unit: run-r5bde6edc9a5d4529bae2a560d81c8025.scope
$ systemctl list-units --all | grep -i "testmnt"
  home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmnt-alpha.mount                                                                                                       loaded    inactive dead      /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha                                                                              
  var-lib-kubelet-testmnt-alpha.mount                                                                                                                                                    loaded    inactive dead      /var/lib/kubelet/testmnt/alpha                                                                                                                           
$ mount | grep -i "testmnt"
tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime)
tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha type tmpfs (rw,relatime)
$ sudo umount /var/lib/kubelet/testmnt/alpha/
$ mount | grep -i "testmnt"
$ systemctl list-units --all | grep -i "testmnt"
  home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmnt-alpha.mount                                                                                                       loaded    inactive dead      /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmnt/alpha                                                                              
  var-lib-kubelet-testmnt-alpha.mount                                                                                                                                                    loaded    inactive dead      /var/lib/kubelet/testmnt/alpha                                                                                                                           

Mounts created directly with the host mount utility do no appear to have the same issue.

@saad-ali
Copy link
Member Author

That said, it does look like the containerized_mounter is causing the leaks:

@gnufied pointed out offline that this is a bit misleading:

[12:45] saad, I think problem isn't that containarized_mounter is causing the leak. the problem is anything that gets mounted in /home/containzerid_mounter/ is creating additional inactive/dead systemd-unit
[12:45] even if you use regular host mount
[12:46] or specifically - if you mount anything in /var/lib/kubelet it propgataes and creates another unit for /home/kubernetes/containized_mounter
[12:46] just using regular mount command, I am not even using systemd-run

So to be clear the problem is with the way the containerized_mounter is setup (specifically with mount propagation) not containerized_mounter triggering the mounting.

@saad-ali
Copy link
Member Author

So to be clear the problem is with the way the containerized_mounter is setup (specifically with mount propagation) not containerized_mounter triggering the mounting.

To expand on this, the mount does not need to created containerized_mounter for the inactive dead systemd transient mount unit to be created. Any mounts created in the /var/lib/kubelet/ dir will do this:

$ sudo mount -t tmpfs tmpfs /var/lib/kubelet/testmntsaad1/
$ mount | grep -i "testmntsaad"
tmpfs on /var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
tmpfs on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
saadali@gke-cluster-1-default-pool-63faa0d6-k4f6 ~ $ systemctl list-units --all | grep -i "testmntsaad"
  home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmntsaad1.mount                                                                   
                                     loaded    inactive dead      /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1        
                                                                       
  var-lib-kubelet-testmntsaad1.mount                                                                                                                
                                     loaded    inactive dead      /var/lib/kubelet/testmntsaad1                                                     
                                                                       
$ sudo umount /var/lib/kubelet/testmntsaad1/
$ mount | grep -i "testmntsaad"
$ systemctl list-units --all | grep -i "testmntsaad"
  home-kubernetes-containerized_mounter-rootfs-var-lib-kubelet-testmntsaad1.mount                                                                   
                                     loaded    inactive dead      /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/testmntsaad1        
                                                                       
  var-lib-kubelet-testmntsaad1.mount                                                                                                                
                                     loaded    inactive dead      /var/lib/kubelet/testmntsaad1    

Any mounts created in /var/lib/docker will also do this:

$ sudo mkdir /var/lib/docker/saaddockertest1
$ mount | grep -i "saaddockertest1"
$ systemctl list-units --all | grep -i "saaddockertest1"
$ sudo mount -t tmpfs tmpfs /var/lib/docker/saaddockertest1
$ mount | grep -i "saaddockertest1"
tmpfs on /var/lib/docker/saaddockertest1 type tmpfs (rw,relatime)
tmpfs on /var/lib/docker/saaddockertest1 type tmpfs (rw,relatime)
$ systemctl list-units --all | grep -i "saaddockertest1"
  var-lib-docker-saaddockertest1.mount                                                                                                                                                   loaded    inactive dead      /var/lib/docker/saaddockertest1                                                                                                                          
$ sudo umount /var/lib/docker/saaddockertest1
$ mount | grep -i "saaddockertest1"
$ systemctl list-units --all | grep -i "saaddockertest1"
  var-lib-docker-saaddockertest1.mount                                                                                                                                                   loaded    inactive dead      /var/lib/docker/saaddockertest1                                                                                                                          

Mounts created outside those directories do not appear to have this issue:

$ sudo mkdir /tmp/mnttestsaad1tmp
$ sudo mount -t tmpfs tmpfs /tmp/mnttestsaad1tmp/
$ mount | grep -i "mnttestsaad1tmp"
tmpfs on /tmp/mnttestsaad1tmp type tmpfs (rw,relatime)
$ systemctl list-units --all | grep -i "mnttestsaad1tmp"
  tmp-mnttestsaad1tmp.mount                                                                                                                                                              loaded    active   mounted   /tmp/mnttestsaad1tmp                                                                                                                                     
$ sudo umount /tmp/mnttestsaad1tmp/
$ mount | grep -i "mnttestsaad1tmp"
$ systemctl list-units --all | grep -i "mnttestsaad1tmp"

These directories are both set up with shared mount propagation:

/dev/sda1[/var/lib/docker]                        │ ├─/var/lib/docker                                                                                                                                               shared
/dev/sda1[/var/lib/kubelet]                       │ ├─/var/lib/kubelet                                                                                                                                              shared
/dev/sda1[/var/lib/kubelet]                         │ ├─/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet                                                                                               shared

Will follow up with COS team.

@gnufied
Copy link
Member

gnufied commented Dec 20, 2017

BTW an easier mitigation might be systemctl daemon-reload which will remove those dead/inactive units.

@saad-ali
Copy link
Member Author

Ya, that's the mitigation we are using at the moment.

@msau42
Copy link
Member

msau42 commented Dec 20, 2017

@rootfs @mtanino would you guys be able to try the mount experiment again in a directory that is set up with shared mount propagation? Trying to narrow down if this is a general issue with shared mount propagation + systemd.

@mtanino
Copy link

mtanino commented Dec 23, 2017

@msau42 @rootfs
ok. Will do.

Is your expected command line like this?

for i in $(seq 1 16384); do echo $i; mkdir -p /mnt/test/$i; mkdir -p /mnt/test-bind/$i; systemd-run --scope -- mount --make-shared -t tmpfs tmpfs /mnt/test/$i; systemd-run --scope -- mount --bind /mnt/test/$i /mnt/test-bind/$i; done
for i in $(seq 1 16384); do sudo umount /mnt/test/$i; sudo umount /mnt/test-bind/$i; done

@msau42
Copy link
Member

msau42 commented Dec 27, 2017

@mtanino actually, just the /mnt/test directory should be bind mounted shared. The underlying mounts are normal mounts.

@mcluseau
Copy link
Contributor

mcluseau commented Jan 9, 2018

Hi, good catch, I'm seeing this in my cluster too:

# systemctl list-units --all | wc -l
131079

The problem is that this is starving systemd resources, causing errors like

# systemctl daemon-reload
Failed to reload daemon: No buffer space available

It seems that waiting a bit allows the daemon-reload to work though:

# systemctl list-units --all | wc -l
1095

Another consequence is socket-activated services, like sshd in CoreOS Container Linux, can't start anymore.

[edit] some references that I think are in the same error class:

@mtanino
Copy link

mtanino commented Jan 12, 2018

@msau42 @saad-ali @gnufied

Sorry for my late response.
I tried shared bind mount using following command as @msau42 mentioned but I didn't see any "loaded inactive dead" on my 'systemctl list-units'.

for i in $(seq 1 32767); do echo $i; mkdir -p /var/lib/kubelet/test/$i; mkdir -p /var/lib/kubelet/test-sharedbind/$i; mount -t tmpfs tmpfs /var/lib/kubelet/test/$i; systemd-run --scope -- mount --make-shared --bind /var/lib/kubelet/test/$i /var/lib/kubelet/test-sharedbind/$i; done

Here is my test results.

Tried same steps with @saad-ali but I didn't see loaded inactive dead.

root# uname -a
Linux bl-k8sbuild 4.13.12-100.fc25.x86_64 #1 SMP Wed Nov 8 18:13:25 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root# mkdir /var/lib/kubelet/testmntsaad1/
root# sudo mount -t tmpfs tmpfs /var/lib/kubelet/testmntsaad1/
root# mount | grep -i "testmntsaad"
tmpfs on /var/lib/kubelet/testmntsaad1 type tmpfs (rw,relatime)
root# systemctl list-units --all | grep -i "testmntsaad"
  var-lib-kubelet-testmntsaad1.mount                                                                             loaded    active   mounted   /var/lib/kubelet/testmntsaad1
root# sudo umount /var/lib/kubelet/testmntsaad1/
root# mount | grep -i "testmntsaad"
root# systemctl list-units --all | grep -i "testmntsaad"
root#

Also I tried shared bind mount for 32767 mount points but also didn't see the problem.

root# uname -a
Linux bl-k8sbuild 4.13.12-100.fc25.x86_64 #1 SMP Wed Nov 8 18:13:25 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root# cat transient_mount_v2.sh 
for i in $(seq 1 32767); do echo $i; mkdir -p /var/lib/kubelet/test/$i; mkdir -p /var/lib/kubelet/test-sharedbind/$i; mount -t tmpfs tmpfs /var/lib/kubelet/test/$i; systemd-run --scope -- mount --make-shared --bind /var/lib/kubelet/test/$i /var/lib/kubelet/test-sharedbind/$i; done
root# ./transient_mount_v2.sh
1
Running scope as unit: run-r8dd78c69477f4a5d99fa327575769464.scope
2
Running scope as unit: run-r3f44a91e916f4b869484232af8c017aa.scope
3
....


root# systemctl list-units --all | grep -i var-lib-kubelet-test | less
root# systemctl list-units --all | grep -i var-lib-kubelet-test | wc
  65534  327670 11479005
root# systemctl list-units --all | grep -i var-lib-kubelet-test | grep inactive
root# systemctl list-units --all | grep -i var-lib-kubelet-test | head -n 10
  var-lib-kubelet-test-1.mount                                                                                   loaded    active   mounted   /var/lib/kubelet/test/1
  var-lib-kubelet-test-10.mount                                                                                  loaded    active   mounted   /var/lib/kubelet/test/10
  var-lib-kubelet-test-100.mount                                                                                 loaded    active   mounted   /var/lib/kubelet/test/100
  var-lib-kubelet-test-1000.mount                                                                                loaded    active   mounted   /var/lib/kubelet/test/1000
  var-lib-kubelet-test-10000.mount                                                                               loaded    active   mounted   /var/lib/kubelet/test/10000
  var-lib-kubelet-test-10001.mount                                                                               loaded    active   mounted   /var/lib/kubelet/test/10001
  var-lib-kubelet-test-10002.mount                                                                               loaded    active   mounted   /var/lib/kubelet/test/10002
  var-lib-kubelet-test-10003.mount                                                                               loaded    active   mounted   /var/lib/kubelet/test/10003
  var-lib-kubelet-test-10004.mount                                                                               loaded    active   mounted   /var/lib/kubelet/test/10004
  var-lib-kubelet-test-10005.mount                                                                               loaded    active   mounted   /var/lib/kubelet/test/10005
root#

@honnix
Copy link

honnix commented Jan 14, 2018

Just to confirm this is still happening on GKE 1.8.5 nodes.

@saad-ali
Copy link
Member Author

At this point, we believe this to be a systemd issue tracked by systemd/systemd#7798:

In systemd if a directory is bind mounted to itself (it doesn't matter how this is done) and then something is mounted at some subdirectory of that directory (by something other than systemd-mount) systemd will create a bad mount unit for that mount point (the mount point at the subdirectory). This mount unit appears to only be cleaned up when either the parent bind mount is unmounted or when systemctl daemon-reload is run.

The suggested mitigation is to run systemctl daemon-reload to periodically clear the bad transient mounts.

Once it is fixed in systemd, the changes must be picked up in the OS you are using.

@tomoe
Copy link
Contributor

tomoe commented Apr 2, 2018

For the record, the systemd issue has been addressed with systemd/systemd#7811 and included in the tag v237.

@artemyarulin
Copy link

Does anyone know when systemd would be updated on GKE? We have that issue every second month :(

@honnix
Copy link

honnix commented Aug 6, 2018

If I'm not mistaken, update has been done in GKE, at least since version 1.8.8-gke.0. @artemyarulin

@artemyarulin
Copy link

Thank you @honnix, but we have v1.9.2-gke.1 and just got an issue today :(

@honnix
Copy link

honnix commented Aug 6, 2018

Interesting, then I'm not sure. Regression? Or maybe ported to later 1.9.x than 1.9.2?

@artemyarulin
Copy link

Yeah, would try to migrate to the latest one today and see how it behaves, thank you!

@msau42
Copy link
Member

msau42 commented Aug 6, 2018

@honnix this is fixed in gke 1.9.3+

@msau42
Copy link
Member

msau42 commented Aug 6, 2018

oops and @artemyarulin ^

@artemyarulin
Copy link

@msau42 This is awesome, thank you! Doing migration now :)

@martinnaughton
Copy link

I had the issue on Azure with kubernetes 1.9.9. i restarted the node and issue disappeared. It might appear again after a few days.

@cofyc
Copy link
Member

cofyc commented Oct 5, 2018

@martinnaughton You need to upgrade your systemd version to v237+ or patch current systemd with change from systemd PR 7811.

@zhangjiarui130
Copy link

I have tested command "systemctl daemon-reload",it don't work.
my question is frequent creation and stop of PODS by JOB mode causes the time to create and stop PODS to gradually increase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.