Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run mount in its own systemd scope. #49640

Merged
merged 2 commits into from Aug 9, 2017

Conversation

@jsafrane
Copy link
Member

commented Jul 26, 2017

Kubelet needs to run /bin/mount in its own cgroup.

  • When kubelet runs as a systemd service, "systemctl restart kubelet" may kill all processes in the same cgroup and thus terminate fuse daemons that are needed for gluster and cephfs mounts.

  • When kubelet runs in a docker container, restart of the container kills all fuse daemons started in the container.

Killing fuse daemons is bad, it basically unmounts volumes from running pods.

This patch runs mount via "systemd-run --scope /bin/mount ...", which makes sure that any fuse daemons are forked in its own systemd scope (= cgroup) and they will survive restart of kubelet's systemd service or docker container.

This helps with #34965

As a downside, each new fuse daemon will run in its own transient systemd service and systemctl output may be cluttered.

@kubernetes/sig-storage-pr-reviews
@kubernetes/sig-node-pr-reviews

fuse daemons for GlusterFS and CephFS are now run in their own systemd scope when Kubernetes runs on a system with systemd.
Run mount in its own systemd scope.
Kubelet needs to run /bin/mount in its own cgroup.

- When kubelet runs as a systemd service, "systemctl restart kubelet" may kill
  all processes in the same cgroup and thus terminate fuse daemons that are
  needed for gluster and cephfs mounts.

- When kubelet runs in a docker container, restart of the container kills all
  fuse daemons started in the container.

Killing fuse daemons is bad, it basically unmounts volumes from running pods.

This patch runs mount via "systemd-run --scope /bin/mount ...", which makes
sure that any fuse daemons are forked in its own systemd scope (= cgroup) and
they will survive restart of kubelet's systemd service or docker container.

As a downside, each new fuse daemon will run in its own transient systemd
service and systemctl output may be cluttered.
@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2017

@sjenning

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2017

Thanks! lgtm on the surface but I'll have to try it out and see how it operates IRL. I'll also be interested to see what @saad-ali thinks about this. Not sure if systemd dependence is already hardcoded into kube or not.

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2017

It's a soft dependency, there is fallback to simple /bin/mount if systemd-run is not installed.

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2017

/retest

@@ -115,6 +117,36 @@ func doMount(mounterPath string, mountCmd string, source string, target string,
mountCmd = mounterPath
}

if _, err := os.Stat(systemdRunPath); err == nil {

This comment has been minimized.

Copy link
@rootfs

rootfs Jul 26, 2017

Member

does it work if kubelet container has /bin/systemd-run in it?

This comment has been minimized.

Copy link
@jsafrane

jsafrane Jul 26, 2017

Author Member

kubelet in container runs NsEnterMounter and that runs systemd-run in the host mount namespace.

@@ -41,6 +41,8 @@ const (
expectedNumFieldsPerLine = 6
// Location of the mount file to use
procMountsPath = "/proc/mounts"
// Location of systemd-run binary
systemdRunPath = "/bin/systemd-run"

This comment has been minimized.

Copy link
@rootfs

rootfs Jul 26, 2017

Member

on ubuntu, systemd-run is at /usr/bin/systemd-run

This comment has been minimized.

Copy link
@jsafrane

jsafrane Jul 27, 2017

Author Member

Good catch, fixed

@jingxu97

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2017

@jsafrane How serious of the downside you mentioned?
So the idea here is if systemd-run binary is available, kubelet will use systemd-run --scope /bin/mount .. to mount. Otherwise, use the original mount binary and the problem mentioned in the commit will not be solved in this case. Is that right?

@euank
Copy link
Member

left a comment

This seems like a needlessly complex way of moving a process into a cgroup.

cgexec is a reasonably common unix utility which executes a command in a cgroup, which is a far simpler solution than "systemd-run" for doing so. Similarly, writing the pid of a given mount process to the cgroups "tasks" file is much simpler.

If we do wish to use systemd for this, the right way of creating a transient unit from go is to bindmount in systemd's socket and use the go-systemd library to talk to it and create said unit. It could even be a mount unit instead of a service file executing mount instead.

However, I question whether this is the right solution to the problem in the first place.. This seems like quite a hack and I'd rather us decide to properly fix the feature than just pile on more hacks.

cc @lucab, we've had some discussion about the limitations of --containerized elsewhere (e.g. #46597 (review)).

// does not run as a systemd service.
// No code here, mountCmd and mountArgs are already populated.
}

glog.V(4).Infof("Mounting cmd (%s) with arguments (%s)", mountCmd, mountArgs)
command := exec.Command(mountCmd, mountArgs...)
output, err := command.CombinedOutput()

This comment has been minimized.

Copy link
@euank

euank Jul 27, 2017

Member

This change breaks error handling here.

Before err would be set if the mount command exited nonzero. Now it won't since systemd-run will create a transient service and exit 0 once the service is created, regardless if the service actually executes successfully.

We still get errors for the mount binary referenced not existing, but any errors later than that are lost to the kubelet.

This comment has been minimized.

Copy link
@jsafrane

jsafrane Jul 27, 2017

Author Member

no, systemd-run --scope waits for the command (/bin/mount) to exit and returns the right exit code and stdout/stderr

$ systemd-run --scope /bin/sh -c "echo hello; exit 42"
Running as unit run-12305.scope.
hello
$ echo $?
42
@euank

This comment has been minimized.

Copy link
Member

commented Jul 27, 2017

@jingxu97

The problems mentioned in the commit only impact someone if they're running the kubelet via systemd without KillMode=process. I believe most configurations of the kubelet today run with KillMode=process, so the problems in the original commit don't happen in practice regardless....

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2017

Otherwise, use the original mount binary and the problem mentioned in the commit will not be solved in this case. Is that right?

Not exactly. If there is no systemd-run, then I would assume that the system does not run systemd and it runs another service manager. AFAIK, only systemd kills all processes in service's cgroup on restart.

How serious of the downside you mentioned?

Well, if Kubernetes does not use glusterfs or cephfs, then these new transient services will be very short-living. They appear, mount a fs and then they're deleted. I think nobody notices any change. For gluster and ceph... they will pollute systemctl output. Apart from that, the resources allocated should be IMO negligible.

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2017

@euank, I can't be sure that cgexec is installed and it would be a new kubelet dependency. OTOH, I can be reasonably sure that system-run is installed on system with systemd.

And while talking to systemd via dbus is possible, it would add lot of code to watch the transient service until the mount exits and collect stdout/stderr. go-systemd is fairly low level. All this is already solved by systemd-run.

I believe most configurations of the kubelet today run with KillMode=process, so the problems in the original commit don't happen in practice regardless....

OpenShift runs KillMode=cgroup to kill faulty cadvisor probes.

@jsafrane jsafrane force-pushed the jsafrane:systemd-mount-service branch from 84bb078 to d6dad06 Jul 27, 2017

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2017

cgexec is a reasonably common unix utility

Did my homework... cgexec is not in default installation on Ubuntu Xenial, Debian Jessie, Fedora 25, CentOS 7 nor CoreOS 1 1478. Which is pretty sad, I am one of its authors... But I must admit, systemd now owns cgroups and it's far better at it than cgexec / cgclassify ever was.

@jsafrane jsafrane force-pushed the jsafrane:systemd-mount-service branch from d6dad06 to 5a8a611 Jul 27, 2017

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2017

/retest

@euank

This comment has been minimized.

Copy link
Member

commented Jul 27, 2017

Thanks for clarifying my misunderstanding; re-reading the docs around scope does make this seem more sane.

I withdraw my technical objections, though I do still consider this as a hack rather than a proper solution.

@rootfs

This comment has been minimized.

Copy link
Member

commented Jul 27, 2017

I wouldn't characterize it a hack - transient systemd scope is used elsewhere for cgroup isolation and system initialization.
/lgtm

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2017

/retest

2 similar comments
@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 3, 2017

/retest

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 3, 2017

/retest

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 3, 2017

This PR breaks hollow node:

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/49640/pull-kubernetes-kubemark-e2e-gce/43372/artifacts/e2e-43372-minion-group-x9fv/kubelet-hollow-node-gjps9.log

Mounting command: /usr/bin/systemd-run
Mounting arguments: --description=Kubernetes transient mount for /tmp/hollow-kubelet.437755395/pods/34409e14-781f-11e7-b1f0-42010a800004/volumes/kubernetes.io~secret/default-token-llw5w --scope -- mount -t tmpfs tmpfs /tmp/hollow-kubelet.437755395/pods/34409e14-781f-11e7-b1f0-42010a800004/volumes/kubernetes.io~secret/default-token-llw5w
Output: Failed to create bus connection: No such file or directory

kubelet runs on a system with systemd but apparently without dbus.

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2017

Diving deeper into hollow kubelet, it uses jessie as base image, i.e. systemd-run is there but there is no systemd running there. @wojtek-t @gmarek, why is hollow kubelet mounting anything at all? Shouldn't it be mocked out? And why is nsenter-mounter not used when hollow kubelet runs in a container?

@gmarek

This comment has been minimized.

Copy link
Member

commented Aug 8, 2017

No reason - it's probably just a natural evolution. I'm fine with changing that.

@k8s-github-robot k8s-github-robot added size/L and removed lgtm size/M labels Aug 8, 2017

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2017

I added a commit that checks if systemd is really running and not just installed. I am open for ideas how to run kubelet with KillMode=cgroup and not killing fuse mounts on restart, so far systemd-run seems to be the best way.

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2017

/retest

@rootfs

This comment has been minimized.

Copy link
Member

commented Aug 9, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 9, 2017

@jsafrane

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2017

Just as a proof that this PR does work both on systemd and non-systemd OSes, pull-kubernetes-node-e2e was successful and here is its output on containerVM (no systemd): https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/49640/pull-kubernetes-node-e2e/43973/artifacts/tmp-node-e2e-689fcb18-e2e-node-containervm-v20161208-image/kubelet.log

I0808 16:17:45.604137    3372 mount_linux.go:160] Mounting cmd (mount) with arguments ([-t tmpfs tmpfs /var/lib/kubelet/pods/1ca1b9d9-7c55-11e7-ab53-42010a800014/volumes/kubernetes.io~projected/projected-configmap-volume])

and GCI (with systemd):
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/49640/pull-kubernetes-node-e2e/43973/artifacts/tmp-node-e2e-689fcb18-cos-beta-60-9592-70-0/kubelet.log

Aug 08 16:17:47 tmp-node-e2e-689fcb18-cos-beta-60-9592-70-0 kubelet[1687]: I0808 16:17:47.463959    1687 mount_linux.go:160] Mounting cmd (systemd-run) with arguments ([--description=Kubernetes transient mount for /var/lib/kubelet/pods/1dc258ed-7c55-11e7-8cd5-42010a800005/volumes/kubernetes.io~projected/projected-secret-volume --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/1dc258ed-7c55-11e7-8cd5-42010a800005/volumes/kubernetes.io~projected/projected-secret-volume])

Also, kubemark seems to be happy.

@jingxu97

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

/approve

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

/test all [submit-queue is verifying that this PR is safe to merge]

@saad-ali

This comment has been minimized.

Copy link
Member

commented Aug 9, 2017

Gets the job done. Thanks.

/lgtm
/approve

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingxu97, jsafrane, rootfs, saad-ali

Associated issue: 34965

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 68ac78a into kubernetes:master Aug 9, 2017

9 of 10 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-bazel
Details
cla/linuxfoundation jsafrane authorized
Details
pull-kubernetes-bazel Job succeeded.
Details
pull-kubernetes-e2e-gce-etcd3 Jenkins job succeeded.
Details
pull-kubernetes-e2e-kops-aws Jenkins job succeeded.
Details
pull-kubernetes-federation-e2e-gce Jenkins job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Jenkins job succeeded.
Details
pull-kubernetes-node-e2e Jenkins job succeeded.
Details
pull-kubernetes-unit Jenkins job succeeded.
Details
pull-kubernetes-verify Jenkins job succeeded.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.