Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s projected volumes don't work when using systemd inside the container #767

Closed
jojonium opened this issue Jan 16, 2024 · 18 comments
Closed

Comments

@jojonium
Copy link

Using an Ubuntu 20.04 (kernel 5.15.0) node running Kubernetes 1.26 and sysbox 0.6.3.

I'm trying to inject a ServiceAccount token into my pod using a Kubernetes projected volume.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sysbox-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: sysbox-test
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        io.kubernetes.cri-o.userns-mode: auto:size=65536
      labels:
        app.kubernetes.io/name: sysbox-test
    spec:
      containers:
      - command: ["sh", "-c", "exec /sbin/init"]
        image: nestybox/ubuntu-bionic-systemd:latest
        name: dev
        securityContext:
          allowPrivilegeEscalation: true
          privileged: false
          readOnlyRootFilesystem: false
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - mountPath: /var/run/secrets/serviceaccount
          name: token
          readOnly: true
      runtimeClassName: sysbox-runc
      securityContext:
        fsGroup: 0
        runAsNonRoot: false
        runAsUser: 0
      serviceAccount: my-service-account
      serviceAccountName: my-service-account
      volumes:
      - name: token
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              expirationSeconds: 86400
              path: token

Systemd works fine from inside the container as expected, but if I try to get the injected token:

root@sysbox-test-d5846cdcb-7m75r:/# cat /var/run/secrets/serviceaccount/token
cat: /var/run/secrets/serviceaccount/token: No such file or directory

If I simply change the command to ["sh", "-c", "sleep 1000"] so it doesn't start systemd as PID 1 the token is injected successfully and I can read it.

I can see the mount with findmnt so I'm not sure why it's failing to actually get mounted:

root@sysbox-test-d5846cdcb-7m75r:/# findmnt | grep serviceaccount
|-/run/secrets/serviceaccount                  /var/lib/sysbox/shiftfs/ef954be7-d6f7-492e-b448-f3b412a7399f
      shiftfs  ro,relatime

I came across issue #728 while looking into this so I thought to check the logs from sysbox-mgr in case shiftfs wasn't working properly but it doesn't seem to be the same issue reported there:

level=info msg="Starting ..."
level=info msg="Sysbox data root: /var/lib/sysbox"
level=info msg="Shiftfs module found in kernel: yes"
level=info msg="Shiftfs works properly: yes"
level=info msg="Shiftfs-on-overlayfs works properly: yes"
level=info msg="ID-mapped mounts supported by kernel: yes"
level=info msg="Overlayfs on ID-mapped mounts supported by kernel: no"
level=info msg="Operating in system container mode."
level=info msg="Inner container image preloading disabled."
level=info msg="Listening on /run/sysbox/sysmgr.sock"
level=info msg="Ready ..."

I don't know why this would only happen when systemd is started as the container's PID 1, any insight is appreciated.

@ctalledo
Copy link
Member

Hi @jojonium,

Thanks for giving Sysbox a shot.

If I simply change the command to ["sh", "-c", "sleep 1000"] so it doesn't start systemd as PID 1 the token is injected successfully and I can read it.

That's so strange; whether systemd is PID 1 or not should not make a difference at all.

I can see the mount with findmnt so I'm not sure why it's failing to actually get mounted:

root@sysbox-test-d5846cdcb-7m75r:/# findmnt | grep serviceaccount
|-/run/secrets/serviceaccount /var/lib/sysbox/shiftfs/ef954be7-d6f7-492e-b448-f3b412a7399f shiftfs ro,relatime

That looks normal to me; what do you see under /var/run/secrets/serviceaccount?

Also, do you have access to the K8s node where the Sysbox pod is running? If so, please re-launch the Sysbox pod and do a findmnt on that node while the pod is running, so I can see how the shiftfs mount is setup by Sysbox.

Thanks!

@jojonium
Copy link
Author

what do you see under /var/run/secrets/serviceaccount?

There's no serviceaccount directory created under /var/run/

root@sysbox-test-58648986c9-qnl2z:~# ls -a /var/run/
.  ..  dbus  initctl  lock  log  sendsigs.omit.d  shm  sudo  systemd  tmpfiles.d  user  utmp

This is what findmnt looks like from the node with the pod running:

├─/var/lib/kubelet/pods/68720154-346b-422e-a319-fde2c39e6a9d/volumes/kubernetes.io~projected/token       tmpfs                                                                                   tmpfs       rw,relatime,size=1048576k,inode64
├─/var/lib/sysbox/shiftfs/28a5860f-a260-49aa-a419-1f2221ed6d91                                           /var/lib/kubelet/pods/68720154-346b-422e-a319-fde2c39e6a9d/volumes/kubernetes.io~projected/token                     shiftfs     rw,relatime,mark

@ctalledo
Copy link
Member

Thanks @jojonium, that helps.

At host level, is /var/lib/kubelet/pods/68720154-346b-422e-a319-fde2c39e6a9d/volumes/kubernetes.io~projected/token a file or a directory?

The way it should work is:

  1. If token is a file:
  • At host level, Sysbox should have created a shiftfs mount from the parent dir (/var/lib/kubelet/pods/68720154-346b-422e-a319-fde2c39e6a9d/volumes/kubernetes.io~projected) onto /var/lib/sysbox/shiftfs/28a5860f-a260-49aa-a419-1f2221ed6d9. That's because shiftfs mounts only work on dirs, not files.

  • Then, Sysbox should have mounted /var/lib/sysbox/shiftfs/28a5860f-a260-49aa-a419-1f2221ed6d9 into /run/secrets/serviceaccount.

  1. If token is a directory:
  • At host level, Sysbox should have created a shiftfs mount from that token (/var/lib/kubelet/pods/68720154-346b-422e-a319-fde2c39e6a9d/volumes/kubernetes.io~projected/token) onto /var/lib/sysbox/shiftfs/28a5860f-a260-49aa-a419-1f2221ed6d9.

  • Then, Sysbox should have mounted /var/lib/sysbox/shiftfs/28a5860f-a260-49aa-a419-1f2221ed6d9 into /run/secrets/serviceaccount/token.

Now, I don't know yet why systemd being PID 1 causes a problem, except that I did notice that when systemd is PID 1, sysbox automatically mounts tmpfs on /run and /run/lock. That tmpfs mount must be related to the problem somehow.

Another question: how does findmnt | grep serviceaccount look inside the container without systemd?

@jojonium
Copy link
Author

At the host level, /var/lib/kubelet/pods/<uuid>/volumes/kubernetes.io~projected/token is a directory, which contains a file called token (the path of the projected volume). It has created the shiftfs mount at /var/lib/sysbox/shiftfs/<uuid>, and I can see the token there too. But the volume still does not appear inside the container.

You're right about systemd mounting /run as a tmpfs but I'm not sure why that would cause problems.

This is from inside the container with systemd enabled:

root@sysbox-test-6f7b77dbd8-rc478:/# findmnt | grep run
|-/run                                                       tmpfs                                                                                                    tmpfs    rw,nosuid,nodev,mode=755,uid=296608,gid=296608,inode64
| `-/run/lock                                                tmpfs                                                                                                    tmpfs    rw,nosuid,nodev,noexec,relatime,size=5120k,uid=296608,gid=296608,inode64
|-/run/.containerenv                                         /var/lib/sysbox/shiftfs/d8d04a15-949c-4f42-a2aa-9a42f286ed27[/.containerenv]                             shiftfs  rw,relatime
|-/run/secrets/serviceaccount                                /var/lib/sysbox/shiftfs/a8354163-5638-4430-900a-e0ffba0dbc6a                                             shiftfs  ro,relatime

And without:

root@sysbox-test-6dbd94ddb7-hdg4v:/# findmnt | grep /run
|-/run/.containerenv                                         /var/lib/sysbox/shiftfs/138a1f02-f20b-4848-bcb9-5d4f0ef5278b[/.containerenv]                             shiftfs  rw,relatime
|-/run/secrets/serviceaccount                                /var/lib/sysbox/shiftfs/8dd39980-16cb-4efb-846f-2d712fbea6c1                                             shiftfs  ro,relatime

This is also without systemd:

root@sysbox-test-6dbd94ddb7-hdg4v:/# findmnt | grep serviceaccount
|-/run/secrets/serviceaccount                                /var/lib/sysbox/shiftfs/8dd39980-16cb-4efb-846f-2d712fbea6c1                                             shiftfs  ro,relatime

@ctalledo
Copy link
Member

Thanks @jojonium, very helpful info.

This is from inside the container with systemd enabled:

root@sysbox-test-6f7b77dbd8-rc478:/# findmnt | grep run
|-/run tmpfs tmpfs rw,nosuid,nodev,mode=755,uid=296608,gid=296608,inode64
| `-/run/lock tmpfs tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,uid=296608,gid=296608,inode64
|-/run/.containerenv /var/lib/sysbox/shiftfs/d8d04a15-949c-4f42-a2aa-9a42f286ed27[/.containerenv] shiftfs rw,relatime
|-/run/secrets/serviceaccount /var/lib/sysbox/shiftfs/a8354163-5638-4430-900a-e0ffba0dbc6a shiftfs ro,relatime

I think I see the problem; in the above output, the /run/secrets/serviceaccount mount should have been a submount of the /run mount (similar to /run/lock), but it does not appear to be.

I suspect that Sysbox (incorrectly) did the /run mount after the /run/secrets/serviceaccount mount and thus it's hiding it. Let me check the code to see where the bug is.

@ctalledo
Copy link
Member

I suspect that Sysbox (incorrectly) did the /run mount after the /run/secrets/serviceaccount mount and thus it's hiding it. Let me check the code to see where the bug is.

Seems the bug is here in sysbox-runc.

That code ensures the mounts are ordered such that they don't opaque each other (e.g., mount /foo before /foo/bar). But it's not doing it for a scenario where we have a tmpfs mount on /run and a bind-mount on /run/some/path.

Normally the higher level container manager (e.g., Docker or K8s) sends the mounts in the correct order, but because Sysbox implicitly adds some mounts of it's own (e.g., tmpfs on /run when systemd is PID 1), it needs to do the ordering again to take into account the implicit mounts. Seems like it's not doing it right for /run in systemd scenarios.

If it's OK, I can try patching it and send you a new sysbox-runc binary that you can then use on the K8s node, to see if it fixes the problem. I've not been able to reproduce locally with Docker yet unfortunately.

@jojonium
Copy link
Author

Sure, I can try out the patched binary and see if that fixes it.

@ctalledo
Copy link
Member

Hi @jojonium,

OK, I've attached a patched sysbox-runc binary. It's based on this PR.

Please stop all sysbox pods on the K8s node, then gunzip the patched sysbox-runc and copy it to the K8s node, to the location where the original sysbox-runc is located (I suggest you back-up the original one just in case). Then relaunch the sysbox pod and let me know if it fixes the problem please.

I've done basic testing on it, but haven't run the full sysbox test suite on the patch yet. Should work fine though 🤞 .

sysbox-runc.gz

@jojonium
Copy link
Author

I copied the new binary onto the node and restarted the sysbox pod but it doesn't seem to have fixed the issue:

From the node:

# sysbox-runc --version
sysbox-runc
        edition:        Community Edition (CE)
        version:        0.6.4-dev
        commit:
        built at:       Tue Jan 23 18:38:26 UTC 2024
        built by:       Cesar Talledo
        oci-specs:      1.1.0+dev
# findmnt | grep token
├─/var/lib/kubelet/pods/58e11f9a-20fd-4840-a1ab-1eedfdae9d77/volumes/kubernetes.io~projected/token                                        tmpfs                                                                                                                tmpfs       rw,relatime,size=1572864k,inode64
├─/var/lib/sysbox/shiftfs/d661c085-1a0d-4b02-b068-fbaa3089858c                                                                            /var/lib/kubelet/pods/58e11f9a-20fd-4840-a1ab-1eedfdae9d77/volumes/kubernetes.io~projected/token                     shiftfs     rw,relatime,mark

And from the container:

root@sysbox-test-64cc5bff86-whg99:/# findmnt | grep /run
|-/run                                                       tmpfs                                                                                                    tmpfs    rw,nosuid,nodev,mode=755,uid=493216,gid=493216,inode64
| `-/run/lock                                                tmpfs                                                                                                    tmpfs    rw,nosuid,nodev,noexec,relatime,size=5120k,uid=493216,gid=493216,inode64
|-/run/.containerenv                                         /var/lib/sysbox/shiftfs/efdf4359-0748-45d8-a336-865ab4bb3bb7[/.containerenv]                             shiftfs  rw,relatime
|-/run/secrets/serviceaccount                                /var/lib/sysbox/shiftfs/d661c085-1a0d-4b02-b068-fbaa3089858c                                             shiftfs  ro,relatime
root@sysbox-test-64cc5bff86-wvb94:/# ls /run/secrets/serviceaccount
ls: cannot access '/run/secrets/serviceaccount': No such file or directory

I will note that the easy workaround is to set the token's mountPath to anywhere outside of /run, but I can think of situations where you might want/need to mount a volume under that path.

@ctalledo
Copy link
Member

Hi @jojonium,

I copied the new binary onto the node and restarted the sysbox pod but it doesn't seem to have fixed the issue

Oh too bad, thanks. So strange that even with the fix I provided, the /run/secrets/serviceaccount mount is still not a sub-mount of /run, since I tested it locally with Docker and it worked fine:

$ docker run --runtime=sysbox-runc -v /some/fake/token:/run/secrets/serviceaccount/token nestybox/ubuntu-jammy-systemd

admin@15d25893315b:~$ findmnt
├─/run                              tmpfs                                                        tmpfs   rw,nosuid,nodev,relatime,size=65536k,mode=755,uid=165536,gid=165536,inode64
│ ├─/run/lock                       tmpfs                                                        tmpfs   rw,nosuid,nodev,noexec,relatime,size=4096k,uid=165536,gid=165536,inode64
│ └─/run/secrets/serviceaccount/token                                                                                                                         
│                                   /var/lib/sysbox/shiftfs/ef4b9249-99a4-4fd5-8bcf-7f88c253b11a shiftfs rw,relatime 

Silly question, but just to double check: did you update the sysbox-runc on all nodes of the K8s cluster? (to make sure the pod is in fact using the updated sysbox-runc)?

@jojonium
Copy link
Author

After experimenting some more I found that the token mount only fails when I have command: ["sh", "-c", "exec /sbin/init"] or command: ["/sbin/init"]. If I omit the command it presumably just uses ENTRYPOINT [ "/sbin/init", "--log-level=err" ] and the mount works. In both cases systemd is started as PID 1. Is there some difference between starting it as the docker entrypoint as opposed to the container's command? And it's the same behavior with both the 0.6.3 and the 0.6.4-dev sysbox-runc binary.

@ctalledo
Copy link
Member

ctalledo commented Jan 31, 2024

After experimenting some more I found that the token mount only fails when I have command: ["sh", "-c", "exec /sbin/init"] or command: ["/sbin/init"]. If I omit the command it presumably just uses ENTRYPOINT [ "/sbin/init", "--log-level=err" ] and the mount works.

Ah ... interesting; since ENTRYPOINT always executes, then I believe the command must be creating a redundant execution of /sbin/init somehow. But I could not reproduce it with Docker:

$ docker run --runtime=sysbox-runc -v /some/fake/token:/run/secrets/serviceaccount/token nestybox/ubuntu-jammy-systemd /sbin/init

From Sysbox's perspective, it knows nothing about ENTRYPOINT or command; it's simply told by the higher level runtime (e.g., K8s or Docker) what program to start the container with and with what arguments. It then checks if the program is systemd (/sbin/init) and if so sets up the mounts properly for systemd to run. Thus, command: ["sh", "-c", "exec /sbin/init"] will definitely not work since the first command is sh, not /sbin/init. But not sure why command: ["/sbin/init"] does not work either.

How does findmnt look when things work?

@jojonium
Copy link
Author

Does the ENTRYPOINT really always run? For example if I set the command to ["sh", "-c", "sleep infinity"] then systemd doesn't start in the pod:

root@sysbox-test-96484895f-lphff:/# systemctl
System has not been booted with systemd as init system (PID 1). Can't operate.

Whereas if I omit the command systemctl gives me a list of running services as expected.

And I think I got mixed up before, command: ["/sbin/init"] DOES work. This is what findmnt looks like when it's working:

root@sysbox-test-67f888b9f9-xmc5r:/# findmnt | grep run
|-/run                                                       tmpfs                                                                                                    tmpfs    rw,nosuid,nodev,noexec,relatime,uid=296608,gid=296608,inode64
| |-/run/lock                                                tmpfs                                                                                                    tmpfs    rw,nosuid,nodev,noexec,relatime,uid=296608,gid=296608,inode64
| |-/run/.containerenv                                       /var/lib/sysbox/shiftfs/b07183e2-781f-47a5-b836-cdbbf9aa2b18[/.containerenv]                             shiftfs  rw,relatime
| |-/run/secrets/serviceaccount                              /var/lib/sysbox/shiftfs/85c6aae5-4026-4f85-8c14-e3d7ebd64a57                                             shiftfs  ro,relatime

What doesn't work is command: ["sh", "-c", "exec /sbin/init"]. As you say I guess it doesn't detect systemd because the first command is sh. I'm not sure if there's any way to detect that.

@ctalledo
Copy link
Member

ctalledo commented Jan 31, 2024

Does the ENTRYPOINT really always run?

I see, my mistake: K8s does not use ENTRYPOINT, it's a Docker thing only.

And I think I got mixed up before, command: ["/sbin/init"] DOES work

Makes sense.

This is what findmnt looks like when it's working ...

That looks good, thanks.

What doesn't work is command: ["sh", "-c", "exec /sbin/init"]. As you say I guess it doesn't detect systemd because the first command is sh. I'm not sure if there's any way to detect that.

Yes that explains it; I guess we could improve the detection logic in Sysbox, but in general there's no need to use the shell to execute systemd, so it's not something we would prioritize.

So think this resolves the issue: the command: ["sh", "-c", "exec /sbin/init"] had to be command: ["/sbin/init"], as otherwise Sysbox won't detect that systemd is in the container, and that in turn causes bind-mounts under /run to not be setup properly.

Thanks for getting to the bottom of it!

@jojonium
Copy link
Author

jojonium commented Feb 1, 2024

Alright I think I fully understand it now. My use case is I want to do some initial set up in the container a shell script in command and then exec /sbin/init at the end to start systemd. I guess there's no general way to detect that and prevent systemd from clobbering mounts under /run. It's easy enough to work around by using a different mount location or doing the setup by a different method, so if this is a low priority I'm fine with closing the issue. Thanks for all your help.

@ctalledo
Copy link
Member

ctalledo commented Feb 1, 2024

My use case is I want to do some initial set up in the container a shell script in command and then exec /sbin/init at the end to start systemd.

Oh I see; is that setup something you could do as a systemd service unit, or does it need to be done before systemd starts?

If the latter, is it something you could do in the Dockerfile for the image (such that when the container starts the setup is already in place), or does it need to be done at runtime?

@jojonium
Copy link
Author

jojonium commented Feb 1, 2024

Yeah it can either be baked into the image or run as a systemd service, running shell commands at container startup was just the easiest first method I thought of, which led me to this issue.

@ctalledo
Copy link
Member

ctalledo commented Feb 1, 2024

Yeah it can either be baked into the image or run as a systemd service, running shell commands at container startup was just the easiest first method I thought of, which led me to this issue.

OK cool, glad we got to the bottom of the issue then.

Closing the issue now. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants