Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman does not support the mount propagation needed by CSI plugins #192

Closed
ejweber opened this issue Sep 20, 2022 · 18 comments · Fixed by #204
Closed

Podman does not support the mount propagation needed by CSI plugins #192

ejweber opened this issue Sep 20, 2022 · 18 comments · Fixed by #204

Comments

@ejweber
Copy link
Contributor

ejweber commented Sep 20, 2022

In #169 @tgross added the ability for the Podman driver to bind mounts from a TaskConfig. While this didn't allow the addition of a mount block using the external API (that's left to do in #142), it did allow Podman to set up the bind mounts requested by the CSI Plugin Prestart Hook, enabling Podman driver CSI support.

As we were experimenting with this functionality for the BeeGFS CSI driver, we noticed that our driver container seemed unable to see volumes it had already mounted. In NodeUnpublishVolume and NodeUnstageVolume, this manifested itself as a bunch of errors wherein Nomad would tell the driver to unstage but the driver insisted there was no mount to operate on. An inspection of the driver container shows the following:

 "Binds": [
                "/opt/nomad/alloc/edb56316-6229-5b3a-e5eb-6053a698d2b7/alloc:/alloc:rw,rprivate,rbind",
                "/opt/nomad/alloc/edb56316-6229-5b3a-e5eb-6053a698d2b7/node/local:/local:rw,rprivate,rbind",
                "/opt/nomad/alloc/edb56316-6229-5b3a-e5eb-6053a698d2b7/node/secrets:/secrets:rw,rprivate,noexec,rbind",
                "/:/host:ro,rshared,rbind",
                "/opt/nomad/client/csi/plugins/edb56316-6229-5b3a-e5eb-6053a698d2b7:/csi:rw,rprivate,rbind",
                "/opt/nomad/client/csi/node/beegfs-csi-plugin:/opt/nomad/client/csi/node/beegfs-csi-plugin:rw,rprivate,rbind",
                "/dev:/dev:rw,rprivate,nosuid,rbind"
            ],

There's a lot there, but the important bit is that /opt/nomad/client/csi/node/beegfs-csi-plugin is bind mounted to /opt/nomad/client/csi/node/beegfs-csi-plugin as rprivate (this would be /opt/nomad/client/csi/node/beegfs-csi-plugin to /local/csi for most plugins, but this is just a BeeGFS CSI driver implementation detail). The CSI Plugin Prestart Hook wants this to be a bidirectional mount (rshared), but the changes in #169 don't take PropagationMode into account. Unless it the bind has rshare, it's not clear how most CSI drivers would be able to propagate the mounts they create inside their containers to Nomad tasks.

I think a fix for this would be as easy as adding a conversion like the Docker driver already does, but the Docker driver has some extra logic in it that gives me a bit of pause (can the Podman driver run on Windows?).

I can probably find some time to make a quick change and do some sanity checking, but am I thinking about this issue correctly?

@jdoss
Copy link
Contributor

jdoss commented Sep 25, 2022

I think I am seeing the same issue when using the JuiceFS CNI plugin.

https://juicefs.com/docs/csi/csi-in-nomad/

Binds on the JuiceFS CSI Node job:

"HostConfig": {
               "Binds": [
                    "/opt/nomad/data/alloc/823858e6-bc4b-fcb2-1c46-34f1ea6903e2/alloc:/alloc:rw,rprivate,rbind",
                    "/opt/nomad/data/alloc/823858e6-bc4b-fcb2-1c46-34f1ea6903e2/juicefs-plugin/local:/local:rw,rprivate,rbind",
                    "/opt/nomad/data/alloc/823858e6-bc4b-fcb2-1c46-34f1ea6903e2/juicefs-plugin/secrets:/secrets:rw,rprivate,noexec,rbind",
                    "/opt/nomad/data/client/csi/plugins/823858e6-bc4b-fcb2-1c46-34f1ea6903e2:/csi:rw,rprivate,rbind",
                    "/opt/nomad/data/client/csi/node/juicefs0:/local/csi:rw,rprivate,rbind",
                    "/dev:/dev:rw,rprivate,nosuid,rbind"
               ],

Inside the JuiceFS node container you can it is mounting correctly.

root@5217e137dc82:/app# mount |grep Juice
JuiceFS:juicefs-volume-test on /var/lib/jfs/juicefs-volume-test type fuse.juicefs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other)
JuiceFS:juicefs-volume-test on /local/csi/per-alloc/903b0491-8af6-6ac9-28e8-0e7b83c1f9cb/juicefs-volume-test/rw-file-system-multi-node-multi-writer type fuse.juicefs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other)

But in the task it has the bind mount:

"HostConfig": {
               "Binds": [
                    "/opt/nomad/data/alloc/903b0491-8af6-6ac9-28e8-0e7b83c1f9cb/alloc:/alloc:rw,rprivate,rbind",
                    "/opt/nomad/data/alloc/903b0491-8af6-6ac9-28e8-0e7b83c1f9cb/nginx/local:/local:rw,rprivate,rbind",
                    "/opt/nomad/data/alloc/903b0491-8af6-6ac9-28e8-0e7b83c1f9cb/nginx/secrets:/secrets:rw,rprivate,noexec,rbind",
                    "/opt/nomad/data/client/csi/node/juicefs0/per-alloc/903b0491-8af6-6ac9-28e8-0e7b83c1f9cb/juicefs-volume-test/rw-file-system-multi-node-multi-writer:/data/job:rw,rprivate,rbind"
               ],

which writes files to the allocation directory and not into the CSI volume mount:

$ nomad alloc exec -i -t 903b0491-8af6-6ac9-28e8-0e7b83c1f9cb bash
root@d1d9e6db27b1:/# touch foo data/job/
root@d1d9e6db27b1:/# ls data/job/
foo
# ls -lah /opt/nomad/data/client/csi/node/juicefs0/per-alloc/903b0491-8af6-6ac9-28e8-0e7b83c1f9cb/juicefs-volume-test/rw-file-system-multi-node-multi-writer/
total 0
drwxr-xr-x. 2 root root 17 Sep 25 20:28 .
drwx------. 3 root root 52 Sep 25 19:54 ..
-rw-r--r--. 1 root root  0 Sep 25 20:25 foo

And the JuiceFS CSI plugin doesn't think things are mounted correctly:

2022-09-25T19:51:32.576538220+00:00 stderr F 2022/09/25 19:51:32.549878 juicefs[119] <WARNING>: AOF is not enabled, you may lose data if Redis is not shutdown properly. [info.go:83]
2022-09-25T19:51:32.576538220+00:00 stderr F 2022/09/25 19:51:32.550093 juicefs[119] <INFO>: Ping redis: 147.259µs [redis.go:2750]
2022-09-25T19:51:32.576538220+00:00 stderr F 2022/09/25 19:51:32.550461 juicefs[119] <INFO>: Data use minio://192.168.1.11:9000/test/juicefs-volume-test/ [format.go:412]
2022-09-25T19:51:32.576538220+00:00 stderr F 2022/09/25 19:51:32.572245 juicefs[119] <INFO>: Volume is formatted as {Name:juicefs-volume-test UUID:9122f69e-bd69-46cd-8e09-3c61046739dc Storage:minio Bucket:http://192.168.1.11:9000/minio/test AccessKey:admin SecretKey:removed BlockSize:4096 Compression:none Shards:0 HashPrefix:false Capacity:0 Inodes:0 EncryptKey: KeyEncrypted:true TrashDays:1 MetaVersion:1 MinClientVersion: MaxClientVersion:} [format.go:450]
2022-09-25T19:51:32.576649791+00:00 stderr F I0925 19:51:32.576630       2 juicefs.go:499] Mount: mounting "redis://192.168.1.11:6379/0" at "/var/lib/jfs/juicefs-volume-test" with options []
2022-09-25T19:51:32.576726977+00:00 stderr F I0925 19:51:32.576716       2 process_mount.go:122] ceMount: mount redis://192.168.1.11:6379/0 at /var/lib/jfs/juicefs-volume-test
2022-09-25T19:51:32.576821906+00:00 stderr F I0925 19:51:32.576810       2 process_mount.go:170] Mount point /var/lib/jfs/juicefs-volume-test is not ready
2022-09-25T19:51:32.621139602+00:00 stderr F 2022/09/25 19:51:32.621032 juicefs[128] <INFO>: Meta address: redis://192.168.1.11:6379/0 [interface.go:385]
2022-09-25T19:51:32.622341835+00:00 stderr F 2022/09/25 19:51:32.622304 juicefs[128] <WARNING>: AOF is not enabled, you may lose data if Redis is not shutdown properly. [info.go:83]
2022-09-25T19:51:32.622542855+00:00 stderr F 2022/09/25 19:51:32.622522 juicefs[128] <INFO>: Ping redis: 143.732µs [redis.go:2750]
2022-09-25T19:51:32.623010189+00:00 stderr F 2022/09/25 19:51:32.622991 juicefs[128] <INFO>: Data use minio://192.168.1.11:9000/test/juicefs-volume-test/ [mount.go:289]
2022-09-25T19:51:32.623156785+00:00 stderr F 2022/09/25 19:51:32.623128 juicefs[128] <INFO>: Disk cache (/var/jfsCache/9122f69e-bd69-46cd-8e09-3c61046739dc/): capacity (102400 MB), free ratio (10%), max pending pages (15) [disk_cache.go:90]
2022-09-25T19:51:32.624494414+00:00 stderr F 2022/09/25 19:51:32.624475 juicefs[128] <INFO>: create session 9 OK [base.go:185]
2022-09-25T19:51:32.624883359+00:00 stderr F 2022/09/25 19:51:32.624864 juicefs[128] <INFO>: Prometheus metrics listening on 127.0.0.1:9567 [mount.go:157]
2022-09-25T19:51:32.624951337+00:00 stderr F 2022/09/25 19:51:32.624938 juicefs[128] <INFO>: Mounting volume juicefs-volume-test at /var/lib/jfs/juicefs-volume-test ... [mount_unix.go:177]
2022-09-25T19:51:33.146749962+00:00 stderr F 2022/09/25 19:51:33.146684 juicefs[128] <INFO>: OK, juicefs-volume-test is ready at /var/lib/jfs/juicefs-volume-test [mount_unix.go:45]
2022-09-25T19:51:33.577677258+00:00 stderr F I0925 19:51:33.577632       2 node.go:146] NodePublishVolume: binding /var/lib/jfs/juicefs-volume-test/juicefs-volume-test at /local/csi/per-alloc/c706f0b0-527b-d7b9-ae4f-df3447f7bd67/juicefs-volume-test/rw-file-system-multi-node-multi-writer with options []
2022-09-25T19:51:33.582400265+00:00 stderr F I0925 19:51:33.582377       2 node.go:152] NodePublishVolume: mounted juicefs-volume-test at /local/csi/per-alloc/c706f0b0-527b-d7b9-ae4f-df3447f7bd67/juicefs-volume-test/rw-file-system-multi-node-multi-writer with options []
2022-09-25T19:52:05.700807920+00:00 stderr F I0925 19:52:05.700704       2 node.go:166] NodeUnpublishVolume: volume_id is juicefs-volume-test
2022-09-25T19:52:05.701786079+00:00 stderr F I0925 19:52:05.701555       2 process_mount.go:195] ProcessUmount: /local/csi/per-alloc/3fb95667-d73f-cae3-3c69-8dbdb1cfe13e/juicefs-volume-test/rw-file-system-multi-node-multi-writer target not mounted
2022-09-25T19:52:05.702180916+00:00 stderr F I0925 19:52:05.702163       2 process_mount.go:232] ProcessUmount: /local/csi/per-alloc/3fb95667-d73f-cae3-3c69-8dbdb1cfe13e/juicefs-volume-test/rw-file-system-multi-node-multi-writer target not mounted
2022-09-25T19:53:05.703816341+00:00 stderr F I0925 19:53:05.703780       2 node.go:166] NodeUnpublishVolume: volume_id is juicefs-volume-test

@ejweber if you take a crack at adding what is needed to make this work, I would be down to help you test it. @tgross what do you think?

@tgross
Copy link
Member

tgross commented Sep 26, 2022

I think a fix for this would be as easy as adding a conversion like the Docker driver already does, but the Docker driver has some extra logic in it that gives me a bit of pause (can the Podman driver run on Windows?).

Yeah, this sounds like a reasonable approach to me.

As far Podman on non-Linux goes, it looks like they're intending to run only on Linux, using a VM on macOS/Windows. See https://podman.io/getting-started/installation.html#installing-on-mac--windows And I don't see any suggestion in their GitHub issues that the developers (mostly RedHat) intend to do otherwise. So I don't think we'll need OS-specific logic there.

@jdoss
Copy link
Contributor

jdoss commented Sep 27, 2022

I was able to get things working with the following hack:

// create binds for host volumes, CSI plugins, and CSI volumes
for _, m := range task.Mounts {
  bind := spec.Mount{
    Type:        "bind",
    Destination: m.TaskPath,
    Source:      m.HostPath,
  }
  if m.Readonly {
    bind.Options = append(bind.Options, "ro")
  }

  bind.Options = append(bind.Options, "rshared")

  binds = append(binds, bind)
}

@ejweber
Copy link
Contributor Author

ejweber commented Oct 7, 2022

It's good to see that works, @jdoss.

I was hoping to do something a bit more flexible, so I tried the following:

d.logger.Info("in containerMounts", "propagationMode", m.PropagationMode)
switch m.PropagationMode {
case nstructs.VolumeMountPropagationPrivate:
	bind.Options = append(bind.Options, "rprivate")
case nstructs.VolumeMountPropagationHostToTask:
	bind.Options = append(bind.Options, "rslave")
case nstructs.VolumeMountPropagationBidirectional:
	bind.Options = append(bind.Options, "rshared")
	// If PropagationMode is something else or unset, Podman defaults to rprivate
}

Unfortunately, it isn't working. The reason is that PropagationMode is coming in empty (confirmed by the log line I added). I also added a log statement at the top of StartTask to see exactly what I'm dealing with. I get the following in the taskConfig:

Mounts: [
  map[HostPath:/opt/nomad/client/csi/plugins/91865d02-e94c-bca1-a984-17688bdaaa1e PropagationMode: Readonly:false TaskPath:/csi] 
  map[HostPath:/opt/nomad/client/csi/node/beegfs-csi-plugin PropagationMode: Readonly:false TaskPath:/opt/nomad/client/csi/node/beegfs-csi-plugin] 
  map[HostPath:/dev PropagationMode: Readonly:false TaskPath:/dev]
]

This looks completely consistent with what I expect from the CSI Plugin Prestart Hook, except with PropagationMode empty where I expect it to be "bidirectional".

I have been trying to understand where in the Nomad plugin infrastructure we might be dropping the PropagationMode, but I'm at a bit of a loss.

@jdoss
Copy link
Contributor

jdoss commented Oct 20, 2022

@tgross do you have any thoughts here to push @ejweber and I in the right direction?

@tgross
Copy link
Member

tgross commented Oct 20, 2022

Sorry about the delay there, it's been a busy last few weeks. I've assigned myself this issue so that I remember to keep coming back to it.

I traced where the propagation mode is being set in the driver all the way up to StartTask and I think it looks fine here, which means I suspect we've got a Nomad bug here and not a driver bug. If you want to verify that with a log line at StartTask that'd be a good place to split the work.

I looked at the Nomad side of things and it looks like we're reading the mounts set by the hook correctly for the driver.TaskConfig at task_runner.go#L1143.

I thought I'd found a spot in the volume_hook.go#L173-L176 where it looks like we're forgetting to add the propagation field, but that's only serving the volume_mount block which doesn't have a propagation mode field. Is there any chance the mounts set by the plugin supervisor hook are getting overwritten in the volume_hook?

@tgross tgross self-assigned this Oct 20, 2022
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 20, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 20, 2022
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Oct 20, 2022
@tgross
Copy link
Member

tgross commented Oct 21, 2022

hashicorp/nomad#15014 just got opened pointing out we're missing the spot needed to serialize the mount propagation entirely. So that explains that.

@ejweber
Copy link
Contributor Author

ejweber commented Oct 21, 2022

Good find @tgross! I imagine the fix I wrote for the Podman driver will "just work" once that one is handled. I'll hold off submitting a PR for this issue until that one is resolved and I can test against it (though no hard feelings if someone else beats me to it 😄).

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 17, 2022

@ejweber #15096 is now merged and will go out in our next Nomad release. If you want to test your changes before then you can pull the latest code from main 🙂

@jdoss
Copy link
Contributor

jdoss commented Nov 22, 2022

Looks like hashicorp/nomad#15096 was just released in Nomad 1.4.3. @ejweber can you push up your PR when you get a chance?

@ejweber
Copy link
Contributor Author

ejweber commented Nov 22, 2022

Sorry all. I didn't get to testing while the code was sitting out there on main, but I did have an opportunity to play with it this afternoon (using the released Nomad v1.4.3). The changes from hashicorp/nomad#15096 plus my new PR (#204) yield the following:

"Binds": [
                "/opt/nomad/alloc/9764ce7a-d3a5-e3cf-a249-58684205eaaa/alloc:/alloc:rw,rprivate,rbind",
                "/opt/nomad/alloc/9764ce7a-d3a5-e3cf-a249-58684205eaaa/node/local:/local:rw,rprivate,rbind",
                "/opt/nomad/alloc/9764ce7a-d3a5-e3cf-a249-58684205eaaa/node/secrets:/secrets:rw,rprivate,noexec,rbind",
                "/:/host:ro,rshared,rbind",
                "/opt/nomad/client/csi/plugins/9764ce7a-d3a5-e3cf-a249-58684205eaaa:/csi:rshared,rw,rbind",
                "/opt/nomad/client/csi/node/beegfs-csi-plugin:/opt/nomad/client/csi/node/beegfs-csi-plugin:rshared,rw,rbind",
                "/dev:/dev:rw,rprivate,nosuid,rbind"
            ],

Note the rshared in "/opt/nomad/client/csi/node/beegfs-csi-plugin:/opt/nomad/client/csi/node/beegfs-csi-plugin:rshared,rw,rbind". Looks like things are working to me!

Nomad - Community Issues Triage automation moved this from In Progress to Done Nov 23, 2022
@quentin9696
Copy link

Hi !

I'm facing the same issue with democratic-csi.

I'm sure your fix will be the solution to my similar issue, but I don't now where I should put the bind mount propagation?

Is it in the csi_plugin stanza?

Since /opt/nomad/client/csi/plugins and /opt/nomad/client/csi/node are automatically mount, I don't know how to use it. I don't find any documentation on it.

Thank you for your help and your fixes

@ejweber
Copy link
Contributor Author

ejweber commented Jan 11, 2023

Hello @quentin9696,

You shouldn't have to do anything at all to make use of these changes. Nomad itself requests certain bind mounts for CSI driver tasks and the combined changes made by @tgross, @lgfa29, and I ensure that these bind mounts are set up with appropriate propagation in a Podman orchestrated container. Assuming the bind mounts Nomad sets up are sufficient for your driver operation (and I think they will be for most use cases), you should be good to go.

One big caveat is versioning. It looks like my changes haven't made it into a released version of nomad-driver-podman yet, so you still have to build it from source off main and deploy it to your Nomad nodes for them to take effect. Hopefully a new version of the task driver is spun sometime soon, but, to be honest, I'm not that familiar with the release process. You also need to make sure you're using at least Nomad version 1.4.3.

@quentin9696
Copy link

quentin9696 commented Jan 12, 2023

Hi @ejweber,

Yes I finally found my reply, I just need to setup my bind mount with this:

volume_mount {
        volume      = "test"
        destination = "/srv"
        read_only   = false
        propagation_mode = "bidirectional"
}

after upgrading my nomad to 1.4.3 and my plugin to the snapshot version.

I think there is some lack of documentation on this plugin, as well as on nomad docker because there is no mention of the propagation_mode attribute on the doc

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 14, 2023

Oh you are right @quentin9696, that parameter is missing from the docs, I've opened hashicorp/nomad#15785 to update it.

Please note the warning about bidirectional mode 😄

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 14, 2023

Also thank you for the reply @ejweber. I will see if we can cut a release in the coming days. There are a few open PRs that we need to take care of.

@quentin9696
Copy link

Thank you for you documentation PR @lgfa29. Yes, that would be nice to have an official release with the @ejweber fix since it's an important feature for the plugin, especially with CSI plugins.

@ejweber
Copy link
Contributor Author

ejweber commented Jan 17, 2023

My mistake @quentin9696. I assumed you were only looking to have the default bind mounts created for a CSI driver task assigned bidirectional mount propagation (as opposed to a custom bind mount you need). Thanks for the documentation @lgfa29.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

5 participants