Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for io.bus and io.cache for shared filesystems with VMs #599

Closed
stgraber opened this issue Mar 9, 2024 · 17 comments
Closed
Assignees
Labels
Easy Good for new contributors Feature New feature, not a bug
Milestone

Comments

@stgraber
Copy link
Member

stgraber commented Mar 9, 2024

Currently io.bus and io.cache are restricted to block volumes being exposed to VMs.
io.bus offers nvme, virtio-scsi and virtio-blk while io.cache offers none, writeback and unsafe.

That's all fine for blocks, but we've recently had requests to do something similar for filesystems.

In those cases, io.bus should be:

  • auto (default, both 9p and virtiofs but silently skip virtiofs if not available)
  • 9p (9p only)
  • virtiofs (virtiofs only, fail if not available)

And io.cache should be made to map to virtiofsd cache options:

  • none (default, map to cache=never)
  • metadata (map to cache=metadata)
  • unsafe (map to cache=always)
@stgraber stgraber added Feature New feature, not a bug Easy Good for new contributors labels Mar 9, 2024
@stgraber stgraber added this to the soon milestone Mar 9, 2024
@gl-yziquel
Copy link

I was told that incus did not support passthrough of block devices into VMs. which is why I've been using filesystems instead.

Will love trying to switch this way between 9p and virtiofs anyway.

@stgraber
Copy link
Member Author

stgraber commented Mar 9, 2024

@gl-yziquel Not sure who told you that :)

You can definitely do:

incus config device add VM-NAME xyz disk source=/dev/XYZ io.bus=nvme

Which will expose the disk at /dev/XYZ as an NVME device inside of the VM.

Just the usual caution around not using the same disk from both host and guest or with multiple guests, unless you're using a clustered filesystem or use it read-only everywhere.

@gl-yziquel
Copy link

gl-yziquel commented Mar 9, 2024

@gl-yziquel Not sure who told you that :)

Various blog posts. Don't remember precisely which.

You can definitely do:

incus config device add VM-NAME xyz disk source=/dev/XYZ io.bus=nvme

Thank you. I'll try it as soon as possible. Though I've been reluctant to do so as I fear using a raw block device through a stack I don't fully understand. And I'd need to buy hardware to back up that disk behind the block device if I had to things properly.

Any "risk" of corrupting a 40 TiB USB ZFS drive by passing it through a block device into incus without a backup ? (Why do I believe this is silly question, all the more an off topic one...)

Which will expose the disk at /dev/XYZ as an NVME device inside of the VM.

Not quite an NVME...

Just the usual caution around not using the same disk from both host and guest or with multiple guests, unless you're using a clustered filesystem or use it read-only everywhere.

I don't plan to.

Anyhow, the io.bus and io.cache stuff is great news. If it works on vms and not only containers, it'll be really great.

Thank you very much.

Sidenote: when playing with what will become io.cache=noneon virtiofs together with rust code, one runs into a mmap MAP_SHARED issue that replaces what is usually MAP_PRIVATE for mmap'd files. Which ends up not being supported by such a setup. One more reason to be able to play with io.cache values. The problem is:

  1. With io.cache=none, I can do IO intensive stuff (sudo updatedb, big svn checkouts, and chromium sized builds; the number of file descriptors on the host is stable, I tested) but rust code tends to fail as soon as cargo build is launched because of MAP_SHARED/MAP_PRIVATE issues.

  2. With io.cache!=none, I can build and run rust code, as MAP_SHARED is not coercive enough when a cache is there. Bug the number of file descriptors blows up.

Bummer: I need both rust code and big IO... can't have both.

I'm really not an expert in such issues, but I'm under the impression that this mmap stuff over virtiofs is what is called the "DAX window" allowing qemu's process to allocate memory for such an mmap. Some kind of dance played by qemu and virtiofsd there that I do not understand. Links to where I analysed this issue very much superficially:

cross-rs/cross#1313 (comment)
Byron/gitoxide#1312

(If I can't have both, I'll have to try the block device pasthrough.)

@stgraber
Copy link
Member Author

Any "risk" of corrupting a 40 TiB USB ZFS drive by passing it through a block device into incus without a backup ? (Why do I believe this is silly question, all the more an off topic one...)

So long as it's not mounted anywhere but inside the VM, it'll be fine.

Not quite an NVME...

That doesn't matter, io.bus refers to how it will appear inside of the VM, not how it's connected to the host. So you can totally use NVME for a floppy disk if you want ;)

I'm really not an expert in such issues, but I'm under the impression that this mmap stuff over virtiofs is what is called the "DAX window" allowing qemu's process to allocate memory for such an mmap. Some kind of dance played by qemu and virtiofsd there that I do not understand. Links to where I analysed this issue very much superficially:

I'm also no expert on virtiofs but I know that they have a special dax mount option for the guest which then works along with a special reserved chunk of memory to directly access the host cache. That sounds pretty good in theory and should solve a lot of problems, the issue is that this has not been merged into QEMU upstream, so even though virtiofsd supports it, it's not actually usable without running out of tree patches.

@gl-yziquel
Copy link

gl-yziquel commented Mar 11, 2024

I'm also no expert on virtiofs but I know that they have a special dax mount option for the guest which then works along with a special reserved chunk of memory to directly access the host cache. That sounds pretty good in theory and should solve a lot of problems, the issue is that this has not been merged into QEMU upstream, so even though virtiofsd supports it, it's not actually usable without running out of tree patches.

Any reference to where these patches are would be highly appreciated. Here ?

https://gitlab.com/virtio-fs/qemu/-/tree/dax-2022-05-17-qemu7.0

If so, 2022 seems indeed to not quite be up to date.

Trying out a quick build of it, just to see...

@gl-yziquel
Copy link

Sidenote: when playing with what will become io.cache=noneon virtiofs together with rust code, one runs into a mmap MAP_SHARED issue that replaces what is usually MAP_PRIVATE for mmap'd files. Which ends up not being supported by such a setup.

I just pulled out a fix for rustc itself:

rust-lang/rust#122262

We'll see if it gets merged. A member of the rust team seems pretty hostile to "buggy filesystems" (i.e. virtio-fs) and seems to believe it's ok if rustc doesn't run with virtiofsd --cache=never.

Anyhow, MAP_SHARED in mmap() is not supported with virtiofsd --cache=never, which is a wider issue than rust (where it just so happens that it materialises more often because it's more of a default.)

@stgraber
Copy link
Member Author

Any reference to where these patches are would be highly appreciated. Here ?

https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05680.html is what I had found, yours looks a bit more recent but still quite out of date :)

@gl-yziquel
Copy link

Which will expose the disk at /dev/XYZ as an NVME device inside of the VM.

No. I do not see a /dev/XYZ. I do see some /dev/nvme0n1p1, this I do, but I see nowhere this xyz device name. I'm wondering if that xyz even means something and is observable in the guest. I can't spot any xyz in the guest.

@stgraber
Copy link
Member Author

The source path isn't visible in the guest, you can however see the device name in the guest.
So with:

incus config device add VM-NAME xyz disk source=/dev/XYZ io.bus=nvme

The source path (/dev/XYZ) isn't visible to the guest, but the device name (xyz) is part of the disk name visible in /dev/disk/by-id.

@gl-yziquel
Copy link

gl-yziquel commented Mar 12, 2024

The source path isn't visible in the guest, you can however see the device name in the guest. So with:

incus config device add VM-NAME xyz disk source=/dev/XYZ io.bus=nvme

The source path (/dev/XYZ) isn't visible to the guest, but the device name (xyz) is part of the disk name visible in /dev/disk/by-id.

I've multiple devices that I try to make passthroughs for in that VM. The device that are assigned, like /dev/nvme[012]n1 are assigned randomly when it comes to the mapping with the host. How do I get them assigned predictively ? Is it even possible ?

EDIT: ah ! yes... that is what /dev/disk/by-id can be used for...

@sharkman424
Copy link
Contributor

Hi, my group and I are looking to take on this issue for our Virtualization class. Can you please assign it to me and offer any insight as to how to best approach this implementation?

@stgraber
Copy link
Member Author

stgraber commented Apr 5, 2024

This should be reasonably straightforward, it's all changes to be done in internal/server/device/disk.go, look for where virtiofsd is started, that's where both options need to come into play as far as what's started and what arguments to pass to virtiofsd.

@stgraber
Copy link
Member Author

stgraber commented May 9, 2024

@sharkman424 do you still intend to work on this one or should I clear the assignee?

@sharkman424
Copy link
Contributor

You can clear the assignee, apologies.

@sharkman424 sharkman424 removed their assignment May 10, 2024
@stgraber
Copy link
Member Author

No worries!

@SpiffyEight77
Copy link
Contributor

Hey @stgraber, I'd like to work on this issue, could you assign this issue for me?🙇🏻‍♂️

@stgraber
Copy link
Member Author

Done!

stgraber added a commit to SpiffyEight77/incus that referenced this issue Jun 27, 2024
Closes lxc#599

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
@stgraber stgraber modified the milestones: soon, incus-6.3 Jun 27, 2024
stgraber added a commit that referenced this issue Jun 28, 2024
Closes #599

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Good for new contributors Feature New feature, not a bug
Development

No branches or pull requests

4 participants