Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime-rs: update device pci info for vfio and virtio-blk devices #8284

Merged

Conversation

amshinde
Copy link
Member

@amshinde amshinde commented Oct 21, 2023

runtime-rs: change hypervisor add_device trait to return device copy

Block(virtio-blk) and vfio devices are currently not handled correctly
by the agent as the agent is not provided with correct PCI paths for
these devices.

The PCI paths for these devices can be inferred from the PCI information
provided by the hypervisor when the device is added.
Hence changing the add_device trait function to return a device copy
with PCI info potentially provided by the hypervisor. This can then be
provided to the agent to correctly detect devices within the VM.

This commit includes implementation for PCI info update for
cloud-hupervisor for virtio-blk devices with stubs provided for other
hypervisors.

This PR fixes virtio-block device addition bug. The following works now:

$ sudo ctr run --runtime "io.containerd.kata.v2" --snapshotter devmapper --rm -t quay.io/prometheus/busybox:latest foo-dm /bin/true;echo $?
0

as well as

sudo ctr run --runtime "io.containerd.kata.v2" --device=/dev/block1 --rm -t quay.io/prometheus/busybox:latest foo-dm /bin/true;echo $?
0

The PR also fixes bugs around vfio device addition. End to end vfio device add now works:

sudo ctr run --runtime "io.containerd.kata.v2" --device=/dev/vfio/{vfio_group} --rm -t quay.io/prometheus/busybox:latest foo-dm /bin/true;echo $?
0

Fixes: #8283

@katacontainersbot katacontainersbot added the size/large Task of significant size label Oct 21, 2023
@amshinde amshinde added the wip Work in Progress (PR incomplete - needs more work or rework) label Oct 21, 2023
@Apokleos
Copy link
Contributor

Hey @amshinde

Nice work on this PR! I'm also working on this feature, and I've submitted a WIP PR as well.

My PR mainly implements the following features:

(1) Due to the different ways that different VMMs handle PCI devices, we hope to provide a general PCIe topology
processing framework that is as compatible as possible with VMMs such as dragonball, cloud-hyperivor, qemu, and fc.
(2) Provide the correct PCI path for PCI devices, including VFIO-PCI devices and virtio-PCI(virtio-blk/vhost-user-blk...) devices.
(3) Fill in the correct correspondence between host path and guest path for kata-agent devices.

Would you be interested in discussing the difference and a merge?
We can compare the pros and cons of both approaches and see if we can come up with a solution that works for everyone.

Let me know what you think.

Thx.

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amshinde - a few comments.

Also, although this works with the default pmem driver, it doesn't wfm using DM:

$ sudo ctr run --runtime "io.containerd.kata.v2"  --rm -t quay.io/prometheus/busybox:latest foo /bin/true;echo $?
0

$ sudo ctr run --runtime "io.containerd.kata.v2" --snapshotter devmapper --rm -t quay.io/prometheus/busybox:latest foo-dm /bin/true;echo $?
ctr: failed to create shim task: Others("failed to handler message create container\n\nCaused by:\n    0: create\n    1: handler rootfs\n    2: new block rootfs\n    3: do handle device failed.\n    4: failed to add deivce\n    5: Unexpected PCI address \"0000:00:05.0\" for clh device add"): unknown
1

@@ -7,16 +7,19 @@
use super::inner::CloudHypervisorInner;
use crate::device::DeviceType;
use crate::BlockDevice;
use crate::HybridVsockConfig;
use crate::HybridVsockDevice;
//use crate::HybridVsockConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out line.

Comment on lines 79 to 82
//DeviceType::ShareFs(sharefs) => self.handle_share_fs_device(sharefs.config).await,
//DeviceType::HybridVsock(hvsock) => self.handle_hvsock_device(&hvsock.config).await,
//DeviceType::Block(block) => self.handle_block_device(block.config).await,
//_ => Err(anyhow!("unhandled device: {:?}", device)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove these commented out lines.

Ok(())
}
Err(e) => {
self.decrease_attach_count().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, we'll return the count error, not the original one (line below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as previous behaviour. If we want to change this, lets discuss. It could go as part of a separate change.

@jodh-intel
Copy link
Contributor

/cc @quanweiZhou, @studychao for input on the Hypervisor trait change on this PR.

@amshinde
Copy link
Member Author

Would you be interested in discussing the difference and a merge? We can compare the pros and cons of both approaches and see if we can come up with a solution that works for everyone.

Let me know what you think.

@Apokleos Yes, lets discuss this and figure out a common approach.

@amshinde amshinde removed the wip Work in Progress (PR incomplete - needs more work or rework) label Oct 25, 2023
@amshinde
Copy link
Member Author

hey @Apokleos I just saw your PR #7932, where you introduce update-device trait with similar idea of the hypervisor providing pcie info. Wanted to discuss with you if we could arrive and agree on a way to do this.

@amshinde amshinde force-pushed the runtime-rs-update-device-pci-info branch 2 times, most recently from fb382a4 to dce711c Compare October 25, 2023 08:13
Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amshinde - Tested and wfm.

We still need input from @Apokleos about his PR and also comments on the very minor hypervisor trait tweak on this PR.

lgtm

.with_context(|| format!("fail to attach passthrough fs {:?}", source));

match result {
Ok(__) => Ok(()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
Ok(__) => Ok(()),
Ok(_) => Ok(()),

? 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed :) It still compiled with that!

@amshinde amshinde force-pushed the runtime-rs-update-device-pci-info branch from dce711c to 46c6795 Compare October 25, 2023 20:23
@amshinde
Copy link
Member Author

/test

@amshinde
Copy link
Member Author

Also, although this works with the default pmem driver, it doesn't wfm using DM:

This is now fixed with this PR. This PR fixes the device addition bug seen with:

$ sudo ctr run --runtime "io.containerd.kata.v2" --snapshotter devmapper --rm -t quay.io/prometheus/busybox:latest foo-dm /bin/true;echo $?
0

as well as

sudo ctr run --runtime "io.containerd.kata.v2" --device=/dev/block1 --rm -t quay.io/prometheus/busybox:latest foo-dm /bin/true;echo $?
0

@amshinde amshinde requested a review from lifupan October 26, 2023 21:30
Copy link
Contributor

@Apokleos Apokleos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amshinde comments as below.
It's mainly for the removed code generate_guest_pci_path.

}
if self.bus_mode == VfioBusMode::PCI {
for hostdev in self.devices.iter() {
if hostdev.guest_pci_path.is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @amshinde, in case of dragonball, the guest_pci_path is always None, as dragonball doesn't return device info as clh does.
IMO, the None case of hostdev.guest_pci_path needs be handled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -47,29 +44,6 @@ lazy_static! {
Arc::new(RwLock::new(HashMap::new()));
}

// map host/guest bdf and the mapping saved into `HOST_GUEST_MAP`,
// and return PciPath.
pub fn generate_guest_pci_path(bdf: String) -> Result<PciPath> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed generate_guest_pci_path which will help deduce the correct pci Path, it should be reserved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess this would be used by dragonball. I'll add it back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Apokleos I have added generate_guest_pci_path back in case it is usefull for dragonball. Plese take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @amshinde

@amshinde amshinde force-pushed the runtime-rs-update-device-pci-info branch from 46c6795 to ae321c1 Compare November 2, 2023 03:42
@amshinde
Copy link
Member Author

amshinde commented Nov 2, 2023

/test

@amshinde
Copy link
Member Author

amshinde commented Nov 2, 2023

@Apokleos I have addressed your comments.

@@ -40,7 +40,7 @@ pub enum DeviceType {
ShareFs(ShareFsDevice),
ShareFsMount(ShareFsMountDevice),
HybridVsock(HybridVsockDevice),
Vsock(VsockDevice),
//Vsock(VsockDevice),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now taken out the Vsock value entirely from the enum. I have added an explanation for it in the commit message. Please take a look at it and let me know if it makes sense.
To summarize, Vsock value was removed as the Vsock device is not really used anywhere nor can it be used in its current state as it does not implement the Device trait. This means that it cannot be attached or detached from the hypervisor. What I can deduce from the code is that one of the reasons is that Vsock does not really implement the Clone trait which looks like a prerequisite for the Device trait, as the async trait functions require Device copies to be passed between them.(Vsock struct currently has a File handle which does not implement Clone, so this would require some rework) This is also the reason that I decided to remove Vsock for now, as the change I am introducing here requires devices to implement Clone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde Vsock is required for QEMU support. Could you add it back? Or @pmores would need to add it back in his PR. Vsock device is not fully implemented but it does not mean that it is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf The implementation is still present, I just removed it from the DeviceType enum. With its current implementation it neither supports attach or detach Device traits, meaning it cant really be hotplugged to a hypervisor. @pmores can add it back in the Device list once he adds support for qemu.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde qemu support is added by PR #8185 which is broken by this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde It would have been greatly appreciated to hold on a bit, so that @pmores had at least a chance to discuss some more before this change was merged.

Copy link
Member Author

@amshinde amshinde Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkurz Was not aware of this change drastiaclly breaking @pmores PR. Apologies, I was eager to get this landed to unblock folks to get started with implementing a CI for clh which is borked without this change.I had a look at @pmores PR for qemu support and here are my suggestions to handle the conflicts:
Like I mentioned I have removed VsockDevice from the DeviceType enum list. I think these enums should only specify devices that implement the Device trait (attach/detach) and are passed down by the DeviceManager. None of these two are true for VsockDevice. VsockDevice is a special case device that is useful internally for qemu for agent communication(currently) and will never be handled/created by the Device Manager. Nor will the device really need to implement the Device trait with attach and detach since it does not make sense to hotplug/unplug this device. What we really want is initializing code to get the guest cid.

From what I have seen from a quick walkthrough of @pmores PR the breakage is due to the fact that he currently maintains a device vector of type DeviceType and stores the VsockDevice inside it which would not work due to change in my PR (and my argument above for the change). My suggestion would be since we only need Vsock for agent communication(for now), would be to store this as an expicit reference to VsockDevice in the QemuInner struct and not handle this in the device list that is semantically meant to handle devices that can be attached/detached. This can then be used in the get_agent_socket and get_agent_vsock_dev methods. Something like:

#[derive(Debug)]
  pub struct QemuInner {
      /// sandbox id
      id: String,
    
      qemu_process: Option<Child>,
  
      config: HypervisorConfig,
      devices: Vec<DeviceType>,
      vsock: VsockDevice,
  }

Copy link
Contributor

@pmores pmores Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amshinde ! I'm aware that the problem this PR creates in the qemu driver can be solved by special-casing, like just about any problem for that matter. I'm not sure though that this is warranted.

I'm not aware of the close semantic ties in the runtime-rs design and current practice that you describe between DeviceType, the Device trait and DeviceManager. Also from Rust standpoint, DeviceType is an enum so no relations among the variant types are implied. In short, I don't see that a DeviceType variant implies cloneability or attach/detach functionality.

Also, could you explain the specific problem that DeviceType::Vsock causes to you? If add_device() now takes and returns a DeviceType by value, I think you can do that with copyable and non-copyable types just as well.

What do other runtime-rs devs think, @bergwolf , @studychao , @Tim-0731-Hzt perhaps?

@amshinde amshinde force-pushed the runtime-rs-update-device-pci-info branch 2 times, most recently from 444b5c1 to f665869 Compare November 3, 2023 23:36
@amshinde
Copy link
Member Author

amshinde commented Nov 3, 2023

/test

Block(virtio-blk) and vfio devices are currently not handled correctly
by the agent as the agent is not provided with correct PCI paths for
these devices.

The PCI paths for these devices can be inferred from the PCI information
provided by the hypervisor when the device is added.
Hence changing the add_device trait function to return a device copy
with PCI info potentially provided by the hypervisor. This can then be
provided to the agent to correctly detect devices within the VM.

This commit includes implementation for PCI info update for
cloud-hupervisor for virtio-blk devices with stubs provided for other
hypervisors.

Removing Vsock from the DeviceType enum as Vsock currently does not
implement the Device Trait, it has no attach and detach trait functions
among others. Part of the reason is because these functions require Vsock
to implement Clone trait as these functions need cloned copies to be
passed down the hypervisor.

The change introduced for returning a device copy from the add_device
hypervisor trait explicitly requires a device to implement
Copy trait. Hence removing Vsock from the DeviceType enum for now, as
its implementation is incomplete and not currently used.

Note, one of the blockers for adding the Clone trait to Vsock is that it
currently includes a file handle which cannot be cloned. For Clone and
Device Traits to be implemented for Vsock, it requires an implementation
change in the future for it to be cloneable.

Fixes: kata-containers#8283

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
If PCI path for block device is not empty for a block device, use
that as identifier for agent instead of virt path which is valid only
for mmio devices.

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Remove earlier functionality that tries to assign PCI path to vfio
devices from the host assuming pci slots to start from 1.
Get this from the hypervisor instead.

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
@amshinde amshinde force-pushed the runtime-rs-update-device-pci-info branch from f665869 to 036b778 Compare November 6, 2023 06:00
Copy link
Contributor

@Apokleos Apokleos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @amshinde LGTM!

@amshinde
Copy link
Member Author

amshinde commented Nov 6, 2023

/test

Copy link
Contributor

@cmaf cmaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @amshinde

@amshinde amshinde merged commit 3b2fb6a into kata-containers:main Nov 6, 2023
137 of 142 checks passed
pmores added a commit to pmores/kata-containers that referenced this pull request Nov 9, 2023
This basically just a partial revert of PR kata-containers#8284, the only new code is
adding Vsock support to DeviceType::try_clone().

At this point, if any DeviceType but Vsock is handled by an add_device()
implementation it will be cloned via try_clone() if necessary as before.
If Vsock is ever passed to add_device() the call will properly return
an error since Vsock devices apparently aren't meant to be handled by
add_device() anyway.

Fixes: kata-containers#8413

Signed-off-by: Pavel Mores <pmores@redhat.com>
@pmores pmores mentioned this pull request Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/large Task of significant size
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runtime-rs: Update PCI path information for VFIO and virtio-blk devices
8 participants