Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

[qemu] q35 VFIO passthrough fails on both bridge and pcie-root-port #2664

Closed
amorenoz opened this issue May 6, 2020 · 7 comments
Closed
Assignees
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.

Comments

@amorenoz
Copy link
Contributor

amorenoz commented May 6, 2020

I have recently tried VFIO passthrough with and without SR-IOV and detected a number of problems that make it fail.
Reporting them as a single issue since they all contribute to "VFIO passthrough not working". Let me know if you prefer to split them.

The below tests were performed with the following device:
65:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)

Description of the problem

Generically, trying to add a PF or VF via VFIO passthrough fails

Problem 1: pci-rescan

The pci-rescan triggered by kata-agent makes shpchp fail. Basically, the same as reported in #2460.
@devimc pointed me to the origin origin of the rescan which seems to be related to lack of ACPI hotplug support in old versions of qemu.

Question: Could we implement a mechanism to tell the agent whether a rescan is needed?
Should we keep the rescan as the default behaviour in that case?

Alternatively, could the mechanism that was implemented in to fix #2460 be extended to support not only devices with large BARs.

Failing dmesg:

[    2.826270] shpchp 0000:00:02.0: Latch close on Slot(1)
[    2.826282] shpchp 0000:00:02.0: Button pressed on Slot(1)
[    2.826290] shpchp 0000:00:02.0: Card present on Slot(1)
[    2.827662] shpchp 0000:00:02.0: PCI slot #1 - powering on due to button press
[    2.827970] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[    2.834453] pci 0000:01:01.0: [8086:1572] type 00 class 0x020000
[    2.834706] pci 0000:01:01.0: reg 0x10: [mem 0x00000000-0x00ffffff 64bit pref]
[    2.834909] pci 0000:01:01.0: reg 0x1c: [mem 0x00000000-0x00007fff 64bit pref]
[    2.836799] pci 0000:01:01.0: PME# supported from D0 D3hot D3cold
[    2.846095] pci 0000:01:01.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
[    2.846100] pci 0000:01:01.0: BAR 0: failed to assign [mem size 0x01000000 64bit pref]
[    2.846106] pci 0000:01:01.0: BAR 3: assigned [mem 0xfe800000-0xfe807fff 64bit pref]
[    2.920303] audit: type=1334 audit(1588753077.748:6): prog-id=8 op=LOAD
[    9.277048] shpchp 0000:00:02.0: Device 0000:01:01.0 already exists at 0000:01:01, cannot hot-add
[    9.277053] shpchp 0000:00:02.0: Cannot add device at 0000:01:01
[    9.285780] shpchp 0000:00:02.0: Latch open on Slot(1)
[    9.285800] shpchp 0000:00:02.0: Button pressed on Slot(1)
[    9.285818] shpchp 0000:00:02.0: Card not present on Slot(1)
[    9.287833] shpchp 0000:00:02.0: PCI slot #1 - powering on due to button press
[   14.397081] shpchp 0000:00:02.0: No adapter on slot(1)

And, lspci -vv reports:

01:01.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev ff) (prog-if ff)
        !!! Unknown header type 7f

FWIW: Forcing isLargeBarSpace to return true, eliminates the issue.

This problem is reproducible both with the PF or with one of its VFs

Problem 2: isPcieDevice does not account for pcie-root-port

When trying to work around problem 1, I tried to take the pcie-pci-bridge out of the picture, so I enabled:

hotplug_vfio_on_root_bus = true
pcie_root_port = 2

and tried to add the PF:

$ ls /sys/kernel/iommu_groups/32/devices/
0000:65:00.0
$ lspci -s 0000:65:00.0
65:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
$ sudo podman run -it --device /dev/vfio/32  --runtime=/usr/local/bin/kata-runtime fedorat sh
Error: QMP command failed: Bus 'pcie.0' does not support hotplugging: OCI runtime error

Looking at the logs, the following caught my eye:

May 06 05:37:51 time="2020-05-06T05:37:51.615333266-04:00" level=info msg="Start hot-plug VFIO device" arch=amd64 command=create container=a1a265ed847721c4009b9a7c767c16466e025a0e5
bef19aaf145b8b82408b050 device-info="{\"IsPCIe\":false,\"Type\":1,\"ID\":\"vfio-4d57823e41cd64510\",\"BDF\":\"65:00.0\",\"SysfsDev\":\"/sys/bus/pci/devices/0000:65:00.0\",\"VendorID\":\"\",\"DeviceID\":\"\",\"Class\":\"0x020000\",\"Bus\":\"
\"}" hotplug-vfio-on-root-bus=true machine-type=q35 name=kata-runtime pcie-root-port=2 pid=25464 source=virtcontainers subsystem=qemu

specially the part where it says it's not a PCIe device

Looking at isPCIeDevice and at my PCI tree:

+-[0000:64]-+-00.0-[65-66]--+-00.0  Intel Corporation Ethernet Controller X710 for 10GbE SFP+                                                                                                                                                  
 |           |               \-00.1  Intel Corporation Ethernet Controller X710 for 10GbE SFP+                                                                                                                                                  
 |           +-05.0  Intel Corporation Sky Lake-E VT-d                                                                                                                                                                                          
 |           +-05.2  Intel Corporation Sky Lake-E RAS Configuration Registers                                                                                                                                                                   
 |           +-05.4  Intel Corporation Sky Lake-E IOxAPIC Configuration Registers                                                                                                                                                               
 |           +-08.0  Intel Corporation Sky Lake-E Integrated Memory Controller                                                                                                                                                                  
...
$ lspci -s 0000:64:00 
64:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)

it seems it's not being detected as a PCIe device although it being connected to a pcie-root-port.

Note that even forcing the hotplug into the pci-root-device does not work. However, repeating the process manually (via QMP) does work, which makes me think the use of pci-root-device is still affected by the rescan race condition described above.

Expected result

We should de able to add both PFs and VFs to a kata container both using a bridge and a pcie-root-port

Actual result

It is not possible to add a PF or VF to a kata container neither on a bridge nor on a pcie-root-port.

Credits:

Thanks to @devimc for his help troubleshooting the issues

@amorenoz amorenoz added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels May 6, 2020
@amorenoz
Copy link
Contributor Author

amorenoz commented May 7, 2020

Confirmed that forcing isLargeBarSpace to return true also fixes the root-port scenario, having the following log:

[    4.193056] pcieport 0000:00:05.0: pciehp: Slot(0): Attention button pressed[    4.193193] pcieport 0000:00:05.0: pciehp: Slot(0) Powering on due to button press
[    4.193418] pcieport 0000:00:05.0: pciehp: Slot(0): Card present[    4.193533] pcieport 0000:00:05.0: pciehp: Slot(0): Link Up
[    4.318660] pci 0000:02:00.0: [8086:1572] type 00 class 0x020000
[    4.319333] pci 0000:02:00.0: reg 0x10: [mem 0x00000000-0x00ffffff 64bit pref]
[    4.319776] pci 0000:02:00.0: reg 0x1c: [mem 0x00000000-0x00007fff 64bit pref]
[    4.320288] pci 0000:02:00.0: Max Payload Size set to 128 (was 256, max 2048)
[    4.321631] pci 0000:02:00.0: PME# supported from D0 D3hot D3cold
[    4.323035] pci 0000:02:00.0: BAR 0: no space for [mem size 0x01000000 64bit pref]
[    4.323243] pci 0000:02:00.0: BAR 0: failed to assign [mem size 0x01000000 64bit pref]
[    4.323445] pci 0000:02:00.0: BAR 3: assigned [mem 0xfe600000-0xfe607fff 64bit pref]
[    4.323882] pcieport 0000:00:05.0: PCI bridge to [bus 02]
[    4.324047] pcieport 0000:00:05.0:   bridge window [io  0x1000-0x1fff]
[    4.328618] pcieport 0000:00:05.0:   bridge window [mem 0xfd800000-0xfd9fffff]
[    4.330752] pcieport 0000:00:05.0:   bridge window [mem 0xfe600000-0xfe7fffff 64bit pref]
[    4.334643] PCI: No. 2 try to assign unassigned res
[    4.334791] release child resource [mem 0xfe600000-0xfe607fff 64bit pref]
[    4.334956] pcieport 0000:00:05.0: resource 15 [mem 0xfe600000-0xfe7fffff 64bit pref] released
[    4.335180] pcieport 0000:00:05.0: PCI bridge to [bus 02]
[    4.340221] pcieport 0000:00:05.0: BAR 15: assigned [mem 0xec0000000-0xec17fffff 64bit pref]
[    4.340474] pci 0000:02:00.0: BAR 0: assigned [mem 0xec0000000-0xec0ffffff 64bit pref]
[    4.340877] pci 0000:02:00.0: BAR 3: assigned [mem 0xec1000000-0xec1007fff 64bit pref]
[    4.341276] pcieport 0000:00:05.0: PCI bridge to [bus 02]
[    4.341432] pcieport 0000:00:05.0:   bridge window [io  0x1000-0x1fff]
[    4.344018] pcieport 0000:00:05.0:   bridge window [mem 0xfd800000-0xfd9fffff]
[    4.345904] pcieport 0000:00:05.0:   bridge window [mem 0xec0000000-0xec17fffff 64bit pref]

So, I tried to reproduce the race condition generically, running while true; do echo 1 > /sys/bus/pci/rescan; done we can:

  • Reproduce the race on plugging and unplugging the bridge with all kind of devices, including for instance e1000
  • Reproduce the race on hot-unplug on the pcie-root-port.

I don't think this depends on the size of the BAR. Even if the problem was solved in the kernel, the result would be to serialize the hotplug and the rescan (making the rescan hold the slot lock?), which does not solve the 5s delay that justified the rescan in the first place.

There is still something I'm missing because I cannot reproduce the race on hotplug on the root-port. However, forcing kata to delay the plug (forcing isLargeBarSpace to return true) solves the issue. If I don't do that I simply see:

[    3.630707] pcieport 0000:00:05.0: pciehp: Slot(0): Attention button pressed
[    3.630872] pcieport 0000:00:05.0: pciehp: Slot(0) Powering on due to button

and nothing else...
which seems to indicate the hotplug has been interrupted. A subsequent rescan does not make the device appear though. Any idea?

Regardless of the mysterious disappearing of the device, I see the following ways forward:

  • kata runtime/agent ensures the rescan does not happen during a hotplug or hotunplug
  • a generic (i.e: not associated with size of BAR) way of disabling rescan. The price of this is: 5seconds or having the orchestration preallocate pcie-root-ports.

What do you think @devimc @amshinde?

@devimc
Copy link

devimc commented May 7, 2020

@amorenoz I'm running some tests to see if we can rid of pci-rescan, this way lazy attach won't be needed

devimc pushed a commit to devimc/kata-agent that referenced this issue May 7, 2020
PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

fixes kata-containers#781
fixes kata-containers/runtime#2664

[1]: clearcontainers/agent#139
[2]: kata-containers/runtime#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
devimc pushed a commit to devimc/kata-runtime that referenced this issue May 7, 2020
The "lazy attach" mechanism [1] was added to hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, fixing LBS hotplug in kata containers.
Since PCI rescan is removed in kata-containers/agent#782, lazy attach is not
longer needed.

Depends-on: github.com/kata-containers/agent#782
fixes kata-containers#2664

[1] kata-containers#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
devimc pushed a commit to devimc/kata-agent that referenced this issue May 7, 2020
PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

Depends-on: github.com/kata-containers/runtime#2670
fixes kata-containers#781
fixes kata-containers/runtime#2664

[1] clearcontainers/agent#139
[2] kata-containers/runtime#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
devimc pushed a commit to devimc/kata-runtime that referenced this issue May 8, 2020
The "lazy attach" mechanism [1] was added to hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, fixing LBS hotplug in kata containers.
Since PCI rescan is removed in kata-containers/agent#782, lazy attach is not
longer needed.

Depends-on: github.com/kata-containers/agent#782
fixes kata-containers#2664

[1] kata-containers#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
@amorenoz
Copy link
Contributor Author

amorenoz commented May 8, 2020

Thanks @devimc. In the tests I ran disabling the rescan worked fine. Also, the pcie-root-port hotplug did not incur in the 5s delay.
Let me know if I can assist in any way.

devimc pushed a commit to devimc/kata-agent that referenced this issue May 8, 2020
PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

Depends-on: github.com/kata-containers/runtime#2670
fixes kata-containers#781
fixes kata-containers/runtime#2664

[1] clearcontainers/agent#139
[2] kata-containers/runtime#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
@amorenoz
Copy link
Contributor Author

Splitting Problem 2 described above to a new issue: #2678

@dagrh
Copy link
Contributor

dagrh commented Jun 10, 2020

The presence of /sys/bus/pci/slots seems to be very random; looking at 4 machines, 2 of them have it and 2 of them has it completely empty, even though there are PCIe devices in there.
Frustratingly /sys/bus/pci/devices/.../max_link_speed doesn't say PCIe or not.

@jodh-intel jodh-intel added this to To do in Issue backlog Aug 10, 2020
devimc pushed a commit to devimc/kata-agent that referenced this issue Sep 4, 2020
PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

Depends-on: github.com/kata-containers/runtime#2670
fixes kata-containers#781
fixes kata-containers/runtime#2664

[1] clearcontainers/agent#139
[2] kata-containers/runtime#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
@dgibson dgibson self-assigned this Sep 22, 2020
@dgibson
Copy link
Contributor

dgibson commented Sep 22, 2020

The original report covers two problems. The isPCIeDevice problem is already fixed with #2889. So, update title to reflect this is just tracking the rescan problem.

@jodh-intel jodh-intel moved this from To do to In progress in Issue backlog Sep 23, 2020
bpradipt pushed a commit to bpradipt/runtime that referenced this issue Nov 27, 2020
We send information about several kinds of devices to the agent so that it
can apply specific handling.  We don't currently do this with VFIO devices.
However we need to do that so that the agent can properly wait for VFIO
devices to be ready (previously it did that using a PCI rescan which may
not be reliable and has some very bad side effects).

This patch collates and sends the relevant information.

Depends-on: github.com/kata-containers/agent#850
fixes kata-containers#2664

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
bpradipt pushed a commit to bpradipt/runtime that referenced this issue Nov 27, 2020
The "lazy attach" mechanism [1] was added to hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, fixing LBS hotplug in kata containers.
Since PCI rescan is removed in kata-containers/agent#782, lazy attach is not
longer needed.

fixes kata-containers#2664

[1] kata-containers#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
@dgibson
Copy link
Contributor

dgibson commented Dec 18, 2020

I'm no longer planning to pursue this in Kata1, I'll be following up in Kata 2 instead.

@dgibson dgibson closed this as completed Dec 18, 2020
Issue backlog automation moved this from In progress to Done Dec 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.
Projects
Issue backlog
  
Done
4 participants