Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM disk hotplug issue (running out of hotplug slots) #1086

Closed
srd424 opened this issue Aug 7, 2024 · 16 comments
Closed

VM disk hotplug issue (running out of hotplug slots) #1086

srd424 opened this issue Aug 7, 2024 · 16 comments
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@srd424
Copy link

srd424 commented Aug 7, 2024

Required information

  • Distribution: incus in docker container running on Balena (!)

  • The output of "incus info" - incus-info.txt

    • Kernel version: 6.1.35
    • Incus version: 6.0.0
    • Storage backend in use: btrfs, lvm, custom

Issue description

virtiofs hotplug seems to fail whenever I try to use it:

steved@xubuntu:~$ incus config device add vchost1 ostree3 disk source=/vol/ostree path=/vol/contpool/ostree2
Error: Failed to start device "ostree3": Failed to add the virtiofs device: Bus 'qemu_pcie21' not found

I'm wondering if this is a logic error in the code here:

// Iterate through all the instance devices in the same sorted order as is used when allocating the
// boot time devices in order to find the PCI bus slot device we would have used at boot time.
// Then attempt to use that same device, assuming it is available.
for _, dev := range d.expandedDevices.Sorted() {
if dev.Name == deviceName {
break // Found our device.
}
pciDevID++
}
pciDeviceName := fmt.Sprintf("%s%d", busDevicePortPrefix, pciDevID)
d.logger.Debug("Using PCI bus device to hotplug virtiofs into", logger.Ctx{"device": deviceName, "port": pciDeviceName})
qemuDev := map[string]string{
"driver": "vhost-user-fs-pci",
"bus": pciDeviceName,
"addr": "00.0",
"tag": mountTag,
"chardev": mountTag,
"id": deviceID,
}

I don't fully understand it, but if I look in the raw qemu config for this VM, this existing entries for all the virtiofs/9p devices show the bus as "qemu_pcie2", which makes me think this code should be doing .. something else!

@stgraber
Copy link
Member

stgraber commented Aug 8, 2024

stgraber@dakara:~$ incus launch images:ubuntu/24.04 v1 --vm
Launching v1
stgraber@dakara:~$ incus config device add v1 etc disk source=/etc/ path=/mnt/etc
Device etc added to v1
stgraber@dakara:~$ incus exec v1 -- df -h /mnt/etc
Filesystem      Size  Used Avail Use% Mounted on
incus_etc        90G   25G   61G  29% /mnt/etc
stgraber@dakara:~$ 

Can you show a full incus config show --expanded ostree3?

Also, any chance you can test on an up to date version of Incus (6.0.1 for LTS, 6.3 for non-LTS)?

@stgraber stgraber added the Incomplete Waiting on more information from reporter label Aug 8, 2024
@srd424
Copy link
Author

srd424 commented Aug 8, 2024 via email

@srd424
Copy link
Author

srd424 commented Aug 8, 2024

I can't easily upgrade to 6.0.1 - I have some VMs running I really don't want to stop. I will look at spinning up another incus install somewhere though.

For the moment, I did create a fresh VM, started it, added two virtiofs mounts OK. Shut it down and restarted, added one more OK, then the fourth failed Error: Failed to start device "test2": Failed to add the virtiofs device: Bus 'qemu_pcie9' not found

This might be a question of working backwards from the code to work out what the failing condition is..

@srd424
Copy link
Author

srd424 commented Aug 8, 2024

Here's the PCI topology after the tests described above:

           +-01.0-[01]--+-00.0  Red Hat, Inc. Virtio memory balloon [1af4:1045]
           |            +-00.1  Red Hat, Inc. Virtio RNG [1af4:1044]
           |            +-00.2  Red Hat, Inc. Virtio input [1af4:1052]
           |            +-00.3  Red Hat, Inc. Virtio input [1af4:1052]
           |            +-00.4  Red Hat, Inc. Virtio socket [1af4:1053]
           |            +-00.5  Red Hat, Inc. Virtio console [1af4:1043]
           |            \-00.6  Red Hat, Inc. QEMU XHCI Host Controller [1b36:000d]
           +-01.1-[02]----00.0  Red Hat, Inc. Virtio SCSI [1af4:1048]
           +-01.2-[03]--+-00.0  Red Hat, Inc. Virtio filesystem [1af4:1049]
           |            +-00.1  Red Hat, Inc. Virtio filesystem [1af4:1049]
           |            +-00.2  Red Hat, Inc. Virtio file system [1af4:105a]
           |            +-00.3  Red Hat, Inc. Virtio filesystem [1af4:1049]
           |            +-00.4  Red Hat, Inc. Virtio file system [1af4:105a]
           |            \-00.5  Red Hat, Inc. Virtio filesystem [1af4:1049]
           +-01.3-[04]----00.0  Red Hat, Inc. Virtio GPU [1af4:1050]
           +-01.4-[05]----00.0  Red Hat, Inc. Virtio network device [1af4:1041]
           +-01.5-[06]--
           +-01.6-[07]----00.0  Red Hat, Inc. Virtio file system [1af4:105a]
           +-01.7-[08]--
           +-02.0-[09]--
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918]
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930]

@srd424
Copy link
Author

srd424 commented Aug 8, 2024

Can you show a full incus config show --expanded ostree3?

Oops, sorry, missed this. I assume you meant for the VM, not the (failed) virtiofs disk device:

architecture: x86_64
config:
  limits.cpu: "4"
  limits.memory: 8GiB
  raw.qemu.conf: |
    [machine]
    kernel = /var/lib/incus/kernels/vchost-vmlinuz
    initrd = /var/lib/incus/kernels/vchost-initrd.img
    append = "root=LABEL=root rootflags=subvol=/rootA console=ttyS0 SYSTEMD_FSTAB=/config/fstab systemd.log_level=info systemd.hostname=vchost1 ip=192.168.160.161::192.168.128.1:255.255.128.0:vchost1:lan0:off debug=y"
  volatile.cloud-init.instance-id: 95ccb25d-8eaf-495e-b32f-80a28f534d06
  volatile.eth0.host_name: tapc34f7557
  volatile.eth0.hwaddr: 00:16:3e:f4:99:73
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: b9d00062-f6cb-4779-b7c5-1da216171475
  volatile.uuid.generation: b9d00062-f6cb-4779-b7c5-1da216171475
  volatile.vsock_id: "2799179731"
devices:
  cachepermpool:
    source: /dev/inthdd/cachepermpool
    type: disk
  cpool:
    source: /dev/ssd/vchost1-cpool
    type: disk
  eth0:
    nictype: bridged
    parent: balbr0
    type: nic
  home-data-net:
    source: /dev/inthdd/home-data-net
    type: disk
  iso:
    source: /dev/inthdd/iso
    type: disk
  nixstore:
    source: /dev/inthdd/nixstore-img
    type: disk
  ostree:
    readonly: "true"
    source: /dev/ssd/ostree
    type: disk
  pip-cache:
    source: /dev/inthdd/pip-cache
    type: disk
  root:
    path: /
    pool: vchost
    type: disk
  rootimg:
    readonly: "true"
    source: /dev/inthdd/vchost-rootfs
    type: disk
  sd-dropbox:
    source: /dev/ssd/sd-dropbox
    type: disk
  sharedconf:
    path: /vol/sharedconf
    source: /vol/clusterconf/shared
    type: disk
  user-cache:
    source: /dev/inthdd/user-cache
    type: disk
  vcconfig:
    path: /media/root-ro/config
    source: /vol/clusterconf
    type: disk
  xu-build:
    source: /dev/inthdd/xu-build
    type: disk
  xu-home:
    source: /dev/inthdd/xu-home
    type: disk
  xu-home-ssd:
    source: /dev/inthdd/xu-home-ssd
    type: disk
  xu-spool:
    source: /dev/inthdd/xu-spool
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

@stgraber
Copy link
Member

stgraber commented Aug 8, 2024

I can't easily upgrade to 6.0.1 - I have some VMs running I really don't want to stop. I will look at spinning up another incus install somewhere though.

We don't restart workloads on upgrade, only the control plane (API) goes down during the upgrade.

@stgraber
Copy link
Member

stgraber commented Aug 8, 2024

Anyway, it's likely caused by the high-ish number of devices.
We have a reserve of I believe 8 hotplug slots but it's supposed to be a sliding thing, basically allowing to hotplug 8 more devices than whatever you started the VM with. The above suggests that this logic may not be behaving as intended and you're running out of slot somehow.

@stgraber stgraber added Bug Confirmed to be a bug and removed Incomplete Waiting on more information from reporter labels Aug 8, 2024
@stgraber stgraber changed the title error on virtiofs hotplug: "Failed to add the virtiofs device: Bus 'qemu_pcie21' not found" VM disk hotplug issue (running out of hotplug slots) Aug 8, 2024
@stgraber stgraber added this to the incus-6.5 milestone Aug 8, 2024
@srd424
Copy link
Author

srd424 commented Aug 8, 2024

We don't restart workloads on upgrade, only the control plane (API) goes down during the upgrade.

Oh, worth knowing, thanks! I did wonder if that was the case but couldn't quickly turn up the right docs.

Anyway, it's likely caused by the high-ish number of devices. We have a reserve of I believe 8 hotplug slots but it's supposed to be a sliding thing, basically allowing to hotplug 8 more devices than whatever you started the VM with. The above suggests that this logic may not be behaving as intended and you're running out of slot somehow.

My brain isn't working brilliantly at the moment, but poking around in the code I did wonder if it should be trying to add devices to one of the existing buses rather than create a new bus? I also noticed the block devices don't seem to end up in qemu.conf - I assume they're now set up using the qemu monitor when the vm is started? I wondered if the virtiofs stuff could take the same approach - at the moment I think there are different code paths for hotplug vs pre-configured mounts? But I very much did get lost in the code ...

@srd424
Copy link
Author

srd424 commented Aug 8, 2024

BTW, can confirm this does happen on 6.0.1 too.

@stgraber
Copy link
Member

stgraber commented Aug 9, 2024

My brain isn't working brilliantly at the moment, but poking around in the code I did wonder if it should be trying to add devices to one of the existing buses rather than create a new bus? I also noticed the block devices don't seem to end up in qemu.conf - I assume they're now set up using the qemu monitor when the vm is started? I wondered if the virtiofs stuff could take the same approach - at the moment I think there are different code paths for hotplug vs pre-configured mounts? But I very much did get lost in the code ...

I'll need to look at the logic again, but I thought we made all the disks be hotplug as we want the ability to add/remove them.

The way things are supposed to work is that at startup we allocate PCIe root addresses for all the stuff that we're going to hotplug through QMP. Then we allocate an additional 8 PCIe root addresses to allow for things to be added later on.

We can't alter the PCIe root once the VM is running so given that limited hotplug/hotremove works, the core of the logic seems fine. I suspect we just have an issue where we're somehow not properly pre-allocating some stuff, basically making your boot time disks already use the "spare" slots, at which point you'd have run out of slot and get the error.

So basically a few different things:

  • Need to recheck the boot time logic to make sure we allocate and use the correct addresses
  • Need to confirm that we're left with our 8 spare slots
  • Need to make the error not be as awful when we ran out of hotplug slots so the user can understand what's going on and what to do (basically hot-remove something or reboot the VM)

@stgraber stgraber self-assigned this Aug 29, 2024
@stgraber
Copy link
Member

stgraber commented Sep 4, 2024

Looking into this one now

@stgraber
Copy link
Member

stgraber commented Sep 5, 2024

Starting with an empty VM and attempting to add 10 disks which require PCIe address (using io.bus=nvme to force that), I'm getting:

stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Device disk01 added to v1
Device disk02 added to v1
Device disk03 added to v1
Error: Failed to start device "disk04": Failed to call monitor hook for block device: Failed adding block device for disk device "disk04": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk05": Failed to call monitor hook for block device: Failed adding block device for disk device "disk05": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk06": Failed to call monitor hook for block device: Failed adding block device for disk device "disk06": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk07": Failed to call monitor hook for block device: Failed adding block device for disk device "disk07": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk08": Failed to call monitor hook for block device: Failed adding block device for disk device "disk08": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk09": Failed to call monitor hook for block device: Failed adding block device for disk device "disk09": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk10": Failed to call monitor hook for block device: Failed adding block device for disk device "disk10": Failed adding device: Bus 'qemu_pcie9' not found
stgraber@castiana:~$ incus restart v1
stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Error: The device already exists
Error: The device already exists
Error: The device already exists
Device disk04 added to v1
Device disk05 added to v1
Device disk06 added to v1
Error: Failed to start device "disk07": Failed to call monitor hook for block device: Failed adding block device for disk device "disk07": Failed adding device: Bus 'qemu_pcie12' not found
Error: Failed to start device "disk08": Failed to call monitor hook for block device: Failed adding block device for disk device "disk08": Failed adding device: Bus 'qemu_pcie12' not found
Error: Failed to start device "disk09": Failed to call monitor hook for block device: Failed adding block device for disk device "disk09": Failed adding device: Bus 'qemu_pcie12' not found
Error: Failed to start device "disk10": Failed to call monitor hook for block device: Failed adding block device for disk device "disk10": Failed adding device: Bus 'qemu_pcie12' not found
stgraber@castiana:~$ incus restart v1
stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Device disk07 added to v1
Device disk08 added to v1
Device disk09 added to v1
Error: Failed to start device "disk10": Failed to call monitor hook for block device: Failed adding block device for disk device "disk10": Failed adding device: Bus 'qemu_pcie15' not found
stgraber@castiana:~$ incus restart v1
stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Device disk10 added to v1
stgraber@castiana:~$ 

So we can see that we can add at most 3 additional devices before running out of slots.

I'll had to tweak things a bit because the number is supposed to be 4, not 3 and we definitely want a much nicer error when hitting the limit of remaining hotplug slots.

@stgraber
Copy link
Member

stgraber commented Sep 5, 2024

Doing a quick test here after doubling the number of hotplug slots to 8, we can see that there's something off in the logic as the first hotplug slot isn't used:

root@v1:~# lspci -tnnnvvv
-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller [8086:29c0]
           +-01.0-[01]--+-00.0  Red Hat, Inc. Virtio 1.0 memory balloon [1af4:1045]
           |            +-00.1  Red Hat, Inc. Virtio 1.0 RNG [1af4:1044]
           |            +-00.2  Red Hat, Inc. Virtio 1.0 input [1af4:1052]
           |            +-00.3  Red Hat, Inc. Virtio 1.0 input [1af4:1052]
           |            +-00.4  Red Hat, Inc. Virtio 1.0 socket [1af4:1053]
           |            +-00.5  Red Hat, Inc. Virtio 1.0 console [1af4:1043]
           |            \-00.6  Red Hat, Inc. QEMU XHCI Host Controller [1b36:000d]
           +-01.1-[02]----00.0  Red Hat, Inc. Virtio 1.0 SCSI [1af4:1048]
           +-01.2-[03]--+-00.0  Red Hat, Inc. Virtio 1.0 filesystem [1af4:1049]
           |            \-00.1  Red Hat, Inc. Virtio 1.0 filesystem [1af4:1049]
           +-01.3-[04]----00.0  Red Hat, Inc. Virtio 1.0 GPU [1af4:1050]
           +-01.4-[05]----00.0  Red Hat, Inc. Virtio 1.0 network device [1af4:1041]
           +-01.5-[06]--
           +-01.6-[07]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-01.7-[08]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.0-[09]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.1-[0a]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.2-[0b]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.3-[0c]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.4-[0d]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918]
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930]
root@v1:~# 

@stgraber
Copy link
Member

stgraber commented Sep 5, 2024

Okay, so that's where we get to what you were pointing out earlier, basically the logic is a bit limited in that it basically assumes that every device in the devices list uses a PCIe slot.

It doesn't consider the fact that devices that were present at boot time don't count towards the hotplug quota nor that a number of devices simply don't need a PCIe address at all.

The cleanest option would be to fetch a list of addresses from QEMU directly, I'm going to look at what query-pci may be able to get us in that regard.

@stgraber
Copy link
Member

stgraber commented Sep 5, 2024

Interesting, we actually do have a mapping for query-pci already, just not using it anywhere.

@stgraber
Copy link
Member

stgraber commented Sep 5, 2024

Got a reliable way to handle things which is also much simpler than the current logic, win win.

stgraber added a commit to stgraber/incus that referenced this issue Sep 5, 2024
Use QMP PCI slot information rather than guessing at usable PCIe slots.

Closes lxc#1086

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
@hallyn hallyn closed this as completed in e9c361c Sep 5, 2024
stgraber added a commit that referenced this issue Sep 10, 2024
Use QMP PCI slot information rather than guessing at usable PCIe slots.

Closes #1086

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Development

No branches or pull requests

2 participants