Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ENGINE v2 : disk /dev/xxxx is already used by AIO bdev disk-x #8129

Closed
arsiesys opened this issue Mar 7, 2024 · 4 comments
Closed
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) backport/1.6.1 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@arsiesys
Copy link

arsiesys commented Mar 7, 2024

Describe the bug

Two identical (models, version) empty formatted disk added to the nodes specs will only be able to provision one of the two:

spec:
  allowScheduling: true
  disks:
    disk-1:
      allowScheduling: true
      diskType: block
      evictionRequested: false
      path: /dev/nvme2n1
      storageReserved: 0
      tags: []
    disk-2:
      allowScheduling: true
      diskType: block
      evictionRequested: false
      path: /dev/nvme3n1
      storageReserved: 0
      tags: []

The instance-manager will logs:

[2024-03-07 17:59:42.459699] bdev_rpc.c: 874:rpc_bdev_get_bdevs: *ERROR*: bdev 'disk-2' does not exist
[longhorn-instance-manager] time="2024-03-07T17:59:42Z" level=error msg="Failed to get disk info" func=spdk.svcDiskGet.func1 file="disk.go:120" diskName=disk-2 error="rpc error: code = NotFound desc = cannot find AIO bdev with name disk-2"
[longhorn-instance-manager] time="2024-03-07T17:59:42Z" level=info msg="Disk Server: Creating disk" func="disk.(*Server).DiskCreate" file="disk.go:120" blockSize=4096 diskName=disk-2 diskPath=/dev/nvme3n1 diskType=block
[longhorn-instance-manager] time="2024-03-07T17:59:42Z" level=info msg="Creating disk" func=spdk.svcDiskCreate file="disk.go:40" blockSize=4096 diskName=disk-2 diskPath=/dev/nvme3n1 diskUUID=
[longhorn-instance-manager] time="2024-03-07T17:59:42Z" level=error msg="Failed to create disk" func=spdk.svcDiskCreate.func1 file="disk.go:43" blockSize=4096 diskName=disk-2 diskPath=/dev/nvme3n1 diskUUID= error="rpc error: code = InvalidArgument desc = failed to validate disk create request: disk /dev/nvme3n1 is already used by AIO bdev disk-1"

It sounds like there is some confusion and it consider that both disks are the same and then won't provision the second one ?

nvme list | grep Micron_7450_MTFDKCC3T2TFS
/dev/nvme2n1          2248422D4868         Micron_7450_MTFDKCC3T2TFS                1          12.50  GB /   3.20  TB      4 KiB +  0 B   E2MU200 
/dev/nvme3n1          2248422D49A4         Micron_7450_MTFDKCC3T2TFS                1           0.00   B /   3.20  TB      4 KiB +  0 B   E2MU200 

lsblk -l -n /dev/nvme2n1 -o NAME,MAJ:MIN
nvme2n1 259:7  
lsblk -l -n /dev/nvme3n1 -o NAME,MAJ:MIN
nvme3n1 259:8  

To Reproduce

  • Install Longhorn
  • Setup one node having 2 nvme disk to be used as block disk by longhorn

Expected behavior

Be able to provision both disks in the longhorn node

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.6.0
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Kubeadm / 1.28
    • Number of control plane nodes in the cluster: 2
    • Number of worker nodes in the cluster: 2
  • Node config
    • OS type and version: Ubuntu 22.04.4
    • Kernel version: 6.5.0-21-generic
    • CPU per node: 64
    • Memory per node: 128
    • Disk type (e.g. SSD/NVMe/HDD): Nvme
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 0

Additional context

This is a fresh experimental setup of longhorn, never used it before.

@arsiesys arsiesys added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Mar 7, 2024
@arsiesys arsiesys changed the title [BUG] ENGINE v2 : disk /dev/xxxx is already used by AIO bdev disk-x [BUG][1.6.0] ENGINE v2 : disk /dev/xxxx is already used by AIO bdev disk-x Mar 7, 2024
@arsiesys
Copy link
Author

arsiesys commented Mar 7, 2024

I suspect something related to this:
https://github.com/longhorn/longhorn-spdk-engine/blob/a6cee5b6febac0a58b5ba2f4b9b5a4d06d9343b2/pkg/spdk/disk.go#L172

return fmt.Sprintf("%d-%d", dev.Export.Major, dev.Export.Minor), nil

While, the spdk helper will feed dev.Nvme instead of dev.Export:
https://github.com/longhorn/go-spdk-helper/blob/ab0344a0c192095de0130f6c5121105be1ab6b33/pkg/util/device.go#L124

if _, err := fmt.Sscanf(f[1], "%d:%d", &dev.Nvme.Major, &dev.Nvme.Minor); err != nil {

The final user of this:
https://github.com/longhorn/longhorn-spdk-engine/blob/a6cee5b6febac0a58b5ba2f4b9b5a4d06d9343b2/pkg/spdk/disk.go#L222

		if id == diskID {
			return fmt.Errorf("disk %v is already used by AIO bdev %v", diskPath, bdev.Name)
		}

Maybe I miss something that would make sure that dev.Export contain the same thing than dev.Nvme. If not, that may explain the issue ?

@derekbit derekbit added the area/v2-data-engine v2 data engine (SPDK) label Mar 7, 2024
@derekbit
Copy link
Member

derekbit commented Mar 7, 2024

I suspect something related to this: https://github.com/longhorn/longhorn-spdk-engine/blob/a6cee5b6febac0a58b5ba2f4b9b5a4d06d9343b2/pkg/spdk/disk.go#L172

return fmt.Sprintf("%d-%d", dev.Export.Major, dev.Export.Minor), nil

While, the spdk helper will feed dev.Nvme instead of dev.Export: https://github.com/longhorn/go-spdk-helper/blob/ab0344a0c192095de0130f6c5121105be1ab6b33/pkg/util/device.go#L124

if _, err := fmt.Sscanf(f[1], "%d:%d", &dev.Nvme.Major, &dev.Nvme.Minor); err != nil {

The final user of this: https://github.com/longhorn/longhorn-spdk-engine/blob/a6cee5b6febac0a58b5ba2f4b9b5a4d06d9343b2/pkg/spdk/disk.go#L222

		if id == diskID {
			return fmt.Errorf("disk %v is already used by AIO bdev %v", diskPath, bdev.Name)
		}

Maybe I miss something that would make sure that dev.Export contain the same thing than dev.Nvme. If not, that may explain the issue ?

Thanks @arsiesys
Yes, checked the codes. The bug is caused by the reason you said. Thank you.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 8, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Enable v2-data-engine
  2. Create two block devices such as loop devices
  3. Add the two disks to block-type disks
  4. One of the two disks shows the error message ... is already used by AIO bdev ... in node.status.diskStatus

After the fix, the two disks can be added successfully.

  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/longhorn-spdk-engine#121

  • Which areas/issues this PR might have potential impacts on?
    Area: v2 volume
    Issues

@derekbit derekbit changed the title [BUG][1.6.0] ENGINE v2 : disk /dev/xxxx is already used by AIO bdev disk-x [BUG] ENGINE v2 : disk /dev/xxxx is already used by AIO bdev disk-x Mar 11, 2024
@chriscchien chriscchien self-assigned this Mar 12, 2024
@chriscchien
Copy link
Contributor

Verified pass on longhorn master (longhorn-instance-manager dce3a1) with test steps

Can add two block devices as block-type disks on the same longhorn node without problem

    disk-2:
      conditions:
      - lastProbeTime: ""
        lastTransitionTime: "2024-03-12T02:28:38Z"
        message: Disk disk-2(/dev/loop5) on node ip-172-31-38-210 is ready
        reason: ""
        status: "True"
        type: Ready
      - lastProbeTime: ""
        lastTransitionTime: "2024-03-12T02:28:38Z"
        message: Disk disk-2(/dev/loop5) on node ip-172-31-38-210 is schedulable
        reason: ""
        status: "True"
        type: Schedulable
      diskType: block
      diskUUID: 37c78eb0-1a15-4fc2-9898-81a672e6039c
      filesystemType: ""
      scheduledReplica: {}
      storageAvailable: 10590617600
      storageMaximum: 10694426624
      storageScheduled: 0
    disk-3:
      conditions:
      - lastProbeTime: ""
        lastTransitionTime: "2024-03-12T02:29:08Z"
        message: Disk disk-3(/dev/loop6) on node ip-172-31-38-210 is ready
        reason: ""
        status: "True"
        type: Ready
      - lastProbeTime: ""
        lastTransitionTime: "2024-03-12T02:29:08Z"
        message: Disk disk-3(/dev/loop6) on node ip-172-31-38-210 is schedulable
        reason: ""
        status: "True"
        type: Schedulable
      diskType: block
      diskUUID: 3924c40a-4377-49c1-9730-571c1f35809a
      filesystemType: ""
      scheduledReplica: {}
      storageAvailable: 10590617600
      storageMaximum: 10694426624
      storageScheduled: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) backport/1.6.1 kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: Resolved/Scheduled
Development

No branches or pull requests

5 participants