-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POD fails to attach correct sriov device on ungraceful node reboot #107928
Comments
/sig node |
This bug doesn't seem to have enough details to reproduce or further investigate. You must include your the container runtime, full kubelet logs, and any relevant Kubernetes manifests along with clear steps to reproduce so we can help with your issue. Once more details are provided, the bug will be accepted. /triage needs-information |
I have attached the logs for this issue. But kubelet continues to attach 0000:00:07.0 as the device after reboot even though pci assignment has changed. |
/remove-triage needs-information |
/triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/cc |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
What happened?
POD with sriov nic device attached to it fails to attach correct sriov device up on node is hard rebooted after volumes are attached to it. The node is a VM in openstack cloud provider environment and the PCI address of the sriov VF changes on node hard reboot when additional volumes are attached to the VM.
Moreover, the same scenario works with graceful node reboot.
This is seen in the logs:
Warning FailedCreatePodSandBox 91s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a74bd117e5aba36e9edfed421360e78cc68799886d1dfd32f0888567bd611774": [ejiazeh-pcg/eric-pc-up-data-plane-5b6b49bd86-6457v:eric-pc-up-data-plane-net0]: error adding container to network "eric-pc-up-data-plane-net0": error with host device: lstat /sys/bus/pci/devices/0000:00:15.0: no such file or directory
Normal AddedInterface 79s multus Add eth0 [192.168.242.212/32] from k8s-pod-network
Normal AddedInterface 64s multus Add eth0 [192.168.242.208/32] from k8s-pod-network
Warning FailedCreatePodSandBox 63s (x2 over 78s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b5e0dd33781642c85e343b14f9209a91980ebd99ed77baa595dffaf9c60ef62b": [ejiazeh-pcg/eric-pc-up-data-plane-5b6b49bd86-6457v:eric-pc-up-data-plane-net0]: error adding container to network "eric-pc-up-data-plane-net0": error with host device: lstat /sys/bus/pci/devices/0000:00:15.0: no such file or directory
Normal SandboxChanged 50s (x14 over 3m41s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal AddedInterface 49s multus Add eth0 [192.168.242.196/32] from k8s-pod-network
Normal AddedInterface 34s multus Add eth0 [192.168.242.217/32] from k8s-pod-network
Normal AddedInterface 21s multus Add eth0 [192.168.242.230/32] from k8s-pod-network
PCI addresses on node before reboot:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:09.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0a.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0b.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0c.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0d.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:11.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:12.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:13.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:14.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:15.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:16.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:17.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
PCI addresses on node after reboot:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:09.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0a.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0b.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0c.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0d.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0e.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0f.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:10.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:11.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:12.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
00:13.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:14.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:15.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:16.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:17.0 SCSI storage controller: Red Hat, Inc. Virtio block device
What did you expect to happen?
POD should attach correct sriov devices on ungraceful node reboot. Currently, it assumes that the pci address of the devices won't change.
How can we reproduce it (as minimally and precisely as possible)?
Create openstack VM with volumes and sriov VFs attached to it.
Create pod with sriov device attached.
Attach additional volumes to the VM and do a hard node reboot.
Pod comes up with same pci address as before but pci address of the device has changed.
Logs
container-inspect-output.txt
failing-pod-describe.txt
kubelet-logs.txt
pci-after-reboot.txt
pci-before-reboot.txt
pod manifest.yml.txt
Kubernetes version
Cloud provider
openstack
OS version
The text was updated successfully, but these errors were encountered: