POD fails to attach correct sriov device on ungraceful node reboot #107928

rthakur-est · 2022-02-03T11:50:46Z

What happened?

POD with sriov nic device attached to it fails to attach correct sriov device up on node is hard rebooted after volumes are attached to it. The node is a VM in openstack cloud provider environment and the PCI address of the sriov VF changes on node hard reboot when additional volumes are attached to the VM.

Moreover, the same scenario works with graceful node reboot.

This is seen in the logs:
Warning FailedCreatePodSandBox 91s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a74bd117e5aba36e9edfed421360e78cc68799886d1dfd32f0888567bd611774": [ejiazeh-pcg/eric-pc-up-data-plane-5b6b49bd86-6457v:eric-pc-up-data-plane-net0]: error adding container to network "eric-pc-up-data-plane-net0": error with host device: lstat /sys/bus/pci/devices/0000:00:15.0: no such file or directory
Normal AddedInterface 79s multus Add eth0 [192.168.242.212/32] from k8s-pod-network
Normal AddedInterface 64s multus Add eth0 [192.168.242.208/32] from k8s-pod-network
Warning FailedCreatePodSandBox 63s (x2 over 78s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b5e0dd33781642c85e343b14f9209a91980ebd99ed77baa595dffaf9c60ef62b": [ejiazeh-pcg/eric-pc-up-data-plane-5b6b49bd86-6457v:eric-pc-up-data-plane-net0]: error adding container to network "eric-pc-up-data-plane-net0": error with host device: lstat /sys/bus/pci/devices/0000:00:15.0: no such file or directory
Normal SandboxChanged 50s (x14 over 3m41s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal AddedInterface 49s multus Add eth0 [192.168.242.196/32] from k8s-pod-network
Normal AddedInterface 34s multus Add eth0 [192.168.242.217/32] from k8s-pod-network
Normal AddedInterface 21s multus Add eth0 [192.168.242.230/32] from k8s-pod-network

PCI addresses on node before reboot:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:09.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0a.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0b.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0c.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0d.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:11.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:12.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:13.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:14.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:15.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:16.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:17.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon

PCI addresses on node after reboot:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:09.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0a.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0b.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0c.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0d.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0e.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0f.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:10.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:11.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:12.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
00:13.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:14.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:15.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:16.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:17.0 SCSI storage controller: Red Hat, Inc. Virtio block device

What did you expect to happen?

POD should attach correct sriov devices on ungraceful node reboot. Currently, it assumes that the pci address of the devices won't change.

How can we reproduce it (as minimally and precisely as possible)?

Create openstack VM with volumes and sriov VFs attached to it.
Create pod with sriov device attached.
Attach additional volumes to the VM and do a hard node reboot.
Pod comes up with same pci address as before but pci address of the device has changed.

Logs

container-inspect-output.txt
failing-pod-describe.txt
kubelet-logs.txt
pci-after-reboot.txt
pci-before-reboot.txt
pod manifest.yml.txt

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"49499222b0eb0349359881bea01d8d5bd78bf444", GitTreeState:"clean", BuildDate:"2021-12-14T12:50:25Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"49499222b0eb0349359881bea01d8d5bd78bf444", GitTreeState:"clean", BuildDate:"2021-12-14T12:41:40Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

openstack

OS version

# On Linux:
$ cat /etc/os-release
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"
$ uname -a
# paste output here



</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and and version (if applicable)
containerd




### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>

rthakur-est · 2022-02-03T11:51:18Z

/sig node

ehashman · 2022-02-09T18:23:19Z

This bug doesn't seem to have enough details to reproduce or further investigate. You must include your the container runtime, full kubelet logs, and any relevant Kubernetes manifests along with clear steps to reproduce so we can help with your issue.

Once more details are provided, the bug will be accepted.

/triage needs-information

rthakur-est · 2022-02-16T06:28:41Z

I have attached the logs for this issue.
Before node reboot, this was the device address attached to the pod -
00:07.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
After node reboot, the address is assigned to block device -
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device

But kubelet continues to attach 0000:00:07.0 as the device after reboot even though pci assignment has changed.

SergeyKanzhelev · 2022-02-16T18:34:08Z

/remove-triage needs-information

SergeyKanzhelev · 2022-02-16T18:34:30Z

/triage accepted
/priority important-longterm

k8s-triage-robot · 2022-05-17T18:46:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rthakur-est · 2022-05-18T04:47:36Z

/remove-lifecycle stale

k8s-triage-robot · 2022-08-16T04:54:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-08-16T22:52:20Z

/remove-lifecycle stale

k8s-triage-robot · 2022-11-14T23:26:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-12-03T21:55:25Z

/remove-lifecycle stale

swatisehgal · 2022-12-21T11:28:17Z

/cc

k8s-triage-robot · 2024-01-19T23:06:47Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2024-04-18T23:24:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-18T23:52:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

rthakur-est added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 3, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 3, 2022

SergeyKanzhelev added this to Triage in SIG Node Bugs Feb 3, 2022

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 9, 2022

ehashman moved this from Triage to Needs Information in SIG Node Bugs Feb 9, 2022

rthakur-est changed the title ~~POD with sriov device failed on ungraceful node reboot~~ POD fails to attach correct sriov device on ungraceful node reboot Feb 16, 2022

k8s-ci-robot removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 16, 2022

SergeyKanzhelev moved this from Needs Information to Triaged in SIG Node Bugs Feb 16, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

rthakur-est mentioned this issue Jun 13, 2022

Incorrect PCI device attached on ungraceful node reboot #110537

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2022

swatisehgal mentioned this issue Dec 21, 2022

node: device-mgr: Handle recovery flow by checking if healthy devices exist #114640

Merged

swatisehgal mentioned this issue Mar 9, 2023

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

Merged

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POD fails to attach correct sriov device on ungraceful node reboot #107928

POD fails to attach correct sriov device on ungraceful node reboot #107928

rthakur-est commented Feb 3, 2022 •

edited

rthakur-est commented Feb 3, 2022

ehashman commented Feb 9, 2022

rthakur-est commented Feb 16, 2022

SergeyKanzhelev commented Feb 16, 2022

SergeyKanzhelev commented Feb 16, 2022

k8s-triage-robot commented May 17, 2022

rthakur-est commented May 18, 2022

k8s-triage-robot commented Aug 16, 2022

vaibhav2107 commented Aug 16, 2022

k8s-triage-robot commented Nov 14, 2022

vaibhav2107 commented Dec 3, 2022

swatisehgal commented Dec 21, 2022

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

k8s-triage-robot commented May 18, 2024

POD fails to attach correct sriov device on ungraceful node reboot #107928

POD fails to attach correct sriov device on ungraceful node reboot #107928

Comments

rthakur-est commented Feb 3, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Logs

Kubernetes version

Cloud provider

OS version

rthakur-est commented Feb 3, 2022

ehashman commented Feb 9, 2022

rthakur-est commented Feb 16, 2022

SergeyKanzhelev commented Feb 16, 2022

SergeyKanzhelev commented Feb 16, 2022

k8s-triage-robot commented May 17, 2022

rthakur-est commented May 18, 2022

k8s-triage-robot commented Aug 16, 2022

vaibhav2107 commented Aug 16, 2022

k8s-triage-robot commented Nov 14, 2022

vaibhav2107 commented Dec 3, 2022

swatisehgal commented Dec 21, 2022

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

k8s-triage-robot commented May 18, 2024

rthakur-est commented Feb 3, 2022 •

edited