Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POD fails to attach correct sriov device on ungraceful node reboot #107928

Open
rthakur-est opened this issue Feb 3, 2022 · 15 comments
Open

POD fails to attach correct sriov device on ungraceful node reboot #107928

rthakur-est opened this issue Feb 3, 2022 · 15 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@rthakur-est
Copy link
Contributor

rthakur-est commented Feb 3, 2022

What happened?

POD with sriov nic device attached to it fails to attach correct sriov device up on node is hard rebooted after volumes are attached to it. The node is a VM in openstack cloud provider environment and the PCI address of the sriov VF changes on node hard reboot when additional volumes are attached to the VM.

Moreover, the same scenario works with graceful node reboot.

This is seen in the logs:
Warning FailedCreatePodSandBox 91s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a74bd117e5aba36e9edfed421360e78cc68799886d1dfd32f0888567bd611774": [ejiazeh-pcg/eric-pc-up-data-plane-5b6b49bd86-6457v:eric-pc-up-data-plane-net0]: error adding container to network "eric-pc-up-data-plane-net0": error with host device: lstat /sys/bus/pci/devices/0000:00:15.0: no such file or directory
Normal AddedInterface 79s multus Add eth0 [192.168.242.212/32] from k8s-pod-network
Normal AddedInterface 64s multus Add eth0 [192.168.242.208/32] from k8s-pod-network
Warning FailedCreatePodSandBox 63s (x2 over 78s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b5e0dd33781642c85e343b14f9209a91980ebd99ed77baa595dffaf9c60ef62b": [ejiazeh-pcg/eric-pc-up-data-plane-5b6b49bd86-6457v:eric-pc-up-data-plane-net0]: error adding container to network "eric-pc-up-data-plane-net0": error with host device: lstat /sys/bus/pci/devices/0000:00:15.0: no such file or directory
Normal SandboxChanged 50s (x14 over 3m41s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal AddedInterface 49s multus Add eth0 [192.168.242.196/32] from k8s-pod-network
Normal AddedInterface 34s multus Add eth0 [192.168.242.217/32] from k8s-pod-network
Normal AddedInterface 21s multus Add eth0 [192.168.242.230/32] from k8s-pod-network

PCI addresses on node before reboot:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:09.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0a.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0b.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0c.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0d.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:11.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:12.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:13.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:14.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:15.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:16.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:17.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon

PCI addresses on node after reboot:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:09.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0a.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0b.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:0c.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0d.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0e.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:0f.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:10.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:11.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
00:12.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
00:13.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:14.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:15.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:16.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:17.0 SCSI storage controller: Red Hat, Inc. Virtio block device

What did you expect to happen?

POD should attach correct sriov devices on ungraceful node reboot. Currently, it assumes that the pci address of the devices won't change.

How can we reproduce it (as minimally and precisely as possible)?

Create openstack VM with volumes and sriov VFs attached to it.
Create pod with sriov device attached.
Attach additional volumes to the VM and do a hard node reboot.
Pod comes up with same pci address as before but pci address of the device has changed.

Logs

container-inspect-output.txt
failing-pod-describe.txt
kubelet-logs.txt
pci-after-reboot.txt
pci-before-reboot.txt
pod manifest.yml.txt

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"49499222b0eb0349359881bea01d8d5bd78bf444", GitTreeState:"clean", BuildDate:"2021-12-14T12:50:25Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"49499222b0eb0349359881bea01d8d5bd78bf444", GitTreeState:"clean", BuildDate:"2021-12-14T12:41:40Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

openstack

OS version

# On Linux:
$ cat /etc/os-release
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"
$ uname -a
# paste output here



</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and and version (if applicable)
containerd




### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>
@rthakur-est rthakur-est added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 3, 2022
@rthakur-est
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 3, 2022
@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Feb 3, 2022
@ehashman
Copy link
Member

ehashman commented Feb 9, 2022

This bug doesn't seem to have enough details to reproduce or further investigate. You must include your the container runtime, full kubelet logs, and any relevant Kubernetes manifests along with clear steps to reproduce so we can help with your issue.

Once more details are provided, the bug will be accepted.

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 9, 2022
@ehashman ehashman moved this from Triage to Needs Information in SIG Node Bugs Feb 9, 2022
@rthakur-est rthakur-est changed the title POD with sriov device failed on ungraceful node reboot POD fails to attach correct sriov device on ungraceful node reboot Feb 16, 2022
@rthakur-est
Copy link
Contributor Author

I have attached the logs for this issue.
Before node reboot, this was the device address attached to the pod -
00:07.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
After node reboot, the address is assigned to block device -
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device

But kubelet continues to attach 0000:00:07.0 as the device after reboot even though pci assignment has changed.

@SergeyKanzhelev
Copy link
Member

/remove-triage needs-information

@k8s-ci-robot k8s-ci-robot removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 16, 2022
@SergeyKanzhelev
Copy link
Member

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2022
@SergeyKanzhelev SergeyKanzhelev moved this from Needs Information to Triaged in SIG Node Bugs Feb 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022
@rthakur-est
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022
@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2022
@vaibhav2107
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2022
@swatisehgal
Copy link
Contributor

/cc

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Development

Successfully merging a pull request may close this issue.

7 participants