Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BACKPORT][v1.5.5][BUG][v1.6.0-rc1] Failed to run instance-manager in storage network environment #8305

Closed
github-actions bot opened this issue Apr 4, 2024 · 5 comments
Assignees
Labels
area/storage-network Storage network for control plane or data plane kind/backport Backport request kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/always 100% reproducible
Milestone

Comments

@github-actions
Copy link

github-actions bot commented Apr 4, 2024

backport #7640

@github-actions github-actions bot added area/storage-network Storage network for control plane or data plane kind/backport Backport request kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/always 100% reproducible labels Apr 4, 2024
@github-actions github-actions bot added this to the v1.5.5 milestone Apr 4, 2024
@ejweber ejweber assigned ejweber and unassigned yangchiu and c3y1huang Apr 4, 2024
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Apr 4, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?

To reproduce in v1.5.x before fix:

  1. Ensure no volumes are attached.
  2. Set the storage-network setting to a random value (e.g. longhorn-system/notexist).
  3. Wait for the instance-manager pods to be recreated (as appropriate).
  4. Delete all longhorn-manager pods.
  5. Observe instance-manager pods being recreated again (for no reason, because the storage-network setting synced).
eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager 
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted

eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-bbd0405d8fa87cc0209520b4c3262577   0/1     ContainerCreating   0          2s
instance-manager-e-c84e8856027c5474944ab4efd990f514   0/1     ContainerCreating   0          2s
instance-manager-e-e2c3e593061b7d2b3dbc3d36d2b2290a   0/1     Terminating         0          3s
instance-manager-r-bbd0405d8fa87cc0209520b4c3262577   0/1     Terminating         0          3s
instance-manager-r-c84e8856027c5474944ab4efd990f514   0/1     ContainerCreating   0          2s
instance-manager-r-e2c3e593061b7d2b3dbc3d36d2b2290a   0/1     Terminating         0          3s

To test the backported fix in v1.5.x:

  1. Observe instance-manager pods are not recreated again.
eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager 
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted

eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-4b443d2688949e932fe861a5e08f2a42   1/1     Running   0               5m23s
instance-manager-e-5b946f5536c20defcce3ba51560a1dee   1/1     Running   0               4m53s
instance-manager-e-5d61ec2e1803a0a52a407f466c402633   1/1     Running   0               5m5s
instance-manager-r-4b443d2688949e932fe861a5e08f2a42   1/1     Running   0               5m23s
instance-manager-r-5b946f5536c20defcce3ba51560a1dee   1/1     Running   0               4m54s
instance-manager-r-5d61ec2e1803a0a52a407f466c402633   1/1     Running   0               5m5s

@yangchiu
Copy link
Member

yangchiu commented Apr 8, 2024

@ejweber I've tested it on v1.5.x-head, but the behavior is a little different from #8305 (comment). If the storage-network setting is set to a longhorn-system/notexist, instance-manager will get stuck in ContainerCreating state instead of Running state. Could you help confirm it's expected behavior? Thank you!

@ejweber
Copy link
Contributor

ejweber commented Apr 9, 2024

@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.

My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.

    k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist",
      "interface": "lhnet1"}]'

This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.

Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no longhorn-system/notexist CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.

If you DO have Multus installed, can we test by EITHER:

  1. Using a cluster that does NOT have Multus installed.
  2. Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of longhorn-system/notexist.

If you DON'T have Multus installed, can you please send a support bundle for evaluation?

@yangchiu
Copy link
Member

@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.

My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.

    k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist",
      "interface": "lhnet1"}]'

This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.

Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no longhorn-system/notexist CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.

Yes, I tested with Multus installed in my cluster. Instance mangers get stuck in ContainerCreating with error message:

Warning  FailedCreatePodSandBox  4s    kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "06f9c589bde0ff43f6673441968b2dde448d465b40c59c93191e4f78c134c1eb": plugin type="multus" failed (add): Multus: [longhorn-system/instance-manager-d6ec2eb00e44dab155201d755760d661/5af79f42-d03a-4dfd-921c-33b766622adf]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: getKubernetesDelegate: cannot find a network-attachment-definition (notexist) in namespace (longhorn-system): network-attachment-definitions.k8s.cni.cncf.io "notexist" not found

So it's expected. Thank you for the clarification!

If you DO have Multus installed, can we test by EITHER:

  1. Using a cluster that does NOT have Multus installed.
  2. Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of longhorn-system/notexist.

Yes, changed the storage-network setting to an existent crd, instance managers work without problem.

@yangchiu
Copy link
Member

Verified passed on v1.5.x-head (longhorn-manager 63ba2f7) following the test plan.

And Longhorn runs without problem on the storage network pipeline: https://ci.longhorn.io/job/private/job/longhorn-storage-network-test/15/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage-network Storage network for control plane or data plane kind/backport Backport request kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/always 100% reproducible
Projects
None yet
Development

No branches or pull requests

4 participants