[BACKPORT][v1.5.5][BUG][v1.6.0-rc1] Failed to run instance-manager in storage network environment #8305

github-actions · 2024-04-04T20:35:15Z

backport #7640

longhorn-io-github-bot · 2024-04-04T21:08:20Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?

To reproduce in v1.5.x before fix:

Ensure no volumes are attached.
Set the storage-network setting to a random value (e.g. longhorn-system/notexist).
Wait for the instance-manager pods to be recreated (as appropriate).
Delete all longhorn-manager pods.
Observe instance-manager pods being recreated again (for no reason, because the storage-network setting synced).

eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager 
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted

eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-bbd0405d8fa87cc0209520b4c3262577   0/1     ContainerCreating   0          2s
instance-manager-e-c84e8856027c5474944ab4efd990f514   0/1     ContainerCreating   0          2s
instance-manager-e-e2c3e593061b7d2b3dbc3d36d2b2290a   0/1     Terminating         0          3s
instance-manager-r-bbd0405d8fa87cc0209520b4c3262577   0/1     Terminating         0          3s
instance-manager-r-c84e8856027c5474944ab4efd990f514   0/1     ContainerCreating   0          2s
instance-manager-r-e2c3e593061b7d2b3dbc3d36d2b2290a   0/1     Terminating         0          3s

To test the backported fix in v1.5.x:

Observe instance-manager pods are not recreated again.

eweber@laptop:~/longhorn> kl delete pod -l app=longhorn-manager 
pod "longhorn-manager-cl472" deleted
pod "longhorn-manager-cnfkv" deleted
pod "longhorn-manager-pdtj2" deleted

eweber@laptop:~/longhorn> kl get pod | grep instance-manager
instance-manager-e-4b443d2688949e932fe861a5e08f2a42   1/1     Running   0               5m23s
instance-manager-e-5b946f5536c20defcce3ba51560a1dee   1/1     Running   0               4m53s
instance-manager-e-5d61ec2e1803a0a52a407f466c402633   1/1     Running   0               5m5s
instance-manager-r-4b443d2688949e932fe861a5e08f2a42   1/1     Running   0               5m23s
instance-manager-r-5b946f5536c20defcce3ba51560a1dee   1/1     Running   0               4m54s
instance-manager-r-5d61ec2e1803a0a52a407f466c402633   1/1     Running   0               5m5s

Is there a workaround for the issue? If so, where is it documented?
Once a volume attaches, pods are no longer erroneously recreated.
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at: fix(storage-network): annotated pod in creation loop (backport #2456) longhorn-manager#2720.

yangchiu · 2024-04-08T10:42:12Z

@ejweber I've tested it on v1.5.x-head, but the behavior is a little different from #8305 (comment). If the storage-network setting is set to a longhorn-system/notexist, instance-manager will get stuck in ContainerCreating state instead of Running state. Could you help confirm it's expected behavior? Thank you!

ejweber · 2024-04-09T14:25:51Z

@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.

My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.

    k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist",
      "interface": "lhnet1"}]'

This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.

Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no longhorn-system/notexist CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.

If you DO have Multus installed, can we test by EITHER:

Using a cluster that does NOT have Multus installed.
Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of longhorn-system/notexist.

If you DON'T have Multus installed, can you please send a support bundle for evaluation?

yangchiu · 2024-04-10T04:03:38Z

@yangchiu, this is not the behavior I observe in my cluster, but it may make sense.

My cluster does not have Multus or the NetworkAttachmentDefinition CRD installed. So, Longhorn sets the annotation on the instance-manager pod, but no component in the cluster actually attempts to set up a secondary network.
    k8s.v1.cni.cncf.io/networks: '[{"namespace": "longhorn-system", "name": "notexist",
      "interface": "lhnet1"}]'
This is fine for testing, as I only want to verify whether setting the Longhorn storage-network setting causes instance-manager pods to be continuously restarted.

Do you have Multus installed in your test cluster? If so, I think it is likely that the container fails to start because there is no longhorn-system/notexist CRD in the cluster, leading to a failure in the network creation for the instance-manager pods.

Yes, I tested with Multus installed in my cluster. Instance mangers get stuck in ContainerCreating with error message:

Warning  FailedCreatePodSandBox  4s    kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "06f9c589bde0ff43f6673441968b2dde448d465b40c59c93191e4f78c134c1eb": plugin type="multus" failed (add): Multus: [longhorn-system/instance-manager-d6ec2eb00e44dab155201d755760d661/5af79f42-d03a-4dfd-921c-33b766622adf]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: getKubernetesDelegate: cannot find a network-attachment-definition (notexist) in namespace (longhorn-system): network-attachment-definitions.k8s.cni.cncf.io "notexist" not found

So it's expected. Thank you for the clarification!

If you DO have Multus installed, can we test by EITHER:

Using a cluster that does NOT have Multus installed.

Changing the storage-network setting to match an actual NetworkAttachmentDefinition that works in the cluster instead of longhorn-system/notexist.

Yes, changed the storage-network setting to an existent crd, instance managers work without problem.

yangchiu · 2024-04-10T04:06:01Z

Verified passed on v1.5.x-head (longhorn-manager 63ba2f7) following the test plan.

And Longhorn runs without problem on the storage network pipeline: https://ci.longhorn.io/job/private/job/longhorn-storage-network-test/15/

github-actions bot added area/storage-network Storage network for control plane or data plane kind/backport Backport request kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/always 100% reproducible labels Apr 4, 2024

github-actions bot added this to the v1.5.5 milestone Apr 4, 2024

github-actions bot assigned yangchiu and c3y1huang Apr 4, 2024

ejweber assigned ejweber and unassigned yangchiu and c3y1huang Apr 4, 2024

innobead assigned yangchiu Apr 9, 2024

yangchiu closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKPORT][v1.5.5][BUG][v1.6.0-rc1] Failed to run instance-manager in storage network environment #8305

[BACKPORT][v1.5.5][BUG][v1.6.0-rc1] Failed to run instance-manager in storage network environment #8305

github-actions bot commented Apr 4, 2024

longhorn-io-github-bot commented Apr 4, 2024 •

edited by ejweber

yangchiu commented Apr 8, 2024

ejweber commented Apr 9, 2024

yangchiu commented Apr 10, 2024

yangchiu commented Apr 10, 2024

[BACKPORT][v1.5.5][BUG][v1.6.0-rc1] Failed to run instance-manager in storage network environment #8305

[BACKPORT][v1.5.5][BUG][v1.6.0-rc1] Failed to run instance-manager in storage network environment #8305

Comments

github-actions bot commented Apr 4, 2024

longhorn-io-github-bot commented Apr 4, 2024 • edited by ejweber

Pre Ready-For-Testing Checklist

yangchiu commented Apr 8, 2024

ejweber commented Apr 9, 2024

yangchiu commented Apr 10, 2024

yangchiu commented Apr 10, 2024

longhorn-io-github-bot commented Apr 4, 2024 •

edited by ejweber