You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Pod stuck in ContainerCreating. Events show a "failed to setup network for sandbox" loop.
Warning FailedCreatePodSandBox 2m2s (x16 over 3m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a6092ae23ccfc6d2eadf18642437053ede456a3a45d1f1a748b9fcd827a92c85": plugin type="multus" name="multus-cni-network" failed (add): [abc-09f7b-jp4rb/mango-01/f84c6bb9-d487-4c6f-a22b-fae50463e461:multus-cni-network]: error adding container to network "multus-cni-network": [abc-09f7b-jp4rb/mango-01/f84c6bb9-d487-4c6f-a22b-fae50463e461:abc-09f7b-jp4rb-main-mesh]: error adding container to network "abc-09f7b-jp4rb-main-mesh": DelegateAdd: cannot set "ovn-k8s-cni-overlay" interface name to "pod21e2223678f": validateIfName: interface name pod21e2223678f already exists
Normal AddedInterface 2m multus Add eth0 [100.64.9.221/32] from aws-cni
Normal AddedInterface 2m multus Add pod21e2223678f [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-main-mesh
Normal AddedInterface 2m multus Add pod1f9bb198e14 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-lemon-mesh
Normal AddedInterface 119s multus Add pod42c79b6cd76 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-mango-mesh-01
Normal AddedInterface 119s multus Add eth0 [100.64.9.221/32] from multus-cni-network
Normal AddedInterface 116s multus Add eth0 [100.64.0.250/32] from aws-cni
Normal AddedInterface 116s multus Add pod21e2223678f [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-main-mesh
Normal AddedInterface 116s multus Add pod1f9bb198e14 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-lemon-mesh
Normal AddedInterface 115s multus Add pod42c79b6cd76 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-mango-mesh-01
Normal AddedInterface 115s multus Add eth0 [100.64.0.250/32] from multus-cni-network
Normal AddedInterface 112s multus Add eth0 [100.64.46.81/32] from aws-cni
Normal AddedInterface 112s multus Add pod21e2223678f [] from abc-09f7b-jp4
Manual resolution:
Deleting /etc/cni/net.d/00-multus.conf file and etc/cni/net.d/multus.d and restarting the daemonset pod resolves this issue for subsequent pods being scheduled. Existing pods finish creating but leave the networking in a bad state (e.g. separate from the above example, after resolving I've seen interface pod6c270ef2f25 not found: route ip+net: no such network interface)
How to reproduce it (as minimally and precisely as possible):
Not sure how to reproduce unfortunately. I believe it's a race condition happening less than 1% of the time. My EKS cluster has instances constantly scaling up and down throughout the day, but I've only seen this 2-3x in the past few months.
Possibly related? #1221, though this isn't happening after a node reboot, and I'm not using the thick plugin.
Anything else we need to know?:
Below is the /etc/cni/net.d/00-multus.conf on a bad node, which I suspect is wrong. delegates is nested within delegates, and all the top-level fields are repeated again. It's like a bad merge happened. This is not the same as what's on a working node. (Sorry, I trimmed out some of the config when sharing with my team and the node doesn't exist anymore, so this is all I have):
Potentially another clue: it's the norm for multus daemonset pods to fail 2x on startup with:
Defaulted container "kube-multus" out of: kube-multus, install-multus-binary (init)
kubeconfig is created in /host/etc/cni/net.d/multus.d/multus.kubeconfig
kubeconfig file is created.
panic: runtime error: index out of range [0] with length 0
goroutine 1 [running]:
main.(*Options).createMultusConfig(0xc000236200)
/usr/src/multus-cni/cmd/thin_entrypoint/main.go:297 +0x1f45
main.main()
/usr/src/multus-cni/cmd/thin_entrypoint/main.go:539 +0x445
Found an issue related to this: #1092, I am not using OVN-Kind, but am running ovn-kubernetes as a secondary CNI. Not sure why ovn-kubernetes would affect this.
The text was updated successfully, but these errors were encountered:
seastco
changed the title
Race condition on node startup is causing Pods to get stuck in ContainerCreating
Race condition on node startup causing Pods to get stuck in ContainerCreating
Jul 19, 2024
OK, still not totally sure about the root cause of the config being a mess but I've learned more since raising this issue and can work around it.
The multus pod restart happening right on startup is because there's no CNI file. I'm running on EKS, where the AWS CNI daemonset pod creates 10-aws.conflist. Because this file takes a second to be created, and these daemonset pods are starting up at the same time on a new node, multus will fail and restart. OK that's fine. Red herring.
In my cluster, multus pods were being evicted by nodes with high memory utilization. THIS restart is what caused the weird 00-multus.conf result highlighted above. I changed resources requests == limits on the init container to give the Multus pod Guaranteed QoS and to stop it from being evicted.
I upgraded to v4.1.0 and set --cleanup-config-on-exit=true. Now the 00-multus.conf is removed on pod teardown.
So again, not sure why 00-multus.conf file isn't resilient to restarts, but if you're running into this issue consider making your pods have Guaranteed QoS and/or setting --cleanup-config-on-exit=true.
EDIT - I've dumped a lot of irrelevant info into this thread so I'm going to close this issue and create a new one about 00-multus.conf not being resilient to restarts.
What happened:
Pod stuck in ContainerCreating. Events show a "failed to setup network for sandbox" loop.
Manual resolution:
Deleting
/etc/cni/net.d/00-multus.conf
file andetc/cni/net.d/multus.d
and restarting the daemonset pod resolves this issue for subsequent pods being scheduled. Existing pods finish creating but leave the networking in a bad state (e.g. separate from the above example, after resolving I've seeninterface pod6c270ef2f25 not found: route ip+net: no such network interface
)How to reproduce it (as minimally and precisely as possible):
Not sure how to reproduce unfortunately. I believe it's a race condition happening less than 1% of the time. My EKS cluster has instances constantly scaling up and down throughout the day, but I've only seen this 2-3x in the past few months.
Possibly related? #1221, though this isn't happening after a node reboot, and I'm not using the thick plugin.
Anything else we need to know?:
Below is the
/etc/cni/net.d/00-multus.conf
on a bad node, which I suspect is wrong.delegates
is nested withindelegates
, and all the top-level fields are repeated again. It's like a bad merge happened. This is not the same as what's on a working node. (Sorry, I trimmed out some of the config when sharing with my team and the node doesn't exist anymore, so this is all I have):/etc/cni/net.d/00-multus.conf
on a WORKING node:Potentially another clue: it's the norm for multus daemonset pods to fail 2x on startup with:
Happens to almost every new pod:
Found an issue related to this: #1092, I am not using OVN-Kind, but am running ovn-kubernetes as a secondary CNI. Not sure why ovn-kubernetes would affect this.
Environment:
kubectl version
):v1.25.16-eks-3af4770
The text was updated successfully, but these errors were encountered: