New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error setting up networking: error adding host side routes for interface: xxx, error: failed to add route file exists #352

Closed
ijumps opened this Issue Jul 11, 2017 · 24 comments

Comments

Projects
None yet
10 participants
@ijumps
Contributor

ijumps commented Jul 11, 2017

If old route exists, set up networking will fail. This may happen in situation describe here: #275

Some log here:

Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Extracted identifiers" Node=kube-node-52 Orchestrator=k8s Workload=infra.console-web-2894439173-nk9cj
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Loaded CNI NetConf" NetConfg={ k8s-pod-network calico { host-local usePodCidr <nil> <nil> [] []} 0  kube-node-52 kubernetes   info {k8s  eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VNlYWNjb3VudC9zydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJkZWZhdWx0LXRva2VuLXpveThxIiwia3ViZXJuZXRlcy5pby9zZXJ2aWZXJ2aWNlLWFjY291bnQubmFtZSI6ImRlZmF1bHQiLCJrMDBkOTg3YmdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJlNWRlZTJhNi1kZmYwLTExZTYtOGQwNS01MjU0QiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zeXN0ZW06ZGVmYXVsdCJ9.l3aHAY1WNj_rS-lZ4sM6O6ssJBuELLBb-xFi2jVzeWleAeFBbgo9KX0SQ_KglcI58XDNVopNzSqaequdbIck0tubinvtksBL0D_tMT7C_kRcSxf_3k3MyVwLr3TKilNW94Hs-6ani7ox2Iwo2AUUthGzI48zo_qMufMVy48qiN1fFpGGfCwRbl5Ax4aXaEQUDTxL8-34EpHwFUdiPB626YLWzTaUWWqFqbXC3DQJMimWLIMXmSE5Bt1siOBxTv1RqQlJ1RowAwfZ9xvQOnRtj8lYhfP0bzJXQouMmxDFuNiCFM6_hyHeDo5tPXM6cpysIz7XuU521lNko0sEEAwHuA   } {https://10.254.0.1:443 /etc/cni/net.d/calico-kubeconfig } {{{ {[]}}}}    } Workload=infra.console-web-2894439173-nk9cj
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Configured environment: [LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin KUBE_LOGTOSTDERR=--logtostderr=true KUBE_LOG_LEVEL=--v=0 KUBE_ALLOW_PRIV=--allow-privileged=true KUBE_MASTER=--master=https://192.168.18.60:443 KUBE_CLOUD_PROVIDER= KUBELET_ADDRESS=--address=192.168.16.224 KUBELET_HOSTNAME=--hostname-override=kube-node-52 KUBELET_API_SERVER= KUBELET_ARGS=--require-kubeconfig --kubeconfig=/etc/kubernetes/kubelet.kubeconfig --pod-manifest-path=/etc/kubernetes/manifests --pod-infra-container-image=cargo.caicloudprivatetest.com/caicloudgcr/google_containers_pause-amd64:3.0 --node-ip=192.168.16.224 --cluster-dns=10.254.0.100 --cluster-domain=cluster.local --network-plugin=cni --feature-gates=Accelerators=true CNI_COMMAND=ADD CNI_CONTAINERID=69cfce426f2d757adc2d8e4389b96829f3b8155c9e0c4f5cf4d2bd3c056b3334 CNI_NETNS=/proc/21837/ns/net CNI_ARGS=IgnoreUnknown=1;IgnoreUnknown=1;K8S_POD_NAMESPACE=infra;K8S_POD_NAME=console-web-2894439173-nk9cj;K8S_POD_INFRA_CONTAINER_ID=69cfce426f2d757adc2d8e4389b96829f3b8155c9e0c4f5cf4d2bd3c056b3334 CNI_IFNAME=eth0 CNI_PATH=/opt/cni/bin:/opt/calico/bin DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/cni/net.d/calico-kubeconfig K8S_API_ENDPOINT=https://10.254.0.1:443 K8S_API_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJkZWZhdWx0LXRva2VuLXpveThxIiwiaZXRlcy5pby9zZXJ2aWNl3ViZXJuYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6ImRlZmF1bNlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJlNWRlZTJhNi1kZmYwLTExZTYtOGQwNS01MjU0MDBkOTg3YmQiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZSHQiLCJrdWJlcm5ldGVzLmlvL31zeXN0ZW06ZGVmYXVsdCJ9.l3aHAY1WNj_rS-lZ4sM6O6ssJBuELLBb-xFi2jVzeWleAeFBbgo9KX0SQ_Kg4aXaEQUlcI58XDNVopNzSqaequdbIck0tubinvtksBL0D_tMT7C_kRcSxf_3k3MyVwLr3TKilNW94Hs-6ani7ox2Iwo2AUUthGzI48zo_qMufMVy48qiN1fFpGGfCwRbl5AxDTxL8-34EpHwFUdiPB626YLWzTaUWWqFqbXC3DQJMimWLIM
Jul 11 20:20:37 c8v224 kubelet[11183]: XmSE5Bt1siOBxTv1RqQlJ1RowAwfZ9xvQOnP0bzJXQouMmxDFuNiCFRtj8lYhfM6_hyHeDo5tPXM6cpysIz7XuU521lNko0sEEAwHuA]"
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Loading config from environment"
Jul 11 20:20:37 c8v224 kubelet[11183]: Calico CNI checking for existing endpoint: <nil>
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Extracted identifiers for CmdAddK8s" Node=kube-node-52 Orchestrator=k8s Workload=infra.console-web-2894439173-nk9cj
Jul 11 20:20:37 c8v224 kubelet[11183]: Calico CNI fetching podCidr from Kubernetes
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Fetched podCidr" Workload=infra.console-web-2894439173-nk9cj podCidr="10.100.2.0/24"
Jul 11 20:20:37 c8v224 kubelet[11183]: Calico CNI passing podCidr to host-local IPAM: 10.100.2.0/24
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Populated endpoint" Workload=infra.console-web-2894439173-nk9cj endpoint=&{{workloadEndpoint v1} {{<nil>} eth0 infra.console-web-2894439173-nk9cj k8s kube-node-52 69cfce426f2d757adc2d8e4389b96829f3b8155c9e0c4f5cf4d2bd3c056b3334 map[kubernetes-admin.caicloud.io/application:console-web kubernetes-admin.caicloud.io/select-by:service_infra_console-web kubernetes-admin.caicloud.io/type:application pod-template-hash:2894439173 calico/k8s_ns:infra]} {[10.100.2.42/32] [] <nil> <nil> [k8s_ns.infra]  <nil>}}
Jul 11 20:20:37 c8v224 kubelet[11183]: Calico CNI using IPs: [10.100.2.42/32]
Jul 11 20:20:37 c8v224 kubelet[11183]: I0711 20:20:37.253621   11183 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/840b78b1-e1dd-11e6-b4ef-525400d987bd-default-token-zoy8q" (spec.Name: "default-token-zoy8q") pod "840b78b1-e1dd-11e6-b4ef-525400d987bd" (UID: "840b78b1-e1dd-11e6-b4ef-525400d987bd").
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=error msg="Error setting up networking: error adding host side routes for interface: cali5ad046dac08, error: failed to add route file exists" Workload=infra.console-web-2894439173-nk9cj
Jul 11 20:20:37 c8v224 kubelet[11183]: time="2017-07-11T20:20:37+08:00" level=info msg="Cleaning up IP allocations for failed ADD" Workload=infra.console-web-2894439173-nk9cj
Jul 11 20:20:37 c8v224 kubelet[11183]: E0711 20:20:37.312002   11183 cni.go:257] Error adding network: error adding host side routes for interface: cali5ad046dac08, error: failed to add route file exists
Jul 11 20:20:37 c8v224 kubelet[11183]: E0711 20:20:37.312049   11183 cni.go:211] Error while adding to cni network: error adding host side routes for interface: cali5ad046dac08, error: failed to add route file exists

code here (https://github.com/projectcalico/cni-plugin/blob/master/utils/network.go#L170) may check err is not route file exists

	// Now that the host side of the veth is moved, state set to UP, and configured with sysctls, we can add the routes to it in the host namespace.
	err = setupRoutes(hostVeth, result)
	if err != nil {
		return "", "", fmt.Errorf("error adding host side routes for interface: %s, error: %s", hostVeth.Attrs().Name, err)
	}
@YYGCui

This comment has been minimized.

Show comment
Hide comment
@YYGCui

YYGCui Jul 13, 2017

I met the same problem when I use two CNI plugin and set calico as the second one.
the log as:

^[[31mERRO^[[0m[0000] Error creating veth: failed to add route file exists  ^[[31mWorkload^[[0m=xxxx.udptest-3970579046-19tjp
^[[31mERRO^[[0m[0000] Error setting up networking: failed to add route file exists  ^[[31mWorkload^[[0m=xxxx.udptest-3970579046-19tjp

YYGCui commented Jul 13, 2017

I met the same problem when I use two CNI plugin and set calico as the second one.
the log as:

^[[31mERRO^[[0m[0000] Error creating veth: failed to add route file exists  ^[[31mWorkload^[[0m=xxxx.udptest-3970579046-19tjp
^[[31mERRO^[[0m[0000] Error setting up networking: failed to add route file exists  ^[[31mWorkload^[[0m=xxxx.udptest-3970579046-19tjp

@pgnaleen

This comment has been minimized.

Show comment
Hide comment
@pgnaleen

pgnaleen Jul 13, 2017

i am getting this error when trying to create new pod.

rpc error: code = 2 desc = failed to start container "cdbefe529700685ff3b73e3d8b69507e1b31861d5b676e2d40b6fc234713017d": Error response from daemon: {"message":"cannot join network of a non running container: 48832d315a4cee520da22d9c5d9a0c79878ca596da32d40d91d8dab9d8c1ca7f"}

pgnaleen commented Jul 13, 2017

i am getting this error when trying to create new pod.

rpc error: code = 2 desc = failed to start container "cdbefe529700685ff3b73e3d8b69507e1b31861d5b676e2d40b6fc234713017d": Error response from daemon: {"message":"cannot join network of a non running container: 48832d315a4cee520da22d9c5d9a0c79878ca596da32d40d91d8dab9d8c1ca7f"}

@caseydavenport

This comment has been minimized.

Show comment
Hide comment
@caseydavenport

caseydavenport Jul 13, 2017

Member

@pgnaleen that sounds like a different issue, and probably belongs in the Kubernetes repo (sounds like a k8s issue not a Calico issue)

Member

caseydavenport commented Jul 13, 2017

@pgnaleen that sounds like a different issue, and probably belongs in the Kubernetes repo (sounds like a k8s issue not a Calico issue)

@caseydavenport

This comment has been minimized.

Show comment
Hide comment
@caseydavenport

caseydavenport Jul 13, 2017

Member

As for this issue, I think the Calico CNI plugin needs to be resilient to the route already existing, since there other other actors that might be setting it up.

I think the fix in the description is probably a good one - check the returned error to see if we care about it.

@matthewdupre WDYT?

Anyone care to have a go at implementing this?

Member

caseydavenport commented Jul 13, 2017

As for this issue, I think the Calico CNI plugin needs to be resilient to the route already existing, since there other other actors that might be setting it up.

I think the fix in the description is probably a good one - check the returned error to see if we care about it.

@matthewdupre WDYT?

Anyone care to have a go at implementing this?

@caseydavenport caseydavenport added this to the next-milestone milestone Jul 13, 2017

@YYGCui

This comment has been minimized.

Show comment
Hide comment
@YYGCui

YYGCui Jul 14, 2017

I read some snippets of calico cni code, it seams this cni is designed with exclusivity, it cannot coexist with other cni plugins.

YYGCui commented Jul 14, 2017

I read some snippets of calico cni code, it seams this cni is designed with exclusivity, it cannot coexist with other cni plugins.

@caseydavenport

This comment has been minimized.

Show comment
Hide comment
@caseydavenport

caseydavenport Jul 14, 2017

Member

@YYGCui which bits do you think make the plugin exclusive?

The intention is not to make the Calico CNI plugin exclusive - so if there are obvious places where we don't play nice we should look to fix them up.

Member

caseydavenport commented Jul 14, 2017

@YYGCui which bits do you think make the plugin exclusive?

The intention is not to make the Calico CNI plugin exclusive - so if there are obvious places where we don't play nice we should look to fix them up.

@YYGCui

This comment has been minimized.

Show comment
Hide comment
@YYGCui

YYGCui Jul 17, 2017

@caseydavenport list some snippets of code

	// Always check if there's an existing endpoint.
	endpoints, err := calicoClient.WorkloadEndpoints().List(api.WorkloadEndpointMetadata{
		Node:         nodename,
		Orchestrator: orchestrator,
		Workload:     workload})
	if err != nil {
		return err
	}

	logger.Debugf("Retrieved endpoints: %v", endpoints)

	var endpoint *api.WorkloadEndpoint
	if len(endpoints.Items) == 1 {
		endpoint = &endpoints.Items[0]
	}
  • another is about this issue, set the route

YYGCui commented Jul 17, 2017

@caseydavenport list some snippets of code

	// Always check if there's an existing endpoint.
	endpoints, err := calicoClient.WorkloadEndpoints().List(api.WorkloadEndpointMetadata{
		Node:         nodename,
		Orchestrator: orchestrator,
		Workload:     workload})
	if err != nil {
		return err
	}

	logger.Debugf("Retrieved endpoints: %v", endpoints)

	var endpoint *api.WorkloadEndpoint
	if len(endpoints.Items) == 1 {
		endpoint = &endpoints.Items[0]
	}
  • another is about this issue, set the route
@caseydavenport

This comment has been minimized.

Show comment
Hide comment
@caseydavenport

caseydavenport Jul 17, 2017

Member

Yep, looks like we make some assumptions about there only being a single interface.

I think this issue is occurring even when only using a single interface.

since there other other actors that might be setting it up

This doesn't just mean other CNI plugins, it could also be other agents on the node (e.g. Felix) or simply a user typing ip route add ...

Member

caseydavenport commented Jul 17, 2017

Yep, looks like we make some assumptions about there only being a single interface.

I think this issue is occurring even when only using a single interface.

since there other other actors that might be setting it up

This doesn't just mean other CNI plugins, it could also be other agents on the node (e.g. Felix) or simply a user typing ip route add ...

@gunjan5

This comment has been minimized.

Show comment
Hide comment
@gunjan5

gunjan5 Jul 26, 2017

Contributor

@YYGCui Can you explain your setup a bit more in detail so we can see how to accommodate that kind of situation? Which other plugin are you using and what route does it add for the same pod?

I met the same problem when I use two CNI plugin and set calico as the second one.

Contributor

gunjan5 commented Jul 26, 2017

@YYGCui Can you explain your setup a bit more in detail so we can see how to accommodate that kind of situation? Which other plugin are you using and what route does it add for the same pod?

I met the same problem when I use two CNI plugin and set calico as the second one.

@YYGCui

This comment has been minimized.

Show comment
Hide comment
@YYGCui

YYGCui Jul 27, 2017

@gunjan5
I use CNI-Genie CNI plugin to setup network, and if I choose canal,calico to assign multiple IP to pod, the second network will replace the first one as I mentioned above.
I modified codes(https://github.com/projectcalico/cni-plugin/blob/master/calico.go#L93) to make calico to check the exact endpoint(such as eth0) exist or not, thus multiple endpoint can be added(such as add eth1).
Or
If I choose weave,calico

the failed to add route file exists will be raised

YYGCui commented Jul 27, 2017

@gunjan5
I use CNI-Genie CNI plugin to setup network, and if I choose canal,calico to assign multiple IP to pod, the second network will replace the first one as I mentioned above.
I modified codes(https://github.com/projectcalico/cni-plugin/blob/master/calico.go#L93) to make calico to check the exact endpoint(such as eth0) exist or not, thus multiple endpoint can be added(such as add eth1).
Or
If I choose weave,calico

the failed to add route file exists will be raised

@klynch

This comment has been minimized.

Show comment
Hide comment
@klynch

klynch Aug 7, 2017

I have encountered this issue (Calico CNI only). It appears as though my issue may be related to a Pod with an assigned IP without a corresponding Calico WEP assigned. Additionally, I had many many WEPs without corresponding pods.

The latter may be related to an old issue on Kubernetes 1.6 with Calico not deleting the WEP when a node was hard rebooted (I believe that issue was fixed). I don't have a lot of data on the former, but now that I have a clean cluster I will keep a closer watch.

klynch commented Aug 7, 2017

I have encountered this issue (Calico CNI only). It appears as though my issue may be related to a Pod with an assigned IP without a corresponding Calico WEP assigned. Additionally, I had many many WEPs without corresponding pods.

The latter may be related to an old issue on Kubernetes 1.6 with Calico not deleting the WEP when a node was hard rebooted (I believe that issue was fixed). I don't have a lot of data on the former, but now that I have a clean cluster I will keep a closer watch.

@bradbehle

This comment has been minimized.

Show comment
Hide comment
@bradbehle

bradbehle Aug 10, 2017

I hit this problem. It looks like kubernetes is trying to create a pod that existed previously but wasn't cleaned up in calico properly. In my case, I got this error creating a pod where the pod didn't exist, but there was an existing workload endpoint in etcd that specified the pod name, namespace, and also the worker node that kubernetes was trying to create the pod on. I was able to work around it by manually deleting the orphan workload endpoint and then trying to create the pod again.

calicoctl delete --workload=<NS>.<POD_NAME> --orchestrator=k8s --node=<NODE_NAME> wep <NAME>

bradbehle commented Aug 10, 2017

I hit this problem. It looks like kubernetes is trying to create a pod that existed previously but wasn't cleaned up in calico properly. In my case, I got this error creating a pod where the pod didn't exist, but there was an existing workload endpoint in etcd that specified the pod name, namespace, and also the worker node that kubernetes was trying to create the pod on. I was able to work around it by manually deleting the orphan workload endpoint and then trying to create the pod again.

calicoctl delete --workload=<NS>.<POD_NAME> --orchestrator=k8s --node=<NODE_NAME> wep <NAME>

@rbjorklin

This comment has been minimized.

Show comment
Hide comment
@rbjorklin

rbjorklin Aug 15, 2017

I'm seeing this error as well and deleting the leftover WEP as @bradbehle suggests does not work. I'm using Kubernetes 1.7.2 and Calico 2.4.1. Logs can be found here. I'm currently unable to deploy any applications outside of the kube-system namespace (dns-addon and dashboard works). I'm assuming I've misconfigured something but I have no idea what.

rbjorklin commented Aug 15, 2017

I'm seeing this error as well and deleting the leftover WEP as @bradbehle suggests does not work. I'm using Kubernetes 1.7.2 and Calico 2.4.1. Logs can be found here. I'm currently unable to deploy any applications outside of the kube-system namespace (dns-addon and dashboard works). I'm assuming I've misconfigured something but I have no idea what.

@caseydavenport

This comment has been minimized.

Show comment
Hide comment
@caseydavenport

caseydavenport Aug 15, 2017

Member

After discussing with @gunjan5, our current best guess as to what is going on here is that it's a race between Felix and the CNI plugin, both of which try to program the route. The race only exists when the kubelet is restarting a Pod that has already been started on that node (i.e. the WEP exists prior to the Pod starting).

Gunjan has a WIP fix here: #358

Member

caseydavenport commented Aug 15, 2017

After discussing with @gunjan5, our current best guess as to what is going on here is that it's a race between Felix and the CNI plugin, both of which try to program the route. The race only exists when the kubelet is restarting a Pod that has already been started on that node (i.e. the WEP exists prior to the Pod starting).

Gunjan has a WIP fix here: #358

@bradbehle

This comment has been minimized.

Show comment
Hide comment
@bradbehle

bradbehle Sep 5, 2017

@caseydavenport @gunjan5 Any update on this? We are hitting this often enough that we are looking for an outlook on a fix.

bradbehle commented Sep 5, 2017

@caseydavenport @gunjan5 Any update on this? We are hitting this often enough that we are looking for an outlook on a fix.

@ijumps

This comment has been minimized.

Show comment
Hide comment
@ijumps

ijumps Sep 6, 2017

Contributor

@caseydavenport @gunjan5 I use the following script as a workaround, generally remove all route created by calico and restart calico-felix to let felix add the route back.

for r in `ip r | awk '/cali/ {print $1}'`; do ip r del "$r" ; done; docker restart `docker ps | awk '/start_runit/ {print $1}'`

Most of the time, this works. But I faced a new situation, I tried run this script, restart kubelet and docker, none of those worked.

I noticed many dirty(not used any more) ips existed in /var/lib/cni/networks/k8s-pod-network/, and remove all ip files and restart docker solved this situation.

Hope this can help you dig into this issue.

Contributor

ijumps commented Sep 6, 2017

@caseydavenport @gunjan5 I use the following script as a workaround, generally remove all route created by calico and restart calico-felix to let felix add the route back.

for r in `ip r | awk '/cali/ {print $1}'`; do ip r del "$r" ; done; docker restart `docker ps | awk '/start_runit/ {print $1}'`

Most of the time, this works. But I faced a new situation, I tried run this script, restart kubelet and docker, none of those worked.

I noticed many dirty(not used any more) ips existed in /var/lib/cni/networks/k8s-pod-network/, and remove all ip files and restart docker solved this situation.

Hope this can help you dig into this issue.

@gunjan5

This comment has been minimized.

Show comment
Hide comment
@gunjan5

gunjan5 Sep 7, 2017

Contributor

@ijumps I have a possible fix, and I've made a debug image. Can you try out this binary, change the CNI logging level to debug and see if this fixes the problem and if not, the logs would help.
The binary is at https://transfer.sh/HnfYy/calico (if you have any node(s) that are non-production critical, you can try it out there) by replacing /opt/cni/bin/calico with this one (make sure you do a chmod +x and change the /etc/cni/net.d/10-calico.conf -> logging to debug

@bradbehle this is a slightly different image from the one I gave you yesterday, in case you haven't started testing it yet.

thanks!

Contributor

gunjan5 commented Sep 7, 2017

@ijumps I have a possible fix, and I've made a debug image. Can you try out this binary, change the CNI logging level to debug and see if this fixes the problem and if not, the logs would help.
The binary is at https://transfer.sh/HnfYy/calico (if you have any node(s) that are non-production critical, you can try it out there) by replacing /opt/cni/bin/calico with this one (make sure you do a chmod +x and change the /etc/cni/net.d/10-calico.conf -> logging to debug

@bradbehle this is a slightly different image from the one I gave you yesterday, in case you haven't started testing it yet.

thanks!

@klynch

This comment has been minimized.

Show comment
Hide comment
@klynch

klynch Sep 7, 2017

@gunjan5 is this binary the changes made in #358 ? Also, are there any known compatibilities with Calico v2.3.0 or should we prioritize an upgrade first before testing with this?

klynch commented Sep 7, 2017

@gunjan5 is this binary the changes made in #358 ? Also, are there any known compatibilities with Calico v2.3.0 or should we prioritize an upgrade first before testing with this?

@gunjan5

This comment has been minimized.

Show comment
Hide comment
@gunjan5

gunjan5 Sep 8, 2017

Contributor

@klynch yes it's from #358. There have been minimal changes to CNI since v1.9.1 which is what's in Calico v2.3, but it should be ok for testing. I would try it on a non-critical host first.

Contributor

gunjan5 commented Sep 8, 2017

@klynch yes it's from #358. There have been minimal changes to CNI since v1.9.1 which is what's in Calico v2.3, but it should be ok for testing. I would try it on a non-critical host first.

@ijumps

This comment has been minimized.

Show comment
Hide comment
@ijumps

ijumps Sep 10, 2017

Contributor

@gunjan5 I fixed it already by the above script. I will try this binary if this happens again.

#358 should fix this, and I'm interested in how this happened.

Most of the time, this works. But I faced a new situation, I tried run this script, restart kubelet and docker, none of those worked.

I noticed many dirty(not used any more) ips existed in /var/lib/cni/networks/k8s-pod-network/, and remove all ip files and restart docker solved this situation.

The special situation I faced may due to another issue. I use network plugin canal on those node, rarely but still I see one node used out it's ip, /var/lib/cni/networks/k8s-pod-network/ is full within the node cidr, but from docker ps and ip r I can see only 10+ pod exists, seems like some unused ips are not released.

Contributor

ijumps commented Sep 10, 2017

@gunjan5 I fixed it already by the above script. I will try this binary if this happens again.

#358 should fix this, and I'm interested in how this happened.

Most of the time, this works. But I faced a new situation, I tried run this script, restart kubelet and docker, none of those worked.

I noticed many dirty(not used any more) ips existed in /var/lib/cni/networks/k8s-pod-network/, and remove all ip files and restart docker solved this situation.

The special situation I faced may due to another issue. I use network plugin canal on those node, rarely but still I see one node used out it's ip, /var/lib/cni/networks/k8s-pod-network/ is full within the node cidr, but from docker ps and ip r I can see only 10+ pod exists, seems like some unused ips are not released.

@gunjan5

This comment has been minimized.

Show comment
Hide comment
@gunjan5

gunjan5 Sep 11, 2017

Contributor

@ijumps our working theory is that this is a race condition between CNI plugin and Felix, and you'd see this when Felix somehow programs the route to the pod before CNI, and since CNI doesn't expect the route to be there it exits with the error.
As far as /var/lib/cni/networks/k8s-pod-network/ being full, I have seen that when kubelet wasn't sending a CNI DEL command to the CNI plugin(s), some cases described in this issue kubernetes/kubernetes#14940

Contributor

gunjan5 commented Sep 11, 2017

@ijumps our working theory is that this is a race condition between CNI plugin and Felix, and you'd see this when Felix somehow programs the route to the pod before CNI, and since CNI doesn't expect the route to be there it exits with the error.
As far as /var/lib/cni/networks/k8s-pod-network/ being full, I have seen that when kubelet wasn't sending a CNI DEL command to the CNI plugin(s), some cases described in this issue kubernetes/kubernetes#14940

@msavlani

This comment has been minimized.

Show comment
Hide comment
@msavlani

msavlani Nov 8, 2017

Hi @gunjan5 @caseydavenport

I am facing exactly same error as mentioned in this issue.
I applied v2.6.2 cni plugin in my environment and now i see below error

5s 5s 1 kubelet, node1 Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "env-test1-1193689166-258nb_namespace1" with SetupNetworkError: "NetworkPlugin cni failed to set up pod "env-test1-1193689166-258nb_namespace1" network: error adding host side routes for interface: calie936bc601e5, error: failed to add route (Dst: 172.40.107.158/32, Scope: %!!(MISSING)s(netlink.Scope=253), Iface: calie936bc601e5): file exists"

Any idea what is wrong here ?

Thanks.

msavlani commented Nov 8, 2017

Hi @gunjan5 @caseydavenport

I am facing exactly same error as mentioned in this issue.
I applied v2.6.2 cni plugin in my environment and now i see below error

5s 5s 1 kubelet, node1 Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "env-test1-1193689166-258nb_namespace1" with SetupNetworkError: "NetworkPlugin cni failed to set up pod "env-test1-1193689166-258nb_namespace1" network: error adding host side routes for interface: calie936bc601e5, error: failed to add route (Dst: 172.40.107.158/32, Scope: %!!(MISSING)s(netlink.Scope=253), Iface: calie936bc601e5): file exists"

Any idea what is wrong here ?

Thanks.

@Random-Liu

This comment has been minimized.

Show comment
Hide comment
@Random-Liu

Random-Liu Nov 27, 2017

Same error with kubernetes + containerd + calico v2.6.1:

E1127 17:09:30.561609    1520 kuberuntime_manager.go:647] createPodSandbox for pod "downwardapi-volume-b8f42546-d395-11e7-889d-0a580a3c9115_e2e-tests-projected-8ww7r(b8f68e63-d395-11e7-9874-42010a800002)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "bf37c1b2c041d7ef0892f6d140086e6c2374581280bb1eb7c61f757e2f51d803": error adding host side routes for interface: cali12f59090da4, error: failed to add route (Dst: 10.64.4.8/32, Scope: %!!(MISSING)s(netlink.Scope=253), Iface: cali12f59090da4): file exists

Random-Liu commented Nov 27, 2017

Same error with kubernetes + containerd + calico v2.6.1:

E1127 17:09:30.561609    1520 kuberuntime_manager.go:647] createPodSandbox for pod "downwardapi-volume-b8f42546-d395-11e7-889d-0a580a3c9115_e2e-tests-projected-8ww7r(b8f68e63-d395-11e7-9874-42010a800002)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "bf37c1b2c041d7ef0892f6d140086e6c2374581280bb1eb7c61f757e2f51d803": error adding host side routes for interface: cali12f59090da4, error: failed to add route (Dst: 10.64.4.8/32, Scope: %!!(MISSING)s(netlink.Scope=253), Iface: cali12f59090da4): file exists
@gunjan5

This comment has been minimized.

Show comment
Hide comment
@gunjan5

gunjan5 Nov 27, 2017

Contributor

@Random-Liu we had a couple more PRs for the fix that went into the latest CNI release v1.11.1. See @fasaxc's comment projectcalico/calico#1253 (comment) explaining the procedure. Latest discussion is in projectcalico/calico#1406

Contributor

gunjan5 commented Nov 27, 2017

@Random-Liu we had a couple more PRs for the fix that went into the latest CNI release v1.11.1. See @fasaxc's comment projectcalico/calico#1253 (comment) explaining the procedure. Latest discussion is in projectcalico/calico#1406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment