Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vxlan interface always down , cause calico-node failed to add route for pods on other node #3271

Closed
weizhoublue opened this issue Feb 21, 2020 · 1 comment
Assignees

Comments

@weizhoublue
Copy link

weizhoublue commented Feb 21, 2020

calico version: 3.11.2
k8s version: 1.17

I have a k8s cluster with 3 nodes , which uses calico CNI .

2 calico ippool is set up , both of them use vxvlan mode

# calicoctl get ippool -o wide
NAME                      CIDR            NAT    IPIPMODE   VXLANMODE   DISABLED   SELECTOR   
default-ipv4-ippool       172.29.0.0/16   true   Never      Always      false      all()      
kube-system-ipv4-ippool   172.28.0.0/16   true   Never      Always      false      all()

the strange thing happened , the vxlan.calico interface on a node always stay down , and I tried to set it up , but after about 3seconds , this tunnel interface goes down again . That make calico-node failed to add route for other 2 node .
After I delete the daemonset calico-node, I can set the vxlan.calico interface up . However, I apply the daemonset calico-node again , the vxlan.calico interface on the same node begin to stay down again , and I could not set it up agian . So I thinkg it is should that calico-node found something wrong and always make the vxlan.calico interface down

However other 2 node2 work fine
BTW, my other Cluster with same calico configuration , has no this issue

Finally , I found a work-around way : delete the tunnel interface , which trigger the calico-node recreate the tunnel interface, then everything go to work fine

this cluster environment may scale up nodes or reinstall calico operation , Maybe the tunnel interface is created by previously with wrong VXLAN configuration for right now . So deleting it can trigger calico to regenerate right one

the calico-node log of the bad node is bellow

2020-02-21 13:51:39.739 [INFO][50] int_dataplane.go 849: Received interface update msg=&intdataplane.ifaceUpdate{Name:"vxlan.calico", State:"up"}
2020-02-21 13:51:39.740 [INFO][50] int_dataplane.go 962: Applying dataplane updates
2020-02-21 13:51:39.740 [INFO][50] int_dataplane.go 597: Linux interface state changed. ifaceName="vxlan.calico" state="down"
2020-02-21 13:51:39.745 [INFO][50] route_table.go 577: Syncing routes: adding new route. ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.122.0/26
2020-02-21 13:51:39.745 [WARNING][50] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.122.0/26
2020-02-21 13:51:39.745 [INFO][50] route_table.go 577: Syncing routes: adding new route. ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.28.122.0/26
2020-02-21 13:51:39.745 [WARNING][50] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.28.122.0/26
2020-02-21 13:51:39.745 [INFO][50] route_table.go 577: Syncing routes: adding new route. ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.193.64/26
2020-02-21 13:51:39.745 [WARNING][50] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.193.64/26
2020-02-21 13:51:39.745 [INFO][50] route_table.go 247: Trying to connect to netlink

the route table of the bad node is bellow , it only have local pod route , and miss subnet route for pods on other node

# ip r
default via 10.6.0.1 dev dce-mng proto static metric 100 
default via 10.7.0.1 dev ens224 proto static metric 101 
10.6.0.0/16 dev dce-mng proto kernel scope link src 10.6.150.57 metric 100 
10.7.0.0/16 dev ens224 proto kernel scope link src 10.7.117.177 metric 100 
169.254.0.0/16 dev parcel-vip proto kernel scope link src 169.254.232.109 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.28.70.128 dev calia1e9256f877 scope link 
172.28.70.129 dev cali23d0198f573 scope link 
172.28.70.130 dev cali63579a15e5c scope link 
172.28.70.131 dev calie07fbc9b479 scope link 

networkmanager has no configuration for this tunnel

# nmcli con show
NAME                 UUID                                  TYPE            DEVICE
Wired connection 1  a7eba6a4-1e06-3217-a476-8fb0326c9a38  802-3-ethernet  ens224  
dce-mng             7adf786c-6968-4d2c-aba0-8f0e89606097  802-3-ethernet  dce-mng 
docker0             d0d7ca77-316e-4260-8b52-8bf6a811ea3f  bridge          docker0 

Expected Behavior

tunnel interface should stay up , and route should be added for pod on other nodes

Current Behavior

tunnel interface stay down , and pods on the bad node can not access pods on other nodes

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

  • Calico version
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
  • Operating System and version:
  • Link to your project (optional):
@rgarcia89
Copy link

@weizhoublue
Try to add the following to /etc/NetworkManager/conf.d/calico.conf

unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico

After that restart the NetworkManager and check if the vxlan.calico interface comes up

nightkr added a commit to Appva/kubespray that referenced this issue Dec 15, 2020
See projectcalico/calico#3271

Otherwise Calico can get into a fight with NM about who "owns" the calico.vxlan
interface, breaking all pod traffic.

Cherry-pick of 2715041c1bbef92cbc2b796eee40c1fc4a51f523
nightkr added a commit to Appva/kubespray that referenced this issue Dec 22, 2020
See projectcalico/calico#3271

Otherwise Calico can get into a fight with NM about who "owns" the vxlan.calico
interface, breaking all pod traffic.
k8s-ci-robot pushed a commit to kubernetes-sigs/kubespray that referenced this issue Dec 23, 2020
See projectcalico/calico#3271

Otherwise Calico can get into a fight with NM about who "owns" the vxlan.calico
interface, breaking all pod traffic.
LuckySB pushed a commit to southbridgeio/kubespray that referenced this issue Jan 17, 2021
…gs#7037)

See projectcalico/calico#3271

Otherwise Calico can get into a fight with NM about who "owns" the vxlan.calico
interface, breaking all pod traffic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants