vxlan interface always down , cause calico-node failed to add route for pods on other node #3271

weizhoublue · 2020-02-21T14:26:38Z

calico version: 3.11.2
k8s version: 1.17

I have a k8s cluster with 3 nodes , which uses calico CNI .

2 calico ippool is set up , both of them use vxvlan mode

# calicoctl get ippool -o wide
NAME                      CIDR            NAT    IPIPMODE   VXLANMODE   DISABLED   SELECTOR   
default-ipv4-ippool       172.29.0.0/16   true   Never      Always      false      all()      
kube-system-ipv4-ippool   172.28.0.0/16   true   Never      Always      false      all()

the strange thing happened , the vxlan.calico interface on a node always stay down , and I tried to set it up , but after about 3seconds , this tunnel interface goes down again . That make calico-node failed to add route for other 2 node .
After I delete the daemonset calico-node, I can set the vxlan.calico interface up . However, I apply the daemonset calico-node again , the vxlan.calico interface on the same node begin to stay down again , and I could not set it up agian . So I thinkg it is should that calico-node found something wrong and always make the vxlan.calico interface down

However other 2 node2 work fine
BTW, my other Cluster with same calico configuration , has no this issue

Finally , I found a work-around way : delete the tunnel interface , which trigger the calico-node recreate the tunnel interface, then everything go to work fine

this cluster environment may scale up nodes or reinstall calico operation , Maybe the tunnel interface is created by previously with wrong VXLAN configuration for right now . So deleting it can trigger calico to regenerate right one

the calico-node log of the bad node is bellow

2020-02-21 13:51:39.739 [INFO][50] int_dataplane.go 849: Received interface update msg=&intdataplane.ifaceUpdate{Name:"vxlan.calico", State:"up"}
2020-02-21 13:51:39.740 [INFO][50] int_dataplane.go 962: Applying dataplane updates
2020-02-21 13:51:39.740 [INFO][50] int_dataplane.go 597: Linux interface state changed. ifaceName="vxlan.calico" state="down"
2020-02-21 13:51:39.745 [INFO][50] route_table.go 577: Syncing routes: adding new route. ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.122.0/26
2020-02-21 13:51:39.745 [WARNING][50] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.122.0/26
2020-02-21 13:51:39.745 [INFO][50] route_table.go 577: Syncing routes: adding new route. ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.28.122.0/26
2020-02-21 13:51:39.745 [WARNING][50] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.28.122.0/26
2020-02-21 13:51:39.745 [INFO][50] route_table.go 577: Syncing routes: adding new route. ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.193.64/26
2020-02-21 13:51:39.745 [WARNING][50] route_table.go 604: Failed to add route error=network is down ifaceName="vxlan.calico" ipVersion=0x4 targetCIDR=172.29.193.64/26
2020-02-21 13:51:39.745 [INFO][50] route_table.go 247: Trying to connect to netlink

the route table of the bad node is bellow , it only have local pod route , and miss subnet route for pods on other node

# ip r
default via 10.6.0.1 dev dce-mng proto static metric 100 
default via 10.7.0.1 dev ens224 proto static metric 101 
10.6.0.0/16 dev dce-mng proto kernel scope link src 10.6.150.57 metric 100 
10.7.0.0/16 dev ens224 proto kernel scope link src 10.7.117.177 metric 100 
169.254.0.0/16 dev parcel-vip proto kernel scope link src 169.254.232.109 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.28.70.128 dev calia1e9256f877 scope link 
172.28.70.129 dev cali23d0198f573 scope link 
172.28.70.130 dev cali63579a15e5c scope link 
172.28.70.131 dev calie07fbc9b479 scope link

networkmanager has no configuration for this tunnel

# nmcli con show
NAME                 UUID                                  TYPE            DEVICE
Wired connection 1  a7eba6a4-1e06-3217-a476-8fb0326c9a38  802-3-ethernet  ens224  
dce-mng             7adf786c-6968-4d2c-aba0-8f0e89606097  802-3-ethernet  dce-mng 
docker0             d0d7ca77-316e-4260-8b52-8bf6a811ea3f  bridge          docker0

Expected Behavior

tunnel interface should stay up , and route should be added for pod on other nodes

Current Behavior

tunnel interface stay down , and pods on the bad node can not access pods on other nodes

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Calico version
Orchestrator version (e.g. kubernetes, mesos, rkt):
Operating System and version:
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

rgarcia89 · 2020-03-31T13:08:49Z

@weizhoublue
Try to add the following to /etc/NetworkManager/conf.d/calico.conf

unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico

After that restart the NetworkManager and check if the vxlan.calico interface comes up

…LAN interface (#3454)

See projectcalico/calico#3271 Otherwise Calico can get into a fight with NM about who "owns" the calico.vxlan interface, breaking all pod traffic. Cherry-pick of 2715041c1bbef92cbc2b796eee40c1fc4a51f523

See projectcalico/calico#3271 Otherwise Calico can get into a fight with NM about who "owns" the vxlan.calico interface, breaking all pod traffic.

…gs#7037) See projectcalico/calico#3271 Otherwise Calico can get into a fight with NM about who "owns" the vxlan.calico interface, breaking all pod traffic.

rafaelvanoni self-assigned this Feb 24, 2020

rafaelvanoni added kind/support and removed kind/support labels Feb 24, 2020

weizhoublue closed this as completed Mar 31, 2020

strigie mentioned this issue Apr 23, 2020

Docs Fix for (#3271) NetworkManager managing vxlan.calico interface #3454

Merged

3 tasks

caseydavenport pushed a commit that referenced this issue Jun 10, 2020

Docs: Fix for (#3271) Stop NetworkManager from managing the Calico VX…

3e18efc

…LAN interface (#3454)

nightkr mentioned this issue Dec 15, 2020

Blacklist Calico's VXLAN interface from NetworkManager kubernetes-sigs/kubespray#7037

Merged

nightkr added a commit to Appva/kubespray that referenced this issue Dec 22, 2020

Blacklist Calico's VXLAN interface from NetworkManager

df95a3a

See projectcalico/calico#3271 Otherwise Calico can get into a fight with NM about who "owns" the vxlan.calico interface, breaking all pod traffic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vxlan interface always down , cause calico-node failed to add route for pods on other node #3271

vxlan interface always down , cause calico-node failed to add route for pods on other node #3271

weizhoublue commented Feb 21, 2020 •

edited

Loading

rgarcia89 commented Mar 31, 2020

vxlan interface always down , cause calico-node failed to add route for pods on other node #3271

vxlan interface always down , cause calico-node failed to add route for pods on other node #3271

Comments

weizhoublue commented Feb 21, 2020 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

rgarcia89 commented Mar 31, 2020

weizhoublue commented Feb 21, 2020 •

edited

Loading