-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in Pod connectivity when an additional IP address on the primary interface in the same subnet is added and removed #8739
Comments
This does seem strange, especially considering the interface still has an IP address within that subnet even after the VIP is removed. Seems likely to be related to BIRD's / BGP next hop calculation rather than Felix though. Perhaps worth looking into whether or not the remote nodes have changed the next hop address on the advertised BGP routes as well, in case it's a peer issue rather than an issue with the local route resolution. |
@caseydavenport yes you are right, seems like an issue in BIRD. The peers are not impacted its only the local route resolution. For now, as a workaround adding a static route for the subnet. |
Our BIRD fork is based on an upstream BIRD version (v1.6.8) that is now a little old, and it's possible that this has been fixed in upstream BIRD since v1.6.8. If an interested party would like to investigate that and identify the relevant change (if there is one), we could certainly look at cherry-picking that to our fork. |
@nelljerram thank you for pointing about the possible bug. It's indeed a bug in v1.6.8.
below is the result from bird2
The setup can be recreate here |
Felix is incorrectly removing the directly connected route when it detects that an IP address is deleted even if there are additional addresses in the same subnet on the interface.
This is causing critical failures in the field.
Expected Behavior
Pod connectivity should not break
I have a 3 node IPv6 Kubernetes cluster with a VIP managed via Keepalived. When things are stable the routing table looks intact, pod subnets for the other nodes have the next hop correctly set as the Node IP Address.
VIP - fd74:ca9b:3a09:868c:10:9:121:136
Primary Node IP - fd74:ca9b:3a09:868c:10:9:61:181
Routing Table on the Host
-------------------------
Bird routing table in the Calico Pod
-----------------------------------
If the VIP now moves to a different node, then the directly connected route is missing from the Calico Pod Bird routing table, because of this the Pod subnet routes are configrued with the next hop as the default gateway. This results in failure in Pod connectivity
VIP is now moved to a different node
Host Routing Table - Note that the next hop for the pod subnets is configured as the default gateway
-----------------------------------------------------------------------------------------------------
Bird Routing table in the Calico Pod - The directly connected route for subnet fd74:ca9b:3a09:868c::/64 is missing
-------------------------------------------------------------------------------------------------------------------
I am able to reproduce the issue even without VIP movement. I just have to add an additional IP address in the same subnet to the primary interface and then remove it. Looks like when Felix detects that an IP address is removed, it incorrectly is removing the directly connected route entry even if there are additional IP addresses on the interface in the same subnet
Your Environment
The text was updated successfully, but these errors were encountered: