A customer reported a calico/node having got into a state where expected
iptables programming for a HostEndpoint was missing. They created a
bridge device, and a HostEndpoint to protect that, and initially the
expected iptables programming was present. However, after some period
of churn, which we believe included
- deleting and re-adding the bridge device
- setting the state of the bridge device down and later up again
- creating and deleting many real or simulated pods/containers on that
node,
it was observed that the HostEndpoint iptables were missing, and that
setting the bridge device down and up made no difference to this. The
missing iptables could only be recreated by restarting the calico/node.
We identified a bug that would account for this if all of the following
points are also true:
1. One time when the bridge device is deleted, the Netlink event for
that is lost (or delayed for a long time), and the removal of the device
is then detected by a period resync that the Felix code does. The bug
is that the exact handling here is different than when processing an
individual Netlink event.
2. When the bridge device is added again, it has the same addresses and
interface index as before it was deleted, and either there is no Netlink
event reporting the addresses, or Felix sees that event before it sees
the event about the device reappearing.
3. Later, when the bridge device is set down and up again, either there
are no address changes that ensue from that (which is unusual, because
Linux devices normally do IPv6 autoconfiguration, which means that an
IPv6 address is added when the device goes up, and taken away when the
device goes down), or there is some other reason why Netlink address
events are not generated, or don't reach Felix.
This change adds a UT case to repro that scenario, and demonstrates the
result that we are missing a non-nil address callback when the bridge
device is added again. In Felix as a whole, the absence of that
callback means that HostEndpoint iptables are not reprogrammed.