Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing iptables after a HostEndpoint-protected interface is deleted and re-added #1883

Merged
merged 5 commits into from
Aug 23, 2018

Commits on Aug 22, 2018

  1. Clarify ifacemonitor_test logging

    Neil Jerram committed Aug 22, 2018
    Configuration menu
    Copy the full SHA
    78a00e0 View commit details
    Browse the repository at this point in the history
  2. UT when resync spots that a link has been deleted

    A customer reported a calico/node having got into a state where expected
    iptables programming for a HostEndpoint was missing.  They created a
    bridge device, and a HostEndpoint to protect that, and initially the
    expected iptables programming was present.  However, after some period
    of churn, which we believe included
    
    - deleting and re-adding the bridge device
    - setting the state of the bridge device down and later up again
    - creating and deleting many real or simulated pods/containers on that
      node,
    
    it was observed that the HostEndpoint iptables were missing, and that
    setting the bridge device down and up made no difference to this.  The
    missing iptables could only be recreated by restarting the calico/node.
    
    We identified a bug that would account for this if all of the following
    points are also true:
    
    1. One time when the bridge device is deleted, the Netlink event for
    that is lost (or delayed for a long time), and the removal of the device
    is then detected by a period resync that the Felix code does.  The bug
    is that the exact handling here is different than when processing an
    individual Netlink event.
    
    2. When the bridge device is added again, it has the same addresses and
    interface index as before it was deleted, and either there is no Netlink
    event reporting the addresses, or Felix sees that event before it sees
    the event about the device reappearing.
    
    3. Later, when the bridge device is set down and up again, either there
    are no address changes that ensue from that (which is unusual, because
    Linux devices normally do IPv6 autoconfiguration, which means that an
    IPv6 address is added when the device goes up, and taken away when the
    device goes down), or there is some other reason why Netlink address
    events are not generated, or don't reach Felix.
    
    This change adds a UT case to repro that scenario, and demonstrates the
    result that we are missing a non-nil address callback when the bridge
    device is added again.  In Felix as a whole, the absence of that
    callback means that HostEndpoint iptables are not reprogrammed.
    Neil Jerram committed Aug 22, 2018
    Configuration menu
    Copy the full SHA
    47921d6 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6adc271 View commit details
    Browse the repository at this point in the history

Commits on Aug 23, 2018

  1. Code review markups

    Neil Jerram committed Aug 23, 2018
    Configuration menu
    Copy the full SHA
    4199045 View commit details
    Browse the repository at this point in the history
  2. Fix ifacemonitor UT flake

    The test code's intention is to test when a deleted link is detected in
    a resync.  However there was a race, and the following was also
    possible:
    
    1. Kick off a resync.
    2. Kick off link deletion.
    3. Resync queries links and finds that the link is question is still
    present.
    4. Link deletion now proceeds, and state for that link deleted.
    5. Resync queries addresses for that link and finds that there aren't
    any.
    
    resulting in an addr callback with set.New(), instead of an addr
    callback with nil.
    
    We can now use delLinkNoSignal to ensure that the original intended
    scenario reliably happens, so that's what this change does.
    Neil Jerram committed Aug 23, 2018
    Configuration menu
    Copy the full SHA
    0db5b57 View commit details
    Browse the repository at this point in the history