Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't call DeleteHostSubnetRules on a replayed(?) Deleted event #18617

Merged

Conversation

danwinship
Copy link
Contributor

In some as-yet-undiagnosed set of circumstances, HostSubnet Deleted events are being replayed:

I0210 19:29:59.006531   36434 sdn_controller.go:268] DeleteHostSubnetRules for ip-172-31-52-121.us-west-2.compute.internal (host: "ip-172-31-52-121.us-west-2.compute.internal", ip: "172.31.52.121", subnet: "10.130.10.0/23")
I0210 19:30:49.860255   36434 sdn_controller.go:261] AddHostSubnetRules for ip-172-31-52-121.us-west-2.compute.internal (host: "ip-172-31-52-121.us-west-2.compute.internal", ip: "172.31.52.121", subnet: "10.130.18.0/23")
...
I0210 19:59:20.717721   36434 sdn_controller.go:268] DeleteHostSubnetRules for ip-172-31-52-121.us-west-2.compute.internal (host: "ip-172-31-52-121.us-west-2.compute.internal", ip: "172.31.52.121", subnet: "10.130.10.0/23")

(Note that when the HostSubnet was recreated, it got a different subnet (18 vs 10), but the second Deleted event references the old subnet again.)

Currently, when we receive the second Deleted event, we don't notice that it's a replay, and just call DeleteHostSubnetRules again. That does:

otx.DeleteFlows("table=10, tun_src=%s", subnet.HostIP)
otx.DeleteFlows("table=50, arp, nw_dst=%s", subnet.Subnet)
otx.DeleteFlows("table=90, ip, nw_dst=%s", subnet.Subnet)

Since the HostIP is still the same, the first DeleteFlows call ends up deleting the rule that was added by the earlier AddHostSubnetRules call. (The second and third DeleteFlows calls are no-ops, because they're trying to delete flows to the 10 subnet, but the current OVS flows refer to the 18 subnet.) As a result, we end up with no rule to accept VXLAN traffic from that node, even though all the other rules for that node are still present.

Anyway, the fix is easy: don't run DeleteHostSubnetRules if the event doesn't correspond to a currently-known-about HostSubnet.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1544903

@danwinship danwinship added kind/bug Categorizes issue or PR as related to a bug. component/networking sig/networking labels Feb 14, 2018
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 14, 2018
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks Dan!

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2018
@knobunc
Copy link
Contributor

knobunc commented Feb 14, 2018

/retest

@@ -103,6 +103,10 @@ func (node *OsdnNode) handleDeleteHostSubnet(obj interface{}) {
if hs.HostIP == node.localIP {
return
}
if _, exists := node.hostSubnetMap[string(hs.UID)]; !exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still delete the rules for the subnet portion? Just skip the host part of the openflow?
I am not sure if those rules are really non-existent, while we clearly do not want the host part to be deleted of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubnetAllocator always returns the lowest-numbered available subnet, so deleted subnets will get reused right away, so it's possible you could get a replayed delete that mentioned a subnet that was already in use again by a different node.

(And we can't easily check both HostIP and Subnet when doing the deletes, because the table 10 rules don't include the subnet, and the table 50/90 rules only include the hostIP in a non-matchable-against field.)

Yeah, I'm vaguely concerned that this patch will cause us to ignore some delete that we shouldn't ignore, but that really shouldn't be possible... if we added OVS flows, then we also added the subnet to hostSubnetMap, so if it's not in hostSubnetMap, then we didn't add OVS flows, so we shouldn't delete them...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I feel that this fix make things better. May be not ideal, but better than what we see today.

@rajatchopra
Copy link
Contributor

/lgtm
/approve

@dcbw
Copy link
Member

dcbw commented Feb 14, 2018

/lgtm though this will be interesting if/when we start supporting multiple hostsubnets per node, right?

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dcbw, knobunc, rajatchopra

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@dcbw
Copy link
Member

dcbw commented Feb 14, 2018

/test gcp

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@pravisankar
Copy link

pravisankar commented Feb 14, 2018

This may have more repercussions. We need to look at how replaying of these events affect in other handleDeleteResource() methods.

For example: Consider handleDeleteNode() (code snippet below)

func (master *OsdnMaster) handleDeleteNode(obj interface{}) {
    node := obj.(*kapi.Node)
    glog.V(5).Infof("Watch %s event for Node %q", watch.Deleted, node.Name)
    delete(master.hostSubnetNodeIPs, node.UID)

    if err := master.deleteNode(node.Name); err != nil {
        utilruntime.HandleError(fmt.Errorf("Error deleting node %s: %v", node.Name, err))
        return
    }
}

If we get replay of these node events:

  1. Node deleted event: Node{ name: N1, UID: XXX }
  2. Node added event: Node{ name: N1, UID: YYY } (same node name but different UID)
  3. Node deleted event: Node{ name: N1, UID: XXX }

Step-3 will delete the node created in step-2 which is incorrect (as these are different nodes with same name)?

@danwinship @knobunc @rajatchopra @dcbw

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 18505, 18617, 18604).

@openshift-merge-robot openshift-merge-robot merged commit 7189b62 into openshift:master Feb 15, 2018
@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@danwinship
Copy link
Contributor Author

This may have more repercussions. We need to look at how replaying of these events affect in other handleDeleteResource() methods.

Yeah, there's more going on here, but I think the question is "why are we getting replayed HostSubnet delete events", not "what happens when other resource types get replayed delete events", because the evidence seems to suggest that they just don't.

@danwinship danwinship deleted the hostsubnet-tracking branch February 15, 2018 12:59
@danwinship danwinship changed the title Don't call DeleteHostSubnetRules on a replayed Deleted event Don't call DeleteHostSubnetRules on a replayed(?) Deleted event Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/networking kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. sig/networking size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants