fix bz1389267 - release subnet leases upon hostsubnet delete #11628

rajatchopra · 2016-10-27T20:08:10Z

@danwinship Could you please take a look at this one?
I removed the cidr release from 'deleteNode' and put it in the actual loop that watches for hostsubnets.

We now have hostsubnets that are created/delete outside of node creation/deletion.
Solves https://bugzilla.redhat.com/show_bug.cgi?id=1389267

rajatchopra · 2016-10-31T18:28:01Z

@openshift/networking Review please

knobunc

LGTM.

marun · 2016-10-31T19:34:11Z

lgtm

marun · 2016-10-31T19:34:17Z

[test]

pravisankar · 2016-10-31T19:37:17Z

pkg/sdn/plugin/subnets.go

-	if err != nil {
-		return fmt.Errorf("Error parsing subnet %q for node %q for deletion: %v", sub.Subnet, nodeName, err)
-	}
-	master.subnetAllocator.ReleaseNetwork(ipnet)


deleteNode() is called from watchNodes() and this will break the normal case when the node is deleted.
Also didn't like the fact that addNode() assigns the subnet but the deleteNode() doesn't revoke the subnet.

May be separate assign/revoke subnet from addNode/removeNode methods, then you can

call revoke subnet from watchSubnets and

call revoke subnet + delete node object from watchNodes

@pravisankar It will not break the normal case when node is deleted, because the hostsubnet is also deleted, and the watchSubnets in master will actually release the subnet

AddNode assigns the subnet because it is the place where hostsubnet is created. If we always end up creating hostsubnet without the 'subnet' field in it, then we will have to change the hostsubnet validator too. Something that we can consider later on, not necessary now.

Yeah, I don't think it "breaks" the normal case, it just means we don't release the subnet until a few seconds later, which should not be a problem at all

Is there a chance a rapid delete/recreate of a node could accidentally delete the wrong subnet because it observes the node delete?

Remember event delivery can be spaced very differently than the API calls that caused those events. Is it possible for something else to delete/recreate a node and subnet, and could this observe the original deletion and delete the wrong subnet? How do we recover in that circumstance?

hostsubnet is an object by itself. The actual subnet it refers to just needs to be checked back into the pool and its release can be delayed forever no problem (only that we may run out of subnets in the cidr it is super slow).

If the node is reborn it will be given a new subnet with a new hostsubnet object. And if the original subnet is still not released (because the event hostsubnet:delete was not delivered), it will just never be able to get its old lease back. There was never a guarantee of that anyway.

I'm not describing the scenario clearly :)

t=0, node exists, subnet exists
t=1, node is deleted
t=2, node is created (tries to allocate/create new subnet? gets AlreadyExists error, keeps existing subnet?)
t=3, node deletion event is delivered
t=4, deleteNode() looks up and deletes the subnet
t=5, watchSubnets() sees the subnet deletion event, calls DeleteHostSubnetRules

it looks like we can get in a situation where we have a node without an allocated subnet. is that the case? how does the systme recover from that?

HostSubnets aren't created synchronously when Nodes are created, they're created by the same Node watch that does HostSubnet deletion. So I think we're ok:

t=0, node exists, subnet exists
t=1, node is deleted
t=2, node is created
t=3, node deletion event is delivered, SDN master deletes the HostSubnet but does not yet release the IP range
t=4, node creation event is delivered, SDN master calls HostSubnets.Get(), sees that no HostSubnet exists for the node, and creates a new one, allocating a new IP range
t=5, HostSubnet deletion event is delivered, SDN master releases the (old) IP range, SDN nodes delete OVS rules for that node
t=6, HostSubnet creation event is delivered, SDN nodes add new OVS rules for the node

(Except that if you have used up your entire clusterNetworkCIDR, then this sequence of events would fail, because at t=4 there wouldn't be any IP subnets available yet. But in the non-racy case where you actually let the deletion get fully processed before recreating the node, it's fine.)

After further discussion with @liggitt on IRC, it turns out that this isn't actually completely safe (see #11740) but that's a pre-existing bug which this patch doesn't make any better or worse.

danwinship · 2016-10-31T22:17:13Z

flake is #10988 (I think). [test]

rajatchopra · 2016-11-01T00:28:49Z

flake #10327 [test]

openshift-bot · 2016-11-01T00:29:21Z

Evaluated for origin test up to 546183c

openshift-bot · 2016-11-01T01:54:39Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10892/) (Base Commit: 899f139)

knobunc · 2016-11-02T19:42:52Z

[merge]

openshift-bot · 2016-11-02T20:06:24Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10892/) (Image: devenv-rhel7_5305)

openshift-bot · 2016-11-02T20:06:24Z

Evaluated for origin merge up to 546183c

fix bz1389267 - release subnet leases upon hostsubnet delete

546183c

knobunc approved these changes Oct 31, 2016

View reviewed changes

pravisankar reviewed Oct 31, 2016

View reviewed changes

danwinship mentioned this pull request Nov 2, 2016

HostSubnet handling is subject to multi-master race conditions #11740

Closed

dcbw added the component/networking label Nov 2, 2016

openshift-bot merged commit 078f177 into openshift:master Nov 2, 2016

rajatchopra deleted the bz1389267 branch November 4, 2016 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bz1389267 - release subnet leases upon hostsubnet delete #11628

fix bz1389267 - release subnet leases upon hostsubnet delete #11628

rajatchopra commented Oct 27, 2016

rajatchopra commented Oct 31, 2016

knobunc left a comment

marun commented Oct 31, 2016

marun commented Oct 31, 2016

pravisankar Oct 31, 2016

pravisankar Oct 31, 2016

rajatchopra Oct 31, 2016

danwinship Oct 31, 2016

liggitt Oct 31, 2016

liggitt Nov 1, 2016

rajatchopra Nov 1, 2016

liggitt Nov 2, 2016

danwinship Nov 2, 2016

danwinship Nov 2, 2016

danwinship commented Oct 31, 2016

rajatchopra commented Nov 1, 2016

openshift-bot commented Nov 1, 2016

openshift-bot commented Nov 1, 2016

knobunc commented Nov 2, 2016

openshift-bot commented Nov 2, 2016 •

edited

Loading

openshift-bot commented Nov 2, 2016

fix bz1389267 - release subnet leases upon hostsubnet delete #11628

fix bz1389267 - release subnet leases upon hostsubnet delete #11628

Conversation

rajatchopra commented Oct 27, 2016

rajatchopra commented Oct 31, 2016

knobunc left a comment

Choose a reason for hiding this comment

marun commented Oct 31, 2016

marun commented Oct 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danwinship commented Oct 31, 2016

rajatchopra commented Nov 1, 2016

openshift-bot commented Nov 1, 2016

openshift-bot commented Nov 1, 2016

knobunc commented Nov 2, 2016

openshift-bot commented Nov 2, 2016 • edited Loading

openshift-bot commented Nov 2, 2016

openshift-bot commented Nov 2, 2016 •

edited

Loading