Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple auto-egress-IP fixes #18808

Merged
merged 5 commits into from Mar 8, 2018

Conversation

@danwinship
Copy link
Contributor

commented Mar 2, 2018

OK, 3 "setup" commits followed by 2 commits with fixes:

  • the first and second commits only change egressip_test.go, so clearly don't break the actual functioning of auto-egress-IP
  • the third commit ("Have multiple egress IP ovscontroller methods rather than one confusing one") is small and fairly self-explanatory.
  • the fourth commit ("Fix egressip handling when a NetNamespace is updated") involves a small fix to egressip.go and then a larger update to egressip_test.go to show that the fix works. Specifically: if the EgressIPs field on a NetNamespace was changed while that egress IP was active, then we did not clean up all of the state associated with the old egress IP, because we were accidentally cleaning up the state associated with the new egress IP instead.
  • the last commit ("Rearrange egressip internals, add duplication tests") is very large and complicated and needs the most review, but in particular note that all of the existing tests still pass with the new code, and it adds two more tests (which don't pass with the old code). This fixes the problem that we were previously mostly doing the right thing when the user screwed up and added a duplicate EgressIP, but we weren't doing the right thing when they removed the duplicates, because we weren't keeping enough information about the global merged node/namespace EgressIP state. I think the end result is actually simpler than the old code.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1551028
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1547899
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1543786
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1520363

@openshift/sig-networking PTAL

danwinship added 4 commits Feb 27, 2018
This makes the tests a little bit more self-documenting, and makes it
simpler to add new tests. Also, change it to only look at OVS flows in
the egress table, so that when there is a mismatch, the error message
is of a manageable length.
Split out one test case, add some new (already-passing) test cases,
add some comments, and remove some double-checking-of-internal-state
that will break with upcoming rewrites.
(We handled delete+recreate correctly, but not update.)
@knobunc
knobunc approved these changes Mar 2, 2018
Copy link
Contributor

left a comment

LGTM

@knobunc

This comment has been minimized.

Copy link
Contributor

commented Mar 2, 2018

/assign @dcbw

@knobunc knobunc self-assigned this Mar 2, 2018
ns.assignedIP = ""
ns.nodeIP = ""
if ns.requestedIP != "" {
eg := eip.egressIPs[ns.requestedIP]

This comment has been minimized.

Copy link
@dcbw

dcbw Mar 2, 2018

Member

Are we sure eg isn't going to be nil here?

if oldNode := eip.nodesByEgressIP[ip]; oldNode != nil {
utilruntime.HandleError(fmt.Errorf("Multiple nodes claiming EgressIP %q (nodes %q, %q)", ip, node.nodeIP, oldNode.nodeIP))
continue
eg := eip.egressIPs[ip]

This comment has been minimized.

Copy link
@dcbw

dcbw Mar 2, 2018

Member

I almost want:

	eg := eip.ensureEgressIPInfo(ip)
	if eg == nil {
		return
	}
	...
}

func (eip *egressIPWatcher) ensureEgressIPInfo(ip string) *egressIPInfo {
	if ip == "" {
		return nil
	}
	eg := eip.egressIPs[ip]
	if eg == nil {
		eg = &egressIPInfo{ip: ip}
		eip.egressIPs[ip] = eg
	}
	return eg
}

which you'd also use down below.

}

eip.nodesByEgressIP[ip] = node
eip.maybeAddEgressIP(ip)
if len(eg.nodes) != 0 {

This comment has been minimized.

Copy link
@dcbw

dcbw Mar 2, 2018

Member

Also kinda want:

func (eg *egressIPInfo) addNode(node *nodeEgress, ip string) {
	if len(eg.nodes) != 0 {
		utilruntime.HandleError(fmt.Errorf("Multiple nodes claiming EgressIP %q (nodes %q, %q)", ip, node.nodeIP, eg.nodes[0].nodeIP))
	}
	eg.nodes = append(eg.nodes, node)
}

func (eg *egressIPInfo) delNode(node *nodeEgress) {
	for i := range eg.nodes {
		if eg.nodes[i] == node {
			eg.nodes = append(eg.nodes[:i], eg.nodes[i+1:]...)
			return
 		}
 	}
}

and the corresponding functions for namespaces.

// The egressIPInfo should have an assigned node IP if and only if the
// egress IP is active (ie, it is assigned to exactly 1 node and exactly
// 1 namespace).
egressIPActive := (len(eg.nodes) == 1 && len(eg.namespaces) == 1)

This comment has been minimized.

Copy link
@dcbw

dcbw Mar 2, 2018

Member

And if you do choose to go with struct egressIPInfo functions, some of the stuff here could be done as functions too. Obviously it can go too far, but maybe a bit cleaner. Your call.

maybe even

changed := eg.syncNodeIP(eip.masqueradeBit, eip.localIP, eip.assignEgressIP, eip.releaseEgressIP)

func (eg *egressIPInfo) syncNodeIP(masqBit uint32, localIP string, assign func(string, string) error, func(string, string) error) bool {
   ...
}
@danwinship danwinship force-pushed the danwinship:egressip-fixes branch from b23627e to 7e5a3c0 Mar 5, 2018
@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2018

I took most of your suggestions. I didn't change syncEgressIP much other than to split it into two subfunctions, one for iptables stuff and one for OVS stuff. The OVS stuff doesn't really split further very well because the blockedVNIDs state has to flow through it so they'd just have to be passed around if we did split it up further

There should never be multiple HostSubnets or multiple NetNamespaces
claiming the same egress IP, but if there are, we need to track them
carefully so we don't get out sync with reality after things are
fixed.
@danwinship danwinship force-pushed the danwinship:egressip-fixes branch from 7e5a3c0 to a21509b Mar 5, 2018
@knobunc

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2018

The changes LGTM too. @dcbw?

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2018

/retest

@dcbw

This comment has been minimized.

Copy link
Member

commented Mar 5, 2018

/lgtm

@openshift-ci-robot

This comment has been minimized.

Copy link

commented Mar 5, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dcbw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2018

/retest

Copy link

left a comment

Added some comments, rest of the changes LGTM

ns.requestedIP = egressIP
eip.namespacesByEgressIP[egressIP] = ns
eip.maybeAddEgressIP(egressIP)
if egressIP == "" {

This comment has been minimized.

Copy link
@pravisankar

pravisankar Mar 5, 2018

I think this is to handle delete case where ns is no longer referenced in eip.namespacesByVNID?
To be clear, may be we could assign ns.requestedIP = egressip after this check

This comment has been minimized.

Copy link
@danwinship

danwinship Mar 6, 2018

Author Contributor

yes, you're right, the ns.requestedIP assignment is irrelevant (but harmless) in this case

@@ -266,44 +233,120 @@ func (eip *egressIPWatcher) updateNamespaceEgress(vnid uint32, egressIP string)

ns := eip.namespacesByVNID[vnid]

This comment has been minimized.

Copy link
@pravisankar

pravisankar Mar 5, 2018

Need to consider join and isolate project operations otherwise there will be some problems.
Example:

  1. ns1 with vnid v1 requested egrees ip e1
  2. ns2 with vnid v2 requested egress ip e2
  3. ns2 joined with ns1, now ns2 has vnid v1
  4. eip.namespacesByVNID[v1] will now have requestedIP=e2
  5. ns2 got isolated

Now, v1 has only one namespace ns1 but the requested ip will be set as ep2 instead of ep1 which is incorrect.

This comment has been minimized.

Copy link
@danwinship

danwinship Mar 6, 2018

Author Contributor

filed #18852


func (eg *egressIPInfo) addNode(node *nodeEgress) {
if len(eg.nodes) != 0 {
utilruntime.HandleError(fmt.Errorf("Multiple nodes claiming EgressIP %q (nodes %q, %q)", eg.ip, node.nodeIP, eg.nodes[0].nodeIP))

This comment has been minimized.

Copy link
@pravisankar

pravisankar Mar 5, 2018

Can we clarify that only the first node that claimed the egress IP will have egress rules?
(syncEgressIPTablesState() checks eg.nodes[0].nodeIP)

This comment has been minimized.

Copy link
@danwinship

danwinship Mar 6, 2018

Author Contributor

actually none of them will; syncEgressIPTablesState() only looks at eg.nodes[0] when len(eg.nodes)==1

This comment has been minimized.

Copy link
@pravisankar
// 1 namespace).
egressIPActive := (len(eg.nodes) == 1 && len(eg.namespaces) == 1)
assignedNodeIPChanged := false
if egressIPActive && eg.assignedNodeIP != eg.nodes[0].nodeIP {

This comment has been minimized.

Copy link
@pravisankar

pravisankar Mar 5, 2018

There could be very rare corner case when F5 is involved.
Scenario:

  1. F5 node n1(no real backing node only hostSubnet object) requested egress ip e1
  2. node n2 requested same egress ip e1
  3. eg.nodes == [n1, n2], n1 claims the egress ip e1
  4. SDN controller calls delete + create on F5 node n1 (removes AssignHostSubnetAnnoation and assigns vnid)
  5. eg.nodes == [n2, n1] now n2 claims the egress ip e1 (side effect of how we handle F5 case)

This is theoretical case but may never happen in practice. I don't think we need to fix this case.

This comment has been minimized.

Copy link
@danwinship

danwinship Mar 6, 2018

Author Contributor

as above, there's no actual corner case here because egressIPActive would be false at both step 3 and step 5

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2018

/retest


func (eg *egressIPInfo) addNode(node *nodeEgress) {
if len(eg.nodes) != 0 {
utilruntime.HandleError(fmt.Errorf("Multiple nodes claiming EgressIP %q (nodes %q, %q)", eg.ip, node.nodeIP, eg.nodes[0].nodeIP))

This comment has been minimized.

Copy link
@pravisankar
@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2018

/test end_to_end

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2018

/retest

3 similar comments
@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2018

/retest

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2018

/retest

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2018

/retest

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2018

/retest

1 similar comment
@knobunc

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

/retest

@knobunc

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

/hold

@knobunc

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

/hold cancel

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2018

/retest

@openshift-merge-robot

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2018

Automatic merge from submit-queue (batch tested with PRs 18780, 18802, 18391, 18832, 18808).

@openshift-merge-robot openshift-merge-robot merged commit 6b99107 into openshift:master Mar 8, 2018
16 checks passed
16 checks passed
Submit Queue Queued to run github e2e tests a second time.
Details
ci/openshift-jenkins/cmd Jenkins job succeeded.
Details
ci/openshift-jenkins/cross Skipped
ci/openshift-jenkins/end_to_end Jenkins job succeeded.
Details
ci/openshift-jenkins/extended_builds Skipped
ci/openshift-jenkins/extended_clusterup Skipped
ci/openshift-jenkins/extended_conformance_install Jenkins job succeeded.
Details
ci/openshift-jenkins/extended_image_ecosystem Skipped
ci/openshift-jenkins/extended_image_registry Skipped
ci/openshift-jenkins/extended_networking_minimal Jenkins job succeeded.
Details
ci/openshift-jenkins/gcp Jenkins job succeeded.
Details
ci/openshift-jenkins/integration Jenkins job succeeded.
Details
ci/openshift-jenkins/service-catalog Skipped
ci/openshift-jenkins/unit Jenkins job succeeded.
Details
ci/openshift-jenkins/verify Jenkins job succeeded.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@danwinship danwinship deleted the danwinship:egressip-fixes branch Jul 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.