Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind #134

squeed · 2020-04-27T15:51:44Z

When adding a pod, if the NetNamespace isn't found, we'll issue a GET directly to the apiserver and treat it as an ADD. Except we didn't actually handle it correctly, and caused NetworkPolicy to ignore this NetNS forever.

Fixes: rhbz 1825355

openshift-ci-robot · 2020-04-27T15:51:51Z

@squeed: This pull request references Bugzilla bug 1825355, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

squeed · 2020-04-27T15:52:16Z

@JacobTanenbaum @danwinship you've both touched this recently

danwinship · 2020-04-27T19:45:21Z

pkg/network/node/vnids.go

@@ -119,7 +119,7 @@ func (vmap *nodeVNIDMap) WaitAndGetVNID(name string) (uint32, error) {
 			return 0, fmt.Errorf("failed to find netid for namespace: %s, %v", name, err)
 		}
 		klog.Warningf("Netid for namespace: %s exists but not found in vnid map", name)
-		vmap.setVNID(netns.Name, netns.NetID, netnsIsMulticastEnabled(netns))
+		vmap.handleAddOrUpdateNetNamespace(netns, nil, watch.Added)


I don't think it's legitimate to call handleAddOrUpdateNetNamespace from here. In fact, it's definitely not, as seen by the fact that you had to change a bunch of other places to make it work. But we can't just change places that call WaitAndGetVNID to call getVNID instead and expect everything will keep working.

Maybe the fix is to just remove the setVNID call here. Though I think if we were going to do that I'd want to make the backoff shorter...

I had considered that. We cache NetNamespaces in two places (networkpolicy.go and vnids.go), and doing that would make those caches diverge. It's not clear what the implication of such a divergence is, given that the code is so tightly coupled.

I actually think it's an error for networkPolicyPlugin.initNamespaces() to call WaitAndGetVNID(), because we're still in startup and haven't even added our handlers yet.

To clarfiy, the flow in npp.Start() is:

vmap.Start(), which calls

vmap.populateVNIDs(), which does a synchronous List and calls vmap.setVNID()

npp.initNamespaces(), which does a synchronous List

The rest of the informers are configured

Not the prettiest. So I'm not surprised we have deadlocks. But that's why I think it's wrong to call an informer handler before we "expect" to see informers running.

ok, but you need to separate out the startup time vs non-startup time behavior more. Currently, at startup the behavior is:

if a VNID is missing from the cache, pointlessly wait 5 seconds, then fetch it manually

With this patch, it becomes

if a VNID is missing from the cache, abort openshift-sdn startup

if a VNID is missing from the cache, abort openshift-sdn startup

We swallow errors (and always have), so that's not a risk.

When adding a pod, if the NetNamespace isn't found, we'll issue a GET directly to the apiserver and treat it as an ADD. Except we didn't actually handle it correctly, and caused NetworkPolicy to ignore this NetNS forever. Fixes: rhbz 1825355

squeed · 2020-04-28T18:52:30Z

@danwinship I switched the locking around a bit, to make the difference between startup and running clearer.

squeed · 2020-04-29T13:54:20Z

@danwinship any final thoughts on this?

danwinship · 2020-04-29T14:35:59Z

/lgtm
but the change to WaitAndGetVNID is tricky and this patch should not simply be backported as fast as possible.

openshift-ci-robot · 2020-04-29T14:36:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danwinship,squeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-04-29T16:13:20Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-04-29T18:23:26Z

/retest

Please review the full test history for this PR and help us cut down flakes.

nee1esh · 2020-04-29T19:45:40Z

/retest

openshift-bot · 2020-04-29T21:12:30Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-04-29T22:41:12Z

@squeed: All pull requests linked via external trackers have merged: openshift/sdn#134. Bugzilla bug 1825355 has been moved to the MODIFIED state.

In response to this:

Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This is a backport of openshift#134 When adding a pod, if the NetNamespace isn't found, we'll issue a GET directly to the apiserver and treat it as an ADD. Except we didn't actually handle it correctly, and caused NetworkPolicy to ignore this NetNS forever. Fixes: rhbz 1839107

Backport of openshift#134 When adding a pod, if the NetNamespace isn't found, we'll issue a GET directly to the apiserver and treat it as an ADD. Except we didn't actually handle it correctly, and caused NetworkPolicy to ignore this NetNS forever. Fixes: rhbz 1389109

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 27, 2020

openshift-ci-robot requested review from JacobTanenbaum and pecameron April 27, 2020 15:52

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2020

squeed force-pushed the fix-slow-netns-watch branch from 8eae9c1 to a97f507 Compare April 27, 2020 15:52

danwinship suggested changes Apr 27, 2020

View reviewed changes

squeed force-pushed the fix-slow-netns-watch branch from a97f507 to b5f89a6 Compare April 28, 2020 12:21

openshift-ci-robot assigned danwinship Apr 29, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2020

openshift-merge-robot merged commit c0456d4 into openshift:master Apr 29, 2020

squeed mentioned this pull request May 26, 2020

Bug 1839107: node/vnids: Correctly handle case where NetNamespace watch is far behind #143

Merged

squeed mentioned this pull request May 26, 2020

Bug 1839109: [backport-4.3] node/vnids: Correctly handle case where NetNamespace watch is far behind #144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind #134

Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind #134

squeed commented Apr 27, 2020

openshift-ci-robot commented Apr 27, 2020

squeed commented Apr 27, 2020

danwinship Apr 27, 2020

squeed Apr 28, 2020

squeed Apr 28, 2020

danwinship Apr 28, 2020

squeed Apr 28, 2020

squeed commented Apr 28, 2020

squeed commented Apr 29, 2020

danwinship commented Apr 29, 2020

openshift-ci-robot commented Apr 29, 2020

openshift-bot commented Apr 29, 2020

openshift-bot commented Apr 29, 2020

nee1esh commented Apr 29, 2020

openshift-bot commented Apr 29, 2020

openshift-ci-robot commented Apr 29, 2020

Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind #134

Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind #134

Conversation

squeed commented Apr 27, 2020

openshift-ci-robot commented Apr 27, 2020

squeed commented Apr 27, 2020

danwinship Apr 27, 2020

Choose a reason for hiding this comment

squeed Apr 28, 2020

Choose a reason for hiding this comment

squeed Apr 28, 2020

Choose a reason for hiding this comment

danwinship Apr 28, 2020

Choose a reason for hiding this comment

squeed Apr 28, 2020

Choose a reason for hiding this comment

squeed commented Apr 28, 2020

squeed commented Apr 29, 2020

danwinship commented Apr 29, 2020

openshift-ci-robot commented Apr 29, 2020

openshift-bot commented Apr 29, 2020

openshift-bot commented Apr 29, 2020

nee1esh commented Apr 29, 2020

openshift-bot commented Apr 29, 2020

openshift-ci-robot commented Apr 29, 2020