Swap out `nsenter` with Go-native code #264

nicholascioli · 2023-07-17T10:26:23Z

As described in this issue, this commit removes the need for the nsenter binary on the host for linkerd to function in CNI mode. Read-only filesystems and OSs without the nsenter binary (such as Talos OS) cannot currently run linkerd without this change.

I tried running the integration tests, but couldn't get the complete integration test to work even with no changes. That could be because I was running docker on Windows and passing the socket into WSL which then passed it into the dev container, but I don't see why that wouldn't work.

Also, if you prefer that I use the code mentioned here instead, please let me know. I'm not sure how to test this on Talos itself, but am definitely willing to try!

kflynn · 2023-10-26T15:39:54Z

Augh, so sorry for the lack of response here! @nicholascioli, can you fix the conflicts and DCO? I'll see if I can get a maintainer looking at this in the meantime. 🙂

This commit swaps out the previous implementation of using the command `nsenter` to switch networking namespaces. By using the Go impl of network namespace switching, read-only filesystems and OSs without the `nsenter` command (such as Talos) should still work. Signed-off-by: Nicholas Cioli <nicholascioli@users.noreply.github.com>

nicholascioli · 2023-11-04T22:37:47Z

No worries! I've rebased and added the DCO, but I am not a go expert and had to run go mod tidy to add the direct dependency on CNI plugins. Let me know if you prefer something else for the go.mod / go.sum files.

mateiidavid

@nicholascioli thanks a lot for the change! I'll be honest, this change is kinda scary on paper. Normally we require only one approval for proxy-init changes but I'll want someone else to review this too.

I spent a bit of time going through write-ups on this topic to better understand the problem. Leaving this as a summary for other folks interested in reviewing this change.

For its goroutine model, Go ships with a userspace scheduler. The userspace scheduler is responsible for distributing N routines on M threads (N:M multiplexing model). To achieve this, each OS thread will receive a scheduling context. The scheduler is greedy, it may steal work from other threads. In particular when a thread will block (e.g. syscall) its work will be stolen by another kernel thread. If no thread exists (in the cache or actively doing work) then a new one will be created.

As the folks from Weave have noticed this doesn't go well (no pun intended) with network namespace switching logic. Most of the network namespace or cgroup operations require system calls. If a program repeatedly executes in arbitrary namespaces, it is possible that during syscall execution the runtime will create a new thread. Each thread operates in its own network namespace. This increases the chances that a thread may operate in a network namespace it is not supposed to (yuck). To make matters worse, once a goroutine is unblocked, it goes back into a global queue which means any thread may pick it back up. A go routine may switch the network namespace it is operating in by virtue of trapping into the kernel (double yuck).

There are some options here, such as using runtime.LockOSThread() which effectively pins a goroutine to an OS thread. This doesn't mean the operation is not volatile, each bit of code needs to ensure it locks the thread and doesn't unlock it midpoint in the execution.

I'm split about this change. On the one hand, I see the use of bringing this functionality into the CNI plugin to unblock operating systems that do not ship with nsenter. On the other, I'm reticent to rubber stamp a change that can cause UB for users running this in production on operating systems that support nsenter. To me the problem doesn't seem to be entirely fixed, we just have more determinism than before.

This looks fine, but perhaps before we go further we should answer some questions:

Should this be feature flagged? Can this break anyone's system? If it does, how do we go back to behaviour that doesn't break?
How can we improve logging here? If this happens in a system, will we spend weeks debugging this? Can we rely on people to run strace on their hosts? Should we include more logs?
Are there any automated tests we can come up with to increase our confidence in shipping this?

I know that's a lot to put on you @nicholascioli, I think some of these decisions can be offloaded onto the team (especially the first) but your input is highly valued as the person submitting the changeset. Lastly, for tests, I'm happy to shepherd or to give more concrete ideas. It is also something we might be able to do as a follow-up.

mateiidavid · 2023-11-10T14:47:51Z

internal/iptables/iptables.go

+			log.Errorf("could not switch to target network namespace \"%s\": %s", firewallConfiguration.NetNs, err.Error())
+			return nil, err
+		}
+


Can we instrument the logger with the current network namespace we're executing in? I'd err on the side of caution here since ub in this space will be incredibly painful to debug without good logging.

mateiidavid · 2023-11-10T14:50:03Z

internal/iptables/iptables.go

+		//
+		// Note: Result needs to be defined here since `netNs.Do` only returns
+		// an error.
+		result := make([]byte, 0)


can we do var result []byte? Even if it's a zero length array we still allocate a pointer that is then set to whatever cmd.CombinedOutput returns, so it might be redundant?

mateiidavid · 2023-11-10T14:51:24Z

internal/iptables/iptables.go

+
+		return result, err
+	} else {
+		return doCommand()


Should we log at the debug level that we're executing in the root network namespace?

alpeb

Thanks @nicholascioli for tackling this and @mateiidavid for the detailed writeup.

It seems this follows the recommended instructions from the ns package docs. The Weave article is indeed scary, in particular about the OS thread lock workaround:

One could argue that locking a goroutine with runtime.LockOSThread could help, but a) the goroutine might spawn a new goroutine which would run in a wrong namespace b) locking does not prevent the runtime from creating a new OS thread for scheduling.

But as pointed out in the ns docs and in golang/go#20676, new threads won't be spawned from locked Ms after go 1.10. And IIUC, the linkerd-cni binary is executed from scratch for each event (ADD pod, which is what we only care about), so it's not a long-lived process where concurrent runtime.LockOSThread could conflict.

I second @mateiidavid's recommendations to fully log whatever we can here. I would also love to have linkerd/linkerd2#8892 resuscitated to actually have some logs to look at! 😉

I think putting this behind a flag disabled by default would make sense while we build more confidence about it. And I also need to address linkerd/linkerd2#11567 so we can actually run this in CI 😣

nicholascioli requested a review from a team as a code owner July 17, 2023 10:26

olix0r assigned mateiidavid Nov 2, 2023

nicholascioli force-pushed the main branch from 1ab66a5 to 2ea820c Compare November 4, 2023 22:34

nicholascioli force-pushed the main branch from 2ea820c to 993182c Compare November 4, 2023 22:36

mateiidavid reviewed Nov 10, 2023

View reviewed changes

alpeb reviewed Nov 10, 2023

View reviewed changes

alpeb mentioned this pull request Dec 12, 2023

Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. linkerd/linkerd2#11735

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap out `nsenter` with Go-native code #264

Swap out `nsenter` with Go-native code #264

nicholascioli commented Jul 17, 2023

kflynn commented Oct 26, 2023

nicholascioli commented Nov 4, 2023

mateiidavid left a comment

mateiidavid Nov 10, 2023

mateiidavid Nov 10, 2023

mateiidavid Nov 10, 2023

alpeb left a comment

Swap out nsenter with Go-native code #264

Are you sure you want to change the base?

Swap out nsenter with Go-native code #264

Conversation

nicholascioli commented Jul 17, 2023

kflynn commented Oct 26, 2023

nicholascioli commented Nov 4, 2023

mateiidavid left a comment

Choose a reason for hiding this comment

mateiidavid Nov 10, 2023

Choose a reason for hiding this comment

mateiidavid Nov 10, 2023

Choose a reason for hiding this comment

mateiidavid Nov 10, 2023

Choose a reason for hiding this comment

alpeb left a comment

Choose a reason for hiding this comment

Swap out `nsenter` with Go-native code #264

Swap out `nsenter` with Go-native code #264