Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap out nsenter with Go-native code #264

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nicholascioli
Copy link

As described in this issue, this commit removes the need for the nsenter binary on the host for linkerd to function in CNI mode. Read-only filesystems and OSs without the nsenter binary (such as Talos OS) cannot currently run linkerd without this change.

I tried running the integration tests, but couldn't get the complete integration test to work even with no changes. That could be because I was running docker on Windows and passing the socket into WSL which then passed it into the dev container, but I don't see why that wouldn't work.

Also, if you prefer that I use the code mentioned here instead, please let me know. I'm not sure how to test this on Talos itself, but am definitely willing to try!

@nicholascioli nicholascioli requested a review from a team as a code owner July 17, 2023 10:26
@kflynn
Copy link
Member

kflynn commented Oct 26, 2023

Augh, so sorry for the lack of response here! @nicholascioli, can you fix the conflicts and DCO? I'll see if I can get a maintainer looking at this in the meantime. 🙂

This commit swaps out the previous implementation of using the
command `nsenter` to switch networking namespaces. By using the
Go impl of network namespace switching, read-only filesystems and
OSs without the `nsenter` command (such as Talos) should still
work.

Signed-off-by: Nicholas Cioli <nicholascioli@users.noreply.github.com>
@nicholascioli
Copy link
Author

No worries! I've rebased and added the DCO, but I am not a go expert and had to run go mod tidy to add the direct dependency on CNI plugins. Let me know if you prefer something else for the go.mod / go.sum files.

Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicholascioli thanks a lot for the change! I'll be honest, this change is kinda scary on paper. Normally we require only one approval for proxy-init changes but I'll want someone else to review this too.

I spent a bit of time going through write-ups on this topic to better understand the problem. Leaving this as a summary for other folks interested in reviewing this change.

For its goroutine model, Go ships with a userspace scheduler. The userspace scheduler is responsible for distributing N routines on M threads (N:M multiplexing model). To achieve this, each OS thread will receive a scheduling context. The scheduler is greedy, it may steal work from other threads. In particular when a thread will block (e.g. syscall) its work will be stolen by another kernel thread. If no thread exists (in the cache or actively doing work) then a new one will be created.

As the folks from Weave have noticed this doesn't go well (no pun intended) with network namespace switching logic. Most of the network namespace or cgroup operations require system calls. If a program repeatedly executes in arbitrary namespaces, it is possible that during syscall execution the runtime will create a new thread. Each thread operates in its own network namespace. This increases the chances that a thread may operate in a network namespace it is not supposed to (yuck). To make matters worse, once a goroutine is unblocked, it goes back into a global queue which means any thread may pick it back up. A go routine may switch the network namespace it is operating in by virtue of trapping into the kernel (double yuck).

There are some options here, such as using runtime.LockOSThread() which effectively pins a goroutine to an OS thread. This doesn't mean the operation is not volatile, each bit of code needs to ensure it locks the thread and doesn't unlock it midpoint in the execution.

I'm split about this change. On the one hand, I see the use of bringing this functionality into the CNI plugin to unblock operating systems that do not ship with nsenter. On the other, I'm reticent to rubber stamp a change that can cause UB for users running this in production on operating systems that support nsenter. To me the problem doesn't seem to be entirely fixed, we just have more determinism than before.

This looks fine, but perhaps before we go further we should answer some questions:

  • Should this be feature flagged? Can this break anyone's system? If it does, how do we go back to behaviour that doesn't break?
  • How can we improve logging here? If this happens in a system, will we spend weeks debugging this? Can we rely on people to run strace on their hosts? Should we include more logs?
  • Are there any automated tests we can come up with to increase our confidence in shipping this?

I know that's a lot to put on you @nicholascioli, I think some of these decisions can be offloaded onto the team (especially the first) but your input is highly valued as the person submitting the changeset. Lastly, for tests, I'm happy to shepherd or to give more concrete ideas. It is also something we might be able to do as a follow-up.

log.Errorf("could not switch to target network namespace \"%s\": %s", firewallConfiguration.NetNs, err.Error())
return nil, err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instrument the logger with the current network namespace we're executing in? I'd err on the side of caution here since ub in this space will be incredibly painful to debug without good logging.

//
// Note: Result needs to be defined here since `netNs.Do` only returns
// an error.
result := make([]byte, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do var result []byte? Even if it's a zero length array we still allocate a pointer that is then set to whatever cmd.CombinedOutput returns, so it might be redundant?


return result, err
} else {
return doCommand()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log at the debug level that we're executing in the root network namespace?

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nicholascioli for tackling this and @mateiidavid for the detailed writeup.

It seems this follows the recommended instructions from the ns package docs. The Weave article is indeed scary, in particular about the OS thread lock workaround:

One could argue that locking a goroutine with runtime.LockOSThread could help, but a) the goroutine might spawn a new goroutine which would run in a wrong namespace b) locking does not prevent the runtime from creating a new OS thread for scheduling.

But as pointed out in the ns docs and in golang/go#20676, new threads won't be spawned from locked Ms after go 1.10. And IIUC, the linkerd-cni binary is executed from scratch for each event (ADD pod, which is what we only care about), so it's not a long-lived process where concurrent runtime.LockOSThread could conflict.

I second @mateiidavid's recommendations to fully log whatever we can here. I would also love to have linkerd/linkerd2#8892 resuscitated to actually have some logs to look at! 😉

I think putting this behind a flag disabled by default would make sense while we build more confidence about it. And I also need to address linkerd/linkerd2#11567 so we can actually run this in CI 😣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants