New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libnetwork: fix restoring thread network namespaces #44356
libnetwork: fix restoring thread network namespaces #44356
Conversation
testutils.SetupTestOSContext() sets the calling thread's network namespace but neglected to restore it on teardown. This was not a problem in practice as it called runtime.LockOSThread() twice but runtime.UnlockOSThread() only once, so the tampered threads would be terminated by the runtime when the test case returned and replaced with a clean thread. Correct the utility so it restores the thread's network namespace during teardown and unlocks the goroutine from the thread on success. Remove unnecessary runtime.LockOSThread() calls peppering test cases which leverage testutils.SetupTestOSContext(). Signed-off-by: Cory Snider <csnider@mirantis.com>
077e43e
to
055a4cc
Compare
// /proc/self/ns/net, which would break any code which (incorrectly) | ||
// assumes that that file is a handle to the network namespace for the | ||
// thread it is currently executing on. | ||
runtime.LockOSThread() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right.
I get what it is doing here, but I think doing it in a func init()
is not desirable.
I'm also not sure that it's strictly necessary to lock the main goroutine to its thread, but if we want to do that it should go in func main()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling runtime.LockOSThread()
in func main()
, even as the first line, is no guarantee that it will be locked to the main thread. Calling it from an init function is the only supported way to do so under the Go 1 Compatibility Promise. Since it has to go into an init function anyway, locating that init function inside the very source file which contains the code that necessitates it follows the principles of locality and encapsulation.
I'm also not sure that it's strictly necessary to lock the main goroutine to its thread, but if we want to do that it should go in
func main()
It would not be strictly necessary if I was to fix up up all existing code which opens "/proc/self/ns/net"
when it really means fmt.Sprintf("/proc/self/task/%d/ns/net", unix.Gettid())
. But someone else might come along who is unaware of the subtleties hidden away in the libnetwork implementation details and writes new code which opens "/proc/self/ns/net"
. The code will seem to work...most of the time. It might happen to work reliably inside the daemon because some other package has locked func main()
to the main thread. Or the author might be lucky for a while. But there will be a subtle race condition nonetheless, waiting to be unleashed in the form of flaky tests and irreproducible bugs which might manifest inside libnetwork or in some unrelated code.
The runtime cost of locking func main()
to the main thread when not strictly necessary is, at most, one extra idle OS thread. In return, we don't have to worry about the aforementioned subtleties w.r.t. procfs or the risk of race-condition bugs when referencing /proc/self
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it in a func init()
in a package main
would also be ok.
Also, checkout out /proc/thread-self
for simpler lookup. 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it in a
func init()
in apackage main
would also be ok.
Why is it not ok to do it in package libnetwork
? I have explained my reasoning; care to rebut any of those points or elaborate on why it "doesn't seem right" to do it my way?
Also, checkout out
/proc/thread-self
for simpler lookup. 😸
/proc/thread-self
was added in Linux 3.17. I am avoiding it to retain compatibility with older kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not ok to do it in package libnetwork?
I have an aversion to func init()
in library code because it forces things to happen for anyone who ever imports it, which I get is specifically what you kinda want here, but I just find it a bit heavy-handed.
The parallel tests were unconditionally unlocking the test case goroutine from the OS thread, irrespective of whether the thread's network namespace was successfully restored. This was not a problem in practice as the unpaired calls to runtime.LockOSThread() peppered through the test case would have prevented the goroutine from being unlocked. Unlock the goroutine from the thread iff the thread's network namespace is successfully restored. Signed-off-by: Cory Snider <csnider@mirantis.com>
func (*network) watchMiss() correctly locks its goroutine to an OS thread before changing the thread's network namespace, but neglects to restore the thread's network namespace before unlocking. Fix this oversight by unlocking iff the thread's network namespace is successfully restored. Prevent the watchMiss goroutine from being locked to the main thread to avoid the issues which would arise if such a situation was to occur. Signed-off-by: Cory Snider <csnider@mirantis.com>
Aside from unconditionally unlocking the OS thread even if restoring the thread's network namespace fails, func (*networkNamespace).InvokeFunc() correctly implements invoking a function inside a network namespace. This is far from obvious, however. func InitOSContext() does much of the heavy lifting but in a bizarre fashion: it restores the initial network namespace before it is changed in the first place, and the cleanup function it returns does not restore the network namespace at all! The InvokeFunc() implementation has to restore the network namespace explicitly by deferring a call to ns.SetNamespace(). func InitOSContext() is a leaky abstraction taped to a footgun. On the one hand, it defensively resets the current thread's network namespace, which has the potential to fix up the thread state if other buggy code had failed to maintain the invariant that an OS thread must be locked to a goroutine unless it is interchangeable with a "clean" thread as spawned by the Go runtime. On the other hand, it _facilitates_ writing buggy code which fails to maintain the aforementioned invariant because the cleanup function it returns unlocks the thread from the goroutine unconditionally while neglecting to restore the thread's network namespace! It is quite scary to need a function which fixes up threads' network namespaces after the fact as an arbitrary number of goroutines could have been scheduled onto a "dirty" thread and run non-libnetwork code before the thread's namespace is fixed up. Any number of (not-so-)subtle misbehaviours could result if an unfortunate goroutine is scheduled onto a "dirty" thread. The whole repository has been audited to ensure that the aforementioned invariant is never violated, making after-the-fact fixing up of thread network namespaces redundant. Make InitOSContext() a no-op on Linux and inline the thread-locking into the function (singular) which previously relied on it to do so. func ns.SetNamespace() is of similarly dubious utility. It intermixes capturing the initial network namespace and restoring the thread's network namespace, which could result in threads getting put into the wrong network namespace if the wrong thread is the first to call it. Delete it entirely; functions which need to manipulate a thread's network namespace are better served by being explicit about capturing and restoring the thread's namespace. Rewrite InvokeFunc() to invoke the closure inside a goroutine to enable a graceful and safe recovery if the thread's network namespace could not be restored. Avoid any potential race conditions due to changing the main thread's network namespace by preventing the aforementioned goroutines from being eligible to be scheduled onto the main thread. Signed-off-by: Cory Snider <csnider@mirantis.com>
The function is a no-op on all platforms. Signed-off-by: Cory Snider <csnider@mirantis.com>
055a4cc
to
22529b8
Compare
Welp, guess it was just wishful thinking. The networkdb tests are still flaky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me overall, I don't really have anything to add except I'm not thrilled by runtime.LockOSThread()
in an init function, but as established in Cory's other PR, this is likely the best way to do things and we just need to get over our squeamishness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, as discussed in the maintainers meeting; thanks!
// I'm not sure what exactly the *intent* of this code was, but it looks very broken. | ||
// Anyway that's why I'm only logging the error and not failing the test. | ||
t.Log(err) | ||
t.Fatalf("Error restoring the current thread's netns: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running tests locally, TestParallel1
seem to fails for me every time:
# make BIND_DIR=. TESTFLAGS='-v' TESTDIRS='github.com/docker/docker/libnetwork' test-unit
...
=== FAIL: libnetwork TestParallel1 (2.51s)
libnetwork_linux_test.go:988: Error restoring the current thread's netns: bad file descriptor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Letting the test pass would violate the unlocked-thread invariant so the change can't be simply reverted. I'm on it.
- What I did
I ensured that libnetwork consistently maintains the invariant that all unlocked OS threads are "clean" in that they are interchangeable with threads freshly spawned by the Go runtime. I also ensured that threads are restored to a "clean" state and unlocked from goroutines whenever possible.
- How I did it
I audited every instance where
runtime.LockOSThread
,runtime.UnlockOSThread
,unix.Unshare
and packagenetns
are referenced in libnetwork code and fixed any bugs discovered.- How to verify it
CI is green. Maybe the bugs fixed in this PR are the root cause of the flaky libnetwork tests?
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)