Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: LockOSThread: didn't exit the main thread without Unlock #34031

Closed
fuweid opened this issue Sep 3, 2019 · 2 comments
Closed

runtime: LockOSThread: didn't exit the main thread without Unlock #34031

fuweid opened this issue Sep 3, 2019 · 2 comments

Comments

@fuweid
Copy link

@fuweid fuweid commented Sep 3, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.9 linux/amd64
$  uname -a
Linux ubuntu-xenial 4.4.0-159-generic #187-Ubuntu SMP Thu Aug 1 16:28:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build476618851=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I try to use runtime.LockOSThread to syscall.Unshare(syscall.CLONE_NEWNS) without unlock. The goroutine should be killed after goroutine exit, but it didn't. The runtime will schedule goroutine on the non-safe thread which has different network namespace with pid 1. Is it possible to schedule goroutine to the main thread?

I use the following code to reproduce the issue:

  • concurrent create 100 network namespaces as one round
  • check the task namespace after one round # compare with network namespace of pid 1
  • repeat
package main

import (
	"fmt"
	"io/ioutil"
	"runtime"
	"sync"
	"syscall"
	"time"

	"github.com/vishvananda/netns"
)

const (
	rounds = 20000
	groups = 100
)

func main() {
	roundCh := make(chan struct{})
	waitCh := make(chan struct{})

	go monitorNSLeak(roundCh, waitCh)

	var wg sync.WaitGroup
	for i := 0; i < rounds; i++ {
		fmt.Printf("found %v\n", i)

		wg.Add(groups)
		for j := 0; j < groups; j++ {
			go func() {
				// without unlock, the thread should be killed.
				runtime.LockOSThread()

				ns, err := netns.New()
				panicIfErr(err)
				fmt.Printf("task id: %v, ns: %v\n", syscall.Gettid(), ns)
				panicIfErr(ns.Close())
				wg.Done()
			}()
		}
		wg.Wait()

		roundCh <- struct{}{}
		<-waitCh
		time.Sleep(50 * time.Millisecond)
	}
}

// monitorNSLeak checks the ns leak for each run.
func monitorNSLeak(roundCh, waitCh chan struct{}) {
	initNs, err := netns.GetFromPid(1)
	panicIfErr(err)

	pid := syscall.Getpid()
	for {
		<-roundCh
		tasks, err := ioutil.ReadDir(fmt.Sprintf("/proc/%v/task", pid))
		panicIfErr(err)

		for _, task := range tasks {
			pathNs := fmt.Sprintf("/proc/%v/task/%v/ns/net", pid, task.Name())
			gotNs, err := netns.GetFromPath(pathNs)
			panicIfErr(err)

			// check the ns leak
			if !gotNs.Equal(initNs) {
				fmt.Printf("ns leaking on task id %s: expected %s, but got %s\n", task.Name(), initNs, gotNs)
				time.Sleep(10 * time.Minute)
				panic("ns leaking")
			}
			gotNs.Close()
		}

		waitCh <- struct{}{}
	}
}

func panicIfErr(err error) {
	if err != nil {
		panic(err)
	}
}
$ go build . && ./nsleak
....
ns leaking on task id 18922: expected NS(3: 3, 4026531957), but got NS(4: 3, 4026532132)

----------
# other shell
$ ll /proc/1/ns/net                                                                                     
lrwxrwxrwx 1 root root 0 Sep  3 10:55 /proc/1/ns/net -> net:[4026531957]

$ ll /proc/$(pidof nsleak)/task/*/ns/net
lrwxrwxrwx 1 root root 0 Sep  3 12:19 /proc/18922/task/18922/ns/net -> net:[4026532132] <--- here
lrwxrwxrwx 1 root root 0 Sep  3 12:19 /proc/18922/task/18923/ns/net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Sep  3 12:19 /proc/18922/task/18927/ns/net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Sep  3 12:19 /proc/18922/task/19024/ns/net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Sep  3 12:19 /proc/18922/task/19025/ns/net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Sep  3 12:19 /proc/18922/task/19026/ns/net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Sep  3 12:19 /proc/18922/task/19027/ns/net -> net:[4026531957]

What did you expect to see?

the lock threads should be killed and the remaining threads should have the same network namespace to pid 1.

What did you see instead?

The main thread has diff network namespace. goexit0 will call the gogo to call mexit if I understand correctly. I check the code https://github.com/golang/go/blob/release-branch.go1.12/src/runtime/proc.go#L1247-L1266 and it seems that the main thread will not be killed but locked.

As mexit function comment said, the main thread will not be scheduled to run any goroutine. But is it possible to run other goroutine in this case?

I can't prove the main thread blocked in mexit has been scheduled to run other goroutine by script right now. But I did see that connection refuse issue happen when there are lot of the concurrent network namespace create requests.

net ns create requests -> Daemon A <---(client) socket connection (server) ---> Daemon B 

Daemon B is still running but the Daemon A has connection refuse issue after concurrent requests. And we found that main thread of Daemon A has different network namespace. So I guess runtime still might schedule the goroutine to the main thread.

Hope those information can help.

@ianlancetaylor ianlancetaylor changed the title runtime.LockOSThread: didn't exit the main thread without Unlock runtime: LockOSThread: didn't exit the main thread without Unlock Sep 3, 2019
@ianlancetaylor ianlancetaylor added this to the Go1.14 milestone Sep 3, 2019
@fuweid

This comment has been minimized.

Copy link
Author

@fuweid fuweid commented Sep 5, 2019

I found that root cause. The Daemon A has sub thread to start the Daemon B with Pdeathsig: syscall.SIGKILL. We didn't unlock thread when we create new net namespace so that the thread will be killed after goexit. If the thread used to create the Daemon B process is scheduled to create new namespace, it will be killed and the Daemon B has been killed too.

That is why we have connection issue. I said the Daemon B is still working because we has goroutine to monitor it and set it up if it goes down. Sorry for missing this part.

I read the go runtime code and it prove me my previous assumption is wrong. If the main thread is used to run goroutine after mexit call, it will panic.

Sorry for any inconvenience!

@fuweid fuweid closed this Sep 5, 2019
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 5, 2019

Thanks for following up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.