New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: on unix Process.Kill() can kill the wrong process #13987

Open
ibukanov opened this Issue Jan 17, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@ibukanov

ibukanov commented Jan 17, 2016

This is somewhat similar to the issue #9382. On Unix Process.Kill() just calls Process.Signal(Kill). As https://golang.org/src/os/exec_unix.go#L39 indicates, the Signal function invokes syscall.Kill(p.Pid) outside any lock after checking if the process still runs. Thus at the point when the signal is called the process could finish and Process.Wait() from another thread return. Thus OS is free to reuse the pid for another unrelated process. If this happens, Process.Kill() kills that process.

Due to this race it is impossible to write a correct platform-independent code in Go that kills the process if it does not terminate after a pause. I.e. the code fragment from #9382 is not correct on Unix:

func spawnAndKill(exePath string, counter int) error {
    cmd := exec.Command(exePath, fmt.Sprintf("%d", counter))
    err := cmd.Start()
    if err != nil {
        return err
    }
    go func() {
        time.Sleep(1000 * time.Millisecond)
        cmd.Process.Kill()
    }()
    cmd.Wait()
    return nil
}
@minux

This comment has been minimized.

Member

minux commented Jan 17, 2016

@ibukanov

This comment has been minimized.

ibukanov commented Jan 18, 2016

@minux On Windows the above code is not racy and works correctly due to fixes that went into #9382. On Linux I can also implement the same semantics correctly via waiting for SIGCHLD and using non-blocking wait4 syscall. That allows to synchronize with kill attempts. However, this is non-portable.

What I would like to see is an option to write a portable Go code implementing the kill-if-not-finished-after-timeout pattern. If this requires new API, so be it, but at least document loudly that calling Process.Kill and Process.Wait from different threads at the same time could race and should be avoided in portable code.

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Jan 18, 2016

@minux

This comment has been minimized.

Member

minux commented Jan 18, 2016

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jan 19, 2016

I think we can do it without a race, and without an extra goroutine, if we add something to the runtime package to count SIGCHLD signals, and provide a way for a goroutine to retrieve the current SIGCHLD count and a way to sleep until the SIGCHLD counter is greater than N.

@ibukanov

This comment has been minimized.

ibukanov commented Jan 19, 2016

@minux The point is precisely not to wait to timeout as typically the process finishes much faster. In my case the timeout is really for runaway cases when something bad happens or the system is too busy. I also cannot just start another go routine that does kill/wait as I need to get the process return status to decide if everything is OK.

@ianlancetaylor Note I do not suggest to change the current implementation. Rather I suggest to document loudly that one should avoid Wait()/Kill() race in portable code. Then add some new API that allows to implement the desired functionality. For example, a version of exec.Cmd.Wait that takes a channel and waits until the child finishes or there is input on the channel is enough to implement all my use cases.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jan 19, 2016

If we can change the existing API so that it avoids a race, I think that is clearly better than introducing a new API.

@minux

This comment has been minimized.

Member

minux commented Jan 20, 2016

I still think the user code should be able to avoid this race
as much as possible.

For example, something like this should be pretty safe
without any complications in runtime and it also have
the benefit that as soon as the process exits, the timeout
goroutine will exit soon, rather than wait until the timer
expires.

func spawnAndKill(exePath string, counter int) error {
cmd := exec.Command(exePath, fmt.Sprintf("%d", counter))
err := cmd.Start()
if err != nil {
return err
}
ch := make(chan bool)
go func() {
select {
case <-time.After(1000 * time.Millisecond):
cmd.Process.Kill()
case <-ch:
return
}
}()
cmd.Wait()
close(ch)
return nil
}

@ibukanov

This comment has been minimized.

ibukanov commented Jan 20, 2016

The above code just minimizes the chances of race. It is still there. As it is impossible to make its probability arbitrary small, one should avoid it, say, in production code.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 10, 2016

CL https://golang.org/cl/23967 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 10, 2016

os: on GNU/Linux use waitid to avoid wait/kill race
On systems that support the POSIX.1-2008 waitid function, we can use it
to block until a wait will succeed. This avoids a possible race
condition: if a program calls p.Kill/p.Signal and p.Wait from two
different goroutines, then it is possible for the wait to complete just
before the signal is sent. In that case, it is possible that the system
will start a new process using the same PID between the wait and the
signal, causing the signal to be sent to the wrong process. The
Process.isdone field attempts to avoid that race, but there is a small
gap of time between when wait returns and isdone is set when the race
can occur.

This CL avoids that race by using waitid to wait until the process has
exited without actually collecting the PID. Then it sets isdone, then
waits for any active signals to complete, and only then collects the PID.

No test because any plausible test would require starting enough
processes to recycle all the process IDs.

Update #13987.
Update #16028.

Change-Id: Id2939431991d3b355dfb22f08793585fc0568ce8
Reviewed-on: https://go-review.googlesource.com/23967
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 10, 2016

I believe this race is now fixed on GNU/Linux.

Leaving the issue open as the fix will have to be implemented on other Unix systems. Systems that provide waitid (which should be most of them, since it is in POSIX.1-2008) need to implement that; it's slightly painful since it's not in the syscall package.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 11, 2016

CL https://golang.org/cl/24021 mentions this issue.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 11, 2016

CL https://golang.org/cl/24020 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 13, 2016

Mikio Hara
os: use waitid to avoid wait/kill race on darwin
This change is a followup to https://go-review.googlesource.com/23967
for Darwin.

Updates #13987.
Updates #16028.

Change-Id: Ib1fb9f957fafd0f91da6fceea56620e29ad82b00
Reviewed-on: https://go-review.googlesource.com/24020
Reviewed-by: Ian Lance Taylor <iant@golang.org>

gopherbot pushed a commit that referenced this issue Jun 13, 2016

Mikio Hara
os: use wait6 to avoid wait/kill race on freebsd
This change is a followup to https://go-review.googlesource.com/23967
for FreeBSD.

Updates #13987.
Updates #16028.

Change-Id: I0f0737372fce6df89d090fe9847305749b79eb4c
Reviewed-on: https://go-review.googlesource.com/24021
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment