In loving memory of Prof. Stan Eisenstat and his legendary course, CS 323.
This document describes the colorful variety of issues that come up when you use child processes, and the solutions that Duct chooses for them. It's intended for users who want to understand Duct's behavior better, and also for library authors who want to compare notes on their own behavior in these cases.
Duct is currently implemented in both Python and Rust, and it aims to be easily portable to other languages. Duct's behavior is generally identical across languages, but this document comments on cases where language differences affect the implementation.
- Reporting errors by default
- Catching pipe errors when writing to standard input
- Cleaning up zombie children
- Making
killthread-safe - Adding
./to program names given as relative paths - Preventing
dirfrom affecting relative program paths on Unix - Preventing pipe inheritance races on Windows
- Preventing pipe inheritance races on macOS
- Matching platform case-sensitivity for environment variables
- Using IO threads to avoid blocking children
- Killing grandchild processes?
Most programming languages make error checking the default, either by crashing
your program with an exception, or by emitting warnings or compiler errors for
unchecked results. But the child process APIs in most standard libraries
(including Python and Rust) do the opposite, ignoring non-zero exit statuses by
default. That's unfortunate, because many command line utilities helpfully
distinguish between success and failure in their exit status. For example, if
you give the wrong path to a tar command:
> tar xf misspelled_filename.txt
tar: misspelled_filename.txt: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
> echo $?
2Duct treats a non-zero exit status as an error and propagates it to the caller
by default. For suppressing these errors, Duct provides the unchecked method.
When writing to a child's stdin, Duct catches and ignores broken pipe errors
(EPIPE). That means it's not an error for the child to exit early without
reading all of its input. Most standard libraries get this right.
Notably on Unix, this requires the process to suppress SIGPIPE.
Implementations in languages that don't suppress SIGPIPE by default (C/C++?)
have no choice but to set a signal handler from library code, which might
conflict with application code or other libraries. There is no good solution to
this problem.
On Unix platforms (but not Windows) child processes hold some OS resources even
after they exit, until their parent process waits on them and receives their
exit status. The OS will do this cleanup automatically if the parent exits, but
not as long as the parent is alive. These exited-but-un-"reaped" children are
called zombie processes, and
they're a common type of resource leak if you run child processes in the
background (start as opposed to run).
The Python subprocess
module mitigates this by keeping a global list of leaked child
processes
and polling each of them whenever it's about to spawn a new child
process.
The Rust implementation of Duct uses the same strategy. The downside of this
strategy is that it makes process spawning O(n2) in the worst case,
if the caller leaks lots of long-lived child processes. Children don't enter
the global list as long you retain a Handle, so most applications won't hit
this case.
An alternative could be to spawn a waiter thread for each leaked child, but
that's more expensive in the common case, and also spawning a thread can fail.
It would be better to share a global waiter thread, but the historical options
for implementing something like that (SIGCHLD or waitpid(-1)) are
off-limits to library code that doesn't own the whole parent process. Polling
Linux pidfds might be the
best modern option, but that API is still new by kernel standards (2019), and
most other Unix platforms have no equivalent.
On Unix-like platforms there's a race condition between kill and waitpid.
If a process exits right before you signal it, a waiting thread might clean it
up and free its PID, and then an unrelated process might immediately reuse that
PID. It's not likely, but all of that could happen just before the call to
kill, and you might end up killing the unrelated process. This race condition
is why the Rust standard library doesn't allow shared access to child
processes.
It's possible to avoid this race using a newer POSIX API called
waitid.
That function has a WNOWAIT flag that leaves the child in its zombie state,
so that its PID isn't freed for reuse. That gives the waiting thread a chance
to set a flag to block further kills, before reaping the child. Duct uses this
approach on Unix-like platforms. Windows doesn't have this problem.
As part of a best-effort check for this bug, Python 3.9 changed the
behavior of Popen.kill to reap child
processes that have already exited. That interacts
poorly
with code that calls os.waitid or os.waitpid directly.
When you run the command foo, it can be ambiguous whether you mean ./foo in
current directory or e.g. /usr/bin/foo in the PATH. Different platforms do
different things here: Unix-like platforms usually require the leading ./ for
programs in the current directory, but Windows will accept a bare filename.
Duct defers to the platform for interpreting program names that are given as
strings, but it prepends ./ to program names that are given as path types
(pathlib in Python, std::path in Rust) when the path is relative.
This solves two problems:
- It prevents "command not found" errors on Unix-like platforms for paths to
programs in the current directory. This is especially important in Python,
where
pathlib.Pathautomatically strips leading dots. - It prevents paths to a nonexistent local file, which should result in
"command not found", from instead matching a program in the
%PATH%on Windows.
Note that Rust 1.58 changed the
behavior
of std::process::Command to exclude the current directory from the search
path on Windows.
Windows and Unix take different approaches to setting a child's working
directory. The CreateProcess function on Windows has an explict
lpCurrentDirectory argument, while most Unix platforms call chdir in
between fork and exec. Unfortunately, those two approaches give different
results when you have a relative path to the child executable. On Windows the
path is interpreted from the parent's working directory, but on Unix it's
interpreted from the child's.
The Windows behavior is preferable, because it lets you add a dir argument
without breaking any existing relative program paths. Duct provides this
behavior on all platforms, by canonicalizing relative program paths on
Unix-like platforms when the dir method is in use.
Spawning child processes on both Unix and Windows involves marking their pipes
"inheritable" in some way. In the Unix fork-exec model, the forked child
process gets copies of all the parent's pipes, and it marks the ones that it
wants to remain open after exec. Windows doesn't use forking, however, and
making pipes inheritable happens within the parent process itself.
Unfortunately, that means that any child spawned while a pipe is inheritable
will inherit it, which is a race condition in multithreaded
programs.
One child might accidentally receive a copy of another child's stdin pipe,
preventing the other child from reading EOF and leading to deadlocks. The Rust
standard library has an internal
mutex
to prevent this race, but the Python standard library does
not. In Python, Duct uses its own
internal mutex to prevent this race. That doesn't prevent races with other
libraries, but at least multiple Duct callers on different threads are
protected.
Update: Windows 7 added
PROC_THREAD_ATTRIBUTE_HANDLE_LIST,
a whitelist/allowlist for handles that a child process will inherit. As of
Python 3.7, setting close_fds=True (also now the default) in
subprocess.Popen uses this feature to support inheriting stdin/stdout/stderr
while avoiding unintentional inheritance. This does prevent races with other
libraries, as long as they go through Python's subprocess module or use a
similar technique. As of v1.0.1, Duct no longer has its own workarounds for
these issues.
Unix pipes are "inheritable" by default, but this is an old default from before
multithreading was common, and most applications (that aren't single-threaded
shells) need to override it. Standard library functions like os.pipe() in
Python and std::io::pipe() in Rust take care of this for you. On Linux and
most other Unixes, the mechanism for this is the O_CLOEXEC flag to the
pipe2 system call, which is "atomic": there's no chance for other threads to
observe a pipe after it's created but before CLOEXEC is set. Unfortunately,
macOS doesn't support pipe2. On macOS, opening a pipe and setting CLOEXEC
are two separate syscalls, and there's a brief race condition in between where
other threads spawning child processes can accidentally inherit unrelated
pipes.
In Python, setting close_fds=True in subprocess.Popen works around this
problem. Accidental inheritance is still possible, but it gets cleaned up right
before exec. Rust doesn't have a similar feature in std::process::Command.
Rather than replicating it, the Rust implementation of Duct uses a global mutex
to prevent pipe opening from overlapping with child spawning on macOS.
Unfortunately, this can only protect pipes that Duct opens itself. Callers who
open their own pipes (to pass them to stdout_file for example), and who might
race with unrelated threads on macOS, need to make their own global mutexes.
Environment variable names are case-sensitive on Unix but case-insensitive on
Windows, and Duct tries to respect each platform's behavior. Windows variable
names are also case-preserving, and some system variables are mixed-case by
default (including Path and SystemRoot). Python's os.environ map
uppercases all variable names, and the Python implementation of Duct does the
same. Rust's std::env API preserves casing, and the Rust implementation of
Duct also preserves casing, using an internal OsString wrapper type for
case-insensitive equality and hashing.
Duct makes no guarantees about non-ASCII environment variable names. Their behavior is implementation-dependent, platform-dependent, programming language-dependent, and probably also human-language-dependent.
When input bytes are supplied or output bytes are captured, Duct's start
method uses background threads to do IO, so that IO makes progress even if
wait is never called. Duct's reader method doesn't use a thread for
standard ouput, since that's left to the caller, but it still uses background
threads to supply input bytes or to capture standard error.
Consider the following scenario. You want to spawn two child processes that will talk to each other somehow, for example using the local network. You also want to capture the output of each process. Your code might look like this:
handle1 = cmd("child1").stdout_capture().start()
handle2 = cmd("child2").stdout_capture().start()
output1 = handle1.wait().stdout
output2 = handle2.wait().stdoutIf Duct handled captured output without threads, e.g. using a read loop inside
of wait, that code could have a deadlock once the output grew large enough.
(So of course it would pass tests but fail occasionally in production.) Suppose
that the messages the children exchanged with each other were synchronous
somehow, such that blocking one child would eventually block the other. And
suppose that both children had enough output that they could also block if the
parent didn't clear space in their stdout pipe buffers by reading. The call to
handle1.wait would block until child1 was finished. Then child2 would
block writing to stdout, because the parent wouldn't be reading it yet.
Finally, child1 would block on child2, waiting for messages. That would be
a deadlock, and it would probably be difficult to reproduce and debug.
For this reason, the start method must use threads to supply input and
capture output. That guarantees that the parent will never cause its children
to block on output, regardless of its order of operations after start.
Also, note that observing that a child process has exited does not guarantee
that its IO pipes will close or that any IO threads using those pipes will
exit. If the child process spawns any grandchild processes (more on those
below), the grandchildren usually inherit copies of the child's IO pipes, and
they can outlive the child and keep those pipes open indefinitely. Non-blocking
methods like
Handle.poll in
Python or
Handle::try_wait
in Rust need to explicitly check whether IO threads have exited before doing
any blocking joins. These situations are also why we can't use files instead of
pipes to capture output: we'd have no way to know whether grandchild processes
were finished writing.
Currently unsolved. This is something of a disaster area in Unix process
management. Consider the following two scripts. Here's test1.py:
import subprocess
subprocess.run(["sleep", "100"])And here's test2.py:
import subprocess
import time
p = subprocess.Popen(["python", "./test1.py"])
time.sleep(1)
p.kill()
p.wait()That is, test1.py starts a sleep child process and then waits on it. And
test2.py starts test1.py, waits for a second, and then kills it. The
question is, if you run test2.py, what happpens to the sleep process? If
you look at something like pgrep sleep after test2.py exits, you'll see
that sleep is still running. Maybe that's not entirely surprising, since we
only killed test1.py and didn't explicitly kill sleep. But compare that to
what happens if you start test2.py and then quickly press Ctrl-C. In that
case, sleep is killed. What the hell!
What's going on is that there's a difference between signaling a process ID and
signaling a process group ID.
The kill function in Python (and Bash and pretty much every other language)
does the former, which only kills a single process. Ctrl-C in the shell does
the latter, which kills a whole tree of child processes at once. Process group
signaling is a great way to cancel an "entire job" reliably, even if that job
has spawned more child processes. So why do existing kill functions use the
surprisingly weak sauce that is individual process signaling?
The sad truth is that process group signaling basically only works for shells.
When the shell forks a child process, before it calls exec, it calls
setpgid to set a new process group ID. Because child processes typically do
not call setpgid themselves, the child process and all of its transitive
children end up in the same process group (which typically has a group ID equal
to the process ID of the original child). However, if one of those child
processes does call setpgid, the relationship between it and the other
children gets lost. Ctrl-C and Ctrl-Z stop working properly. The fundamental
issue is that each process only has a single process group ID. Process groups
do not form a tree.
What does form a tree, however, is process IDs themselves. Each process knows the ID of its parent, so it's possible to query a process's full transitive tree of children. The problem with using such a query for signaling purposes is that it's racy. In the time between when you run the query and when you send signals, any process in the tree may have spawned new children. (Even worse, some processes might've exited, and those PIDs might've been reused for processes that aren't in the tree.) We can just barely almost solve that problem by killing a child process, not reaping it yet, and querying the child processes of the zombie. But alas, that strategy only works for one level of the tree, as the OS automatically reaps any zombie whose parent is also a zombie. So close!
The modern solution for all of this on Linux is supposed to be
cgroups. But as if to rub salt in
our wounds, it turns out there's no way to atomically signal a
cgroup. Systemd
works around this problem with a kill loop that repeatedly queries the PIDs in
a cgroup and tries to kill all of them individually. And it's still
vulnerable to the PID reuse race. Update: As of Linux 5.14 (August 2021)
cgroups support an atomic cgroup.kill operation that looks robust. The last
major holdout might be macOS.
Windows has a cleaner solution (job objects), but even there it sounds like some important features aren't supported on Windows 7. Realistically, there won't be good techniques for Duct to use to solve this problem for many years.