Permalink
Show file tree
Hide file tree
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
waitfd: new syscall implementing waitpid() over fds
This syscall, originally due to Casey Dahlin but significantly modified
since, is called quite like waitid():
fd = waitfd(P_PID, some_pid, WEXITED | WSTOPPED, 0);
This returns a file descriptor which becomes ready whenever waitpid()
would return, and when read() returns the return value waitpid() would
have returned. (Alternatively, you can use it as a pure indication that
waitpid() is callable without hanging, and then call waitpid()). See the
example in tools/testing/selftests/waitfd/.
The original reason for rejection of this patch back in 2009 was that it
was redundant to waitpid()ing in a separate thread and transmitting
process information to another thread that polls: but this is only the
case for the conventional child-process use of waitpid(). Other
waitpid() uses, such as ptrace() returns, are targetted on a single
thread, so without waitfd or something like it, it is impossible to have
a thread that both accepts requests for servicing from other threads
over an fd *and* manipulates the state of a ptrace()d process in
response to those requests without ugly CPU-chewing polling (accepting
requests requires blocking in poll() or select(): handling the ptraced
process requires blocking in waitpid()).
There is one ugliness in this patch which I would appreciate suggestions
to improve (due to me, not due to Casey, don't blame him). The poll()
machinery expects to be used with files, or things enough like files
that the wake_up key contains an indication as to whether this wakeup
corresponds to a POLLIN / POLLOUT / POLLERR event on this fd. You can
override this in your poll_queue_proc, but the poll() and epoll() queue
procs both have this interpretation.
Unfortunately, this is not true for waitfds, which wait on the the
wait_chldexit waitqueue, whose key is a pointer to the task_struct of
the task being killed. We can't do anything with this key, but we
certainly don't want the poll machinery treating it as a bitmask and
checking it against poll events!
So we introduce a new poll_wait() analogue, poll_wait_fixed(). This is used
for poll_wait() calls which know they must wait on waitqueues whose keys are
not a typecast representation of poll events, and passes in an extra
argument to the poll_queue_proc, which if nonzero is the event which a
wakeup on this waitqueue should be considered as equivalent to. The
poll_queue_proc can then skip adding entirely if that fixed event is not
included in the set to be caught by this poll().
We also add a new poll_table_entry.fixed_key. The poll_queue_proc can
record the fixed key it is passed in here, and reuse it at wakeup time to
track that a nonzero fixed key was passed in to poll_wait_fixed() and that
the key should be ignored in preference to fixed_key.
With this in place, you can say, e.g. (as waitfd does)
poll_wait_fixed(file, ¤t->signal->wait_chldexit, wait,
POLLIN);
and the key passed to wakeups on the wait_chldexit waitqueue will be
ignored: the fd will always be treated as having raised POLLIN, waking
up poll()s and epoll()s that have specified that event. (Obviously, a
poll function that calls this should return the same value from the poll
function as was passed to poll_wait_fixed(), or, as usual, zero if this
was a spurious wakeup.)
I do not like this scheme: it's sufficiently arcane that I had to go
back to my old commit messages to figure out what it was doing and
why. But I don't see another way to cause poll() to return on
appropriate activity on waitqueues that do not actually correspond to
files. (I do wonder how signalfd works. It doesn't seem to need any of
this and I don't understand why not. I would be overjoyed to remove the
whole invasive poll_wait_fixed() mess, but I'm not sure what to replace
it with.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Eugene Loh <eugene.loh@oracle.com>
Signed-off-by: David Mc Lean <david.mclean@oracle.com>
Signed-off-by: Vincent Lim <vincent.lim@oracle.com>- Loading branch information
Showing
22 changed files
with
376 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
130
fs/waitfd.c
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| /* SPDX-License-Identifier: GPL-2.0 */ | ||
| /* | ||
| * fs/waitfd.c | ||
| * | ||
| * Copyright (C) 2008 Red Hat, Casey Dahlin <cdahlin@redhat.com> | ||
| * | ||
| * Largely derived from fs/signalfd.c | ||
| */ | ||
|
|
||
| #include <linux/file.h> | ||
| #include <linux/poll.h> | ||
| #include <linux/init.h> | ||
| #include <linux/fs.h> | ||
| #include <linux/sched.h> | ||
| #include <linux/slab.h> | ||
| #include <linux/kernel.h> | ||
| #include <linux/signal.h> | ||
| #include <linux/list.h> | ||
| #include <linux/anon_inodes.h> | ||
| #include <linux/syscalls.h> | ||
|
|
||
| long kernel_wait4(pid_t upid, int __user *stat_addr, | ||
| int options, struct rusage __user *ru); | ||
|
|
||
| struct waitfd_ctx { | ||
| int options; | ||
| pid_t upid; | ||
| }; | ||
|
|
||
| static int waitfd_release(struct inode *inode, struct file *file) | ||
| { | ||
| kfree(file->private_data); | ||
| return 0; | ||
| } | ||
|
|
||
| static unsigned int waitfd_poll(struct file *file, poll_table *wait) | ||
| { | ||
| struct waitfd_ctx *ctx = file->private_data; | ||
| long value; | ||
|
|
||
| poll_wait_fixed(file, ¤t->signal->wait_chldexit, wait, | ||
| POLLIN); | ||
|
|
||
| value = kernel_wait4(ctx->upid, NULL, ctx->options | WNOHANG | WNOWAIT, | ||
| NULL); | ||
| if (value > 0 || value == -ECHILD) | ||
| return POLLIN | POLLRDNORM; | ||
|
|
||
| return 0; | ||
| } | ||
|
|
||
| /* | ||
| * Returns a multiple of the size of a stat_addr, or a negative error code. The | ||
| * "count" parameter must be at least sizeof(int). | ||
| */ | ||
| static ssize_t waitfd_read(struct file *file, char __user *buf, size_t count, | ||
| loff_t *ppos) | ||
| { | ||
| struct waitfd_ctx *ctx = file->private_data; | ||
| int __user *stat_addr = (int *)buf; | ||
| int flags = ctx->options; | ||
| ssize_t ret, total = 0; | ||
|
|
||
| count /= sizeof(int); | ||
| if (!count) | ||
| return -EINVAL; | ||
|
|
||
| if (file->f_flags & O_NONBLOCK) | ||
| flags |= WNOHANG; | ||
|
|
||
| do { | ||
| ret = kernel_wait4(ctx->upid, stat_addr, flags, NULL); | ||
| if (ret == 0) | ||
| ret = -EAGAIN; | ||
| if (ret == -ECHILD) | ||
| ret = 0; | ||
| if (ret <= 0) | ||
| break; | ||
|
|
||
| stat_addr++; | ||
| total += sizeof(int); | ||
| } while (--count); | ||
|
|
||
| return total ? total : ret; | ||
| } | ||
|
|
||
| static const struct file_operations waitfd_fops = { | ||
| .release = waitfd_release, | ||
| .poll = waitfd_poll, | ||
| .read = waitfd_read, | ||
| .llseek = noop_llseek, | ||
| }; | ||
|
|
||
| SYSCALL_DEFINE4(waitfd, int __maybe_unused, which, pid_t, upid, int, options, | ||
| int __maybe_unused, flags) | ||
| { | ||
| int ufd; | ||
| struct waitfd_ctx *ctx; | ||
|
|
||
| /* | ||
| * Options validation from kernel_wait4(), minus WNOWAIT, which is | ||
| * only used by our polling implementation. If WEXITED or WSTOPPED | ||
| * are provided, silently remove them (for backward compatibility with | ||
| * older callers). | ||
| */ | ||
| options &= ~(WEXITED | WSTOPPED); | ||
| if (options & ~(WNOHANG|WUNTRACED|WCONTINUED| | ||
| __WNOTHREAD|__WCLONE|__WALL)) | ||
| return -EINVAL; | ||
|
|
||
| ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); | ||
| if (!ctx) | ||
| return -ENOMEM; | ||
|
|
||
| ctx->options = options; | ||
| ctx->upid = upid; | ||
|
|
||
| ufd = anon_inode_getfd("[waitfd]", &waitfd_fops, ctx, | ||
| O_RDWR | flags | ((options & WNOHANG) ? | ||
| O_NONBLOCK | 0 : 0)); | ||
| /* | ||
| * Use the fd's nonblocking state from now on, since that can change. | ||
| */ | ||
| ctx->options &= ~WNOHANG; | ||
|
|
||
| if (ufd < 0) | ||
| kfree(ctx); | ||
|
|
||
| return ufd; | ||
| } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.