Skip to content
Permalink
Browse files
waitfd: new syscall implementing waitpid() over fds
This syscall, originally due to Casey Dahlin but significantly modified
since, is called quite like waitid():

	fd = waitfd(P_PID, some_pid, WEXITED | WSTOPPED, 0);

This returns a file descriptor which becomes ready whenever waitpid()
would return, and when read() returns the return value waitpid() would
have returned.  (Alternatively, you can use it as a pure indication that
waitpid() is callable without hanging, and then call waitpid()).  See the
example in tools/testing/selftests/waitfd/.

The original reason for rejection of this patch back in 2009 was that it
was redundant to waitpid()ing in a separate thread and transmitting
process information to another thread that polls: but this is only the
case for the conventional child-process use of waitpid().  Other
waitpid() uses, such as ptrace() returns, are targetted on a single
thread, so without waitfd or something like it, it is impossible to have
a thread that both accepts requests for servicing from other threads
over an fd *and* manipulates the state of a ptrace()d process in
response to those requests without ugly CPU-chewing polling (accepting
requests requires blocking in poll() or select(): handling the ptraced
process requires blocking in waitpid()).

There is one ugliness in this patch which I would appreciate suggestions
to improve (due to me, not due to Casey, don't blame him).  The poll()
machinery expects to be used with files, or things enough like files
that the wake_up key contains an indication as to whether this wakeup
corresponds to a POLLIN / POLLOUT / POLLERR event on this fd.  You can
override this in your poll_queue_proc, but the poll() and epoll() queue
procs both have this interpretation.

Unfortunately, this is not true for waitfds, which wait on the the
wait_chldexit waitqueue, whose key is a pointer to the task_struct of
the task being killed.  We can't do anything with this key, but we
certainly don't want the poll machinery treating it as a bitmask and
checking it against poll events!

So we introduce a new poll_wait() analogue, poll_wait_fixed().  This is used
for poll_wait() calls which know they must wait on waitqueues whose keys are
not a typecast representation of poll events, and passes in an extra
argument to the poll_queue_proc, which if nonzero is the event which a
wakeup on this waitqueue should be considered as equivalent to.  The
poll_queue_proc can then skip adding entirely if that fixed event is not
included in the set to be caught by this poll().

We also add a new poll_table_entry.fixed_key.  The poll_queue_proc can
record the fixed key it is passed in here, and reuse it at wakeup time to
track that a nonzero fixed key was passed in to poll_wait_fixed() and that
the key should be ignored in preference to fixed_key.

With this in place, you can say, e.g. (as waitfd does)

        poll_wait_fixed(file, &current->signal->wait_chldexit, wait,
                POLLIN);

and the key passed to wakeups on the wait_chldexit waitqueue will be
ignored: the fd will always be treated as having raised POLLIN, waking
up poll()s and epoll()s that have specified that event.  (Obviously, a
poll function that calls this should return the same value from the poll
function as was passed to poll_wait_fixed(), or, as usual, zero if this
was a spurious wakeup.)

I do not like this scheme: it's sufficiently arcane that I had to go
back to my old commit messages to figure out what it was doing and
why.  But I don't see another way to cause poll() to return on
appropriate activity on waitqueues that do not actually correspond to
files.  (I do wonder how signalfd works.  It doesn't seem to need any of
this and I don't understand why not.  I would be overjoyed to remove the
whole invasive poll_wait_fixed() mess, but I'm not sure what to replace
it with.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Eugene Loh <eugene.loh@oracle.com>
Signed-off-by: David Mc Lean <david.mclean@oracle.com>
Signed-off-by: Vincent Lim <vincent.lim@oracle.com>
  • Loading branch information
nickalcock committed Nov 7, 2021
1 parent b6d6896 commit 0fd5e3924d69c1dc81e7c875fc9889b52c50248e
Show file tree
Hide file tree
Showing 22 changed files with 376 additions and 18 deletions.
@@ -453,3 +453,6 @@
446 i386 landlock_restrict_self sys_landlock_restrict_self
447 i386 memfd_secret sys_memfd_secret
448 i386 process_mrelease sys_process_mrelease
# This one is a temporary number, designed for no clashes.
# Nothing but DTrace should use it.
473 i386 waitfd sys_waitfd
@@ -370,6 +370,9 @@
446 common landlock_restrict_self sys_landlock_restrict_self
447 common memfd_secret sys_memfd_secret
448 common process_mrelease sys_process_mrelease
# This one is a temporary number, designed for no clashes.
# Nothing but DTrace should use it.
473 common waitfd sys_waitfd

#
# Due to a historical design error, certain syscalls are numbered differently
@@ -79,7 +79,8 @@ static int virqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void
}

static void virqfd_ptable_queue_proc(struct file *file,
wait_queue_head_t *wqh, poll_table *pt)
wait_queue_head_t *wqh, poll_table *pt,
unsigned long unused)
{
struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
add_wait_queue(wqh, &virqfd->wait);
@@ -152,7 +152,7 @@ static void vhost_flush_work(struct vhost_work *work)
}

static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
poll_table *pt, unsigned long unused)
{
struct vhost_poll *poll;

@@ -95,7 +95,7 @@ static int hsm_irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
}

static void hsm_irqfd_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
poll_table *pt, unsigned long fixed_event)
{
struct hsm_irqfd *irqfd;

@@ -31,6 +31,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_WAITFD) += waitfd.o
obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_IO_URING) += io_uring.o
obj-$(CONFIG_IO_WQ) += io-wq.o
@@ -1717,7 +1717,7 @@ struct aio_poll_table {

static void
aio_poll_queue_proc(struct file *file, struct wait_queue_head *head,
struct poll_table_struct *p)
struct poll_table_struct *p, unsigned long fixed_event)
{
struct aio_poll_table *pt = container_of(p, struct aio_poll_table, pt);

@@ -153,6 +153,9 @@ struct epitem {
/* The file descriptor information this item refers to */
struct epoll_filefd ffd;

/* fd always raises this fixed event. */
unsigned long fixed_event;

/* List containing poll wait queues */
struct eppoll_entry *pwqlist;

@@ -1206,6 +1209,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
if (!(epi->event.events & EPOLLEXCLUSIVE))
ewake = 1;

/*
* If this fd type has a hardwired event which should override the key
* (e.g. if it is waiting on a non-file waitqueue), jam it in here.
*/
if (epi->fixed_event)
key = (void *)epi->fixed_event;

if (pollflags & POLLFREE) {
/*
* If we race with ep_remove_wait_queue() it can miss
@@ -1230,7 +1240,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
* target file wakeup lists.
*/
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
poll_table *pt)
poll_table *pt, unsigned long fixed_event)
{
struct ep_pqueue *epq = container_of(pt, struct ep_pqueue, pt);
struct epitem *epi = epq->epi;
@@ -1239,6 +1249,12 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
if (unlikely(!epi)) // an earlier allocation has failed
return;

if (fixed_event & !(epi->event.events & fixed_event))
return;

if (fixed_event)
epi->fixed_event = fixed_event;

pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL);
if (unlikely(!pwq)) {
epq->epi = NULL;
@@ -1463,6 +1479,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->fixed_event = 0;
epi->next = EP_UNACTIVE_PTR;

if (tep)
@@ -5492,7 +5492,8 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt,
}

static void io_async_queue_proc(struct file *file, struct wait_queue_head *head,
struct poll_table_struct *p)
struct poll_table_struct *p,
unsigned long fixed_event)
{
struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
struct async_poll *apoll = pt->req->apoll;
@@ -5804,7 +5805,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
}

static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
struct poll_table_struct *p)
struct poll_table_struct *p, unsigned long fixed_event)
{
struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);

@@ -116,7 +116,7 @@ struct poll_table_page {
* poll table.
*/
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
poll_table *p);
poll_table *p, unsigned long fixed_event);

void poll_initwait(struct poll_wqueues *pwq)
{
@@ -212,22 +212,37 @@ static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key
struct poll_table_entry *entry;

entry = container_of(wait, struct poll_table_entry, wait);

/*
* If this fd type has a hardwired key which should override the key
* (e.g. if it is waiting on a non-file waitqueue), jam it in here.
*/
if (entry->fixed_key)
key = (void *)entry->fixed_key;

if (key && !(key_to_poll(key) & entry->key))
return 0;
return __pollwake(wait, mode, sync, key);
}

/* Add a new entry */
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
poll_table *p)
poll_table *p, unsigned long fixed_event)
{
struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
struct poll_table_entry *entry = poll_get_entry(pwq);
struct poll_table_entry *entry;

if (fixed_event && !(p->_key & fixed_event))
return;

entry = poll_get_entry(pwq);
if (!entry)
return;

entry->filp = get_file(filp);
entry->wait_address = wait_address;
entry->key = p->_key;
entry->fixed_key = fixed_event;
init_waitqueue_func_entry(&entry->wait, pollwake);
entry->wait.private = pwq;
add_wait_queue(wait_address, &entry->wait);
@@ -0,0 +1,130 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* fs/waitfd.c
*
* Copyright (C) 2008 Red Hat, Casey Dahlin <cdahlin@redhat.com>
*
* Largely derived from fs/signalfd.c
*/

#include <linux/file.h>
#include <linux/poll.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/kernel.h>
#include <linux/signal.h>
#include <linux/list.h>
#include <linux/anon_inodes.h>
#include <linux/syscalls.h>

long kernel_wait4(pid_t upid, int __user *stat_addr,
int options, struct rusage __user *ru);

struct waitfd_ctx {
int options;
pid_t upid;
};

static int waitfd_release(struct inode *inode, struct file *file)
{
kfree(file->private_data);
return 0;
}

static unsigned int waitfd_poll(struct file *file, poll_table *wait)
{
struct waitfd_ctx *ctx = file->private_data;
long value;

poll_wait_fixed(file, &current->signal->wait_chldexit, wait,
POLLIN);

value = kernel_wait4(ctx->upid, NULL, ctx->options | WNOHANG | WNOWAIT,
NULL);
if (value > 0 || value == -ECHILD)
return POLLIN | POLLRDNORM;

return 0;
}

/*
* Returns a multiple of the size of a stat_addr, or a negative error code. The
* "count" parameter must be at least sizeof(int).
*/
static ssize_t waitfd_read(struct file *file, char __user *buf, size_t count,
loff_t *ppos)
{
struct waitfd_ctx *ctx = file->private_data;
int __user *stat_addr = (int *)buf;
int flags = ctx->options;
ssize_t ret, total = 0;

count /= sizeof(int);
if (!count)
return -EINVAL;

if (file->f_flags & O_NONBLOCK)
flags |= WNOHANG;

do {
ret = kernel_wait4(ctx->upid, stat_addr, flags, NULL);
if (ret == 0)
ret = -EAGAIN;
if (ret == -ECHILD)
ret = 0;
if (ret <= 0)
break;

stat_addr++;
total += sizeof(int);
} while (--count);

return total ? total : ret;
}

static const struct file_operations waitfd_fops = {
.release = waitfd_release,
.poll = waitfd_poll,
.read = waitfd_read,
.llseek = noop_llseek,
};

SYSCALL_DEFINE4(waitfd, int __maybe_unused, which, pid_t, upid, int, options,
int __maybe_unused, flags)
{
int ufd;
struct waitfd_ctx *ctx;

/*
* Options validation from kernel_wait4(), minus WNOWAIT, which is
* only used by our polling implementation. If WEXITED or WSTOPPED
* are provided, silently remove them (for backward compatibility with
* older callers).
*/
options &= ~(WEXITED | WSTOPPED);
if (options & ~(WNOHANG|WUNTRACED|WCONTINUED|
__WNOTHREAD|__WCLONE|__WALL))
return -EINVAL;

ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
if (!ctx)
return -ENOMEM;

ctx->options = options;
ctx->upid = upid;

ufd = anon_inode_getfd("[waitfd]", &waitfd_fops, ctx,
O_RDWR | flags | ((options & WNOHANG) ?
O_NONBLOCK | 0 : 0));
/*
* Use the fd's nonblocking state from now on, since that can change.
*/
ctx->options &= ~WNOHANG;

if (ufd < 0)
kfree(ctx);

return ufd;
}
@@ -34,7 +34,8 @@ struct poll_table_struct;
/*
* structures and helpers for f_op->poll implementations
*/
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *,
struct poll_table_struct *, unsigned long fixed_event);

/*
* Do not touch the structure directly, use the access functions
@@ -48,7 +49,15 @@ typedef struct poll_table_struct {
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && p->_qproc && wait_address)
p->_qproc(filp, wait_address, p);
p->_qproc(filp, wait_address, p, 0);
}

static inline void poll_wait_fixed(struct file *filp,
wait_queue_head_t *wait_address, poll_table *p,
unsigned long fixed_event)
{
if (p && p->_qproc && wait_address)
p->_qproc(filp, wait_address, p, fixed_event);
}

/*
@@ -93,6 +102,7 @@ static inline __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
struct poll_table_entry {
struct file *filp;
__poll_t key;
unsigned long fixed_key;
wait_queue_entry_t wait;
wait_queue_head_t *wait_address;
};
@@ -1376,6 +1376,9 @@ long compat_ksys_semtimedop(int semid, struct sembuf __user *tsems,
long __do_semtimedop(int semid, struct sembuf *tsems, unsigned int nsops,
const struct timespec64 *timeout,
struct ipc_namespace *ns);
#ifdef CONFIG_DTRACE
asmlinkage long sys_waitfd(int which, pid_t upid, int options, int flags);
#endif

int __sys_getsockopt(int fd, int level, int optname, char __user *optval,
int __user *optlen);
@@ -880,8 +880,11 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
#define __NR_process_mrelease 448
__SYSCALL(__NR_process_mrelease, sys_process_mrelease)

#define __NR_waitfd 473
__SYSCALL(__NR_waitfd, sys_waitfd)

#undef __NR_syscalls
#define __NR_syscalls 449
#define __NR_syscalls 474

/*
* 32 bit systems traditionally used different
@@ -1615,6 +1615,22 @@ config EPOLL
Disabling this option will cause the kernel to be built without
support for epoll family of system calls.

config WAITFD
bool "Enable waitfd() system call" if EXPERT
select ANON_INODES
default n
help
Enable the waitfd() system call that allows receiving child state
changes from a file descriptor. This permits use of poll() to
monitor waitpid() output simultaneously with other fd state changes,
even if the waitpid() output is coming from thread-targetted sources
such as ptrace().

Note: this system call is not upstream: its syscall number is not
finalized, so the call itself should only be used with caution.

If unsure, say N.

config SIGNALFD
bool "Enable signalfd() system call" if EXPERT
default y

0 comments on commit 0fd5e39

Please sign in to comment.