Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a test for an epoll+signalfd issue encountered in wayland #32

Merged
merged 1 commit into from
Sep 6, 2021

Conversation

arichardson
Copy link
Contributor

The wayland event_loop_signal test fails non-deterministically on FreeBSD
due to a missing SIGUSR1 notification. This test case is a minimal
reproducer for the issue: it passes on Linux but fails when run on FreeBSD.

The wayland event_loop_signal test fails non-deterministically on FreeBSD
due to a missing SIGUSR1 notification. This test case is a minimal
reproducer for the issue: it passes on Linux but fails when run on FreeBSD.
@arichardson
Copy link
Contributor Author

I tried to debug this and include a fix, but so far I've failed to narrow it down further than the poll call in epollfd_ctx_wait sometimes failing. It also only reproduces 1/10 times and so far never when ran under truss/ktrace

@jiixyj
Copy link
Owner

jiixyj commented Aug 2, 2021

Thanks for the test! It looks like we are hitting a race in the FreeBSD kernel. I've managed to reproduce it with kqueue:

#ifdef NDEBUG
#undef NDEBUG
#endif

#define _GNU_SOURCE

#include <sys/types.h>

#include <sys/event.h>

#include <assert.h>
#include <errno.h>
#include <signal.h>
#include <stdlib.h>

#include <poll.h>
#include <unistd.h>

int
main(void)
{
	int rv;

	sigset_t set;
	rv = sigemptyset(&set);
	assert(rv == 0);
	rv = sigaddset(&set, SIGUSR1);
	assert(rv == 0);

	rv = sigprocmask(SIG_BLOCK, &set, NULL);
	assert(rv == 0);

	int skq = kqueue();
	assert(skq >= 0);

	struct kevent kev;
	EV_SET(&kev, SIGUSR1, EVFILT_SIGNAL, EV_ADD, 0, 0, 0);
	rv = kevent(skq, &kev, 1, NULL, 0, NULL);
	assert(rv == 0);

	int kq = kqueue();
	assert(kq >= 0);

	EV_SET(&kev, skq, EVFILT_READ, EV_ADD | EV_CLEAR, 0, 0, 0);
	rv = kevent(kq, &kev, 1, NULL, 0, NULL);
	assert(rv == 0);

	for (;;) {
		rv = kill(getpid(), SIGUSR1);
		assert(rv == 0);

		/* Turn this into `#if 1` to avoid the race. */
#if 1
		rv = kevent(kq, NULL, 0, &kev, 1, NULL);
#else
		rv = kevent(kq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
#endif
		assert(rv == 1);
		rv = kevent(kq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
		assert(rv == 0);

		rv = kevent(skq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
		assert(rv == 1);
		rv = kevent(skq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
		assert(rv == 0);

		siginfo_t siginfo;

		rv = sigtimedwait(&set, &siginfo, &(struct timespec) { 0, 0 });
		assert(rv == SIGUSR1);

		rv = sigtimedwait(&set, &siginfo, &(struct timespec) { 0, 0 });
		assert(rv < 0);
		assert(errno == EAGAIN);
	}
}
  • skq is the "signalfd", essentially -- a kqueue with EVFILT_SIGNAL registered.
  • kq is the "epollfd"

So the notification from the kill(getpid(), SIGUSR1); call makes its way through both skq and kq. Polling kq directly after the kill may result in nothing when the signal wasn't "fast" enough.

I think I recall from having looked at the FreeBSD kernel sources some time ago that this happens because various parts of the involved signal/notification mechanisms can be scheduled on different CPU cores. So it isn't 100% guaranteed that right after the kill(getpid(), SIGUSR1); call all listeners have been notified.

@jiixyj
Copy link
Owner

jiixyj commented Aug 2, 2021

I'm curious if Linux guarantees that the signalfd/epoll equivalent of the above code will always work, or if the race window is so short that it will always "appear" to work correctly.

@arichardson
Copy link
Contributor Author

I'm curious if Linux guarantees that the signalfd/epoll equivalent of the above code will always work, or if the race window is so short that it will always "appear" to work correctly.

I would assume that at least a single-thread program always sees the signal immediately (even if it was migrated to another CPU between the kill and epoll_wait call). Similarly I'd also assume it to be visible immediately for a signal sent to the current thread using pthread_kill/thr_kill/tkill.
Have you by any chance reported this race in the FreeBSD bug tracker?

@jiixyj jiixyj merged commit f84eab9 into jiixyj:master Sep 6, 2021
@jiixyj
Copy link
Owner

jiixyj commented Sep 6, 2021

Have you by any chance reported this race in the FreeBSD bug tracker?

No, I haven't -- but I'm also not sure if this is an actual bug in the kernel or "by design". One might interpret this behavior as violating the "spirit of POSIX", because they say:

If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.

Of course there is no kqueue/epoll in POSIX, but to me it reads that all side effects of the signal should be observable after the kill() returned.

@arichardson arichardson deleted the wayland-testcase branch September 6, 2021 10:42
@jiixyj
Copy link
Owner

jiixyj commented Sep 6, 2021

I merged your test case, but I changed it so that it won't fail outright, but only skip. I guess this falls under the category of edge cases that cannot be 100% faithfully emulated right now.

@jiixyj
Copy link
Owner

jiixyj commented Sep 6, 2021

Thanks again!

@arichardson
Copy link
Contributor Author

Reported as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=258310 since I'm not sure if this is intended or an implementation bug.

@arichardson
Copy link
Contributor Author

Fixed in freebsd/freebsd-src@98168a6

@jiixyj
Copy link
Owner

jiixyj commented Sep 7, 2021

Fantastic! I'll update the text in the atf_tc_skip accordingly. I guess skipping the test is better for now, because I don't know yet how the other BSD's handle this scenario.

freebsd-git pushed a commit to freebsd/freebsd-src that referenced this pull request Sep 13, 2021
This test (based on jiixyj/epoll-shim#32 (comment))
fails reproducibly on QEMU with KVM and `-smp 2` prior to D31858 (committed
as 98168a6) and passes with the patch applied.

Reviewed By:	kib, imp
Differential Revision: https://reviews.freebsd.org/D31862
bsdjhb pushed a commit to bsdjhb/cheribsd that referenced this pull request Dec 30, 2021
This test (based on jiixyj/epoll-shim#32 (comment))
fails reproducibly on QEMU with KVM and `-smp 2` prior to D31858 (committed
as 98168a6) and passes with the patch applied.

Reviewed By:	kib, imp
Differential Revision: https://reviews.freebsd.org/D31862
freebsd-git pushed a commit to freebsd/freebsd-src that referenced this pull request May 7, 2022
This test (based on jiixyj/epoll-shim#32 (comment))
fails reproducibly on QEMU with KVM and `-smp 2` prior to D31858 (committed
as 98168a6) and passes with the patch applied.

Reviewed By:	kib, imp
Differential Revision: https://reviews.freebsd.org/D31862

(cherry picked from commit d7d962e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants