New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meson crashes while running GLib gdbus-proxy-threads test #3967
Comments
Looks like we're not killing process groups correctly after each test. Not sure what's going on because it seems to be BSD-specific? |
After spending time reading When
The ideal fix of the problem will be fixing the |
In file gio/gtestdbus.c, function watch_parent, there is a loop which waits for commands sent from the parent process and kills all processes recorded in 'pids_to_kill' array on parent process exit. The detection of parent process exit is done by calling g_poll and checking whether the returned event is G_IO_HUP. However, 'revents' is a bit mask, and we should use a bitwise-AND check instead of the equality check here. It seems to work fine on Linux, but it fails on FreeBSD because the g_poll returns both G_IO_IN and G_IO_HUP on pipe close. This means the watcher process continues waiting for commands after the parent process exit, and g_io_channel_read_line returns G_IO_STATUS_EOF with 'command' set to NULL. Then the watcher process crashes with segfault when calling sscanf because 'command' is NULL. Since the test result is already reported by the parent process as 'OK', this kind of crash is likely to be unnoticed unless someone checks dmesg messages after the test: pid 57611 (defaultvalue), uid 1001: exited on signal 11 pid 57935 (actions), uid 1001: exited on signal 11 pid 57945 (gdbus-bz627724), uid 1001: exited on signal 11 pid 57952 (gdbus-connection), uid 1001: exited on signal 11 pid 57970 (gdbus-connection-lo), uid 1001: exited on signal 11 pid 57976 (gdbus-connection-sl), uid 1001: exited on signal 11 pid 58039 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58043 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58047 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58051 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58055 (gdbus-export), uid 1001: exited on signal 11 pid 58059 (gdbus-introspection), uid 1001: exited on signal 11 pid 58065 (gdbus-names), uid 1001: exited on signal 11 pid 58071 (gdbus-proxy), uid 1001: exited on signal 11 pid 58079 (gdbus-proxy-threads), uid 1001: exited on signal 11 pid 58083 (gdbus-proxy-well-kn), uid 1001: exited on signal 11 pid 58091 (gdbus-test-codegen), uid 1001: exited on signal 11 pid 58095 (gdbus-threading), uid 1001: exited on signal 11 pid 58104 (gmenumodel), uid 1001: exited on signal 11 pid 58108 (gnotification), uid 1001: exited on signal 11 pid 58112 (gdbus-test-codegen-), uid 1001: exited on signal 11 pid 58116 (gapplication), uid 1001: exited on signal 11 pid 58132 (dbus-appinfo), uid 1001: exited on signal 11 If the watcher process crashes before killing the dbus-daemon process spawned by the parent process, the dbus-daemon process will keep running after all tests complete. Due to the implementation of 'communicate' function in Python subprocess, it causes meson to crash. 'communicate' assumes the stdout and stderr pipes are closed when the child process exits, but it is not true if processes forked by the child process doesn't exit. It causes Python subprocess 'communicate' function to block on the call to poll until the timeout expires even if the test finishes in a few seconds. Meson assumes the timeout exception always means the test is still running. It calls 'communicate' again and crashes because pipes no longer exist. https://gitlab.gnome.org/Infrastructure/GitLab/issues/286 mesonbuild/meson#3967 https://bugs.python.org/issue30154
In file gio/gtestdbus.c, function watch_parent, there is a loop which waits for commands sent from the parent process and kills all processes recorded in 'pids_to_kill' array on parent process exit. The detection of parent process exit is done by calling g_poll and checking whether the returned event is G_IO_HUP. However, 'revents' is a bit mask, and we should use a bitwise-AND check instead of the equality check here. It seems to work fine on Linux, but it fails on FreeBSD because the g_poll returns both G_IO_IN and G_IO_HUP on pipe close. This means the watcher process continues waiting for commands after the parent process exit, and g_io_channel_read_line returns G_IO_STATUS_EOF with 'command' set to NULL. Then the watcher process crashes with segfault when calling sscanf because 'command' is NULL. Since the test result is already reported by the parent process as 'OK', this kind of crash is likely to be unnoticed unless someone checks dmesg messages after the test: pid 57611 (defaultvalue), uid 1001: exited on signal 11 pid 57935 (actions), uid 1001: exited on signal 11 pid 57945 (gdbus-bz627724), uid 1001: exited on signal 11 pid 57952 (gdbus-connection), uid 1001: exited on signal 11 pid 57970 (gdbus-connection-lo), uid 1001: exited on signal 11 pid 57976 (gdbus-connection-sl), uid 1001: exited on signal 11 pid 58039 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58043 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58047 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58051 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58055 (gdbus-export), uid 1001: exited on signal 11 pid 58059 (gdbus-introspection), uid 1001: exited on signal 11 pid 58065 (gdbus-names), uid 1001: exited on signal 11 pid 58071 (gdbus-proxy), uid 1001: exited on signal 11 pid 58079 (gdbus-proxy-threads), uid 1001: exited on signal 11 pid 58083 (gdbus-proxy-well-kn), uid 1001: exited on signal 11 pid 58091 (gdbus-test-codegen), uid 1001: exited on signal 11 pid 58095 (gdbus-threading), uid 1001: exited on signal 11 pid 58104 (gmenumodel), uid 1001: exited on signal 11 pid 58108 (gnotification), uid 1001: exited on signal 11 pid 58112 (gdbus-test-codegen-), uid 1001: exited on signal 11 pid 58116 (gapplication), uid 1001: exited on signal 11 pid 58132 (dbus-appinfo), uid 1001: exited on signal 11 If the watcher process crashes before killing the dbus-daemon process spawned by the parent process, the dbus-daemon process will keep running after all tests complete. Due to the implementation of 'communicate' function in Python subprocess, it causes meson to crash. 'communicate' assumes the stdout and stderr pipes are closed when the child process exits, but it is not true if processes forked by the child process doesn't exit. It causes Python subprocess 'communicate' function to block on the call to poll until the timeout expires even if the test finishes in a few seconds. Meson assumes the timeout exception always means the test is still running. It calls 'communicate' again and crashes because pipes no longer exist. https://gitlab.gnome.org/Infrastructure/GitLab/issues/286 mesonbuild/meson#3967 https://bugs.python.org/issue30154
The leftover process problem is now fixed in GLib: https://gitlab.gnome.org/GNOME/glib/merge_requests/251. I still hope the crash problem can be fixed in meson. I think having a misbehaving test crashes the build system isn't something users expect to happen. Meson is still not reliable enough to run GLib tests on FreeBSD. Meson puts wrong RPATH on the generated executable, so the test still fails on the CI machine because it uses the system-installed GLib instead of the one built under the build directory: https://gitlab.gnome.org/GNOME/glib/-/jobs/86836. |
Are you sure this is not caused by them using the "other" style of rpath or runpath or whatever it was where system paths override entries in RPATH instead of the other way around? |
There is no LD_LIBRARY_PATH set in the environment and LDFLAGS already includes |
I think I got an easy reproduction, at least on Linux, using this main.c source: #include <stdio.h>
#include <unistd.h>
int main(int argc, char **argv)
{
if (!fork()) {
freopen("/dev/null", "w", stdout);
sleep(5);
}
return 0;
} and this meson.build: project('communicate bug', 'c')
test('bug',
executable('bug', 'main.c', install : false),
timeout : 3) The problem is that the test process died, standard error is still "alive" but standard output is not (note the sleep(5) in the code and the 3 seconds timeout for Meson test). |
Can you test if #4129 fixes things for you? |
With #4129 change Meson give timeout and a much better exit. On the Meson code there's this comment: # Python does not provide multiplatform support for
# killing a process and all its children so we need
# to roll our own. which looks like you are attempting to kill all children. However if you look at my small test program there's a simple fork call (no setsid calls) so is not clear why the killpg (used by Meson) is not working under Linux. |
It's not the kill that is the problem but the closing of stdin et al (I think). I just put that in the same MR because they would conflict with each other if they had been in different MRs. |
This is currently a blocker for adding FreeBSD CI to GLib. It is easily reproducible and it makes it almost impossible to complete the GLib test on FreeBSD.
It took 30 seconds for meson to crash, so it looked like a timeout issue. However,
gdbus-proxy-threads
test itself completed in less than 5 seconds. It exited and became a zombie process before meson crashed.Interestingly, meson doesn't crash when running the same test with
-v
flag.Running
meson test
with-v
avoids the crash, but it still doesn't work with GitLab runner. It seems there is a leftoverdbus-daemon
process when runningmeson test
with-v
and GitLab runner waits forever for it. The test https://gitlab.gnome.org/GNOME/glib/-/jobs/70974 already runs for a week and GitLab runner still waits for it.Related comments on GNOME GitLab:
The text was updated successfully, but these errors were encountered: