Meson crashes while running GLib gdbus-proxy-threads test #3967

lantw44 · 2018-08-01T18:31:33Z

This is currently a blocker for adding FreeBSD CI to GLib. It is easily reproducible and it makes it almost impossible to complete the GLib test on FreeBSD.

$ ~/gnome/source/meson/meson.py test gdbus-proxy-threads
ninja: Entering directory `/tmp/ram/glib-build'
ninja: no work to do.
Traceback (most recent call last):
  File "/home/lantw44/gnome/source/meson/meson.py", line 29, in <module>
    sys.exit(mesonmain.main())
  File "/home/lantw44/gnome/source/meson/mesonbuild/mesonmain.py", line 379, in main
    return run(sys.argv[1:], launcher)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mesonmain.py", line 278, in run
    return mtest.run(remaining_args)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 759, in run
    return th.run_special()
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 692, in run_special
    self.run_tests(tests)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 666, in run_tests
    self.drain_futures(futures)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 682, in drain_futures
    self.process_test_result(result.result())
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 262, in run
    return self._run_cmd(wrap + cmd + self.test.cmd_args + self.options.test_args)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 353, in _run_cmd
    (stdo, stde) = p.communicate()
  File "/usr/local/lib/python3.6/subprocess.py", line 843, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/local/lib/python3.6/subprocess.py", line 1505, in _communicate
    selector.register(self.stdout, selectors.EVENT_READ)
  File "/usr/local/lib/python3.6/selectors.py", line 351, in register
    key = super().register(fileobj, events, data)
  File "/usr/local/lib/python3.6/selectors.py", line 237, in register
    key = SelectorKey(fileobj, self._fileobj_lookup(fileobj), events, data)
  File "/usr/local/lib/python3.6/selectors.py", line 224, in _fileobj_lookup
    return _fileobj_to_fd(fileobj)
  File "/usr/local/lib/python3.6/selectors.py", line 39, in _fileobj_to_fd
    "{!r}".format(fileobj)) from None
ValueError: Invalid file object: <_io.BufferedReader name=6>

It took 30 seconds for meson to crash, so it looked like a timeout issue. However, gdbus-proxy-threads test itself completed in less than 5 seconds. It exited and became a zombie process before meson crashed.

 └─┬= 93390 lantw44 python3 /home/lantw44/gnome/source/meson/meson.py test gdbus-proxy-threads (python3.6)
   └──= 93399 lantw44 <defunct>

Interestingly, meson doesn't crash when running the same test with -v flag.

$ ~/gnome/source/meson/meson.py test -v gdbus-proxy-threads
ninja: Entering directory `/tmp/ram/glib-build'
ninja: no work to do.
/gdbus/proxy/vs-threads: OK
1/1 glib:gio / gdbus-proxy-threads          OK       2.58 s 

OK:         1
FAIL:       0
SKIP:       0
TIMEOUT:    0

Running meson test with -v avoids the crash, but it still doesn't work with GitLab runner. It seems there is a leftover dbus-daemon process when running meson test with -v and GitLab runner waits forever for it. The test https://gitlab.gnome.org/GNOME/glib/-/jobs/70974 already runs for a week and GitLab runner still waits for it.

Related comments on GNOME GitLab:

The text was updated successfully, but these errors were encountered:

nirbheek · 2018-08-02T12:16:12Z

Looks like we're not killing process groups correctly after each test. Not sure what's going on because it seems to be BSD-specific?

lantw44 · 2018-08-13T14:40:27Z

After spending time reading mtest.py and subprocess.py, I think I understand the problem better now. Although it cannot be reproduced on GNU/Linux, it is not a BSD-specific issue but a bug of Python subprocess communicate function: https://bugs.python.org/issue30154.

When communicate is called, Python blocks itself on a call to poll to wait for data written to stdout and stderr. It exits the poll loop when all monitored file descriptors are closed. After it exits the loop, it waits for the child process to exit and return. It works most of the time because file descriptors are closed on process exit, so Python can exit the loop when the child process exits. However, if the child process forks a grandchild process and the grandchild process never exits, both stdout and stderr will still be open after the child process exits. This causes communicate to block forever until the timeout is expired.

gdbus-proxy-threads is a test which forks at least two child processes. One of them is dbus-daemon and another one is a watcher which is responsible for killing dbus-daemon when the test is done. On GNU/Linux, the test works correctly, so dbus-daemon is terminated when the test exits. On FreeBSD, the watcher process crashes and doesn't kill the dbus-daemon process, causing stdout and stderr to be kept open after the main process exits. This triggers the problem described above and seems to cause the second call to communicate to fail.

The ideal fix of the problem will be fixing the gdbus-proxy-threads crash on FreeBSD. However, it is still nice if meson itself can handle the case and explain the problem to users, so users will be able to easily understand what the problem is.

In file gio/gtestdbus.c, function watch_parent, there is a loop which waits for commands sent from the parent process and kills all processes recorded in 'pids_to_kill' array on parent process exit. The detection of parent process exit is done by calling g_poll and checking whether the returned event is G_IO_HUP. However, 'revents' is a bit mask, and we should use a bitwise-AND check instead of the equality check here. It seems to work fine on Linux, but it fails on FreeBSD because the g_poll returns both G_IO_IN and G_IO_HUP on pipe close. This means the watcher process continues waiting for commands after the parent process exit, and g_io_channel_read_line returns G_IO_STATUS_EOF with 'command' set to NULL. Then the watcher process crashes with segfault when calling sscanf because 'command' is NULL. Since the test result is already reported by the parent process as 'OK', this kind of crash is likely to be unnoticed unless someone checks dmesg messages after the test: pid 57611 (defaultvalue), uid 1001: exited on signal 11 pid 57935 (actions), uid 1001: exited on signal 11 pid 57945 (gdbus-bz627724), uid 1001: exited on signal 11 pid 57952 (gdbus-connection), uid 1001: exited on signal 11 pid 57970 (gdbus-connection-lo), uid 1001: exited on signal 11 pid 57976 (gdbus-connection-sl), uid 1001: exited on signal 11 pid 58039 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58043 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58047 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58051 (gdbus-exit-on-close), uid 1001: exited on signal 11 pid 58055 (gdbus-export), uid 1001: exited on signal 11 pid 58059 (gdbus-introspection), uid 1001: exited on signal 11 pid 58065 (gdbus-names), uid 1001: exited on signal 11 pid 58071 (gdbus-proxy), uid 1001: exited on signal 11 pid 58079 (gdbus-proxy-threads), uid 1001: exited on signal 11 pid 58083 (gdbus-proxy-well-kn), uid 1001: exited on signal 11 pid 58091 (gdbus-test-codegen), uid 1001: exited on signal 11 pid 58095 (gdbus-threading), uid 1001: exited on signal 11 pid 58104 (gmenumodel), uid 1001: exited on signal 11 pid 58108 (gnotification), uid 1001: exited on signal 11 pid 58112 (gdbus-test-codegen-), uid 1001: exited on signal 11 pid 58116 (gapplication), uid 1001: exited on signal 11 pid 58132 (dbus-appinfo), uid 1001: exited on signal 11 If the watcher process crashes before killing the dbus-daemon process spawned by the parent process, the dbus-daemon process will keep running after all tests complete. Due to the implementation of 'communicate' function in Python subprocess, it causes meson to crash. 'communicate' assumes the stdout and stderr pipes are closed when the child process exits, but it is not true if processes forked by the child process doesn't exit. It causes Python subprocess 'communicate' function to block on the call to poll until the timeout expires even if the test finishes in a few seconds. Meson assumes the timeout exception always means the test is still running. It calls 'communicate' again and crashes because pipes no longer exist. https://gitlab.gnome.org/Infrastructure/GitLab/issues/286 mesonbuild/meson#3967 https://bugs.python.org/issue30154

lantw44 · 2018-08-22T16:10:26Z

The leftover process problem is now fixed in GLib: https://gitlab.gnome.org/GNOME/glib/merge_requests/251. I still hope the crash problem can be fixed in meson. I think having a misbehaving test crashes the build system isn't something users expect to happen.

Meson is still not reliable enough to run GLib tests on FreeBSD. Meson puts wrong RPATH on the generated executable, so the test still fails on the CI machine because it uses the system-installed GLib instead of the one built under the build directory: https://gitlab.gnome.org/GNOME/glib/-/jobs/86836.

jpakkane · 2018-08-22T16:22:25Z

est still fails on the CI machine because it uses the system-installed GLib instead of the one built under the build directory

Are you sure this is not caused by them using the "other" style of rpath or runpath or whatever it was where system paths override entries in RPATH instead of the other way around?

lantw44 · 2018-08-22T17:52:35Z

Are you sure this is not caused by them using the "other" style of rpath or runpath or whatever it was where system paths override entries in RPATH instead of the other way around?

There is no LD_LIBRARY_PATH set in the environment and LDFLAGS already includes -Wl,--disable-new-dtags. I think it is really a RPATH problem because the first item in RPATH is $ORIGIN/../../../../../../../../../../../usr/local/lib. You can see the output of ldd and readelf in this comment: https://gitlab.gnome.org/GNOME/glib/merge_requests/251#note_296739. The test was done with meson 0.47.1.

freddy77 · 2018-09-05T14:49:19Z

I think I got an easy reproduction, at least on Linux, using this main.c source:

#include <stdio.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	if (!fork()) {
		freopen("/dev/null", "w", stdout);
		sleep(5);
	}
	return 0;
}

and this meson.build:

project('communicate bug', 'c')
test('bug',
     executable('bug', 'main.c', install : false),
     timeout : 3)

The problem is that the test process died, standard error is still "alive" but standard output is not (note the sleep(5) in the code and the 3 seconds timeout for Meson test).

jpakkane · 2018-09-05T17:29:06Z

Can you test if #4129 fixes things for you?

freddy77 · 2018-09-06T10:06:03Z

With #4129 change Meson give timeout and a much better exit.
Question is: is it the right behaviour? The test program exited successfully and autoconf and different test environment are perfectly fine with this.

On the Meson code there's this comment:

            # Python does not provide multiplatform support for
            # killing a process and all its children so we need
            # to roll our own.

which looks like you are attempting to kill all children. However if you look at my small test program there's a simple fork call (no setsid calls) so is not clear why the killpg (used by Meson) is not working under Linux.
Looking at code the behaviour will change to a way similar to autoconf if you use -v (verbose). In this case normal output/error are used and when the test program exit test give success.
So, in the end, I think the correct behaviour would be to act like autoconf not waiting for all children but only for the test process.

jpakkane · 2018-09-06T16:16:11Z

It's not the kill that is the problem but the closing of stdin et al (I think). I just put that in the same MR because they would conflict with each other if they had been in different MRs.

nirbheek added bug OS:*BSD Issues specific to one of the BSD Operating Systems test targets labels Aug 2, 2018

jpakkane added a commit that referenced this issue Sep 5, 2018

Guard against the process redirecting stdout. Closes #3967.

277f39d

jpakkane added a commit that referenced this issue Sep 5, 2018

Guard against the process redirecting stdout. Closes #3967.

dc594d8

jpakkane closed this as completed in 8ca463f Sep 6, 2018

lantw44 mentioned this issue Sep 26, 2018

rpaths_for_bundled_shared_libraries adds unnecessary RPATH for pkg-config dependencies and causes linking failure #4270

Open

mrc0mmand mentioned this issue May 29, 2023

test-async is unstable systemd/systemd#27833

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meson crashes while running GLib gdbus-proxy-threads test #3967

Meson crashes while running GLib gdbus-proxy-threads test #3967

lantw44 commented Aug 1, 2018

nirbheek commented Aug 2, 2018

lantw44 commented Aug 13, 2018

lantw44 commented Aug 22, 2018

jpakkane commented Aug 22, 2018

lantw44 commented Aug 22, 2018

freddy77 commented Sep 5, 2018

jpakkane commented Sep 5, 2018

freddy77 commented Sep 6, 2018

jpakkane commented Sep 6, 2018

Meson crashes while running GLib gdbus-proxy-threads test #3967

Meson crashes while running GLib gdbus-proxy-threads test #3967

Comments

lantw44 commented Aug 1, 2018

nirbheek commented Aug 2, 2018

lantw44 commented Aug 13, 2018

lantw44 commented Aug 22, 2018

jpakkane commented Aug 22, 2018

lantw44 commented Aug 22, 2018

freddy77 commented Sep 5, 2018

jpakkane commented Sep 5, 2018

freddy77 commented Sep 6, 2018

jpakkane commented Sep 6, 2018