Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meson crashes while running GLib gdbus-proxy-threads test #3967

Closed
lantw44 opened this issue Aug 1, 2018 · 9 comments
Closed

Meson crashes while running GLib gdbus-proxy-threads test #3967

lantw44 opened this issue Aug 1, 2018 · 9 comments
Labels
bug OS:*BSD Issues specific to one of the BSD Operating Systems test targets

Comments

@lantw44
Copy link
Contributor

lantw44 commented Aug 1, 2018

This is currently a blocker for adding FreeBSD CI to GLib. It is easily reproducible and it makes it almost impossible to complete the GLib test on FreeBSD.

$ ~/gnome/source/meson/meson.py test gdbus-proxy-threads
ninja: Entering directory `/tmp/ram/glib-build'
ninja: no work to do.
Traceback (most recent call last):
  File "/home/lantw44/gnome/source/meson/meson.py", line 29, in <module>
    sys.exit(mesonmain.main())
  File "/home/lantw44/gnome/source/meson/mesonbuild/mesonmain.py", line 379, in main
    return run(sys.argv[1:], launcher)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mesonmain.py", line 278, in run
    return mtest.run(remaining_args)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 759, in run
    return th.run_special()
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 692, in run_special
    self.run_tests(tests)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 666, in run_tests
    self.drain_futures(futures)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 682, in drain_futures
    self.process_test_result(result.result())
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 262, in run
    return self._run_cmd(wrap + cmd + self.test.cmd_args + self.options.test_args)
  File "/home/lantw44/gnome/source/meson/mesonbuild/mtest.py", line 353, in _run_cmd
    (stdo, stde) = p.communicate()
  File "/usr/local/lib/python3.6/subprocess.py", line 843, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/local/lib/python3.6/subprocess.py", line 1505, in _communicate
    selector.register(self.stdout, selectors.EVENT_READ)
  File "/usr/local/lib/python3.6/selectors.py", line 351, in register
    key = super().register(fileobj, events, data)
  File "/usr/local/lib/python3.6/selectors.py", line 237, in register
    key = SelectorKey(fileobj, self._fileobj_lookup(fileobj), events, data)
  File "/usr/local/lib/python3.6/selectors.py", line 224, in _fileobj_lookup
    return _fileobj_to_fd(fileobj)
  File "/usr/local/lib/python3.6/selectors.py", line 39, in _fileobj_to_fd
    "{!r}".format(fileobj)) from None
ValueError: Invalid file object: <_io.BufferedReader name=6>

It took 30 seconds for meson to crash, so it looked like a timeout issue. However, gdbus-proxy-threads test itself completed in less than 5 seconds. It exited and became a zombie process before meson crashed.

 └─┬= 93390 lantw44 python3 /home/lantw44/gnome/source/meson/meson.py test gdbus-proxy-threads (python3.6)
   └──= 93399 lantw44 <defunct>

Interestingly, meson doesn't crash when running the same test with -v flag.

$ ~/gnome/source/meson/meson.py test -v gdbus-proxy-threads
ninja: Entering directory `/tmp/ram/glib-build'
ninja: no work to do.
/gdbus/proxy/vs-threads: OK
1/1 glib:gio / gdbus-proxy-threads          OK       2.58 s 

OK:         1
FAIL:       0
SKIP:       0
TIMEOUT:    0

Running meson test with -v avoids the crash, but it still doesn't work with GitLab runner. It seems there is a leftover dbus-daemon process when running meson test with -v and GitLab runner waits forever for it. The test https://gitlab.gnome.org/GNOME/glib/-/jobs/70974 already runs for a week and GitLab runner still waits for it.

Related comments on GNOME GitLab:

@nirbheek nirbheek added bug OS:*BSD Issues specific to one of the BSD Operating Systems test targets labels Aug 2, 2018
@nirbheek
Copy link
Member

nirbheek commented Aug 2, 2018

Looks like we're not killing process groups correctly after each test. Not sure what's going on because it seems to be BSD-specific?

@lantw44
Copy link
Contributor Author

lantw44 commented Aug 13, 2018

After spending time reading mtest.py and subprocess.py, I think I understand the problem better now. Although it cannot be reproduced on GNU/Linux, it is not a BSD-specific issue but a bug of Python subprocess communicate function: https://bugs.python.org/issue30154.

When communicate is called, Python blocks itself on a call to poll to wait for data written to stdout and stderr. It exits the poll loop when all monitored file descriptors are closed. After it exits the loop, it waits for the child process to exit and return. It works most of the time because file descriptors are closed on process exit, so Python can exit the loop when the child process exits. However, if the child process forks a grandchild process and the grandchild process never exits, both stdout and stderr will still be open after the child process exits. This causes communicate to block forever until the timeout is expired.

gdbus-proxy-threads is a test which forks at least two child processes. One of them is dbus-daemon and another one is a watcher which is responsible for killing dbus-daemon when the test is done. On GNU/Linux, the test works correctly, so dbus-daemon is terminated when the test exits. On FreeBSD, the watcher process crashes and doesn't kill the dbus-daemon process, causing stdout and stderr to be kept open after the main process exits. This triggers the problem described above and seems to cause the second call to communicate to fail.

The ideal fix of the problem will be fixing the gdbus-proxy-threads crash on FreeBSD. However, it is still nice if meson itself can handle the case and explain the problem to users, so users will be able to easily understand what the problem is.

gnomesysadmins pushed a commit to GNOME/glib that referenced this issue Aug 19, 2018
In file gio/gtestdbus.c, function watch_parent, there is a loop which
waits for commands sent from the parent process and kills all processes
recorded in 'pids_to_kill' array on parent process exit. The detection
of parent process exit is done by calling g_poll and checking whether
the returned event is G_IO_HUP. However, 'revents' is a bit mask, and
we should use a bitwise-AND check instead of the equality check here.

It seems to work fine on Linux, but it fails on FreeBSD because the
g_poll returns both G_IO_IN and G_IO_HUP on pipe close. This means the
watcher process continues waiting for commands after the parent process
exit, and g_io_channel_read_line returns G_IO_STATUS_EOF with 'command'
set to NULL. Then the watcher process crashes with segfault when calling
sscanf because 'command' is NULL. Since the test result is already
reported by the parent process as 'OK', this kind of crash is likely to
be unnoticed unless someone checks dmesg messages after the test:

pid 57611 (defaultvalue), uid 1001: exited on signal 11
pid 57935 (actions), uid 1001: exited on signal 11
pid 57945 (gdbus-bz627724), uid 1001: exited on signal 11
pid 57952 (gdbus-connection), uid 1001: exited on signal 11
pid 57970 (gdbus-connection-lo), uid 1001: exited on signal 11
pid 57976 (gdbus-connection-sl), uid 1001: exited on signal 11
pid 58039 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58043 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58047 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58051 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58055 (gdbus-export), uid 1001: exited on signal 11
pid 58059 (gdbus-introspection), uid 1001: exited on signal 11
pid 58065 (gdbus-names), uid 1001: exited on signal 11
pid 58071 (gdbus-proxy), uid 1001: exited on signal 11
pid 58079 (gdbus-proxy-threads), uid 1001: exited on signal 11
pid 58083 (gdbus-proxy-well-kn), uid 1001: exited on signal 11
pid 58091 (gdbus-test-codegen), uid 1001: exited on signal 11
pid 58095 (gdbus-threading), uid 1001: exited on signal 11
pid 58104 (gmenumodel), uid 1001: exited on signal 11
pid 58108 (gnotification), uid 1001: exited on signal 11
pid 58112 (gdbus-test-codegen-), uid 1001: exited on signal 11
pid 58116 (gapplication), uid 1001: exited on signal 11
pid 58132 (dbus-appinfo), uid 1001: exited on signal 11

If the watcher process crashes before killing the dbus-daemon process
spawned by the parent process, the dbus-daemon process will keep running
after all tests complete. Due to the implementation of 'communicate'
function in Python subprocess, it causes meson to crash. 'communicate'
assumes the stdout and stderr pipes are closed when the child process
exits, but it is not true if processes forked by the child process
doesn't exit. It causes Python subprocess 'communicate' function to
block on the call to poll until the timeout expires even if the test
finishes in a few seconds. Meson assumes the timeout exception always
means the test is still running. It calls 'communicate' again and
crashes because pipes no longer exist.

https://gitlab.gnome.org/Infrastructure/GitLab/issues/286
mesonbuild/meson#3967
https://bugs.python.org/issue30154
gnomesysadmins pushed a commit to GNOME/glib that referenced this issue Aug 20, 2018
In file gio/gtestdbus.c, function watch_parent, there is a loop which
waits for commands sent from the parent process and kills all processes
recorded in 'pids_to_kill' array on parent process exit. The detection
of parent process exit is done by calling g_poll and checking whether
the returned event is G_IO_HUP. However, 'revents' is a bit mask, and
we should use a bitwise-AND check instead of the equality check here.

It seems to work fine on Linux, but it fails on FreeBSD because the
g_poll returns both G_IO_IN and G_IO_HUP on pipe close. This means the
watcher process continues waiting for commands after the parent process
exit, and g_io_channel_read_line returns G_IO_STATUS_EOF with 'command'
set to NULL. Then the watcher process crashes with segfault when calling
sscanf because 'command' is NULL. Since the test result is already
reported by the parent process as 'OK', this kind of crash is likely to
be unnoticed unless someone checks dmesg messages after the test:

pid 57611 (defaultvalue), uid 1001: exited on signal 11
pid 57935 (actions), uid 1001: exited on signal 11
pid 57945 (gdbus-bz627724), uid 1001: exited on signal 11
pid 57952 (gdbus-connection), uid 1001: exited on signal 11
pid 57970 (gdbus-connection-lo), uid 1001: exited on signal 11
pid 57976 (gdbus-connection-sl), uid 1001: exited on signal 11
pid 58039 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58043 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58047 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58051 (gdbus-exit-on-close), uid 1001: exited on signal 11
pid 58055 (gdbus-export), uid 1001: exited on signal 11
pid 58059 (gdbus-introspection), uid 1001: exited on signal 11
pid 58065 (gdbus-names), uid 1001: exited on signal 11
pid 58071 (gdbus-proxy), uid 1001: exited on signal 11
pid 58079 (gdbus-proxy-threads), uid 1001: exited on signal 11
pid 58083 (gdbus-proxy-well-kn), uid 1001: exited on signal 11
pid 58091 (gdbus-test-codegen), uid 1001: exited on signal 11
pid 58095 (gdbus-threading), uid 1001: exited on signal 11
pid 58104 (gmenumodel), uid 1001: exited on signal 11
pid 58108 (gnotification), uid 1001: exited on signal 11
pid 58112 (gdbus-test-codegen-), uid 1001: exited on signal 11
pid 58116 (gapplication), uid 1001: exited on signal 11
pid 58132 (dbus-appinfo), uid 1001: exited on signal 11

If the watcher process crashes before killing the dbus-daemon process
spawned by the parent process, the dbus-daemon process will keep running
after all tests complete. Due to the implementation of 'communicate'
function in Python subprocess, it causes meson to crash. 'communicate'
assumes the stdout and stderr pipes are closed when the child process
exits, but it is not true if processes forked by the child process
doesn't exit. It causes Python subprocess 'communicate' function to
block on the call to poll until the timeout expires even if the test
finishes in a few seconds. Meson assumes the timeout exception always
means the test is still running. It calls 'communicate' again and
crashes because pipes no longer exist.

https://gitlab.gnome.org/Infrastructure/GitLab/issues/286
mesonbuild/meson#3967
https://bugs.python.org/issue30154
@lantw44
Copy link
Contributor Author

lantw44 commented Aug 22, 2018

The leftover process problem is now fixed in GLib: https://gitlab.gnome.org/GNOME/glib/merge_requests/251. I still hope the crash problem can be fixed in meson. I think having a misbehaving test crashes the build system isn't something users expect to happen.

Meson is still not reliable enough to run GLib tests on FreeBSD. Meson puts wrong RPATH on the generated executable, so the test still fails on the CI machine because it uses the system-installed GLib instead of the one built under the build directory: https://gitlab.gnome.org/GNOME/glib/-/jobs/86836.

@jpakkane
Copy link
Member

est still fails on the CI machine because it uses the system-installed GLib instead of the one built under the build directory

Are you sure this is not caused by them using the "other" style of rpath or runpath or whatever it was where system paths override entries in RPATH instead of the other way around?

@lantw44
Copy link
Contributor Author

lantw44 commented Aug 22, 2018

Are you sure this is not caused by them using the "other" style of rpath or runpath or whatever it was where system paths override entries in RPATH instead of the other way around?

There is no LD_LIBRARY_PATH set in the environment and LDFLAGS already includes -Wl,--disable-new-dtags. I think it is really a RPATH problem because the first item in RPATH is $ORIGIN/../../../../../../../../../../../usr/local/lib. You can see the output of ldd and readelf in this comment: https://gitlab.gnome.org/GNOME/glib/merge_requests/251#note_296739. The test was done with meson 0.47.1.

@freddy77
Copy link

freddy77 commented Sep 5, 2018

I think I got an easy reproduction, at least on Linux, using this main.c source:

#include <stdio.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	if (!fork()) {
		freopen("/dev/null", "w", stdout);
		sleep(5);
	}
	return 0;
}

and this meson.build:

project('communicate bug', 'c')
test('bug',
     executable('bug', 'main.c', install : false),
     timeout : 3)

The problem is that the test process died, standard error is still "alive" but standard output is not (note the sleep(5) in the code and the 3 seconds timeout for Meson test).

@jpakkane
Copy link
Member

jpakkane commented Sep 5, 2018

Can you test if #4129 fixes things for you?

@freddy77
Copy link

freddy77 commented Sep 6, 2018

With #4129 change Meson give timeout and a much better exit.
Question is: is it the right behaviour? The test program exited successfully and autoconf and different test environment are perfectly fine with this.

On the Meson code there's this comment:

            # Python does not provide multiplatform support for
            # killing a process and all its children so we need
            # to roll our own.

which looks like you are attempting to kill all children. However if you look at my small test program there's a simple fork call (no setsid calls) so is not clear why the killpg (used by Meson) is not working under Linux.
Looking at code the behaviour will change to a way similar to autoconf if you use -v (verbose). In this case normal output/error are used and when the test program exit test give success.
So, in the end, I think the correct behaviour would be to act like autoconf not waiting for all children but only for the test process.

@jpakkane
Copy link
Member

jpakkane commented Sep 6, 2018

It's not the kill that is the problem but the closing of stdin et al (I think). I just put that in the same MR because they would conflict with each other if they had been in different MRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug OS:*BSD Issues specific to one of the BSD Operating Systems test targets
Projects
None yet
Development

No branches or pull requests

4 participants