Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make runtest: Lost processes & ERROR: socket connect failed (host: localhost, port: 48920): error: 111 #1866

Closed
vt-alt opened this issue Dec 24, 2023 · 7 comments

Comments

@vt-alt
Copy link

vt-alt commented Dec 24, 2023

When make runtest is finished there are still uftrace processes remaining:

[00:00:08] + ps axuwf
[00:00:08] USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[00:00:08] builder  1138152  0.0  0.0   3528  2612 ?        SNs+ 22:25   0:00 /bin/sh -e /.host/entry
[00:00:08] builder  1138155  0.0  0.0   2424   808 ?        SN+  22:25   0:00  \_ time rpmbuild --rebuild --target=i586 --define _binary_payload w5T.xzdio --define _debuginfo_payload w7T.xzdio --define _source_payload w7T.xzdio uftrace-0.14-alt2.src.rpm
[00:00:08] builder  1138158  0.9  0.0   7300  5332 ?        SN+  22:25   0:00      \_ rpmbuild --rebuild --target=i586 --define _binary_payload w5T.xzdio --define _debuginfo_payload w7T.xzdio --define _source_payload w7T.xzdio uftrace-0.14-alt2.src.rpm
[00:00:08] builder  1140836  0.0  0.0   3528  2592 ?        SN+  22:25   0:00          \_ /bin/sh -e /usr/src/tmp/rpm-tmp.53502
[00:00:08] builder  1141240  0.0  0.0   4028  2596 ?        RN+  22:25   0:00              \_ ps axuwf
[00:00:08] + timeout 600 make -j8 runtest V=1
[00:00:08] ./runtest.py   -j8  
[00:00:18] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:18]  ERROR: socket connect failed (host: localhost, port: 48920): error: 111
[00:00:18] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:18]  ERROR: socket connect failed (host: localhost, port: 43147): error: 111
[00:00:18] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:18]  ERROR: socket connect failed (host: localhost, port: 40651): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 42775): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 48226): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 41528): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 43940): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 44453): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 40054): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 45349): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 43537): error: 111
[00:00:19] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:19]  ERROR: socket connect failed (host: localhost, port: 49222): error: 111
[00:00:20] Start 287 tests with 8 worker

...

[00:01:12] runtime test stats
[00:01:12] ====================
[00:01:12] total  2870  Tests executed (success: 46.20%)
[00:01:12]   OK:  1304  Test succeeded
[00:01:12]   OK:    22  Test succeeded (with some fixup)
[00:01:12]   NG:   566  Different test result
[00:01:12]   NZ:    12  Non-zero return value
[00:01:12]   SG:    16  Abnormal exit by signal
[00:01:12]   TM:     0  Test ran too long
[00:01:12]   BI:   355  Build failed
[00:01:12]   LA:     0  Unsupported Language
[00:01:12]   SK:   595  Skipped
[00:01:12] + ps axuwf
[00:01:12] USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[00:01:12] builder  1138152  0.0  0.0   3528  2612 ?        SNs+ 22:25   0:00 /bin/sh -e /.host/entry
[00:01:12] builder  1138155  0.0  0.0   2424   808 ?        SN+  22:25   0:00  \_ time rpmbuild --rebuild --target=i586 --define _binary_payload w5T.xzdio --define _debuginfo_payload w7T.xzdio --define _source_payload w7T.xzdio uftrace-0.14-alt2.src.rpm
[00:01:12] builder  1138158  0.0  0.0   7300  5332 ?        SN+  22:25   0:00      \_ rpmbuild --rebuild --target=i586 --define _binary_payload w5T.xzdio --define _debuginfo_payload w7T.xzdio --define _source_payload w7T.xzdio uftrace-0.14-alt2.src.rpm
[00:01:12] builder  1140836  0.0  0.0   3528  2596 ?        SN+  22:25   0:00          \_ /bin/sh -e /usr/src/tmp/rpm-tmp.53502
[00:01:12] builder  1231903  0.0  0.0   4028  2892 ?        RN+  22:26   0:00              \_ ps axuwf
[00:01:12] builder  1190639  0.0  0.0   9880  1420 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 44231 t-abc
[00:01:12] builder  1188389  0.0  0.0   9880  1544 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 49222 -N memcpy t-abc
[00:01:12] builder  1188011  0.0  0.0   9880  1612 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 43537 -N memcpy t-abc
[00:01:12] builder  1187707  0.0  0.0   9880  1676 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 45349 -N memcpy t-abc
[00:01:12] builder  1187590  0.0  0.0   9880  1724 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 40054 -N memcpy t-abc
[00:01:12] builder  1186853  0.0  0.0   9880  1672 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 44453 -N memcpy t-abc
[00:01:12] builder  1186776  0.0  0.0   9880  1484 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 43940 -N memcpy t-abc
[00:01:12] builder  1186539  0.0  0.0   9880  1548 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 41528 -N memcpy t-abc
[00:01:12] builder  1186299  0.0  0.0   9880  1484 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 48226 -N memcpy t-abc
[00:01:12] builder  1186069  0.0  0.0   9880  1676 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 42775 -N memcpy t-abc
[00:01:12] builder  1185924  0.0  0.0   9880  1548 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 40651 -N memcpy t-abc
[00:01:12] builder  1185852  0.0  0.0   9880  1612 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 43147 -N memcpy t-abc
[00:01:12] builder  1185436  0.0  0.0   9880  1728 ?        SN   22:26   0:00 /usr/src/RPM/BUILD/uftrace-0.14/uftrace record --no-pager --no-event --libmcount-path=/usr/src/RPM/BUILD/uftrace-0.14 --host localhost --port 48920 -N memcpy t-abc

(This causes our build system to choke.)

@namhyung
Copy link
Owner

It seems read on eventfd from child process got stuck. Maybe I need to use a pipe instead of eventfd to handle failure cases well.

@namhyung
Copy link
Owner

Can you please test this?

diff --git a/cmds/record.c b/cmds/record.c
index 81de11af..e4707603 100644
--- a/cmds/record.c
+++ b/cmds/record.c
@@ -10,7 +10,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/epoll.h>
-#include <sys/eventfd.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
 #include <sys/personality.h>
@@ -1835,7 +1834,7 @@ static void setup_writers(struct writer_data *wd, struct uftrace_opts *opts)
        wd->writers = xmalloc(opts->nr_thread * sizeof(*wd->writers));

        if (pipe(thread_ctl) < 0)
-               pr_err("cannot create an eventfd for writer thread");
+               pr_err("cannot create a pipe for writer thread");
 }

 static void start_tracing(struct writer_data *wd, struct uftrace_opts *opts, int ready_fd)
@@ -2078,16 +2077,17 @@ static void write_symbol_files(struct writer_data *wd, struct uftrace_opts *opts
                chown_directory(opts->dirname);
 }
-int do_main_loop(int ready, struct uftrace_opts *opts, int pid)
+int do_main_loop(int ready[], struct uftrace_opts *opts, int pid)
 {
        int ret;
        struct writer_data wd;
        char *channel = NULL;

+       close(ready[0]);
        if (opts->nop) {
                setup_writers(&wd, opts);
-               start_tracing(&wd, opts, ready);
-               close(ready);
+               start_tracing(&wd, opts, ready[1]);
+               close(ready[1]);

                wait(NULL);
                uftrace_done = true;
@@ -2110,8 +2110,8 @@ int do_main_loop(int ready, struct uftrace_opts *opts, int pid)
                pr_out("uftrace: install signal handlers to task %d\n", pid);

        setup_writers(&wd, opts);
-       start_tracing(&wd, opts, ready);
-       close(ready);
+       start_tracing(&wd, opts, ready[1]);
+       close(ready[1]);

        while (!uftrace_done) {
                struct pollfd pollfd = {
@@ -2139,7 +2139,7 @@ int do_main_loop(int ready, struct uftrace_opts *opts, int pid)
        return ret;
 }

-int do_child_exec(int ready, struct uftrace_opts *opts, int argc, char *argv[])
+int do_child_exec(int ready[], struct uftrace_opts *opts, int argc, char *argv[])
 {
        uint64_t dummy;
        char *shebang = NULL;
@@ -2148,6 +2148,7 @@ int do_child_exec(int ready, struct uftrace_opts *opts, int argc, char *argv[])
        struct strv new_args = STRV_INIT;
        bool is_python = false;

+       close(ready[1]);
        if (opts->no_randomize_addr) {
                /* disable ASLR (Address Space Layout Randomization) */
                if (personality(ADDR_NO_RANDOMIZE) < 0)
@@ -2223,8 +2224,9 @@ int do_child_exec(int ready, struct uftrace_opts *opts, int argc, char *argv[])
        setup_child_environ(opts, argc, argv);

        /* wait for parent ready */
-       if (read(ready, &dummy, sizeof(dummy)) != (ssize_t)sizeof(dummy))
+       if (read(ready[0], &dummy, sizeof(dummy)) != (ssize_t)sizeof(dummy))
                pr_err("waiting for parent failed");
+       close(ready[0]);

        if (is_python) {
                char *python_path = NULL;
@@ -2257,7 +2259,7 @@ int do_child_exec(int ready, struct uftrace_opts *opts, int argc, char *argv[])
 int command_record(int argc, char *argv[], struct uftrace_opts *opts)
 {
        int pid;
-       int ready;
+       int ready[2];
        int ret = -1;
        char *channel = NULL;

@@ -2279,9 +2281,8 @@ int command_record(int argc, char *argv[], struct uftrace_opts *opts)

        fflush(stdout);

-       ready = eventfd(0, EFD_CLOEXEC | EFD_SEMAPHORE);
-       if (ready < 0)
-               pr_dbg("creating eventfd failed: %d\n", ready);
+       if (pipe(ready) < 0)
+               pr_err("creating pipe failed");

        pid = fork();
        if (pid < 0)

@vt-alt
Copy link
Author

vt-alt commented Dec 25, 2023

Thanks, I will test this.

@vt-alt
Copy link
Author

vt-alt commented Dec 25, 2023

Well, patch does not apply to the master branch. (I used clipboard copy button on your quotation to copy it correctly.)

$ patch -p1 --merge  < test.ptch
patching file cmds/record.c
Hunk #2 NOT MERGED at 1837-1843.
patch: **** malformed patch at line 42: @@ -2110,8 +2110,8 @@ int do_main_loop(int ready, struct uftrace_opts *opts, int pid)

After I add empty line before 42:

$ patch -p1 --merge  < test.patch
patching file cmds/record.c
Hunk #2 NOT MERGED at 1837-1843.
Hunk #3 NOT MERGED at 2085-2090,2092-2107.
Hunk #4 NOT MERGED at 2125-2131.
Hunk #5 merged at 2173.
Hunk #6 NOT MERGED at 2183-2189.
Hunk #7 NOT MERGED at 2262-2269.
Hunk #8 NOT MERGED at 2305-2312.
Hunk #9 NOT MERGED at 2334-2341.

Fuzzy apply:

$ patch -p1 --merge -l < test.patch
patching file cmds/record.c
Hunk #3 NOT MERGED at 2081-2086, merged at 2092, merged at 2095-2096.

It seems that tabs are lost (when copied from terminal?). But last unmerged hunk is obscure. Last hunk does not apply because of empty line before unmerged line in source but not in patch.

So nevermind, I can manually fix all this just for testing.

@vt-alt
Copy link
Author

vt-alt commented Dec 25, 2023

Just tested this two times with TESTARG=basic and it does not create lost processes even when there are Connection refused errors.

[00:00:08] + timeout 600 make -j8 runtest V=1 TESTARG=basic
[00:00:08] ./runtest.py   -j8 basic 
[00:00:08] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:08]  ERROR: socket connect failed (host: localhost, port: 45093): Connection refused
[00:00:08] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/record.c:2228:do_child_exec
[00:00:08]  ERROR: waiting for parent failed: Invalid argument
[00:00:08] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/recv.c:124:setup_client_socket
[00:00:08]  ERROR: socket connect failed (host: localhost, port: 48638): Connection refused
[00:00:08] uftrace: /usr/src/RPM/BUILD/uftrace-0.14/cmds/record.c:2228:do_child_exec
[00:00:08]  ERROR: waiting for parent failed: Invalid argument
[00:00:25] Start 5 tests with 5 worker
[00:00:25] 
[00:00:25] Compiler                  gcc                                         
[00:00:25] Runtime test case         pg             finstrument-fu fpatchable-fun
[00:00:25] ------------------------: O0 O1 O2 O3 Os O0 O1 O2 O3 Os O0 O1 O2 O3 Os
[00:00:25] 001 basic               : OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK
[00:00:25] 024 report_basic        : OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK
[00:00:25] 097 dump_basic          : OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK
[00:00:25] 141 recv_basic          : OK OK NZ OK OK NZ OK OK OK OK OK OK OK OK OK
[00:00:25] 273 agent_basic         : OK OK OK OK OK OK OK OK OK OK NZ OK OK OK OK
[00:00:25] 
[00:00:25] + ps axuwf
[00:00:25] USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[00:00:25] builder  3407700  0.0  0.0   3364  2632 ?        SNs+ 20:51   0:00 /bin/sh -e /.host/entry
[00:00:25] builder  3407703  0.0  0.0   2356  1112 ?        SN+  20:51   0:00  \_ time rpmbuild --rebuild --target=x86_64 --define _binary_payload w5T.xzdio --define _debuginfo_payload w7T.xzdio --define _source_payload w7T.xzdio uftrace-0.14-alt3.test.src.rpm
[00:00:25] builder  3407706  0.1  0.0   8328  6580 ?        SN+  20:51   0:00      \_ rpmbuild --rebuild --target=x86_64 --define _binary_payload w5T.xzdio --define _debuginfo_payload w7T.xzdio --define _source_payload w7T.xzdio uftrace-0.14-alt3.test.src.rpm
[00:00:25] builder  3410338  0.0  0.0   3364  2640 ?        SN+  20:51   0:00          \_ /bin/sh -e /usr/src/tmp/rpm-tmp.17519
[00:00:25] builder  3413962  0.0  0.0   4184  2860 ?        RN+  20:52   0:00              \_ ps axuwf

@vt-alt
Copy link
Author

vt-alt commented Dec 25, 2023

Just tested without TESTARG=basic and even though it causes No space left on device it finishes without lost processes too. So this change seems good. Thanks!

@namhyung
Copy link
Owner

Thanks for the test!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants