Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests time out on ppc64el architecture #955

Closed
drew-parsons opened this issue Aug 6, 2023 · 42 comments · Fixed by #959
Closed

tests time out on ppc64el architecture #955

drew-parsons opened this issue Aug 6, 2023 · 42 comments · Fixed by #959

Comments

@drew-parsons
Copy link
Contributor

drew-parsons commented Aug 6, 2023

Hypre tests (in debian builds) have started timing out on ppc64el architecture, see https://buildd.debian.org/status/logs.php?pkg=hypre&arch=ppc64el. The timeouts are happening for both hypre v2.28.0 and v2.29.0. The timeout is somewhat random, happening in a different test in different attempts.

Previously (April 2023) 2.28.0 passed tests. Since then Debian has made a new stable release (using gcc-12, openmpi 4.1.4). The failing environment now builds with gcc-13 and openmpi 4.1.5. I'm not sure how relevant that is, but evidently the timeout is related to the build environment.

For reference, there is also a recent timeout on hppa, https://buildd.debian.org/status/logs.php?pkg=hypre&arch=hppa
But 2.29.0 recently built successfully on hppa, so I expect success with 2.28.0 after trying again. The ppc64el failure has been more reproducible (4 recent failed attempts)

In general there's been a degree of flakiness in the hypre tests. Occasionally I've had timeouts locally on my own machine (amd64). I just run the tests again, they pass the next time. Is there any thing that can be done to make the tests more robust?

Possibly there's nothing hypre can do about it. But if you have any ideas, I'll be glad to try them.

@victorapm
Copy link
Contributor

Hi @drew-parsons, thank you for your message.

Before addressing the timeout concerns, let me mention another way to check if the library works:

make check performs a simple check:

hypre/src/Makefile

Lines 158 to 165 in 03b9d3d

check:
@ \
echo "Checking the library ..."; \
(cd test; $(MAKE) all); \
(cd test; $(CHECKRUN) ./ij $(PARMS) 2> ij.err); \
(cd test; $(CHECKRUN) ./struct $(PARMS) 2> struct.err); \
(cd test; cp -f TEST_sstruct/sstruct.in.default .; $(CHECKRUN) ./sstruct $(PARMS) 2> sstruct.err); \
(cd test; ls -l ij.err struct.err sstruct.err)

make checkpar does more extensive checks targeting parallel runs:

hypre/src/Makefile

Lines 167 to 182 in 03b9d3d

checkpar:
@ \
echo "Checking the library ..."; \
(cd test; $(MAKE) all); \
echo "Testing IJ ..."; \
(cd test; ./runtest.sh -tol 1.e-06 -mpi "$(CHECKRUN)" TEST_ij/solvers.sh); \
(cd test; ./checktest.sh); \
(cd test; ./cleantest.sh); \
echo "Testing Struct ..."; \
(cd test; ./runtest.sh -tol 1.e-06 -mpi "$(CHECKRUN)" TEST_struct/solvers.sh); \
(cd test; ./checktest.sh); \
(cd test; ./cleantest.sh); \
echo "Testing SStruct ..."; \
(cd test; ./runtest.sh -tol 1.e-06 -mpi "$(CHECKRUN)" TEST_sstruct/solvers.sh); \
(cd test; ./checktest.sh); \
(cd test; ./cleantest.sh);

Perhaps those options are enough for the purpose of making sure hypre works in your platforms?

Hypre tests (in debian builds) have started timing out on ppc64el architecture

I looked at https://buildd.debian.org/status/fetch.php?pkg=hypre&arch=ppc64el&ver=2.28.0-2&stamp=1691288180&raw=0 and I see the line: E: Build killed with signal TERM after 150 minutes of inactivity. It would help to understand why execution stalled here. Please share if you have additional info.

Occasionally I've had timeouts locally on my own machine (amd64)

I haven't observed this behavior on my amd64 machine, but if you have additional info to share, e.g., did you have an executable hanging? Did you get any error messages during execution or as output to any *.err files?

Is there any thing that can be done to make the tests more robust?

As mentioned above, with some additional info about the timed out runs, we can hopefully make the tests more robust.

Thanks!

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 6, 2023

That's a fair suggestion. make check and make checkpar will run a subset of the available tests, so they'll reduce the probability of hitting the condition that's triggering the timeout. Currently debian is running ./runtest.sh -t TEST_*/*.sh on every single test in src/tests/. In once sense I'm reluctant to "hide" the problem by not running the full set, but pragmatically it makes sense to use the tool that you've provided (make check) rather than complaining it times out on tests you didn't intend me to run! In practice the make check* tests are sufficient for Debian CI.

Because the timeout is unreliably reproducible, at least on amd64, it's hard to provide more information. If I catch it again locally I'll try to capture the build state and logs. Since it's been reproducible on ppc64el, I can try to generate some logs from there. It was a problem on armhf too (cf. the red marks at https://ci.debian.net/packages/h/hypre/unstable/armhf/), eventually we just switched off tests on armhf. I'll see if I can get some armhf logs also.

@drew-parsons
Copy link
Contributor Author

make check passes but make checkpar fails. Manually running the test it invokes,

src/test$ ./runtest.sh -tol 1.e-06 -mpi TEST_ij/solvers.sh
./runtest.sh: test command file -tol does not exist
can not find .sh file

Sure enough, ./runtest.sh -h indicates

    -rtol <tol>    use relative tolerance 'tol' to compare numeric test values
    -atol <tol>    use absolute tolerance 'tol' to compare numeric test values

There are -rtol and -atol options but not the -tol option used by make checkpar.

@drew-parsons
Copy link
Contributor Author

There is a separate problem with make checkpar, compounding the error described above. Even though make checkpar fails, it still returns 0 so scripts running make checkpar will not detect that the failure occurred.

@victorapm
Copy link
Contributor

Thanks for the info, @drew-parsons! Let me investigate what's happening with the checkpar target and I'll get back to you

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 6, 2023

I think part of the problem detecting the error state is in test/checktest.sh. It reports FAILED to stdout (echo) when there is a problem, but does not return an error code.

For reference, I generated a checkpar error by trying to use CHECKRUN,

CHECKRUN="mpirun -n 4"  make checkpar

Investigating the problem, I learnt that I don't need to provide CHECKRUN for checkpar. It automatically invokes a -np flag (and CHECKRUN="mpirun -np" make checkpar passes successfully after patching -tol to -atol, the same as checkpar without CHECKRUN).

set -e helps catch errors in the check rule, though it still misses the checkpar error because of the checktest.sh issue.

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 7, 2023

Well huh. It's given me a timeout locally, right in the middle of testing make check (more precisely checkpar).

I was running a build inside a chroot (pbuilder). The debian build tests both a shared library build and a static build, this timeout occurred testing the static build. The original src dir is copied as src-static and the build (and tests) in this case takes place in src-static. That is probably not directly relevant since the timeout seems to be random (it succeeded just before).

The build log (i.e. stdout) just shows

...
== running make checkpar ==
make[2]: Entering directory '/build/hypre-2.28.0/src-static'
Checking the library ...
make[3]: Entering directory '/build/hypre-2.28.0/src-static/test'
make[3]: Nothing to be done for 'all'.
make[3]: Leaving directory '/build/hypre-2.28.0/src-static/test'
Testing IJ ...

The timeout is in make checkpar, in the IJ tests. ps shows

$ ps aux | grep ij
1234     1186429  0.0  0.0   2584  1536 pts/4    S    03:06   0:00 /bin/sh -c \         echo "Checking the library ..."; \         set -e; \         (cd test; make all); \         echo "Testing IJ ..."; \         (cd test; ./runtest.sh -atol 1.e-06 -mpi "" TEST_ij/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); \         echo "Testing Struct ..."; \         (cd test; ./runtest.sh -atol 1.e-06 -mpi "" TEST_struct/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); \         echo "Testing SStruct ..."; \         (cd test; ./runtest.sh -atol 1.e-06 -mpi "" TEST_sstruct/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); 
1234     1186431  0.0  0.0   4336  3200 pts/4    S    03:06   0:00 /bin/bash ./runtest.sh -atol 1.e-06 -mpi  TEST_ij/solvers.sh
1234     1187696  0.0  0.0 208568 17472 pts/4    Sl   03:07   0:00 /usr/bin/mpirun -np 8 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
1234     1187706  0.0  0.0 220080 20980 pts/4    Sl   03:07   0:00 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
drew     1204786  0.0  0.0   6340  2168 pts/3    S+   09:16   0:00 grep ij

Inspecting src-static/test/TEST_ij, all solvers.err.* files are empty. The last entry in the dir is .113:

src-static/test/TEST_ij$ ls -lrt | tail
-rw-r--r-- 1 1234 1234    5212 Aug  7 03:07 solvers.out.109
-rw-r--r-- 1 1234 1234       0 Aug  7 03:07 solvers.err.110
-rw-r--r-- 1 1234 1234    5115 Aug  7 03:07 solvers.out.110
-rw-r--r-- 1 1234 1234       0 Aug  7 03:07 solvers.err.111
-rw-r--r-- 1 1234 1234    5688 Aug  7 03:07 solvers.out.111
-rw-r--r-- 1 1234 1234       0 Aug  7 03:07 solvers.err.112
-rw-r--r-- 1 1234 1234    5816 Aug  7 03:07 solvers.out.112
-rwxr-xr-x 1 1234 1234 8623816 Aug  7 03:07 ij
-rw-r--r-- 1 1234 1234       0 Aug  7 03:07 solvers.err.113
-rw-r--r-- 1 1234 1234    5601 Aug  7 03:07 solvers.out.113

Test 113 does seem to show any problem, e.g.

$ tail -n25  solvers.out.113
   20    3.403224e-05    0.549558   3.804920e-07
   21    1.925875e-05    0.565897   2.153194e-07
   22    9.773671e-06    0.507493   1.092730e-07
   23    3.836149e-06    0.392498   4.288945e-08
   24    1.568592e-06    0.408898   1.753739e-08
   25    7.820921e-07    0.498595   8.744056e-09


Final L2 norm of residual: 7.820921e-07



=============================================
Solve phase times:
=============================================
GMRES Solve:
  wall clock time = 0.006753 seconds
  wall MFLOPS     = 0.000000
  cpu clock time  = 0.006754 seconds
  cpu MFLOPS      = 0.000000


GMRES Iterations = 25
Final GMRES Relative Residual Norm = 8.744056e-09

I figure the race condition (or whatever the problem is) must be happening in "test 114".

In this case the pbuilder build doesn't kill the process. That is, it hasn't "timed out", it's just hung. It's currently still "live" on my system. Is there anything specific you'd like me to poke before I kill it?

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 7, 2023

If I copy the stalled build directory and run the stalled ij command manually, it generates the same output as solvers.out.113 i.e.

GMRES Iterations = 25
Final GMRES Relative Residual Norm = 8.744056e-09

I figure the residuals in "test 114" would not be exactly the same as test 113. That suggests the problem is that test 113 did not close successfully. solvers.out.113 shows the calculation itself finished, so the race condition must happen while exiting the processes.

@drew-parsons
Copy link
Contributor Author

Separate from my captured hanging test, some make check build logs are now available at https://buildd.debian.org/status/package.php?p=hypre. The checkpar tolerance (1e-6) is too tight for half of the architectures. I patched checktest.sh to echo the error. For instance arm64 fails (https://buildd.debian.org/status/fetch.php?pkg=hypre&arch=arm64&ver=2.28.0-4&stamp=1691405154&raw=0) with

== running make checkpar ==
make[2]: Entering directory '/<<PKGBUILDDIR>>/src'
Checking the library ...
make[3]: Entering directory '/<<PKGBUILDDIR>>/src/test'
make[3]: Nothing to be done for 'all'.
make[3]: Leaving directory '/<<PKGBUILDDIR>>/src/test'
Testing IJ ...

FAILED : TEST_ij/solvers.err  (775)

(128) -  Average Convergence Factor = 0.119746
(128) +  Average Convergence Factor = 0.119752      (err 6.00e-06)

(131) -                 operator = 2.860205
(131) +                 operator = 2.860093      (err 1.12e-04)

(132) -                    cycle = 5.720112
(132) +                    cycle = 5.719888      (err 2.24e-04)

(142) -  Average Convergence Factor = 0.419936
(142) +  Average Convergence Factor = 0.419934      (err 2.00e-06)

(144) -      Complexity:    grid = 1.614937
(144) +      Complexity:    grid = 1.614875      (err 6.20e-05)

(145) -                 operator = 2.866563
(145) +                 operator = 2.866222      (err 3.41e-04)

(146) -                    cycle = 5.732747
(146) +                    cycle = 5.732066      (err 6.81e-04)


make[2]: *** [Makefile:169: checkpar] Error 1

I can easily patch the tolerance used by checkpar, e.g. use 1e-3 instead of 1e-6, but it'd be valuable to hear what you think about it.

@victorapm
Copy link
Contributor

Hi @drew-parsons, thank you for investigating this carefully!!!

Would you be able to open a PR with your changes so that the team can evaluate?

Changing the tolerance to 1e-3 makes sense to me, for the purposes of make check or make checkpar.

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 7, 2023

I'll open a PR for the tests.

Is there anything specific you can extract from my hanging test while it's still "running"? I should kill it otherwise. ( That's a separate issue from the test tolerances, the original issue, really).

@victorapm
Copy link
Contributor

victorapm commented Aug 7, 2023

Thank you!

I cannot extract any info from your hanging test unfortunately. But, I suspect gdb would help.

Did you compile hypre in debug mode (--enable-debug)? If so, one idea would be to find out the process ID and attach gdb on it if the process is still running. Then, we would have more details on exactly where the program is hanging. From what you described, I suspect this might be an issue with MPI, though... Another idea is trying out another MPI implementation such as mpich.

Thank you again!

@drew-parsons
Copy link
Contributor Author

Eh, the process id is right in front of us, given by ps! pid 1187706 , the other 7 in the mpirun -np 8 run have already closed. I've never attached gdb to a running process before but I'm happy to try it.

Wouldn't surprise me if it's an MPI implementation issue. Other projects have also suggested mpich might be more reliable. Debian has been using OpenMPI as default MPI for historical reasons. From time to time we raise the question of whether we should switch, but haven't convinced ourselvesit will necessarily improve things (might just swap one set of implementation bugs for another)

@victorapm
Copy link
Contributor

Sounds good! You can do: gdb -p 1187706. gdb will stop the process and, at this point, you can look at the backtrace with bt. I expect to see MPI_Finalize in there. At this point, it will help to have debug symbols for openmpi, so you can see where it stopped inside the call to MPI_Finalize.

Just adding another idea: maybe running through valgrind?

Thank you!

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 7, 2023

Hmm, it seems in the process of trying to attach to the process, I managed to shut it down accidentally, and missed the backtrace :( Got so close, but the moment passed. Might have to go back to a ppc64el box.

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 7, 2023

Here attaching a screenshot of the debian build summary before I patch around it. Green indicates check and checkpar passed (with default tolerance 1e-6). Red indicates checkpar failed tolerance, should pass when patched to 1e-3.

image

@victorapm
Copy link
Contributor

Out of curiosity, are those executed on virtual machines (VM)?

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 7, 2023

Some of them I think. It's a mixture. We try to build on metal where possible but use VMs in some cases. For instance i386 builds are done in VM on amd64 machines. The build machines (buildd) are listed at https://db.debian.org/machines.cgi

@victorapm
Copy link
Contributor

Thanks for the info!

PS: You might have noticed the compilation warnings with gcc-13. I have a branch that I'm working slowly on to fix all those warnings: #896

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 7, 2023

The pbuilder chroot has obligingly given me another sample. Got the backtrace this time:

Reading symbols from /var/cache/pbuilder/build/1478835/build/hypre-2.28.0/src-static/test/TEST_ij/ij...
Attaching to program: /var/cache/pbuilder/build/1478835/build/hypre-2.28.0/src-static/test/TEST_ij/ij, process 1510478
[New LWP 1510490]

warning: .dynamic section for "/usr/lib/x86_64-linux-gnu/liblapack.so.3" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/usr/lib/x86_64-linux-gnu/libblas.so.3" is not at the expected address (wrong library or version mismatch?)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
futex_wait (private=0, expected=2, futex_word=0x155ce8000000) at ../sysdeps/nptl/futex-internal.h:146
146	../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x155ce8000000) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x155ce8000000, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007fb0294a9682 in lll_mutex_lock_optimized (mutex=0x155ce8000000) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x155ce8000000) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007fb0280e3632 in pmix_shmem_segment_detach () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#5  0x00007fb02814293b in ?? () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#6  0x00007fb02813fcc1 in ?? () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#7  0x00007fb02813d3ab in ?? () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#8  0x00007fb0280d917c in pmix_mca_base_framework_close () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#9  0x00007fb0280a754d in pmix_rte_finalize () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#10 0x00007fb028051a2c in PMIx_Finalize () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#11 0x00007fb02826f61a in ext3x_client_finalize () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext3x.so
#12 0x00007fb028a8e4e5 in ?? () from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so
#13 0x00007fb0292ead7d in orte_finalize () from /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
#14 0x00007fb029f49770 in ompi_mpi_finalize () from /usr/lib/x86_64-linux-gnu/libmpi.so.40
#15 0x000055b04d4bb1fb in main (argc=<optimized out>, argv=<optimized out>) at ./src-static/test/ij.c:8690

pmix. It's always causing trouble.

Current versions
OpenMPI 4.1.5-2
pmix 5.0.0~rc1-2

@victorapm
Copy link
Contributor

OK! Good that you managed to get the backtrace. It does stop on mpi_finalize as we expected. I'm afraid I don't have the knowledge to debug this further, but at least we identified the issue is on MPI. Maybe the pmix or openmpi folks can help?

@drew-parsons
Copy link
Contributor Author

Install debug libraries (with symbols) + https://stackoverflow.com/questions/31069258/

(gdb) nosharedlibrary
(gdb) sharedlibrary
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x155ce8000000) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x155ce8000000, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007fb0294a9682 in lll_mutex_lock_optimized (mutex=0x155ce8000000) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x155ce8000000) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007fb0280e3632 in update_ref_count (inc=-1, header=0x155ce8000000) at ../../../../src/util/pmix_shmem.c:46
#5  pmix_shmem_segment_detach (shmem=0x7fb0180014c0) at ../../../../src/util/pmix_shmem.c:213
#6  0x00007fb02814293b in job_destruct (job=0x7fb018001250) at ../../../../../../src/mca/gds/shmem/gds_shmem.c:475
#7  0x00007fb02813fcc1 in pmix_obj_run_destructors (object=0x7fb018001250) at ./src/class/pmix_object.h:680
#8  module_finalize () at ../../../../../../src/mca/gds/shmem/gds_shmem.c:1035
#9  0x00007fb02813d3ab in pmix_gds_close () at ../../../../../src/mca/gds/base/gds_base_frame.c:72
#10 pmix_gds_close () at ../../../../../src/mca/gds/base/gds_base_frame.c:58
#11 0x00007fb0280d917c in pmix_mca_base_framework_close (framework=0x7fb0282010e0 <pmix_gds_base_framework>) at ../../../../../src/mca/base/pmix_mca_base_framework.c:213
#12 0x00007fb0280a754d in pmix_rte_finalize () at ../../../src/runtime/pmix_finalize.c:98
#13 0x00007fb028051a2c in PMIx_Finalize (info=info@entry=0x0, ninfo=ninfo@entry=0) at ../../../src/client/pmix_client.c:1099
#14 0x00007fb02826f61a in ext3x_client_finalize () at ./debian/build-gfortran/opal/mca/pmix/ext3x/ext3x_client.c:199
#15 0x00007fb028a8e4e5 in rte_finalize () at ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c:650
#16 0x00007fb0292ead7d in orte_finalize () at ../../../orte/runtime/orte_finalize.c:77
#17 0x00007fb029f49770 in ompi_mpi_finalize () at ../../../ompi/runtime/ompi_mpi_finalize.c:478
#18 0x000055b04d5ae5b5 in hypre_MPI_Finalize () at ./src-static/utilities/mpistubs.c:772
#19 0x000055b04d4bb1fb in main (argc=<optimized out>, argv=<optimized out>) at ./src-static/test/ij.c:8690

@drew-parsons
Copy link
Contributor Author

The pmix (openmpi) developers might be able to help. Worth drawing it to their attention.

@drew-parsons
Copy link
Contributor Author

openpmix/openpmix@fc56068
might be relevant

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 8, 2023

Worth noting that hypre 2.26.0 continues to pass ppc64el CI tests with pmix 5.0.0~rc1-2, see https://ci.debian.net/packages/h/hypre/unstable/ppc64el/
e.g. https://ci.debian.net/data/autopkgtest/unstable/ppc64el/h/hypre/36600838/log.gz
That's the same pmix version causing the build test problems with 2.28 (and 2.29)
That suggests hypre (via changes between 2.26 and 2.28) is not uninvolved with the problem.

ppc64el continues to hang consistently, https://buildd.debian.org/status/fetch.php?pkg=hypre&arch=ppc64el&ver=2.28.0-5&stamp=1691466990&raw=0

The final release of pmix v5 was made yesterday, hopefully that will bring more stability in any case.

@victorapm
Copy link
Contributor

Thanks @drew-parsons! Can you try make check CHECKRUN="valgrind --leak-check=full" on the ppc64el build just to see if we get any additional info?

The fact that the tests are passing with hypre 2.26.0, but not with hypre 2.28.0, for the same pmix version sounds strange to me. I cannot think of a change between these two versions that would impact MPI finalization.

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 9, 2023

Even on a ppc64el machine the race condition is hard to catch. I couldn't reproduce the hanged test in a test build. Instead I had to put make checkpar in a loop, it finally caught something on repeat 27. Here is the redacted gdb traceback, probing each process one by one:

(sid_ppc64el-dchroot)dparsons@platti:~/hypre$ nproc
4
(sid_ppc64el-dchroot)dparsons@platti:~/hypre$ ps aux | grep ij
_lldpd      1022  0.0  0.0  28416   768 ?        S    Aug01   0:03 lldpd: connected to pijper.debian.org.
dparsons 2180975  0.0  0.0   3712  1472 pts/1    S+   11:49   0:00 /bin/sh -c \         echo "Checking the library ..."; \         set -e; \         (cd test; make all); \         echo "Testing IJ ..."; \         (cd test; ./runtest.sh -atol 1d-3 -mpi "" TEST_ij/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); \         echo "Testing Struct ..."; \         (cd test; ./runtest.sh -atol 1d-3 -mpi "" TEST_struct/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); \         echo "Testing SStruct ..."; \         (cd test; ./runtest.sh -atol 1d-3 -mpi "" TEST_sstruct/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); 
dparsons 2180977  0.0  0.0   6016  4416 pts/1    S+   11:49   0:00 /bin/bash ./runtest.sh -atol 1d-3 -mpi  TEST_ij/solvers.sh
dparsons 2182250  0.0  0.3 221824 26432 pts/1    Sl+  11:49   0:00 /usr/bin/mpirun -np 8 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182254  2.7  0.3 243712 33344 pts/1    Sl   11:49   6:43 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182255  2.7  0.3 243712 33280 pts/1    Sl   11:49   6:43 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182256  2.7  0.3 243712 33344 pts/1    Sl   11:49   6:43 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182257  2.7  0.3 243712 33280 pts/1    Sl   11:49   6:44 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182258  0.0  0.2 191232 19072 pts/1    Sl   11:49   0:00 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182262  0.0  0.2 191232 19136 pts/1    Sl   11:49   0:00 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182265  2.7  0.3 243712 33344 pts/1    Sl   11:49   6:44 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2182268  2.7  0.3 243712 33280 pts/1    Sl   11:49   6:43 ./ij -n 20 20 20 -P 2 2 2 -agg_nl 1 -solver 3 -mult_add 0 -add_Pmx 5
dparsons 2216851  0.0  0.0   4864  2432 pts/7    S+   15:50   0:00 grep ij
(sid_ppc64el-dchroot)dparsons@platti:~/hypre$ gdb /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij
GNU gdb (Debian 13.2-1) 13.2
Reading symbols from /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij...
(gdb) attach 2182250
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182250
[New LWP 2182251]
[New LWP 2182252]
[New LWP 2182253]
warning: Build ID mismatch between current exec-file /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij
and automatically determined exec-file /usr/bin/orterun
exec-file-mismatch handling is currently "ask"
Load new symbol table from "/usr/bin/orterun"? (y or n) y
Reading symbols from /usr/bin/orterun...
(No debugging symbols found in /usr/bin/orterun)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x00007fffaa0012d0 in __GI___poll (fds=0x204, nfds=20, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
29	../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x00007fffaa0012d0 in __GI___poll (fds=0x204, nfds=20, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fffaa1540b0 in ?? () from /usr/lib/powerpc64le-linux-gnu/libevent_core-2.1.so.7
#2  0x00007fffaa14730c in event_base_loop () from /usr/lib/powerpc64le-linux-gnu/libevent_core-2.1.so.7
#3  0x0000000103e012f4 in ?? ()
#4  0x0000000103e00e18 in ?? ()
#5  0x00007fffa9ee4bac in __libc_start_call_main (main=main@entry=0x103e00e00, argc=argc@entry=20, argv=argv@entry=0x7ffff0526858, auxvec=auxvec@entry=0x7ffff05269e8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#6  0x00007fffa9ee4dec in generic_start_main (init=0x0, fini=0x0, stack_end=<optimized out>, rtld_fini=<optimized out>, auxvec=0x7ffff05269e8, argv=0x7ffff0526858, argc=20, main=0x103e00e00)
    at ../csu/libc-start.c:360
#7  __libc_start_main_impl (argc=20, argv=0x7ffff0526858, ev=<optimized out>, auxvec=0x7ffff05269e8, rtld_fini=<optimized out>, stinfo=<optimized out>, stack_on_entry=<optimized out>)
    at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:109
#8  0x0000000000000000 in ?? ()
(gdb) detach
Detaching from program: /usr/bin/orterun, process 2182250
[Inferior 1 (process 2182250) detached]
(gdb) attach 2182254
Attaching to program: /usr/bin/orterun, process 2182254
[New LWP 2182259]
[New LWP 2182260]
warning: Build ID mismatch between current exec-file /usr/bin/orterun
and automatically determined exec-file /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij
exec-file-mismatch handling is currently "ask"
Load new symbol table from "/home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij"? (y or n) y
Reading symbols from /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x00007fffb5e320f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffee08e7b8, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) bt
#0  0x00007fffb5e320f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffee08e7b8, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fffb5e39aa8 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007fffb5e76e54 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:31
#3  0x00007fffb62067e0 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=<optimized out>, provided=0x7fffee08e954, reinit_ok=<optimized out>) at ../../../ompi/runtime/ompi_mpi_init.c:713
#4  0x00007fffb6173b80 in PMPI_Init (argc=0x7fffee08edcc, argv=0x7fffee08edc0) at pinit.c:67
#5  0x00007fffb65a5168 in hypre_MPI_Init (argc=<optimized out>, argv=<optimized out>) at mpistubs.c:766
#6  0x000000012122e2f0 in main (argc=<optimized out>, argv=<optimized out>) at ij.c:532
(gdb) detach
Detaching from program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182254
[Inferior 1 (process 2182254) detached]
(gdb) attach 2182255
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182255
[New LWP 2182261]
[New LWP 2182263]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x00007fffa77420f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffe5ff4698, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) bt
#0  0x00007fffa77420f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffe5ff4698, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fffa7749aa8 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007fffa7786e54 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:31
#3  0x00007fffa7b167e0 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=<optimized out>, provided=0x7fffe5ff4834, reinit_ok=<optimized out>) at ../../../ompi/runtime/ompi_mpi_init.c:713
#4  0x00007fffa7a83b80 in PMPI_Init (argc=0x7fffe5ff4cac, argv=0x7fffe5ff4ca0) at pinit.c:67
#5  0x00007fffa7eb5168 in hypre_MPI_Init (argc=<optimized out>, argv=<optimized out>) at mpistubs.c:766
#6  0x00000001346ee2f0 in main (argc=<optimized out>, argv=<optimized out>) at ij.c:532
(gdb) detach
Detaching from program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182255
[Inferior 1 (process 2182255) detached]
(gdb) attach 2182256
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182256
[New LWP 2182264]
[New LWP 2182267]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x00007fffab3d20f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffffa7fb88, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) bt
#0  0x00007fffab3d20f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffffa7fb88, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fffab3d9aa8 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007fffab416e54 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:31
#3  0x00007fffab7a67e0 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=<optimized out>, provided=0x7fffffa7fd24, reinit_ok=<optimized out>) at ../../../ompi/runtime/ompi_mpi_init.c:713
#4  0x00007fffab713b80 in PMPI_Init (argc=0x7fffffa8019c, argv=0x7fffffa80190) at pinit.c:67
#5  0x00007fffabb45168 in hypre_MPI_Init (argc=<optimized out>, argv=<optimized out>) at mpistubs.c:766
#6  0x0000000133a7e2f0 in main (argc=<optimized out>, argv=<optimized out>) at ij.c:532
(gdb) detach
Detaching from program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182256
[Inferior 1 (process 2182256) detached]
(gdb) attach 2182257
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182257
[New LWP 2182266]
[New LWP 2182269]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x00007fff8f9420f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffd29e3ef8, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) bt
#0  0x00007fff8f9420f0 in __GI___clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, req=0x7fffd29e3ef8, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fff8f949aa8 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007fff8f986e54 in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:31
#3  0x00007fff8fd167e0 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=<optimized out>, provided=0x7fffd29e4094, reinit_ok=<optimized out>) at ../../../ompi/runtime/ompi_mpi_init.c:713
#4  0x00007fff8fc83b80 in PMPI_Init (argc=0x7fffd29e450c, argv=0x7fffd29e4500) at pinit.c:67
#5  0x00007fff900b5168 in hypre_MPI_Init (argc=<optimized out>, argv=<optimized out>) at mpistubs.c:766
#6  0x000000010853e2f0 in main (argc=<optimized out>, argv=<optimized out>) at ij.c:532
(gdb) detach
Detaching from program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182257
[Inferior 1 (process 2182257) detached]
(gdb) attach 2182258
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182258
[New LWP 2182270]
[New LWP 2182271]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x00007fffb328ea60 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=<optimized out>, expected=<optimized out>, futex_word=0x200) at futex-internal.c:57
57	futex-internal.c: No such file or directory.
(gdb) bt
#0  0x00007fffb328ea60 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=<optimized out>, expected=<optimized out>, futex_word=0x200) at futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=<optimized out>, expected=<optimized out>, futex_word=0x200) at futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=0x200, expected=<optimized out>, clockid=<optimized out>, abstime=0x0, private=<optimized out>) at futex-internal.c:139
#3  0x00007fffb3293004 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fffe4c85c20, cond=0x7fffe4c85c48) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7fffe4c85c48, mutex=0x7fffe4c85c20) at pthread_cond_wait.c:618
#5  0x00007fffb0d33f4c in PMIx_Init (proc=<optimized out>, info=0x0, ninfo=0) at ../../../src/client/pmix_client.c:820
#6  0x00007fffb0fad0c0 in ext3x_client_init (ilist=<optimized out>) at ext3x_client.c:106
#7  0x00007fffb1802590 in rte_init () at ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c:152
#8  0x00007fffb27d2394 in orte_init (pargc=<optimized out>, pargv=<optimized out>, flags=<optimized out>) at ../../../orte/runtime/orte_init.c:271
#9  0x00007fffb36c5ff4 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=<optimized out>, provided=0x7fffe4c86ad4, reinit_ok=<optimized out>) at ../../../ompi/runtime/ompi_mpi_init.c:518
#10 0x00007fffb3633b80 in PMPI_Init (argc=0x7fffe4c86f4c, argv=0x7fffe4c86f40) at pinit.c:67
#11 0x00007fffb3a65168 in hypre_MPI_Init (argc=<optimized out>, argv=<optimized out>) at mpistubs.c:766
#12 0x000000011872e2f0 in main (argc=<optimized out>, argv=<optimized out>) at ij.c:532
(gdb) detach
Detaching from program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182258
[Inferior 1 (process 2182258) detached]
(gdb) attach 2182262
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182262
[New LWP 2182272]
[New LWP 2182274]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
0x00007fffbb8eea60 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=<optimized out>, expected=<optimized out>, futex_word=0x200) at futex-internal.c:57
57	futex-internal.c: No such file or directory.
(gdb) bt
#0  0x00007fffbb8eea60 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=<optimized out>, expected=<optimized out>, futex_word=0x200) at futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=<optimized out>, expected=<optimized out>, futex_word=0x200) at futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=0x200, expected=<optimized out>, clockid=<optimized out>, abstime=0x0, private=<optimized out>) at futex-internal.c:139
#3  0x00007fffbb8f3004 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fffce9f05f0, cond=0x7fffce9f0618) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7fffce9f0618, mutex=0x7fffce9f05f0) at pthread_cond_wait.c:618
#5  0x00007fffb9393f4c in PMIx_Init (proc=<optimized out>, info=0x0, ninfo=0) at ../../../src/client/pmix_client.c:820
#6  0x00007fffb960d0c0 in ext3x_client_init (ilist=<optimized out>) at ext3x_client.c:106
#7  0x00007fffb9e62590 in rte_init () at ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c:152
#8  0x00007fffbae32394 in orte_init (pargc=<optimized out>, pargv=<optimized out>, flags=<optimized out>) at ../../../orte/runtime/orte_init.c:271
#9  0x00007fffbbd25ff4 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=<optimized out>, provided=0x7fffce9f14a4, reinit_ok=<optimized out>) at ../../../ompi/runtime/ompi_mpi_init.c:518
#10 0x00007fffbbc93b80 in PMPI_Init (argc=0x7fffce9f191c, argv=0x7fffce9f1910) at pinit.c:67
#11 0x00007fffbc0c5168 in hypre_MPI_Init (argc=<optimized out>, argv=<optimized out>) at mpistubs.c:766
#12 0x000000012c24e2f0 in main (argc=<optimized out>, argv=<optimized out>) at ij.c:532
(gdb) detach
Detaching from program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182262
[Inferior 1 (process 2182262) detached]
(gdb) attach 2182265 
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182265
ptrace: No such process.
(gdb) attach 2182268 
Attaching to program: /home/dparsons/hypre/hypre-2.28.0/src/test/TEST_ij/ij, process 2182268
ptrace: No such process.
(gdb) q
(sid_ppc64el-dchroot)dparsons@platti:~/hypre$ ps aux | grep ij
_lldpd      1022  0.0  0.0  28416   768 ?        S    Aug01   0:03 lldpd: connected to pijper.debian.org.
dparsons 2224034  0.0  0.0   3712  1472 pts/1    S+   15:56   0:00 /bin/sh -c \         echo "Checking the library ..."; \         set -e; \         (cd test; make all); \         echo "Testing IJ ..."; \         (cd test; ./runtest.sh -atol 1d-3 -mpi "" TEST_ij/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); \         echo "Testing Struct ..."; \         (cd test; ./runtest.sh -atol 1d-3 -mpi "" TEST_struct/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); \         echo "Testing SStruct ..."; \         (cd test; ./runtest.sh -atol 1d-3 -mpi "" TEST_sstruct/solvers.sh); \         (cd test; ./checktest.sh); \         (cd test; ./cleantest.sh); 
dparsons 2224036  0.9  0.0   5824  4288 pts/1    S+   15:56   0:00 /bin/bash ./runtest.sh -atol 1d-3 -mpi  TEST_ij/solvers.sh
dparsons 2225167 42.8  0.3 221696 26048 pts/1    Sl+  15:56   0:00 /usr/bin/mpirun -np 2 ./ij -n 20 20 20 -P 2 1 1 -agg_nl 1 -solver 1 -CF 0 -rlx 0 -w 0.7 -rlx_coarse 0 -ns_coarse 2
dparsons 2225171 75.0  0.4 248320 36352 pts/1    Rl   15:56   0:00 ./ij -n 20 20 20 -P 2 1 1 -agg_nl 1 -solver 1 -CF 0 -rlx 0 -w 0.7 -rlx_coarse 0 -ns_coarse 2
dparsons 2225172 75.0  0.4 248320 36160 pts/1    Rl   15:56   0:00 ./ij -n 20 20 20 -P 2 1 1 -agg_nl 1 -solver 1 -CF 0 -rlx 0 -w 0.7 -rlx_coarse 0 -ns_coarse 2
dparsons 2225178  0.0  0.0   4864  2496 pts/7    S+   15:56   0:00 grep ij

The hang seems to have occurred between processes 2182258 and 2182262 (not sure what the other processes were doing at this time). They both stop at __futex_abstimed_wait_common64 inside PMIx_Init. After touching them, that seems to have released the deadlock, the test finished after interfering with gdb and continued on to the next. After that, repeat 43 seems to have stalled next, again out of hypre_MPI_Init. After that it stalled on repeat 56. I stopped the cycle of repeats at that point.

Need to note that this test exposes hypre_MPI_Init, not hypre_MPI_Finalize.

My guess is that the threads are checking for each other in PMIx_Init, seeing each other are not ready yet, and having a nano-nap (via __futex_abstimed_wait_common64). Evidentially there's no mechanism to guarantee they wake up in time to catch each other, so they continue to sleep around each other indefinitely. Unless that's the behaviour fixed in pmix 5.

Can get a qualitative sense of the uncontrolled waiting running checkpar by hand. Even without an hours-long hang, sometimes it completes immediately, sometimes it pauses several seconds before completing.

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Aug 9, 2023

make check CHECKRUN="/usr/bin/valgrind --leak-check=full" found some things. I can't say if they're significant. I'm attaching ij.err here.
ij.err.hypre_ppc64el.txt

@drew-parsons
Copy link
Contributor Author

Might as well attach sstruct.err and struct.err too
sstruct.err.hypre_ppc64el.txt
struct.err.hypre_ppc64el.txt

@victorapm
Copy link
Contributor

Thank you for the additional tests!

My guess is that the threads are checking for each other in PMIx_Init, seeing each other are not ready yet, and having a nano-nap

This sounds plausible to me. Do you have enough (8) CPU threads in that machine? I'm not sure if having less threads would be a problem for the MPI implementation you are using, but it might be worth checking. Again, need help from the OpenMPI folks

make check valgrind found some things

These seem false positives from OpenMPI, see https://www.open-mpi.org/faq/?category=debugging#valgrind_clean

@drew-parsons
Copy link
Contributor Author

That ppc64el machine reported 4 processors available (nproc), so it would be running oversubscribed. I was going to comment that I could well believe the problem is exacerbated when the available CPUs are oversubscribed . I didn't draw attention to it before though, since my own local machine has 8 CPUs and so shouldn't be oversubscribed as such. But sometimes experiences the hang (in that case the problem might be exacerbated from running in pbuilder chroot).

@victorapm
Copy link
Contributor

Thanks, @drew-parsons! It's still a bit weird that you experienced the hang with your machine, which has enough CPU cores for running the test. I suppose that doesn't happen with mpich? I'm afraid I can't help further with this issue, but hopefully the pmix/openmpi folks can provide further guidance?

Lastly, would you be able to create a PR with your changes in hypre so that we can incorporate them in the library?

Thank you!

@victorapm
Copy link
Contributor

@drew-parsons, would you be able to create a PR with your changes in hypre or share any patch files you produced?

@drew-parsons
Copy link
Contributor Author

Thanks for the reminder. I'll review our patches and push a PR.

@drew-parsons
Copy link
Contributor Author

PR#959 is now ready. I took the liberty of adding a timings patch, let me know if you'd rather not include it here.

@victorapm
Copy link
Contributor

Thanks @drew-parsons! That was a good idea to include contents from other patches as well.

@drew-parsons
Copy link
Contributor Author

Please check that the timings patch is correct too :)

@victorapm
Copy link
Contributor

Sure, I will!

@drew-parsons
Copy link
Contributor Author

Just want to mention here, there's a different issue running checkpar in the installed system. It's passing during builds. But I'm also trying to run it against the installed libraries as part of debian CI testing. In this context it detectsa couple of differences from solver.saved
e.g.

Number of numeric entries do not match!!
Saved file (142 entries)  Output file (202 entries)

Numeric type mismatch in floating point or integer comparison!!
(128) -  Average Convergence Factor = 0.119746 
(128) +     Cycle  3   2.063758e-01    0.117502     1.631544e-03  

(130) -      Complexity:    grid = 1.613875
(128) +     Cycle  3   2.063758e-01    0.117502     1.631544e-03       (err 1.41e+00)

Not sure why it's not matching. Maybe I hacked the Makefile the wrong file for enabling building tests against the listalled libraries.

@victorapm
Copy link
Contributor

victorapm commented Sep 22, 2023

@drew-parsons sorry I missed your last message. I hope your last issue has been solved, let me know I can help.

Thanks for your PR contribution! I've just merged it on master

@drew-parsons
Copy link
Contributor Author

Thanks. The other problem running make checkpar afterwards is still there. I'll file a separate bug once I understand how to frame the problem.

@victorapm
Copy link
Contributor

Thanks @drew-parsons. I'll close this issue, but feel free to open a new one about your new problem. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants