-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests time out on ppc64el architecture #955
Comments
Hi @drew-parsons, thank you for your message. Before addressing the timeout concerns, let me mention another way to check if the library works:
Lines 158 to 165 in 03b9d3d
Lines 167 to 182 in 03b9d3d
Perhaps those options are enough for the purpose of making sure hypre works in your platforms?
I looked at https://buildd.debian.org/status/fetch.php?pkg=hypre&arch=ppc64el&ver=2.28.0-2&stamp=1691288180&raw=0 and I see the line:
I haven't observed this behavior on my amd64 machine, but if you have additional info to share, e.g., did you have an executable hanging? Did you get any error messages during execution or as output to any
As mentioned above, with some additional info about the timed out runs, we can hopefully make the tests more robust. Thanks! |
That's a fair suggestion. Because the timeout is unreliably reproducible, at least on amd64, it's hard to provide more information. If I catch it again locally I'll try to capture the build state and logs. Since it's been reproducible on ppc64el, I can try to generate some logs from there. It was a problem on armhf too (cf. the red marks at https://ci.debian.net/packages/h/hypre/unstable/armhf/), eventually we just switched off tests on armhf. I'll see if I can get some armhf logs also. |
Sure enough,
There are -rtol and -atol options but not the -tol option used by |
There is a separate problem with |
Thanks for the info, @drew-parsons! Let me investigate what's happening with the |
I think part of the problem detecting the error state is in test/checktest.sh. It reports FAILED to stdout (echo) when there is a problem, but does not return an error code. For reference, I generated a checkpar error by trying to use CHECKRUN,
Investigating the problem, I learnt that I don't need to provide CHECKRUN for checkpar. It automatically invokes a -np flag (and
|
Well huh. It's given me a timeout locally, right in the middle of testing I was running a build inside a chroot (pbuilder). The debian build tests both a shared library build and a static build, this timeout occurred testing the static build. The original src dir is copied as src-static and the build (and tests) in this case takes place in src-static. That is probably not directly relevant since the timeout seems to be random (it succeeded just before). The build log (i.e. stdout) just shows
The timeout is in
Inspecting src-static/test/TEST_ij, all solvers.err.* files are empty. The last entry in the dir is .113:
Test 113 does seem to show any problem, e.g.
I figure the race condition (or whatever the problem is) must be happening in "test 114". In this case the pbuilder build doesn't kill the process. That is, it hasn't "timed out", it's just hung. It's currently still "live" on my system. Is there anything specific you'd like me to poke before I kill it? |
If I copy the stalled build directory and run the stalled ij command manually, it generates the same output as solvers.out.113 i.e.
I figure the residuals in "test 114" would not be exactly the same as test 113. That suggests the problem is that test 113 did not close successfully. solvers.out.113 shows the calculation itself finished, so the race condition must happen while exiting the processes. |
Separate from my captured hanging test, some
I can easily patch the tolerance used by checkpar, e.g. use 1e-3 instead of 1e-6, but it'd be valuable to hear what you think about it. |
Hi @drew-parsons, thank you for investigating this carefully!!! Would you be able to open a PR with your changes so that the team can evaluate? Changing the tolerance to |
I'll open a PR for the tests. Is there anything specific you can extract from my hanging test while it's still "running"? I should kill it otherwise. ( That's a separate issue from the test tolerances, the original issue, really). |
Thank you! I cannot extract any info from your hanging test unfortunately. But, I suspect Did you compile hypre in debug mode ( Thank you again! |
Eh, the process id is right in front of us, given by ps! pid 1187706 , the other 7 in the Wouldn't surprise me if it's an MPI implementation issue. Other projects have also suggested mpich might be more reliable. Debian has been using OpenMPI as default MPI for historical reasons. From time to time we raise the question of whether we should switch, but haven't convinced ourselvesit will necessarily improve things (might just swap one set of implementation bugs for another) |
Sounds good! You can do: Just adding another idea: maybe running through valgrind? Thank you! |
Hmm, it seems in the process of trying to attach to the process, I managed to shut it down accidentally, and missed the backtrace :( Got so close, but the moment passed. Might have to go back to a ppc64el box. |
Here attaching a screenshot of the debian build summary before I patch around it. Green indicates check and checkpar passed (with default tolerance 1e-6). Red indicates checkpar failed tolerance, should pass when patched to 1e-3. |
Out of curiosity, are those executed on virtual machines (VM)? |
Some of them I think. It's a mixture. We try to build on metal where possible but use VMs in some cases. For instance i386 builds are done in VM on amd64 machines. The build machines (buildd) are listed at https://db.debian.org/machines.cgi |
Thanks for the info! PS: You might have noticed the compilation warnings with gcc-13. I have a branch that I'm working slowly on to fix all those warnings: #896 |
The pbuilder chroot has obligingly given me another sample. Got the backtrace this time:
pmix. It's always causing trouble. Current versions |
OK! Good that you managed to get the backtrace. It does stop on mpi_finalize as we expected. I'm afraid I don't have the knowledge to debug this further, but at least we identified the issue is on MPI. Maybe the pmix or openmpi folks can help? |
Install debug libraries (with symbols) + https://stackoverflow.com/questions/31069258/
|
The pmix (openmpi) developers might be able to help. Worth drawing it to their attention. |
openpmix/openpmix@fc56068 |
Worth noting that hypre 2.26.0 continues to pass ppc64el CI tests with pmix 5.0.0~rc1-2, see https://ci.debian.net/packages/h/hypre/unstable/ppc64el/ ppc64el continues to hang consistently, https://buildd.debian.org/status/fetch.php?pkg=hypre&arch=ppc64el&ver=2.28.0-5&stamp=1691466990&raw=0 The final release of pmix v5 was made yesterday, hopefully that will bring more stability in any case. |
Thanks @drew-parsons! Can you try The fact that the tests are passing with hypre 2.26.0, but not with hypre 2.28.0, for the same pmix version sounds strange to me. I cannot think of a change between these two versions that would impact MPI finalization. |
Even on a ppc64el machine the race condition is hard to catch. I couldn't reproduce the hanged test in a test build. Instead I had to put
The hang seems to have occurred between processes 2182258 and 2182262 (not sure what the other processes were doing at this time). They both stop at __futex_abstimed_wait_common64 inside PMIx_Init. After touching them, that seems to have released the deadlock, the test finished after interfering with gdb and continued on to the next. After that, repeat 43 seems to have stalled next, again out of hypre_MPI_Init. After that it stalled on repeat 56. I stopped the cycle of repeats at that point. Need to note that this test exposes hypre_MPI_Init, not hypre_MPI_Finalize. My guess is that the threads are checking for each other in PMIx_Init, seeing each other are not ready yet, and having a nano-nap (via __futex_abstimed_wait_common64). Evidentially there's no mechanism to guarantee they wake up in time to catch each other, so they continue to sleep around each other indefinitely. Unless that's the behaviour fixed in pmix 5. Can get a qualitative sense of the uncontrolled waiting running checkpar by hand. Even without an hours-long hang, sometimes it completes immediately, sometimes it pauses several seconds before completing. |
|
Might as well attach sstruct.err and struct.err too |
Thank you for the additional tests!
This sounds plausible to me. Do you have enough (8) CPU threads in that machine? I'm not sure if having less threads would be a problem for the MPI implementation you are using, but it might be worth checking. Again, need help from the OpenMPI folks
These seem false positives from OpenMPI, see https://www.open-mpi.org/faq/?category=debugging#valgrind_clean |
That ppc64el machine reported 4 processors available (nproc), so it would be running oversubscribed. I was going to comment that I could well believe the problem is exacerbated when the available CPUs are oversubscribed . I didn't draw attention to it before though, since my own local machine has 8 CPUs and so shouldn't be oversubscribed as such. But sometimes experiences the hang (in that case the problem might be exacerbated from running in pbuilder chroot). |
Thanks, @drew-parsons! It's still a bit weird that you experienced the hang with your machine, which has enough CPU cores for running the test. I suppose that doesn't happen with mpich? I'm afraid I can't help further with this issue, but hopefully the pmix/openmpi folks can provide further guidance? Lastly, would you be able to create a PR with your changes in hypre so that we can incorporate them in the library? Thank you! |
@drew-parsons, would you be able to create a PR with your changes in hypre or share any patch files you produced? |
Thanks for the reminder. I'll review our patches and push a PR. |
PR#959 is now ready. I took the liberty of adding a timings patch, let me know if you'd rather not include it here. |
Thanks @drew-parsons! That was a good idea to include contents from other patches as well. |
Please check that the timings patch is correct too :) |
Sure, I will! |
Just want to mention here, there's a different issue running checkpar in the installed system. It's passing during builds. But I'm also trying to run it against the installed libraries as part of debian CI testing. In this context it detectsa couple of differences from solver.saved
Not sure why it's not matching. Maybe I hacked the Makefile the wrong file for enabling building tests against the listalled libraries. |
@drew-parsons sorry I missed your last message. I hope your last issue has been solved, let me know I can help. Thanks for your PR contribution! I've just merged it on |
Thanks. The other problem running make checkpar afterwards is still there. I'll file a separate bug once I understand how to frame the problem. |
Thanks @drew-parsons. I'll close this issue, but feel free to open a new one about your new problem. Thanks again! |
Hypre tests (in debian builds) have started timing out on ppc64el architecture, see https://buildd.debian.org/status/logs.php?pkg=hypre&arch=ppc64el. The timeouts are happening for both hypre v2.28.0 and v2.29.0. The timeout is somewhat random, happening in a different test in different attempts.
Previously (April 2023) 2.28.0 passed tests. Since then Debian has made a new stable release (using gcc-12, openmpi 4.1.4). The failing environment now builds with gcc-13 and openmpi 4.1.5. I'm not sure how relevant that is, but evidently the timeout is related to the build environment.
For reference, there is also a recent timeout on hppa, https://buildd.debian.org/status/logs.php?pkg=hypre&arch=hppa
But 2.29.0 recently built successfully on hppa, so I expect success with 2.28.0 after trying again. The ppc64el failure has been more reproducible (4 recent failed attempts)
In general there's been a degree of flakiness in the hypre tests. Occasionally I've had timeouts locally on my own machine (amd64). I just run the tests again, they pass the next time. Is there any thing that can be done to make the tests more robust?
Possibly there's nothing hypre can do about it. But if you have any ideas, I'll be glad to try them.
The text was updated successfully, but these errors were encountered: