-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zombie/defunct processes caused by xpmem? #45
Comments
@hjelmn not sure if this is important, but I've noticed that I get the deadlock less often when I don't call |
@angainor have you found a solution or root cause for this? We are experiencing very similar crashes resulting in zombie/defunct processes in our AMD cluster running RHEL 8.6 and MLNX_OFED_LINUX-5.8-1.0.1.1:
|
@cvmeq Unfortunately no, I still see those issues sometimes, mostly when you kill / interrupt a large job, or at job cleanup. Then only solution for me was not to use xpmem transport in OpenMPI / UCX |
KERNEL: Also support kernel 6.5+
I'm using xpmem in our home-brew application (OpenMPI + our own xpmem for in-node comm), on an AMD EPYC cluster, 7.7 (Maipo), kernel 3.10.0-1062.9.1.el7.x86_64. Sometimes after the applications finishes, multiple compute nodes have plenty of zombie/defunct processes that never die. Looking at the stack of some of those processes, I see this:
So they seem to be hanging on some XPMEM-related process cleanup. This is strange for a few reasons: I checked, and in the code I match each
xpmem_attach
with anxpmem_detatch
. Also, it seems strange that the kernel would be unable to end a process, because xpmem is unable to perform cleanup.Does anyone have any ideas as to what might be the problem here?
Thanks a lot!
The text was updated successfully, but these errors were encountered: