Zombie/defunct processes caused by xpmem? #45

angainor · 2021-03-22T16:49:40Z

I'm using xpmem in our home-brew application (OpenMPI + our own xpmem for in-node comm), on an AMD EPYC cluster, 7.7 (Maipo), kernel 3.10.0-1062.9.1.el7.x86_64. Sometimes after the applications finishes, multiple compute nodes have plenty of zombie/defunct processes that never die. Looking at the stack of some of those processes, I see this:

[<ffffffffb6acbf5e>] __synchronize_srcu+0xfe/0x150
[<ffffffffb6acbfcd>] synchronize_srcu+0x1d/0x20
[<ffffffffb6c1c10d>] mmu_notifier_unregister+0xad/0xe0
[<ffffffffc0b5e614>] xpmem_mmu_notifier_unlink+0x54/0x97 [xpmem]
[<ffffffffc0b5a13d>] xpmem_flush+0x13d/0x1c0 [xpmem]
[<ffffffffb6c47ce7>] filp_close+0x37/0x90
[<ffffffffb6c6b0b8>] put_files_struct+0x88/0xe0
[<ffffffffb6c6b1b9>] exit_files+0x49/0x50
[<ffffffffb6aa2022>] do_exit+0x2b2/0xa50
[<ffffffffb6aa283f>] do_group_exit+0x3f/0xa0
[<ffffffffb6aa28b4>] SyS_exit_group+0x14/0x20
[<ffffffffb718dede>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

So they seem to be hanging on some XPMEM-related process cleanup. This is strange for a few reasons: I checked, and in the code I match each xpmem_attach with an xpmem_detatch. Also, it seems strange that the kernel would be unable to end a process, because xpmem is unable to perform cleanup.

Does anyone have any ideas as to what might be the problem here?

Thanks a lot!

The text was updated successfully, but these errors were encountered:

angainor · 2021-03-23T11:38:07Z

@hjelmn not sure if this is important, but I've noticed that I get the deadlock less often when I don't call xpmem_remove explicitly in my code. This makes me wonder: is there a possible cleanup problem when the publisher calls xpmem_remove on a region, which is still attached to by the peers? In other words, publisher calls xpmem_remove and only then the peer calls xpmem_detach and xpmem_release.

cvmeq · 2023-06-14T16:00:30Z

@angainor have you found a solution or root cause for this? We are experiencing very similar crashes resulting in zombie/defunct processes in our AMD cluster running RHEL 8.6 and MLNX_OFED_LINUX-5.8-1.0.1.1:

[Wed Jun 14 17:28:43 2023] Call Trace:
[Wed Jun 14 17:28:43 2023]  __schedule+0x2d1/0x840
[Wed Jun 14 17:28:43 2023]  schedule+0x35/0xa0
[Wed Jun 14 17:28:43 2023]  schedule_timeout+0x278/0x300
[Wed Jun 14 17:28:43 2023]  ? number+0x324/0x360
[Wed Jun 14 17:28:43 2023]  ? get_futex_key+0x98/0x3e0
[Wed Jun 14 17:28:43 2023]  wait_for_completion+0x96/0x100
[Wed Jun 14 17:28:43 2023]  __synchronize_srcu.part.17+0x83/0xb0
[Wed Jun 14 17:28:43 2023]  ? __bpf_trace_rcu_utilization+0x10/0x10
[Wed Jun 14 17:28:43 2023]  ? synchronize_srcu+0xad/0xf0
[Wed Jun 14 17:28:43 2023]  mmu_notifier_unregister+0xa6/0xe0
[Wed Jun 14 17:28:43 2023]  xpmem_flush+0x14a/0x170 [xpmem]
[Wed Jun 14 17:28:43 2023]  filp_close+0x31/0x70
[Wed Jun 14 17:28:43 2023]  put_files_struct+0x70/0xc0
[Wed Jun 14 17:28:43 2023]  do_exit+0x32f/0xb10
[Wed Jun 14 17:28:43 2023]  do_group_exit+0x3a/0xa0
[Wed Jun 14 17:28:43 2023]  get_signal+0x158/0x870
[Wed Jun 14 17:28:43 2023]  do_signal+0x36/0x690
[Wed Jun 14 17:28:43 2023]  ? do_send_sig_info+0x63/0x90
[Wed Jun 14 17:28:43 2023]  ? recalc_sigpending+0x17/0x60
[Wed Jun 14 17:28:43 2023]  exit_to_usermode_loop+0x89/0x100
[Wed Jun 14 17:28:43 2023]  do_syscall_64+0x19c/0x1b0
[Wed Jun 14 17:28:43 2023]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

angainor · 2023-06-27T07:43:51Z

@cvmeq Unfortunately no, I still see those issues sometimes, mostly when you kill / interrupt a large job, or at job cleanup. Then only solution for me was not to use xpmem transport in OpenMPI / UCX

KERNEL: Also support kernel 6.5+

angainor mentioned this issue Dec 10, 2021

Unkillable processes caused by xpmem openucx/xpmem#12

Open

rljacob mentioned this issue Apr 19, 2024

Turn off xpmem in OFED 5.8 on Chrysalis E3SM-Project/E3SM#6359

Merged

tzafrir-mellanox pushed a commit to tzafrir-mellanox/xpmem that referenced this issue Sep 11, 2024

Merge pull request hpc#45 from tvegas1/build_kernel_65

e39bb4d

KERNEL: Also support kernel 6.5+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zombie/defunct processes caused by xpmem? #45

Zombie/defunct processes caused by xpmem? #45

angainor commented Mar 22, 2021

angainor commented Mar 23, 2021

cvmeq commented Jun 14, 2023

angainor commented Jun 27, 2023 •

edited

Loading

Zombie/defunct processes caused by xpmem? #45

Zombie/defunct processes caused by xpmem? #45

Comments

angainor commented Mar 22, 2021

angainor commented Mar 23, 2021

cvmeq commented Jun 14, 2023

angainor commented Jun 27, 2023 • edited Loading

angainor commented Jun 27, 2023 •

edited

Loading