Run time failures on Cori supercomputer at NERSC #5

cdaley · 2017-05-06T02:12:52Z

Hello SNAP developers,

I am using the Cori KNL partition and running the version of SNAP at http://www.nersc.gov/research-and-development/apex/apex-benchmarks/snap/. I have compiled SNAP with Intel compiler version 17.0.3.191 and am using the "extra large" benchmark problem. I have encountered failures when using 41,472 and 82,944 MPI ranks.

My job output shows the following:

MPICH2 ERROR [Rank 1683] [job id 10004815419] [Thu May  4 04:50:24 2017] [c6-1c2s11n1] [nid03629] - xpmem_seglist_lookup(): failed lookup for src rank 2

Rank 1683 [Thu May  4 04:50:24 2017] [c6-1c2s11n1] xpmem_seglist_lookup failed

The core file shows that both xpmem_seglist_lookup and PMPI_Recv are in the call path:

(gdb) bt
#0  0x000000002022a8fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000203eb945 in abort () at abort.c:99
#2  0x000000002032aa1e in for.issue_diagnostic ()
#3  0x000000002032e8d4 in for.signal_handler ()
#4  <signal handler called>
#5  0x000000002022a8fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#6  0x00000000203eb888 in abort () at abort.c:78
#7  0x00000000200b2f22 in MPID_Abort ()
#8  0x00000000200ca693 in xpmem_seglist_lookup ()
#9  0x00000000200ca776 in do_xpmem_attach ()
#10 0x00000000200cb88d in MPID_nem_lmt_xpmem_start_recv ()
#11 0x00000000200c91a3 in do_cts ()
#12 0x00000000200c9f58 in pkt_RTS_handler ()
#13 0x00000000200c1f2c in MPIDI_CH3I_Progress ()
#14 0x00000000200814f0 in PMPI_Recv ()
#15 0x000000002006deed in pmpi_recv__ ()
#16 0x000000002000c1a6 in plib_module_mp_precv_d_3d_ ()
#17 0x0000000020058ff1 in thrd_comm_module_mp_sweep_recv_bdry_ ()
#18 0x0000000020048de4 in dim3_sweep_module_mp_dim3_sweep_ ()
#19 0x000000002003be4d in octsweep_module_mp_octsweep_ ()
#20 0x000000002003a966 in sweep_module_mp_sweep_ ()
#21 0x00000000200337be in inner_module_mp_inner_ ()
#22 0x000000002002b8f8 in outer_module_mp_outer_ ()
#23 0x000000002002050b in translv_ ()
#24 0x0000000020295573 in __kmp_invoke_microtask ()
#25 0x000000002024fba0 in __kmp_invoke_task_func ()
#26 0x000000002024ee35 in __kmp_launch_thread ()
#27 0x00000000202959c1 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) ()
#28 0x0000000020214134 in start_thread (arg=0x2aab163d5800) at pthread_create.c:309
#29 0x0000000020450f69 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

It looks like memory may be getting corrupted. Have you seen this error before? Any suggestions to fix the error?

Thanks,
Chris

The text was updated successfully, but these errors were encountered:

zerr · 2017-05-07T21:00:14Z

I unfortunately haven't gotten much chance to run snap on KNLs. Were you successfully running at smaller node counts? From: cdaley <notifications@github.com<mailto:notifications@github.com>> Reply-To: lanl/SNAP <reply@reply.github.com<mailto:reply@reply.github.com>> Date: Friday, May 5, 2017 at 8:12 PM To: lanl/SNAP <SNAP@noreply.github.com<mailto:SNAP@noreply.github.com>> Cc: Subscribed <subscribed@noreply.github.com<mailto:subscribed@noreply.github.com>> Subject: [lanl/SNAP] Run time failures on Cori supercomputer at NERSC (#5) Hello SNAP developers, I am using the Cori KNL partition and running the version of SNAP at http://www.nersc.gov/research-and-development/apex/apex-benchmarks/snap/. I have compiled SNAP with Intel compiler version 17.0.3.191 and am using the "extra large" benchmark problem. I have encountered failures when using 41,472 and 82,944 MPI ranks. My job output shows the following: MPICH2 ERROR [Rank 1683] [job id 10004815419] [Thu May 4 04:50:24 2017] [c6-1c2s11n1] [nid03629] - xpmem_seglist_lookup(): failed lookup for src rank 2 Rank 1683 [Thu May 4 04:50:24 2017] [c6-1c2s11n1] xpmem_seglist_lookup failed The core file shows that both xpmem_seglist_lookup and PMPI_Recv are in the call path: (gdb) bt #0 0x000000002022a8fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x00000000203eb945 in abort () at abort.c:99 #2 0x000000002032aa1e in for.issue_diagnostic () #3 0x000000002032e8d4 in for.signal_handler () #4 <signal handler called> #5 0x000000002022a8fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #6 0x00000000203eb888 in abort () at abort.c:78 #7 0x00000000200b2f22 in MPID_Abort () #8 0x00000000200ca693 in xpmem_seglist_lookup () #9 0x00000000200ca776 in do_xpmem_attach () #10 0x00000000200cb88d in MPID_nem_lmt_xpmem_start_recv () #11 0x00000000200c91a3 in do_cts () #12 0x00000000200c9f58 in pkt_RTS_handler () #13 0x00000000200c1f2c in MPIDI_CH3I_Progress () #14 0x00000000200814f0 in PMPI_Recv () #15 0x000000002006deed in pmpi_recv__ () #16 0x000000002000c1a6 in plib_module_mp_precv_d_3d_ () #17 0x0000000020058ff1 in thrd_comm_module_mp_sweep_recv_bdry_ () #18 0x0000000020048de4 in dim3_sweep_module_mp_dim3_sweep_ () #19 0x000000002003be4d in octsweep_module_mp_octsweep_ () #20 0x000000002003a966 in sweep_module_mp_sweep_ () #21 0x00000000200337be in inner_module_mp_inner_ () #22 0x000000002002b8f8 in outer_module_mp_outer_ () #23 0x000000002002050b in translv_ () #24 0x0000000020295573 in __kmp_invoke_microtask () #25 0x000000002024fba0 in __kmp_invoke_task_func () #26 0x000000002024ee35 in __kmp_launch_thread () #27 0x00000000202959c1 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) () #28 0x0000000020214134 in start_thread (arg=0x2aab163d5800) at pthread_create.c:309 #29 0x0000000020450f69 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 It looks like memory may be getting corrupted. Have you seen this error before? Any suggestions to fix the error? Thanks, Chris - You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#5>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD0I8cv4ntk77HFhSIe5qNxsyOQy2bbcks5r29clgaJpZM4NSlfN>.

cdaley · 2017-05-11T23:13:58Z

Thanks for your reply.

My runs at low node count for the "small" problem worked as expected. I encountered the same problem once in the "large" problem on 10,368 MPI ranks, however, the resultant core file did not contain any useful information.

I'll let you know if I find out more information. I submitted this ticket in the hope that someone had seen something similar. I understand that I have not really provided enough information to get to the bottom of the problem.

cdaley closed this as completed May 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run time failures on Cori supercomputer at NERSC #5

Run time failures on Cori supercomputer at NERSC #5

cdaley commented May 6, 2017

zerr commented May 7, 2017 via email

cdaley commented May 11, 2017

Run time failures on Cori supercomputer at NERSC #5

Run time failures on Cori supercomputer at NERSC #5

Comments

cdaley commented May 6, 2017

zerr commented May 7, 2017 via email

cdaley commented May 11, 2017