Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run time failures on Cori supercomputer at NERSC #5

Closed
cdaley opened this issue May 6, 2017 · 2 comments
Closed

Run time failures on Cori supercomputer at NERSC #5

cdaley opened this issue May 6, 2017 · 2 comments

Comments

@cdaley
Copy link

cdaley commented May 6, 2017

Hello SNAP developers,

I am using the Cori KNL partition and running the version of SNAP at http://www.nersc.gov/research-and-development/apex/apex-benchmarks/snap/. I have compiled SNAP with Intel compiler version 17.0.3.191 and am using the "extra large" benchmark problem. I have encountered failures when using 41,472 and 82,944 MPI ranks.

My job output shows the following:

MPICH2 ERROR [Rank 1683] [job id 10004815419] [Thu May  4 04:50:24 2017] [c6-1c2s11n1] [nid03629] - xpmem_seglist_lookup(): failed lookup for src rank 2

Rank 1683 [Thu May  4 04:50:24 2017] [c6-1c2s11n1] xpmem_seglist_lookup failed

The core file shows that both xpmem_seglist_lookup and PMPI_Recv are in the call path:

(gdb) bt
#0  0x000000002022a8fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000203eb945 in abort () at abort.c:99
#2  0x000000002032aa1e in for.issue_diagnostic ()
#3  0x000000002032e8d4 in for.signal_handler ()
#4  <signal handler called>
#5  0x000000002022a8fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#6  0x00000000203eb888 in abort () at abort.c:78
#7  0x00000000200b2f22 in MPID_Abort ()
#8  0x00000000200ca693 in xpmem_seglist_lookup ()
#9  0x00000000200ca776 in do_xpmem_attach ()
#10 0x00000000200cb88d in MPID_nem_lmt_xpmem_start_recv ()
#11 0x00000000200c91a3 in do_cts ()
#12 0x00000000200c9f58 in pkt_RTS_handler ()
#13 0x00000000200c1f2c in MPIDI_CH3I_Progress ()
#14 0x00000000200814f0 in PMPI_Recv ()
#15 0x000000002006deed in pmpi_recv__ ()
#16 0x000000002000c1a6 in plib_module_mp_precv_d_3d_ ()
#17 0x0000000020058ff1 in thrd_comm_module_mp_sweep_recv_bdry_ ()
#18 0x0000000020048de4 in dim3_sweep_module_mp_dim3_sweep_ ()
#19 0x000000002003be4d in octsweep_module_mp_octsweep_ ()
#20 0x000000002003a966 in sweep_module_mp_sweep_ ()
#21 0x00000000200337be in inner_module_mp_inner_ ()
#22 0x000000002002b8f8 in outer_module_mp_outer_ ()
#23 0x000000002002050b in translv_ ()
#24 0x0000000020295573 in __kmp_invoke_microtask ()
#25 0x000000002024fba0 in __kmp_invoke_task_func ()
#26 0x000000002024ee35 in __kmp_launch_thread ()
#27 0x00000000202959c1 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) ()
#28 0x0000000020214134 in start_thread (arg=0x2aab163d5800) at pthread_create.c:309
#29 0x0000000020450f69 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

It looks like memory may be getting corrupted. Have you seen this error before? Any suggestions to fix the error?

Thanks,
Chris

@zerr
Copy link
Collaborator

zerr commented May 7, 2017 via email

@cdaley
Copy link
Author

cdaley commented May 11, 2017

Thanks for your reply.

My runs at low node count for the "small" problem worked as expected. I encountered the same problem once in the "large" problem on 10,368 MPI ranks, however, the resultant core file did not contain any useful information.

I'll let you know if I find out more information. I submitted this ticket in the hope that someone had seen something similar. I understand that I have not really provided enough information to get to the bottom of the problem.

@cdaley cdaley closed this as completed May 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants