Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occurs when multiple processes exchange large messages via CH3 nemesis shared memory. #2860

Closed
minsii opened this issue Nov 12, 2017 · 1 comment

Comments

@minsii
Copy link
Contributor

minsii commented Nov 12, 2017

When multiple processes (>=4) exchanges large messages (>= 80000bytes) multiple times (=100) via CH3 nemesis shared memory LMT routine, the following error occurs on both osx and ubuntu. Will simplify a test program.
On osx (mpich-mac1.mcs.anl.gov), np=8.

Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(392)................: MPI_Waitall(count=200, req_array=0x7ffee33bc3e0, status_array=0x7ffee33bb440) failed
MPIR_Waitall_impl(221)..........:
MPIDI_CH3I_Progress(547)........:
pkt_CTS_handler(350)............:
MPID_nem_lmt_shm_start_send(271):
MPID_nem_delete_shm_region(958).:
MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory - unlink No such file or directory

On ubuntu (14.04.5, travis), np=4.

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPID_nem_delete_shm_region:957
Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(392)...............: MPI_Waitall(count=200, req_array=0x7fff9b74ff10, status_array=0x7fff9b750230) failed
MPIR_Waitall_impl(221)..........: 
MPIDI_CH3I_Progress(547)........: 
pkt_CTS_handler(350)............: 
MPID_nem_lmt_shm_start_send(270): 
MPID_nem_delete_shm_region(957).: 
minsii added a commit to minsii/casper that referenced this issue Nov 12, 2017
Test isendirecv_waitall fails with MPICH (3.3a2)/ch3/tcp at
waitall->MPID_nem_lmt_shm_start_send->MPID_nem_delete_shm_region
when exchanging large number of large messages (100 8*10000 bytes
messages in the original test) on both osx and ubuntu. This issue can be
reproduced without Casper.

We reduce the count of messages in this test as workaround to be able
to check offloading functionality. The number will be increased to 100
once we fixed in MPICH.
See pmodels/mpich#2860
minsii added a commit to pmodels/casper that referenced this issue Nov 12, 2017
Test isendirecv_waitall fails with MPICH (3.3a2)/ch3/tcp at
waitall->MPID_nem_lmt_shm_start_send->MPID_nem_delete_shm_region
when exchanging large number of large messages (100 8*10000 bytes
messages in the original test) on both osx and ubuntu. This issue can be
reproduced without Casper.

We reduce the count of messages in this test as workaround to be able
to check offloading functionality. The number will be increased to 100
once we fixed in MPICH.
See pmodels/mpich#2860
@pavanbalaji
Copy link
Contributor

ch3 specific issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants