You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When multiple processes (>=4) exchanges large messages (>= 80000bytes) multiple times (=100) via CH3 nemesis shared memory LMT routine, the following error occurs on both osx and ubuntu. Will simplify a test program.
On osx (mpich-mac1.mcs.anl.gov), np=8.
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(392)................: MPI_Waitall(count=200, req_array=0x7ffee33bc3e0, status_array=0x7ffee33bb440) failed
MPIR_Waitall_impl(221)..........:
MPIDI_CH3I_Progress(547)........:
pkt_CTS_handler(350)............:
MPID_nem_lmt_shm_start_send(271):
MPID_nem_delete_shm_region(958).:
MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory - unlink No such file or directory
On ubuntu (14.04.5, travis), np=4.
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPID_nem_delete_shm_region:957
Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(392)...............: MPI_Waitall(count=200, req_array=0x7fff9b74ff10, status_array=0x7fff9b750230) failed
MPIR_Waitall_impl(221)..........:
MPIDI_CH3I_Progress(547)........:
pkt_CTS_handler(350)............:
MPID_nem_lmt_shm_start_send(270):
MPID_nem_delete_shm_region(957).:
The text was updated successfully, but these errors were encountered:
minsii
added a commit
to minsii/casper
that referenced
this issue
Nov 12, 2017
Test isendirecv_waitall fails with MPICH (3.3a2)/ch3/tcp at
waitall->MPID_nem_lmt_shm_start_send->MPID_nem_delete_shm_region
when exchanging large number of large messages (100 8*10000 bytes
messages in the original test) on both osx and ubuntu. This issue can be
reproduced without Casper.
We reduce the count of messages in this test as workaround to be able
to check offloading functionality. The number will be increased to 100
once we fixed in MPICH.
See pmodels/mpich#2860
minsii
added a commit
to pmodels/casper
that referenced
this issue
Nov 12, 2017
Test isendirecv_waitall fails with MPICH (3.3a2)/ch3/tcp at
waitall->MPID_nem_lmt_shm_start_send->MPID_nem_delete_shm_region
when exchanging large number of large messages (100 8*10000 bytes
messages in the original test) on both osx and ubuntu. This issue can be
reproduced without Casper.
We reduce the count of messages in this test as workaround to be able
to check offloading functionality. The number will be increased to 100
once we fixed in MPICH.
See pmodels/mpich#2860
When multiple processes (>=4) exchanges large messages (>= 80000bytes) multiple times (=100) via CH3 nemesis shared memory LMT routine, the following error occurs on both osx and ubuntu. Will simplify a test program.
On osx (mpich-mac1.mcs.anl.gov), np=8.
On ubuntu (14.04.5, travis), np=4.
The text was updated successfully, but these errors were encountered: