-
Notifications
You must be signed in to change notification settings - Fork 934
Closed
Description
I get a failure in both 1.8 latest and trunk. I think this has been around for months. With 1.8, I get the following:
mpirun --mca btl_openib_if_include mlx5_0:1 --mca btl_openib_cpc_include udcm -np 2 -mca btl self,sm,openib loop_spawn 10
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #0 return : 0
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #0 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:10637): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #1 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:10641): exiting
Child (drossetti-ivy4.nvidia.com): launch
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #2 rank 0, size 2
Child (drossetti-ivy4.nvidia.com:10645): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #3 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:10649): exiting
Child (drossetti-ivy4.nvidia.com): launch
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #4 rank 0, size 2
Child (drossetti-ivy4.nvidia.com:10653): exiting
Child (drossetti-ivy4.nvidia.com): launch
[warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once.
[drossetti-ivy4.nvidia.com:10657] too many retries sending message to 0x000b:0x00437583, giving up
-------------------------------------------------------
Child job 7 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[39866,7],0]
Exit code: 255
With the trunk I get this.
mpirun --mca btl_openib_if_include mlx5_0:1 --mca btl_openib_cpc_include udcm -np 2 -mca btl self,sm,openib loop_spawn 10
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #0 return : 0
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #0 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24936): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #1 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24941): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #2 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24946): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #3 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24951): exiting
Child (drossetti-ivy4.nvidia.com): launch
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #4 rank 0, size 2
Child (drossetti-ivy4.nvidia.com:24956): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #5 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24961): exiting
Child (drossetti-ivy4.nvidia.com): launch
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #6 rank 0, size 2
Child (drossetti-ivy4.nvidia.com:24966): exiting
Child (drossetti-ivy4.nvidia.com): launch
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #7 rank 0, size 2
Child (drossetti-ivy4.nvidia.com:24971): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #8 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24976): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #9 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24981): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #10 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24986): exiting
Child (drossetti-ivy4.nvidia.com): launch
parent (drossetti-ivy4.nvidia.com): MPI_Comm_spawn #11 rank 0, size 2
Child (drossetti-ivy4.nvidia.com) merged rank = 1, size = 2
Child (drossetti-ivy4.nvidia.com:24991): exiting
Child (drossetti-ivy4.nvidia.com): launch
[warn] opal_libevent2022_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once.
[warn] opal_libevent2022_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once.
[...snip... about 1000 of the previous messages and then this]
[drossetti-ivy4.nvidia.com:24996] too many retries sending message to 0x000b:0x00437425, giving up
[rvandevaart@drossetti-ivy4 dynamic]$
bot:milestone:v1.8.5
bot:label:bug
bot:assign: @hjelmn