-
Notifications
You must be signed in to change notification settings - Fork 932
Description
When scheduled under LSF, and with LSF specifying bindings (via the LSB_AFFINITY_HOSTFILE file) if the user runs mpirun with multiple app contexts the binding will be incorrect.
For example, if the mpirun looks like:
mpirun -np 2 myprog1 : -np 2 myprog2 : -np 2 myprog3And let's assume that we are mapping by socket, and have 4 sockets per node over 2 node (nodeA, nodeB). Open MPI will map:
nodeA: socket0: myprog1 (rank 0)
nodeA: socket1: myprog1 (rank 1)
nodeA: socket0: myprog2 (rank 2)
nodeA: socket1: myprog2 (rank 3)
nodeA: socket0: myprog2 (rank 4)
nodeA: socket1: myprog2 (rank 5)
Instead of what might be expected:
nodeA: socket0: myprog1 (rank 0)
nodeA: socket1: myprog1 (rank 1)
nodeA: socket2: myprog2 (rank 2)
nodeA: socket3: myprog2 (rank 3)
nodeB: socket0: myprog2 (rank 4)
nodeB: socket1: myprog2 (rank 5)
What is happening is that the lsf RAS is associating the same LSB_AFFINITY_HOSTFILE to each of the app_contexts. Then the seq RMAPS is processing each app_context one at a time. When it processes an app_context it is paying attention to where it is in the file (which is why we see the different bindings for 2 of the processes of the first program), but when it switches app_contexts it resets to the beginning since it assumes it is a new file (which is why it goes back over those same bindings).
Note: Affinity was added back in 3f9d9ae (Nov. 2014) where both the RAS and RMAPS were updated.
This was noticed on v1.10.2, but I would assume it would still happen on the v2.x and master branches.