Skip to content

LSF binding is incorrect if multiple app contexts #1570

@jjhursey

Description

@jjhursey

When scheduled under LSF, and with LSF specifying bindings (via the LSB_AFFINITY_HOSTFILE file) if the user runs mpirun with multiple app contexts the binding will be incorrect.

For example, if the mpirun looks like:

mpirun -np 2 myprog1 : -np 2 myprog2 : -np 2 myprog3

And let's assume that we are mapping by socket, and have 4 sockets per node over 2 node (nodeA, nodeB). Open MPI will map:

 nodeA: socket0: myprog1 (rank 0)
 nodeA: socket1: myprog1 (rank 1)
 nodeA: socket0: myprog2 (rank 2)
 nodeA: socket1: myprog2 (rank 3)
 nodeA: socket0: myprog2 (rank 4)
 nodeA: socket1: myprog2 (rank 5)

Instead of what might be expected:

 nodeA: socket0: myprog1 (rank 0)
 nodeA: socket1: myprog1 (rank 1)
 nodeA: socket2: myprog2 (rank 2)
 nodeA: socket3: myprog2 (rank 3)
 nodeB: socket0: myprog2 (rank 4)
 nodeB: socket1: myprog2 (rank 5)

What is happening is that the lsf RAS is associating the same LSB_AFFINITY_HOSTFILE to each of the app_contexts. Then the seq RMAPS is processing each app_context one at a time. When it processes an app_context it is paying attention to where it is in the file (which is why we see the different bindings for 2 of the processes of the first program), but when it switches app_contexts it resets to the beginning since it assumes it is a new file (which is why it goes back over those same bindings).

Note: Affinity was added back in 3f9d9ae (Nov. 2014) where both the RAS and RMAPS were updated.

This was noticed on v1.10.2, but I would assume it would still happen on the v2.x and master branches.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions