Add process binding arguments to horovodrun #1767

nvcastet · 2020-03-04T14:26:20Z

To maintain compute/memory locality, it is often good practise to
bind processes to NUMA nodes.
In ppc64le architecture with large models, I see a 20% performance gain
because tensor swapping happens between GPU memory and system memory and
CPU/GPU link (NVLINK) is faster than inter-socket link.
The PR also makes socket binding the default for SMPI (ppc64le). Even
with those flags, you may end up in a situation when a process with a GPU is bound to a
non-local CPU socket. That happens usually when GPUs are not split evenly across sockets or
you don't use all the GPUs of a node. In those cases, it
is recommended to use a rankfile to get more control: --binding-args="--rankfile myrankfile"

Signed-off-by: Nicolas V Castet nvcastet@us.ibm.com

To maintain compute/memory locality, it is often good practise to bind processes to NUMA nodes. In ppc64le architecture with large models, I see a 20% performance gain because tensor swapping happens between GPU memory and system memory and CPU/GPU link (NVLINK) is faster than inter-socket link. The PR also makes socket binding the default for SMPI (ppc64le). Even with those flags, you may end up in a situation when a process with a GPU is bound to a non-local CPU socket. That happens usually when GPUs are not split evenly across sockets or you don't use all the GPUs of a node. In those cases, it is recommended to use a rankfile to get more control: --binding-args="--rankfile myrankfile" Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>

tgaddair

LGTM!

nvcastet requested review from tgaddair, romerojosh and alsrgv March 4, 2020 14:26

nvcastet force-pushed the custom_binding branch from 0140c05 to e5d22f5 Compare March 5, 2020 20:38

tgaddair approved these changes Mar 6, 2020

View reviewed changes

tgaddair merged commit 167aa00 into horovod:master Mar 6, 2020

tgaddair mentioned this pull request Mar 9, 2020

Example to run training on multi-sockets CPU with TF2.0 #1775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add process binding arguments to horovodrun #1767

Add process binding arguments to horovodrun #1767

nvcastet commented Mar 4, 2020

tgaddair left a comment

Add process binding arguments to horovodrun #1767

Add process binding arguments to horovodrun #1767

Conversation

nvcastet commented Mar 4, 2020

tgaddair left a comment

Choose a reason for hiding this comment