You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current way that in TPS we assign a GPU to each MPI rank is (see here ):
device_id = mpi_rank % numGpusPerRank
where numGpusPerRank is set from the .ini file.
The default value of this variable is 1; see here. None of the *.ini input files in our testsuite changes the default value, so I am assuming all local jobs are running on a single GPU.
This makes TPS hard to port across different clusters and local machines. Some schedulers (e.g. those on TACC) make all GPUs on a node available to all tasks on that node, while other schedulers (e.g. flux) restrict which GPUs are visible to each task (e.g. through the variable ROCR_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, NVIDIA_VISIBLE_DEVICES, or ``CUDA_VISIBLE_DEVICES`).
I propose a more flexible way to handle this by introducing the command line argument --gpu-affinity (short-hand -ga).
Three affinity policies will be available:
default: Set the deviceID to 0. This is perfect for local resources with a single GPU or when the scheduler restricts which devices are visible to a task (like flux does).
direct (default): Set the deviceID equal to the mpi-rank. This is perfect on a single node (local or on the cluster) when the number of mpi-tasks is less or equal to the number of GPUs.
env-localid: the device id is set through an environmental variable defined with --localid-varname. Many schedulers set an environmental variable that provides a local numbering of the tasks running on a specific node. In slurm, this variable is called SLURM_LOCALID, in flux FLUX_TASK_LOCAL_ID. See also: https://docs.nersc.gov/jobs/affinity/#gpus
The text was updated successfully, but these errors were encountered:
The current way that in TPS we assign a GPU to each MPI rank is (see here ):
where
numGpusPerRank
is set from the.ini
file.The default value of this variable is
1
; see here. None of the*.ini
input files in our testsuite changes the default value, so I am assuming all local jobs are running on a single GPU.This makes TPS hard to port across different clusters and local machines. Some schedulers (e.g. those on TACC) make all GPUs on a node available to all tasks on that node, while other schedulers (e.g. flux) restrict which GPUs are visible to each task (e.g. through the variable
ROCR_VISIBLE_DEVICES
,HIP_VISIBLE_DEVICES
,NVIDIA_VISIBLE_DEVICES
, or ``CUDA_VISIBLE_DEVICES`).I propose a more flexible way to handle this by introducing the command line argument
--gpu-affinity
(short-hand-ga
).Three affinity policies will be available:
default
: Set the deviceID to 0. This is perfect for local resources with a single GPU or when the scheduler restricts which devices are visible to a task (likeflux
does).direct
(default): Set the deviceID equal to the mpi-rank. This is perfect on a single node (local or on the cluster) when the number of mpi-tasks is less or equal to the number of GPUs.env-localid
: the device id is set through an environmental variable defined with--localid-varname
. Many schedulers set an environmental variable that provides a local numbering of the tasks running on a specific node. In slurm, this variable is calledSLURM_LOCALID
, in fluxFLUX_TASK_LOCAL_ID
. See also:https://docs.nersc.gov/jobs/affinity/#gpus
The text was updated successfully, but these errors were encountered: