Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Nvidia GPU-MPS #5

Closed
lsawade opened this issue Mar 5, 2022 · 1 comment
Closed

Support Nvidia GPU-MPS #5

lsawade opened this issue Mar 5, 2022 · 1 comment

Comments

@lsawade
Copy link
Contributor

lsawade commented Mar 5, 2022

Hi,

I would like to think about how to implement GPU-MPS. Right now job.py Hardcodes the -a flag to -a 1. Due to the [nprocs, cpus-per-task, gpus-per-task] setup, it seems like it would require quite a lot of recoding on your end. To see how I usually implement GPU-MPS for specfem I attached an example for specfem.

My main worry is that on the one hand, creating a new LSF job to support is quite easy, but starts making the package cluttered. On the other hand, incorporating the GPU-MPS capability into the current LSF(Job) class may overcomplicate the class.

What do you think?

Maybe a add_special_mpi() in node.py and special_mpiexec() in job.py?


Specfem example

I compile Specfem for 6 chunks and NEX_*=2, so a total of 24 MPI tasks. Now I want to run Specfem on a single node using 6 GPUs. GPUMPS has to be enable at job-request level using the line

#BSUB -alloc_flags "gpumps"

Then, to run specfem you have to assign 4 tasks to a single GPU. The way I'm doing it is to ask for 6 resource sets, each with 4 tasks and 4 cpus but only 1 gpus:

jsrun -n 6 -a 4 -c 4 -g 1 ./bin/xspecfem3D
@lsawade
Copy link
Contributor Author

lsawade commented Mar 6, 2022

As a fix @icui implemented that the gpus_per_task can be a floating point number. So setting.

Compare 24 GPU implementation with 6 GPUs and 4 MPS slices:

  1. 24 GPUs

    node.add_mpi('bin/xspecfem3D', 24, (1, 1), cwd=specfemdir)

    results in the following jsrun command:

    jsrun -n 24 -a 1 -c 1 -g 1 ./bin/xspecfem3D
  2. 6 GPUs

    node.add_mpi('bin/xspecfem3D', 24, (1, 0.25), cwd=specfemdir)

    results in the following jsrun command:

    jsrun -n 6 -a 4 -c 4 -g 1 ./bin/xspecfem3D

source in nnodes/job.py:

class LSF(Job):
...
   def mpiexec(...):
       ...
       a = 1

       if isinstance(gpus_per_proc, float):
           a = round(1 / gpus_per_proc)
           cpus_per_proc *= a
           gpus_per_proc = 1
           nprocs //= a
       
       return f'{jsrun} -n {nprocs} -a {a} -c {cpus_per_proc} -g {gpus_per_proc} {cmd}'

@lsawade lsawade closed this as completed Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant