-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor directives. #785
Comments
With #784, we could store the maximum memory per cpu that a partition allows without allocating extra CPUs and use that information to provide the user with an error. Alternately, we could remove |
Here are some example directives.
|
@joaander I want to give my vote of support for this idea. The landscape of HPC clusters has continued to evolve since I was last actively involved in signac-flow's cluster templates. It seems things have solidified a bit more around core concepts and "directives" that are aligned with the above proposal. I am also generally appreciative and supportive of you proposing and pursuing significant changes like this. 👍 |
@bdice Thank you for reviewing the proposal and your positive comments. |
To support this on GPU partitions we would also need a
vs.
In SLURM, |
@joaander Thanks for adding this, as it will help support the Georgia Tech HPCs! For Georgia Tech HPCs the |
@bcrawford39GT |
Thank you! I was always confused by differences in how flow does things and how user guides for SLURM etc describe things...processes, threads, ranks, oh my!
If there is usually no cost in amount of memory requested, I highly support this change to make using flow easier for users. I know people who have had jobs confusingly canceled due to running out of memory. Flow could print that it automatically selected the maximum allowed for the allocation, for instance:
|
@cbkerr Signac should likely support the There may be processes that need this. Additionally, there may be a cost to asking for more RAM, because they may charge you for it, depending on the HPC system or Cloud compute system you are using. For Georgia Tech HPCs the |
We discussed this offline. We plan to make Users that request more than the maximum will not only incur extra charges, but may also result in broken slurm scripts. For example, I recently tested Purdue Anvil with Note that because Anvil automatically scales the CPU request with the memory request, there is no reason to ever request anything less than the maximum. By doing so, you risk out of memory errors in your job. The same goes on Bridges-2 which errors at submission time when you request more than the maximum. On systems that both default to less than the maximum and allow users to oversubscribe memory and undersubscribe CPUs (Georgia Tech, UMich Great Lakes, Expanse shared queue), users may wish to request less than the maximum (without incurring extra charges). However, the best a user can ever hope to achieve by this is gain some goodwill with the rest of the system's user community - especially those that request more than the maximum.
Yes, this is standard SLURM behavior. I do not recommend the use of |
Please, only when verbose output is requested - if at all. This information is in the |
@joaander Yeah, I think what you are saying makes sense. Just need to get rid of Signac's auto printing of |
Feature description
Proposed solution
Replace directives with the new schema:
executable
executable
.walltime
walltime
.launcher
None
(the default) or'mpi'
.processes
np
whenlauncher is None
andnranks
whenlauncher == 'mpi'
.threads_per_process
omp_num_threads
with a more general term. Flow will always setOMP_NUM_THREADS
whenthreads_per_process
is greater than 1.gpus_per_process
gpu
.memory_per_cpu
memory
with a more naturally expressible quantity and one that is easier to set appropriately based on the machine configuration.processor_fraction
is not present in the new schema. It is not implementable in any batch scheduler currently in production use. If users desire to oversubscribe resources with many short tasks, they can use an interactive job andrun --parallel
.fork
should also be removed. Flow automatically decides to fork when needed.Additional context
This design would solve #777, provide a more understandable schema for selecting resources, and reduce the effort needed to develop future cluster job templates.
When
launcher is None
:When
launcher == 'mpi'
:srun
,mpirun
, or the appropriate machine specific MPI launcher to distribute processes, threads, memory, and gpus to the appropriate resources.launcher
,processes
,threads_per_process
,gpus_per_process
, andmemory_per_cpu
. Flow will raise an error for any invocation of--bundle --parallel
.launcher
is a string to allow for potential future expansion to some non-MPI launcher capable of distributing processes to multiple nodes: see #220.This refactor solves issues discussed in #777, #455, #115, #235.
The text was updated successfully, but these errors were encountered: