-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore job-based granularity #37
Comments
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser). To make my intention more clear, here's my current workaround using job arrays
where the nth array job executes the nth available job. I now realize the developers may have had the intent to enable a more work queue-based approach in Signac flow, with the operations as smallest units, but if that is the case, it should be possible to give hints to Signac how to map those onto real scheduler jobs. |
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser). Some additional thoughts: you can really only guarantee independent execution between independent jobs, so the smallest work unit a more fine-grained scheduler, like flow, could submit in parallel, is always a job. Inside a job, there are dependencies. If the goal was to exploit parallelism between (possibly) independent nodes in the execution graph of a single job, these should be extracted and submitted as separate cluster jobs. Though I currently don't have an example for parallelism inside a job. Otherwise the default mode of operation should be 'Bundle up as many dependent operations [that share the same resource requests] into a single cluster job as possible', in order to generate only the minimum number of cluster jobs necessary. |
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser). The only reason why one would break up cluster jobs into even smaller ones that I can see, is if they differ in the needed resources. To this end, one portion of the execution flow may run fast on 8 CPU cores, whereas another one may need a GPU. For this, one may introduce resource decorators, which map onto real job script options (such as "-l nodes=X:gpus=Y"), so that operations are never considered for inclusion into the same job script if they have different resource tags. However, this would make it more complicated, as now the scheduler would have to resolve the dependency between the sub-jobs (possibly using PBS -w depend:afterok=... like options) |
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser). Side remark: The approach outlined above offers some additional flexibility, as it doesn't specify the job ids/statepoints to be executed explicitly, so they can still be changed before the job actually runs, without resubmission. Only their number is fixed. |
Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr). As I just discussed with Jens, the natural way to implement this right now is using the Concretely: @FlowProject.operation
@flow.cmd
def auto(job):
return "python project.py run -j {}".format(job) Currently our use of directives does give some idea of the resource usage requirements for each operation, so inferring some of the information that you want is already possible. However, I don't think it's true that inside a job there are always dependencies. The purpose of having operations behave this ways is precisely to support workflows where there are no dependencies of that form. For example, if you had a branched workflow that wasn't completely linear, then once you hit the branch point you would have at least 2 different operations eligible to run for that job that would be able to run independently. So I think it's important to support that functionality. |
This seems to be related to issue #30 . |
The original question raised in this issue -- the ability to submit multiple operations within a single cluster job -- is resolved by the addition of groups (#114). Addressing #35 will make it even easier to ensure that all preconditions are satisfied, but that is a separate issue so this one can be closed. |
Original report by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
I have to admit that I don't quite understand the submission interface. Currently it maps operations onto real cluster jobs. However, that may be a finer level of granularity than one really wants if the operations are quick, but the job scheduler throughput is slow. One may end up waiting for the next cluster slot to free up between two trivial tasks. I don't see an option to generate scripts that basically contain
In other words, bundle all job-specific tasks into one single cluster job.
Alternatively, the execution graph could be traversed, as proposed for issue #35, and only the last operation would be issued, automatically fulfilling previous dependencies.
The text was updated successfully, but these errors were encountered: