Restore job-based granularity #37

csadorf · 2018-09-30T23:28:33Z

Original report by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

I have to admit that I don't quite understand the submission interface. Currently it maps operations onto real cluster jobs. However, that may be a finer level of granularity than one really wants if the operations are quick, but the job scheduler throughput is slow. One may end up waiting for the next cluster slot to free up between two trivial tasks. I don't see an option to generate scripts that basically contain

python project.py run -j <my_job_id>

In other words, bundle all job-specific tasks into one single cluster job.

Alternatively, the execution graph could be traversed, as proposed for issue #35, and only the last operation would be issued, automatically fulfilling previous dependencies.

The text was updated successfully, but these errors were encountered:

csadorf · 2018-09-30T23:52:21Z

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

To make my intention more clear, here's my current workaround using job arrays

#PBS -l nodes=1:gpus=1,walltime=96:00:00,pmem=4g
#PBS -A sglotzer1_fluxoe
#PBS -l qos=flux
#PBS -q fluxoe
#PBS -N seed_hp
#PBS -t 1-10

cd ${PBS_O_WORKDIR}

python project_nocharge.py run -j `signac find | sed "${PBS_ARRAYID}q;d"`

where the nth array job executes the nth available job.

I now realize the developers may have had the intent to enable a more work queue-based approach in Signac flow, with the operations as smallest units, but if that is the case, it should be possible to give hints to Signac how to map those onto real scheduler jobs.

csadorf · 2018-10-01T00:34:51Z

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

Some additional thoughts: you can really only guarantee independent execution between independent jobs, so the smallest work unit a more fine-grained scheduler, like flow, could submit in parallel, is always a job. Inside a job, there are dependencies. If the goal was to exploit parallelism between (possibly) independent nodes in the execution graph of a single job, these should be extracted and submitted as separate cluster jobs. Though I currently don't have an example for parallelism inside a job. Otherwise the default mode of operation should be 'Bundle up as many dependent operations [that share the same resource requests] into a single cluster job as possible', in order to generate only the minimum number of cluster jobs necessary.

csadorf · 2018-10-01T00:43:03Z

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

The only reason why one would break up cluster jobs into even smaller ones that I can see, is if they differ in the needed resources. To this end, one portion of the execution flow may run fast on 8 CPU cores, whereas another one may need a GPU. For this, one may introduce resource decorators, which map onto real job script options (such as "-l nodes=X:gpus=Y"), so that operations are never considered for inclusion into the same job script if they have different resource tags. However, this would make it more complicated, as now the scheduler would have to resolve the dependency between the sub-jobs (possibly using PBS -w depend:afterok=... like options)

csadorf · 2018-10-01T01:18:24Z

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

Side remark: The approach outlined above offers some additional flexibility, as it doesn't specify the job ids/statepoints to be executed explicitly, so they can still be changed before the job actually runs, without resubmission. Only their number is fixed.

csadorf · 2018-10-01T18:01:44Z

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).

As I just discussed with Jens, the natural way to implement this right now is using the flow.cmd decorator to construct an auto operation that runs everything.

Concretely:

@FlowProject.operation
@flow.cmd
def auto(job):
    return "python project.py run -j {}".format(job)

Currently our use of directives does give some idea of the resource usage requirements for each operation, so inferring some of the information that you want is already possible. However, I don't think it's true that inside a job there are always dependencies. The purpose of having operations behave this ways is precisely to support workflows where there are no dependencies of that form. For example, if you had a branched workflow that wasn't completely linear, then once you hit the branch point you would have at least 2 different operations eligible to run for that job that would be able to run independently. So I think it's important to support that functionality.

csadorf · 2019-02-26T14:34:50Z

This seems to be related to issue #30 .

vyasr · 2019-07-05T15:55:44Z

This issue should be solved by #114, and is also closely related to #35 if we want to make this even easier.

vyasr · 2020-02-26T17:07:19Z

The original question raised in this issue -- the ability to submit multiple operations within a single cluster job -- is resolved by the addition of groups (#114). Addressing #35 will make it even easier to ensure that all preconditions are satisfied, but that is a separate issue so this one can be closed.

csadorf added minor bug Something isn't working labels Jan 30, 2019

mikemhenry added this to the v0.7 milestone Feb 16, 2019

mikemhenry removed v0.6.x labels Feb 16, 2019

csadorf added proposal and removed bug Something isn't working labels Feb 26, 2019

csadorf removed this from the v0.7 milestone Feb 26, 2019

csadorf removed the proposal label Jul 5, 2019

vyasr added the enhancement New feature or request label Jul 5, 2019

vyasr modified the milestones: v0.8, v0.9 Jul 5, 2019

csadorf added the groups label Sep 6, 2019

bdice modified the milestones: v0.9.0, v0.10.0 Dec 20, 2019

vyasr closed this as completed Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore job-based granularity #37

Restore job-based granularity #37

csadorf commented Sep 30, 2018

csadorf commented Sep 30, 2018

csadorf commented Oct 1, 2018

csadorf commented Oct 1, 2018

csadorf commented Oct 1, 2018

csadorf commented Oct 1, 2018

csadorf commented Feb 26, 2019

vyasr commented Jul 5, 2019

vyasr commented Feb 26, 2020

Restore job-based granularity #37

Restore job-based granularity #37

Comments

csadorf commented Sep 30, 2018

csadorf commented Sep 30, 2018

csadorf commented Oct 1, 2018

csadorf commented Oct 1, 2018

csadorf commented Oct 1, 2018

csadorf commented Oct 1, 2018

csadorf commented Feb 26, 2019

vyasr commented Jul 5, 2019

vyasr commented Feb 26, 2020