Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore job-based granularity #37

Closed
csadorf opened this issue Sep 30, 2018 · 8 comments
Closed

Restore job-based granularity #37

csadorf opened this issue Sep 30, 2018 · 8 comments
Labels
enhancement New feature or request groups
Milestone

Comments

@csadorf
Copy link
Contributor

csadorf commented Sep 30, 2018

Original report by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


I have to admit that I don't quite understand the submission interface. Currently it maps operations onto real cluster jobs. However, that may be a finer level of granularity than one really wants if the operations are quick, but the job scheduler throughput is slow. One may end up waiting for the next cluster slot to free up between two trivial tasks. I don't see an option to generate scripts that basically contain

python project.py run -j <my_job_id>

In other words, bundle all job-specific tasks into one single cluster job.

Alternatively, the execution graph could be traversed, as proposed for issue #35, and only the last operation would be issued, automatically fulfilling previous dependencies.

@csadorf
Copy link
Contributor Author

csadorf commented Sep 30, 2018

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


To make my intention more clear, here's my current workaround using job arrays

#PBS -l nodes=1:gpus=1,walltime=96:00:00,pmem=4g
#PBS -A sglotzer1_fluxoe
#PBS -l qos=flux
#PBS -q fluxoe
#PBS -N seed_hp
#PBS -t 1-10

cd ${PBS_O_WORKDIR}

python project_nocharge.py run -j `signac find | sed "${PBS_ARRAYID}q;d"`

where the nth array job executes the nth available job.

I now realize the developers may have had the intent to enable a more work queue-based approach in Signac flow, with the operations as smallest units, but if that is the case, it should be possible to give hints to Signac how to map those onto real scheduler jobs.

@csadorf
Copy link
Contributor Author

csadorf commented Oct 1, 2018

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


Some additional thoughts: you can really only guarantee independent execution between independent jobs, so the smallest work unit a more fine-grained scheduler, like flow, could submit in parallel, is always a job. Inside a job, there are dependencies. If the goal was to exploit parallelism between (possibly) independent nodes in the execution graph of a single job, these should be extracted and submitted as separate cluster jobs. Though I currently don't have an example for parallelism inside a job. Otherwise the default mode of operation should be 'Bundle up as many dependent operations [that share the same resource requests] into a single cluster job as possible', in order to generate only the minimum number of cluster jobs necessary.

@csadorf
Copy link
Contributor Author

csadorf commented Oct 1, 2018

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


The only reason why one would break up cluster jobs into even smaller ones that I can see, is if they differ in the needed resources. To this end, one portion of the execution flow may run fast on 8 CPU cores, whereas another one may need a GPU. For this, one may introduce resource decorators, which map onto real job script options (such as "-l nodes=X:gpus=Y"), so that operations are never considered for inclusion into the same job script if they have different resource tags. However, this would make it more complicated, as now the scheduler would have to resolve the dependency between the sub-jobs (possibly using PBS -w depend:afterok=... like options)

@csadorf
Copy link
Contributor Author

csadorf commented Oct 1, 2018

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


Side remark: The approach outlined above offers some additional flexibility, as it doesn't specify the job ids/statepoints to be executed explicitly, so they can still be changed before the job actually runs, without resubmission. Only their number is fixed.

@csadorf
Copy link
Contributor Author

csadorf commented Oct 1, 2018

Original comment by Vyas Ramasubramani (Bitbucket: vramasub, GitHub: vyasr).


As I just discussed with Jens, the natural way to implement this right now is using the flow.cmd decorator to construct an auto operation that runs everything.

Concretely:

@FlowProject.operation
@flow.cmd
def auto(job):
    return "python project.py run -j {}".format(job)

Currently our use of directives does give some idea of the resource usage requirements for each operation, so inferring some of the information that you want is already possible. However, I don't think it's true that inside a job there are always dependencies. The purpose of having operations behave this ways is precisely to support workflows where there are no dependencies of that form. For example, if you had a branched workflow that wasn't completely linear, then once you hit the branch point you would have at least 2 different operations eligible to run for that job that would be able to run independently. So I think it's important to support that functionality.

@csadorf csadorf added minor bug Something isn't working labels Jan 30, 2019
@mikemhenry mikemhenry added this to the v0.7 milestone Feb 16, 2019
@csadorf csadorf added proposal and removed bug Something isn't working labels Feb 26, 2019
@csadorf
Copy link
Contributor Author

csadorf commented Feb 26, 2019

This seems to be related to issue #30 .

@csadorf csadorf removed this from the v0.7 milestone Feb 26, 2019
@csadorf csadorf removed the proposal label Jul 5, 2019
@vyasr
Copy link
Contributor

vyasr commented Jul 5, 2019

This issue should be solved by #114, and is also closely related to #35 if we want to make this even easier.

@vyasr vyasr added the enhancement New feature or request label Jul 5, 2019
@vyasr vyasr modified the milestones: v0.8, v0.9 Jul 5, 2019
@csadorf csadorf added the groups label Sep 6, 2019
@bdice bdice modified the milestones: v0.9.0, v0.10.0 Dec 20, 2019
@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2020

The original question raised in this issue -- the ability to submit multiple operations within a single cluster job -- is resolved by the addition of groups (#114). Addressing #35 will make it even easier to ensure that all preconditions are satisfied, but that is a separate issue so this one can be closed.

@vyasr vyasr closed this as completed Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request groups
Projects
None yet
Development

No branches or pull requests

4 participants