New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlling Grid Concurrency #756

Closed
halfwayBraindead opened this Issue Jan 16, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@halfwayBraindead

halfwayBraindead commented Jan 16, 2018

Our site cluster that we use to run Canu for large genome assemblies does not possess a parallel filesystem, and from previous experience (#705), we've found it best to limit the number of compute tasks during the correction stage, particularly for overlap bucketizing and sorting; this tends to maximize CPU usage while avoiding thrashing of the single RAID Canu writes output to.

Previously, via Slurm, we've used gridOptions="--exclude=chosenNodeListInverse" to target the nodes we desire to run the correction steps upon; this is somewhat clumsy for us, as our site has many users from many labs and departments that access this shared resource (e.g. the cluster).

Is there any way to more directly control grid concurrency (e.g. limiting the number of tasks initiated) for Canu? The parameter reference for "{prefix}Concurrency" does not appear to be grid-aware.

Thanks!

@brianwalenz

This comment has been minimized.

Member

brianwalenz commented Jan 18, 2018

You can stage the input data used for correction to node-local storage; see http://canu.readthedocs.io/en/latest/parameter-reference.html#file-staging

There's no direct support for limiting the number of grid jobs running at the same time in Canu, it's up to the grid scheduler. Under SGE, option -tc will limit the 'task concurrency' (the last paragraph at https://arc.liv.ac.uk/SGE/howto/sge-array.html talks about this). For Slurm, the limit is specified in the array option itself (the -array option at https://slurm.schedmd.com/sbatch.html)

I think -- but haven't tested -- that using gridEngineArrayOption="-a ARRAY_JOBS%4" you limit all canu job arrays to running only 4 processes at a time.

If that works, a one line change should hard code the behavior for just correction jobs. Before line 1062 (before the submitOrRunParallelJobs() call) in src/pipelines/canu/CorrectReads.pm, add:

    setGlobal("gridEngineArrayOption", "-a ARRAY_JOBS%4");
@outpaddling

This comment has been minimized.

outpaddling commented Jan 19, 2018

To clarify, the issue here is not the lack of a parallel filesystem, but the data quality. Good quality reads run just fine on our NFS servers, which are mounted over FDR Infiniband and capable of sustaining over 800 MB/sec. I don't want other readers to get the impression that you must have a parallel filesystem to run canu.

The job that's killing us is forced to utilize some very short reads that would normally be thrown out, so I/O is going through the roof. We worked around it in the last run by increasing per-process memory limit. But with the latest run, canu grabbed about available 1500 cores for the early stages and the file server became a bottleneck, causing low CPU utilization. I think if we limit canu to a couple hundred cores, it should be more balanced and should be able to churn through the data in a reasonable time despite the quality issues.

Thanks for the tip on gridEngineArrayOption - we'll give that a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment