New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local executor cannot resume due to thread pool limit (1000) reached while checking cache #1871
Comments
Hi @gpertea , Thanks for reporting this issue! This could be because the
With a quick search, I could see that a couple of possible solutions have been mentioned in a similar thread earlier #92 |
I checked that thread before posting this bug report and tried the solutions suggested there on the user side. Using It seems to me that the 1000 thread limit is hard-coded in the GPars scheduler package: Also, in the nextflow documentation I could not find a way to specify the maximum number of threads, I have no idea where that
|
The limit of 1000 local threads is obviously a reasonable maximum (at least until we get servers with >1000 cores), the problem seems to be with the lack of throttling/scheduling, of proper queuing of the tasks in the case of the local executor, such that the number of Actor Threads is kept under control. I also tried limiting cpus, submitRateLimit and queueSize in the |
@gpertea , unfortunately this is beyond my knowledge of the dark art of threading. However, based on the following comment
... from an earlier gitter chat I believe that NF does provide a mechanism to tweak the threading behavior. On further investigation, I came across the
|
It looks like I'll perform further testing and I'll close the issue if I get persistent positive results with this option. |
After multiple tests I can confirm that |
Great to know that this resolved your issue :) |
Regarding what you mentioned about |
I had a similar problem w/the thread pool limit (1000) running the nf-core/sarek @maxulysse . The node running nextflow has (128 Core/256 Threads), then the jobs have been deployed on our sge executor.
|
nextflow-io/nextflow#1871 Co-authored-by: Geo Pertea <geo.pertea@gmail.com>
Bug report
On a server with 144 cores, when resuming a bioinformatics workflow processing 800 samples with multiple processes/stages, the local executor runs hits the upper limit of the thread pool (1000, as hard-coded in GPars), while checking the cached tasks in parallel, for multiple processes at once.
Expected behavior and actual behavior
Resuming (using the
-resume
option) should be possible for the local executor, checking the caches for the 800 * (number of outputs from multiple processes) should be throttled/scheduled/limited to a lower number of threads in order to prevent hitting the hard-coded thread pool limit of 1000.Steps to reproduce the problem
Using the
-resume
flag after the pipeline crashed/interrupted at the final stage, after most of the previous steps and outputs were generated, when the SPEAQeasy pipeline is run with a-profile local
or-profile docker_local
options (which invoke the local executor in nextflow) and an input of 800 samples in thesamples.manifest
file.Program output
Here it is the output of SPEAQeasy showing the progress of the cached tasks when the pipeline fails
nextflow.log
Environment
The text was updated successfully, but these errors were encountered: