New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve foreign files download #265
Comments
And to confirm, as discussed on Gitter - would be great if such files are only downloaded a single time if used in multiple places by a workflow (eg. the same foreign input file used as the input to a process that is run for many other files). |
One caveat: this may be impractical in the case of distributed execution. I'm running Nextflow in a shared SLURM environment with limited networked storage but dedicated fast scratch space per node. I definitely would not want large remote files to be staged into the shared mount; rather, they should go into scratch. It would certainly be ideal to schedule jobs dependent on the same file onto the same node (into the same scratch dir?) so they don't need to transfer the file twice, but I'd rather have two transfers from object store to scratch than have the shared filer become a bottleneck. |
Is that cloud SLURM deployment? Yes, ideally it should be able to cover this use case as well. |
@pditommaso yes - it is a SLURM cluster implemented on Google Compute Engine. Each node has local SSD mounted at /scratch and a shared mount at /home. |
Hello all, Looking at related issues I found #686 (comment) and #686 (comment) by @ewels. I totally agree and I think we would cover most of the execution cases this way. |
This commit adds the ability to cache foreign input files so that they are staged in the pipeline work directory. This brings two main benefits: 1) Multiple processes using the same remote input file will use the same downloaded copy with triggering multiple downloads of the same file; 2) When resuming the pipeline execution all remote files previously downloaded are retried from the execution cache. Moreover this commit fixes a bug in the download thread pool that was limiting the download one file at time. Solves #265, #686. Merge #1006 Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Currently foreign input files (eg.
s3://foo/bar
) are downloaded by the driver application even when using the distributed Ignite executor. Moreover an extra copy is needed when using a scratch directory.This needs to be improved so that foreign files are download by the remote execution nodes and files are created in the target destination folder without any intermediate copy.
The text was updated successfully, but these errors were encountered: