Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve foreign files download #265

Closed
pditommaso opened this issue Dec 14, 2016 · 5 comments
Closed

Improve foreign files download #265

pditommaso opened this issue Dec 14, 2016 · 5 comments

Comments

@pditommaso
Copy link
Member

Currently foreign input files (eg. s3://foo/bar) are downloaded by the driver application even when using the distributed Ignite executor. Moreover an extra copy is needed when using a scratch directory.

This needs to be improved so that foreign files are download by the remote execution nodes and files are created in the target destination folder without any intermediate copy.

@ewels
Copy link
Member

ewels commented May 4, 2017

And to confirm, as discussed on Gitter - would be great if such files are only downloaded a single time if used in multiple places by a workflow (eg. the same foreign input file used as the input to a process that is run for many other files).

@ihaque-freenome
Copy link

would be great if such files are only downloaded a single time if used in multiple places by a workflow

One caveat: this may be impractical in the case of distributed execution. I'm running Nextflow in a shared SLURM environment with limited networked storage but dedicated fast scratch space per node. I definitely would not want large remote files to be staged into the shared mount; rather, they should go into scratch. It would certainly be ideal to schedule jobs dependent on the same file onto the same node (into the same scratch dir?) so they don't need to transfer the file twice, but I'd rather have two transfers from object store to scratch than have the shared filer become a bottleneck.

@pditommaso
Copy link
Member Author

pditommaso commented Jul 8, 2017

Is that cloud SLURM deployment? Yes, ideally it should be able to cover this use case as well.

@ihaque-freenome
Copy link

@pditommaso yes - it is a SLURM cluster implemented on Google Compute Engine. Each node has local SSD mounted at /scratch and a shared mount at /home.

@emi80
Copy link
Contributor

emi80 commented Jan 14, 2019

Hello all,
I am awakening this thread since I am now facing the same problem.

Looking at related issues I found #686 (comment) and #686 (comment) by @ewels. I totally agree and I think we would cover most of the execution cases this way.

pditommaso added a commit that referenced this issue Jan 29, 2019
This commit adds the ability to cache foreign input files
so that they are staged in the pipeline work directory.

This brings two main benefits:
1) Multiple processes using the same remote input file
   will use the same downloaded copy with triggering
   multiple downloads of the same file;
2) When resuming the pipeline execution all remote files
   previously downloaded are retried from the execution
   cache.

Moreover this commit fixes a bug in the download thread pool
that was limiting the download one file at time.

Solves #265, #686. Merge #1006

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants