-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash of file included in repository changes between runs on Batch when using 'lenient' cache #1989
Comments
Using I think I've narrowed down the issue to be that the I can see the
The When I use the files from s3, this is what the FileHolder looks like instead:
Where I'm looking now is at the |
My understanding is that your NF execution is run remotely and therefore the pipeline project is downloaded twice causing the cache to be invalidated. Is this correct? |
That's correct. It happens when using the AWS Batch executor. Your patch looks good; thanks for the quick action! My one question is if this will work in conjunction with the 'lenient' mode of caching, which is required for resume on shared filesystems (like s3). It looks like |
lenient mode should not be affected by this problem because the issue is caused by the changing of the file timestamps when the repository is downloaded across multiple runs |
It does happen on lenient mode for me, unfortunately. The only mode not affected is 'deep', which I could use but that slows down my workflow quite substantially as my other inputs are large files. You can see that the 'lenient' mode is in use in my output from the first/second run with First run:
Second run:
|
I've double-checked this and lenient works as well. You have to make sure the pipeline is run with |
I am setting The issue is that lenient mode uses the path to the file along with its size, and the path to the file includes the staging path, which is different between runs on Batch. Is there something else I could do to help demonstrate the behaviour to you? The example workflow I provided here is as minimal as I could get it. If not, that's fine, I'll see if I can get some extra cycles to work on a patch for what I believe to be the issue. I appreciate you looking at this! |
But why you are using lenient in the first place? Using the patch uploaded the cache will work for batch across restarts without the need to of specifying any cache directive. Lenient is mainly made for hpc shared file systems |
Good point, I guess I misunderstood the caching modes in the first place. Thanks for the explanation and patch! |
Good. Thanks for reporting the problem. |
Bug report
Expected behavior and actual behavior
I have a workflow that uses files that are tracked in the same repository, and passed to my process along with other inputs (which are files stored on s3).
In my nextflow.config file, I specify
cache = 'lenient'
so that-resume
works on Batch, and this works for all my channels that are sourced from s3. However, all my processes were being re-run on Batch, whereas locally caching was working.Steps to reproduce the problem
I have an example repository here that uses a local file for input to a process. Running it with
nextflow run main.nf -resume
twice shows that the caching works on a local filesystem, but running it twice on Batch results in the whole workflow being re-executed.https://github.com/brandoncazander/nextflow-file-caching-example
Program output
I ran the above workflow on Batch using
-resume
and-dump-hashes
(thank you very much for this post) to figure out what hash was different, and here's the relevant sections:First run
Second run
Diff
Here is the hash that differs for the process.
Environment
Additional context
Let me know if there's any more context that I can provide. If you have an idea of where to look or how I can start troubleshooting the code around this, I would be happy to give that a shot as well. Thank you!
The text was updated successfully, but these errors were encountered: