Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash of file included in repository changes between runs on Batch when using 'lenient' cache #1989

Closed
brandoncazander opened this issue Mar 24, 2021 · 10 comments

Comments

@brandoncazander
Copy link
Contributor

Bug report

Expected behavior and actual behavior

I have a workflow that uses files that are tracked in the same repository, and passed to my process along with other inputs (which are files stored on s3).

manifest = file("${baseDir}/manifests/my_manifest.tsv")
process_foo(manifest, other_inputs)

In my nextflow.config file, I specify cache = 'lenient' so that -resume works on Batch, and this works for all my channels that are sourced from s3. However, all my processes were being re-run on Batch, whereas locally caching was working.

Steps to reproduce the problem

I have an example repository here that uses a local file for input to a process. Running it with nextflow run main.nf -resume twice shows that the caching works on a local filesystem, but running it twice on Batch results in the whole workflow being re-executed.
https://github.com/brandoncazander/nextflow-file-caching-example

Program output

I ran the above workflow on Batch using -resume and -dump-hashes (thank you very much for this post) to figure out what hash was different, and here's the relevant sections:

First run

Mar-24 18:53:19.626 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: fd7dbfb25e5913cbca2ac1ad607946b3; mode: LENIENT; entries: 
  e43788d1332df5bc69827108419d7150 [java.util.UUID] 4e5df450-1c24-44a3-b348-70b22a2c0e05 
  2c5b2cb099d94748fe659e4cfc1d63f3 [java.lang.String] foo 
  defd25ba45902fd53b8dcfcb143f3925 [java.lang.String] """
    sha1sum ${template_file}
    cat ${template_file} > ${name}.txt
    echo ${name} >> ${name}.txt
    """
 
  3b67b3bb12729f568e831e72b2a90b6e [java.lang.String] opensuse/leap:latest 
  804956c6e764ab30963d196077092d8b [java.lang.String] name 
  101ad52bd9390c666a6ee58c083e782b [java.lang.String] Fifi 
  c4e99fc600782c59d03c0ef3c504dc17 [java.lang.String] template_file 
  3c662c3f258f87c7f8567dcafb28a78a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, stageName:template.txt)] 
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true 

Second run

Mar-24 19:23:05.228 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: d2333aa2f61a1251c9442d04252e86de; mode: LENIENT; entries: 
  e43788d1332df5bc69827108419d7150 [java.util.UUID] 4e5df450-1c24-44a3-b348-70b22a2c0e05 
  2c5b2cb099d94748fe659e4cfc1d63f3 [java.lang.String] foo 
  defd25ba45902fd53b8dcfcb143f3925 [java.lang.String] """
    sha1sum ${template_file}
    cat ${template_file} > ${name}.txt
    echo ${name} >> ${name}.txt
    """
 
  3b67b3bb12729f568e831e72b2a90b6e [java.lang.String] opensuse/leap:latest 
  804956c6e764ab30963d196077092d8b [java.lang.String] name 
  101ad52bd9390c666a6ee58c083e782b [java.lang.String] Fifi 
  c4e99fc600782c59d03c0ef3c504dc17 [java.lang.String] template_file 
  f00084daa79510d22546014ec7cb01b1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, stageName:template.txt)] 
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true 

Diff

Here is the hash that differs for the process.

135c135
<   3c662c3f258f87c7f8567dcafb28a78a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, stageName:template.txt)] 
---
>   f00084daa79510d22546014ec7cb01b1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, stageName:template.txt)] 

Environment

  • Nextflow version: 20.10.0.5430
  • Java version: openjdk 1.8.0_282
  • Operating system: Linux
  • Bash version: 5.0.17

Additional context

Let me know if there's any more context that I can provide. If you have an idea of where to look or how I can start troubleshooting the code around this, I would be happy to give that a shot as well. Thank you!

@brandoncazander
Copy link
Contributor Author

Using cache = 'deep' does get around the issue, but I can't set this per-input and my process takes other input files that are very large.

I think I've narrowed down the issue to be that the hashFileMetadata method in the CacheHelper class is adding the absolute path of the file to the hash. When run locally, the absolute path of the file remains the same between runs.

I can see the FileHolder object in the output of -dump-hashes and that the sourceObj and storePath attributes contain the full path to the staging area.

FileHolder(
    sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt,
    storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt,
    stageName:template.txt
)

The sourceObj is what is hashed.

When I use the files from s3, this is what the FileHolder looks like instead:

FileHolder(
    sourceObj:/bcazander-orchestration-ch/assets/template.txt,
    storePath:/bcazander-orchestration-ch/assets/template.txt,
    stageName:template.txt
)

Where I'm looking now is at the normalizeInputToFiles method in TaskProcessor, but I'm not sure if this is the correct place for a change. My hunch is that the FileHolder class will need to store a relative path as well, but I would love guidance from someone more experienced with the codebase here.

@pditommaso
Copy link
Member

but running it twice on Batch results in the whole workflow being re-executed.

My understanding is that your NF execution is run remotely and therefore the pipeline project is downloaded twice causing the cache to be invalidated. Is this correct?

@brandoncazander
Copy link
Contributor Author

My understanding is that your NF execution is run remotely and therefore the pipeline project is downloaded twice causing the cache to be invalidated. Is this correct?

That's correct. It happens when using the AWS Batch executor.

Your patch looks good; thanks for the quick action!

My one question is if this will work in conjunction with the 'lenient' mode of caching, which is required for resume on shared filesystems (like s3). It looks like hashFileAsset will only be called in the 'standard' mode of caching from my reading. I will see if I can find a way to test this out.

@pditommaso
Copy link
Member

lenient mode should not be affected by this problem because the issue is caused by the changing of the file timestamps when the repository is downloaded across multiple runs

@brandoncazander
Copy link
Contributor Author

lenient mode should not be affected by this problem because the issue is caused by the changing of the file timestamps when the repository is downloaded across multiple runs

It does happen on lenient mode for me, unfortunately. The only mode not affected is 'deep', which I could use but that slows down my workflow quite substantially as my other inputs are large files.

You can see that the 'lenient' mode is in use in my output from the first/second run with -dump-hashes:

First run:

Mar-24 18:53:19.626 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: fd7dbfb25e5913cbca2ac1ad607946b3; mode: LENIENT; entries: 
(snip)
  3c662c3f258f87c7f8567dcafb28a78a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, stageName:template.txt)] 
(snip)

Second run:

Mar-24 19:23:05.228 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: d2333aa2f61a1251c9442d04252e86de; mode: LENIENT; entries: 
(snip)
  f00084daa79510d22546014ec7cb01b1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, stageName:template.txt)] 
(snip)

@pditommaso
Copy link
Member

pditommaso commented Mar 30, 2021

I've double-checked this and lenient works as well. You have to make sure the pipeline is run with process.cache = 'lenient' in the first place.

@brandoncazander
Copy link
Contributor Author

I've double-checked this and lenient works as well. You have to make sure the pipeline is run with process.cache = 'lenient' in the first place.

I am setting process.cache = 'lenient' and the logs confirm this: [foo (Fifi)] cache hash: d2333aa2f61a1251c9442d04252e86de; mode: LENIENT; entries:

The issue is that lenient mode uses the path to the file along with its size, and the path to the file includes the staging path, which is different between runs on Batch.

Is there something else I could do to help demonstrate the behaviour to you? The example workflow I provided here is as minimal as I could get it. If not, that's fine, I'll see if I can get some extra cycles to work on a patch for what I believe to be the issue.

I appreciate you looking at this!

@pditommaso
Copy link
Member

But why you are using lenient in the first place? Using the patch uploaded the cache will work for batch across restarts without the need to of specifying any cache directive. Lenient is mainly made for hpc shared file systems

@brandoncazander
Copy link
Contributor Author

But why you are using lenient in the first place? Using the patch uploaded the cache will work for batch across restarts without the need to of specifying any cache directive. Lenient is mainly made for hpc shared file systems

Good point, I guess I misunderstood the caching modes in the first place. Thanks for the explanation and patch!

@pditommaso
Copy link
Member

Good. Thanks for reporting the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants