Hash of file included in repository changes between runs on Batch when using 'lenient' cache #1989

brandoncazander · 2021-03-24T19:51:08Z

Bug report

Expected behavior and actual behavior

I have a workflow that uses files that are tracked in the same repository, and passed to my process along with other inputs (which are files stored on s3).

manifest = file("${baseDir}/manifests/my_manifest.tsv")
process_foo(manifest, other_inputs)

In my nextflow.config file, I specify cache = 'lenient' so that -resume works on Batch, and this works for all my channels that are sourced from s3. However, all my processes were being re-run on Batch, whereas locally caching was working.

Steps to reproduce the problem

I have an example repository here that uses a local file for input to a process. Running it with nextflow run main.nf -resume twice shows that the caching works on a local filesystem, but running it twice on Batch results in the whole workflow being re-executed.
https://github.com/brandoncazander/nextflow-file-caching-example

Program output

I ran the above workflow on Batch using -resume and -dump-hashes (thank you very much for this post) to figure out what hash was different, and here's the relevant sections:

First run

Mar-24 18:53:19.626 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: fd7dbfb25e5913cbca2ac1ad607946b3; mode: LENIENT; entries: 
  e43788d1332df5bc69827108419d7150 [java.util.UUID] 4e5df450-1c24-44a3-b348-70b22a2c0e05 
  2c5b2cb099d94748fe659e4cfc1d63f3 [java.lang.String] foo 
  defd25ba45902fd53b8dcfcb143f3925 [java.lang.String] """
    sha1sum ${template_file}
    cat ${template_file} > ${name}.txt
    echo ${name} >> ${name}.txt
    """
 
  3b67b3bb12729f568e831e72b2a90b6e [java.lang.String] opensuse/leap:latest 
  804956c6e764ab30963d196077092d8b [java.lang.String] name 
  101ad52bd9390c666a6ee58c083e782b [java.lang.String] Fifi 
  c4e99fc600782c59d03c0ef3c504dc17 [java.lang.String] template_file 
  3c662c3f258f87c7f8567dcafb28a78a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, stageName:template.txt)] 
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

Second run

Mar-24 19:23:05.228 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: d2333aa2f61a1251c9442d04252e86de; mode: LENIENT; entries: 
  e43788d1332df5bc69827108419d7150 [java.util.UUID] 4e5df450-1c24-44a3-b348-70b22a2c0e05 
  2c5b2cb099d94748fe659e4cfc1d63f3 [java.lang.String] foo 
  defd25ba45902fd53b8dcfcb143f3925 [java.lang.String] """
    sha1sum ${template_file}
    cat ${template_file} > ${name}.txt
    echo ${name} >> ${name}.txt
    """
 
  3b67b3bb12729f568e831e72b2a90b6e [java.lang.String] opensuse/leap:latest 
  804956c6e764ab30963d196077092d8b [java.lang.String] name 
  101ad52bd9390c666a6ee58c083e782b [java.lang.String] Fifi 
  c4e99fc600782c59d03c0ef3c504dc17 [java.lang.String] template_file 
  f00084daa79510d22546014ec7cb01b1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, stageName:template.txt)] 
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

Diff

Here is the hash that differs for the process.

135c135
<   3c662c3f258f87c7f8567dcafb28a78a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, stageName:template.txt)] 
---
>   f00084daa79510d22546014ec7cb01b1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, stageName:template.txt)]

Environment

Nextflow version: 20.10.0.5430
Java version: openjdk 1.8.0_282
Operating system: Linux
Bash version: 5.0.17

Additional context

Let me know if there's any more context that I can provide. If you have an idea of where to look or how I can start troubleshooting the code around this, I would be happy to give that a shot as well. Thank you!

The text was updated successfully, but these errors were encountered:

brandoncazander · 2021-03-25T04:33:35Z

Using cache = 'deep' does get around the issue, but I can't set this per-input and my process takes other input files that are very large.

I think I've narrowed down the issue to be that the hashFileMetadata method in the CacheHelper class is adding the absolute path of the file to the hash. When run locally, the absolute path of the file remains the same between runs.

I can see the FileHolder object in the output of -dump-hashes and that the sourceObj and storePath attributes contain the full path to the staging area.

FileHolder(
    sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt,
    storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt,
    stageName:template.txt
)

The sourceObj is what is hashed.

When I use the files from s3, this is what the FileHolder looks like instead:

FileHolder(
    sourceObj:/bcazander-orchestration-ch/assets/template.txt,
    storePath:/bcazander-orchestration-ch/assets/template.txt,
    stageName:template.txt
)

Where I'm looking now is at the normalizeInputToFiles method in TaskProcessor, but I'm not sure if this is the correct place for a change. My hunch is that the FileHolder class will need to store a relative path as well, but I would love guidance from someone more experienced with the codebase here.

pditommaso · 2021-03-28T14:44:32Z

but running it twice on Batch results in the whole workflow being re-executed.

My understanding is that your NF execution is run remotely and therefore the pipeline project is downloaded twice causing the cache to be invalidated. Is this correct?

brandoncazander · 2021-03-29T15:34:41Z

My understanding is that your NF execution is run remotely and therefore the pipeline project is downloaded twice causing the cache to be invalidated. Is this correct?

That's correct. It happens when using the AWS Batch executor.

Your patch looks good; thanks for the quick action!

My one question is if this will work in conjunction with the 'lenient' mode of caching, which is required for resume on shared filesystems (like s3). It looks like hashFileAsset will only be called in the 'standard' mode of caching from my reading. I will see if I can find a way to test this out.

pditommaso · 2021-03-29T16:06:13Z

lenient mode should not be affected by this problem because the issue is caused by the changing of the file timestamps when the repository is downloaded across multiple runs

brandoncazander · 2021-03-29T17:20:01Z

lenient mode should not be affected by this problem because the issue is caused by the changing of the file timestamps when the repository is downloaded across multiple runs

It does happen on lenient mode for me, unfortunately. The only mode not affected is 'deep', which I could use but that slows down my workflow quite substantially as my other inputs are large files.

You can see that the 'lenient' mode is in use in my output from the first/second run with -dump-hashes:

First run:

Mar-24 18:53:19.626 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: fd7dbfb25e5913cbca2ac1ad607946b3; mode: LENIENT; entries: 
(snip)
  3c662c3f258f87c7f8567dcafb28a78a [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/a7/0ae769aa87b10ec4772dd0ee4b151d/template.txt, stageName:template.txt)] 
(snip)

Second run:

Mar-24 19:23:05.228 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [foo (Fifi)] cache hash: d2333aa2f61a1251c9442d04252e86de; mode: LENIENT; entries: 
(snip)
  f00084daa79510d22546014ec7cb01b1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, storePath:/bcazander-orchestration-ch/_nextflow/workdir/stage/03/1d5bce71948af02348239e0f81a90f/template.txt, stageName:template.txt)] 
(snip)

pditommaso · 2021-03-30T16:45:52Z

I've double-checked this and lenient works as well. You have to make sure the pipeline is run with process.cache = 'lenient' in the first place.

brandoncazander · 2021-03-30T18:04:42Z

I've double-checked this and lenient works as well. You have to make sure the pipeline is run with process.cache = 'lenient' in the first place.

I am setting process.cache = 'lenient' and the logs confirm this: [foo (Fifi)] cache hash: d2333aa2f61a1251c9442d04252e86de; mode: LENIENT; entries:

The issue is that lenient mode uses the path to the file along with its size, and the path to the file includes the staging path, which is different between runs on Batch.

Is there something else I could do to help demonstrate the behaviour to you? The example workflow I provided here is as minimal as I could get it. If not, that's fine, I'll see if I can get some extra cycles to work on a patch for what I believe to be the issue.

I appreciate you looking at this!

pditommaso · 2021-03-31T15:03:35Z

But why you are using lenient in the first place? Using the patch uploaded the cache will work for batch across restarts without the need to of specifying any cache directive. Lenient is mainly made for hpc shared file systems

brandoncazander · 2021-03-31T15:51:53Z

But why you are using lenient in the first place? Using the patch uploaded the cache will work for batch across restarts without the need to of specifying any cache directive. Lenient is mainly made for hpc shared file systems

Good point, I guess I misunderstood the caching modes in the first place. Thanks for the explanation and patch!

pditommaso · 2021-03-31T15:52:56Z

Good. Thanks for reporting the problem.

abhi18av added executor/aws-batch storage/aws labels Mar 24, 2021

pditommaso added a commit that referenced this issue Mar 28, 2021

Fix Cache invalidation when repo is cloned between runs #1989

ac526f3

pditommaso added this to the v21.04.0 milestone Mar 28, 2021

pditommaso closed this as completed Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash of file included in repository changes between runs on Batch when using 'lenient' cache #1989

Hash of file included in repository changes between runs on Batch when using 'lenient' cache #1989

brandoncazander commented Mar 24, 2021

brandoncazander commented Mar 25, 2021

pditommaso commented Mar 28, 2021

brandoncazander commented Mar 29, 2021

pditommaso commented Mar 29, 2021

brandoncazander commented Mar 29, 2021

pditommaso commented Mar 30, 2021 •

edited

brandoncazander commented Mar 30, 2021

pditommaso commented Mar 31, 2021

brandoncazander commented Mar 31, 2021

pditommaso commented Mar 31, 2021

Hash of file included in repository changes between runs on Batch when using 'lenient' cache #1989

Hash of file included in repository changes between runs on Batch when using 'lenient' cache #1989

Comments

brandoncazander commented Mar 24, 2021

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

First run

Second run

Diff

Environment

Additional context

brandoncazander commented Mar 25, 2021

pditommaso commented Mar 28, 2021

brandoncazander commented Mar 29, 2021

pditommaso commented Mar 29, 2021

brandoncazander commented Mar 29, 2021

pditommaso commented Mar 30, 2021 • edited

brandoncazander commented Mar 30, 2021

pditommaso commented Mar 31, 2021

brandoncazander commented Mar 31, 2021

pditommaso commented Mar 31, 2021

pditommaso commented Mar 30, 2021 •

edited