Unnecessary copy of input files to the task work directory #961

rsuchecki · 2018-12-05T02:28:34Z

Bug report - or perhaps just an inconsistency

Expected behavior and actual behavior

Some caveats on glob pattern behavior:

Input files are not included in the list of possible matches.

This no longer holds if scratch true directive is set for a process.

Steps to reproduce the problem

#!/usr/bin/env nextflow

Channel.from('Bonjour')
  .collectFile()
  .set { greetings }

process echo {
  scratch true
  input:
    file ('greeting.in')  from greetings

  output:
    file('*')

  script:
  """
  cat greeting.in greeting.in > greeting.out
  """
}

rm -rf work/ && nextflow run main.nf && tree -h work

Program output

With `scratch false`

File greetings.in is a symlink as expected.

N E X T F L O W  ~  version 18.10.1
Launching `main.nf` [hungry_easley] - revision: 99f69e3c7e
[warm up] executor > local
[27/82d2c1] Submitted process > echo (1)
work
├── [4.0K]  27
│   └── [4.0K]  82d2c16f3fcc6f85b8195c0801ba57
│       ├── [  71]  greeting.in -> /tmp/hello/work/tmp/2c/711ca50bd2a4c3e3ac1b437bb769c9/collect-file.data
│       └── [  14]  greeting.out
└── [4.0K]  tmp
    └── [4.0K]  2c
        └── [4.0K]  711ca50bd2a4c3e3ac1b437bb769c9
            └── [   7]  collect-file.data

With `scratch true`

File greetings.in is a regular file, in a real case scenario this could be a large file duplicated as many times as process is executed.

N E X T F L O W  ~  version 18.10.1
Launching `main.nf` [mad_minsky] - revision: abe1eb2016
[warm up] executor > local
[35/8379d6] Submitted process > echo (1)
work
├── [4.0K]  35
│   └── [4.0K]  8379d6f078b84b19445786988c69e9
│       ├── [   7]  greeting.in
│       └── [  14]  greeting.out
└── [4.0K]  tmp
    └── [4.0K]  73
        └── [4.0K]  2c980de793155629eabb0faca75422
            └── [   7]  collect-file.data

Environment

Nextflow version: 18.10.1
Java version: 1.8.0_191
Operating system: Linux

The text was updated successfully, but these errors were encountered:

rsuchecki · 2018-12-05T02:54:06Z

By the way, this is specific to the default stageOutMode 'copy' but not stageOutMode 'move' or stageOutMode 'rsync', so perhaps cp should not be used with -L, --dereference?

pditommaso · 2018-12-05T11:06:03Z

Many things, let's focus on the issue subject. I've modified the test case as

Channel.from('Bonjour')
  .collectFile()
  .set { greetings }

process echo {
  input:
    file ('greeting.in')  from greetings

  output:
    file('*') into result

  script:
  """
  cat greeting.in > greeting.out
  """
}

result.println()

It output on file independently the use os scratch:

 $ nextflow run test.nf 
N E X T F L O W  ~  version 18.11.0-edge
Launching `test.nf` [stoic_hamilton] - revision: 1f7de396b7
[warm up] executor > local
[d8/93493b] Submitted process > echo (1)
/Users/pditommaso/projects/nextflow/work/d8/93493b90638693d394f44a2d321ad4/greeting.out

$ nextflow run test.nf -process.scratch true
N E X T F L O W  ~  version 18.11.0-edge
Launching `test.nf` [fervent_bassi] - revision: 1f7de396b7
[warm up] executor > local
[4e/d435a7] Submitted process > echo (1)
/Users/pditommaso/projects/nextflow/work/4e/d435a754f967b44eec35c914c2bfa2/greeting.out

Therefore the the assertion Input files are not included in the list of possible matches is valid in both cases.

rsuchecki · 2018-12-05T22:20:46Z

Thanks @pditommaso - I misdiagnosed the issue and (surprise, surprise) you are right about the main point so this issue should be closed as it is misleading.

I guess the actual concern is the fact that input file symlinks are being resolved when copying from scratch. If the process in question is executed multiple times, we'll end up with multiple copies of input files.

pditommaso · 2018-12-06T07:37:30Z

Damn! you are right, I'm getting what you mean, the problem is not the input file is copied from the work dir to the scratch but the other way around.

Now the problem is how to exclude from a cp * /target a list of known file name?

rsuchecki · 2018-12-07T05:22:04Z

Correct, it is about files being copied from scratch to work dir.

As for how - have to be careful, in simple terms this should work:

resolve glob(s)
filter out staged-in filenames
Add any explicitly declared output filenames to capture inputs which are meant to be passed through the process (this may include one or more of the original inputs)
cp...

but things could get tricky if multiple outputs specified especially in sub-dirs (?)

Exclusion would be ideal, as current set-up leads to a range of different outcomes depending on combination of settings used:

`stageIn`	`stageOut`	comment
`copy`	any	inputs captured by glob and copied/moved to work dir ❌
`link`	`move`	hardlink preserved ✔️
`link`	`cp/rsync`	hardlink lost ❌
`symlink`	`mv/rsync`	soft link preserved ✔️
`symlink`	`cp`	soft link resolved ❌, should suffice to run `cp * /target` without `-L, --dereference` but risky as symliks may be created within some dir structure generated by process script
`rellink`		ignore, not allowed with `scratch true`

So the cases with ✔️ indicate that although files are captured by glob, these are (sym/hard) links so additional consumption of disk space is not an issue.

As a separate point, can hard-links be of much use in the context of proper scratch space? That would typically be a separate file system?

pditommaso · 2018-12-19T13:35:54Z

Sorry for the very late reply, staring from point 1. resolve glob(s) not so trivial.

The main problem is that the glob can only resolved at runtime (a task can create any file) therefore the it must be resolved by the Bash script. Ideally it would needed a kind of cp <glob> /target --exclude <input file names list>.

rsuchecki · 2018-12-20T00:56:49Z

That is right, but glob can be resolved in bash before cp is called. For example, use ls or find to match the glob pattern, then filter out inputs using a temp file of filenames, or as below, process substitution to have one less file to worry about

ls GLOB | grep -vFf <(echo ${InputFileNameList}) | xargs cp --target-directory /target

(where InputFileNameList is the NF variable)

Also interesting: extended globbing

Side note: the rsync version should be straightforward with --include and --exclude.

pditommaso · 2018-12-20T10:16:34Z

Nice the extended globs, tho not sure it cal help in the case. I'm more thing something like already implement for Batch

nextflow/src/main/groovy/nextflow/cloud/aws/batch/S3Helper.groovy

Lines 34 to 41 in 8932eb7

    
                       IFS=\$'\\n' 
        
                       for name in \$(eval "ls -1d \$pattern");do 
        
                         if [[ -d "\$name" ]]; then 
        
                           $cli s3 cp --only-show-errors --recursive $encryption--storage-class $storage "\$name" "\$s3path/\$name" 
        
                         else 
        
                           $cli s3 cp --only-show-errors $encryption--storage-class $storage "\$name" "\$s3path/\$name" 
        
                         fi 
        
                       done

Evan the output file name glob in a for loop, then check for each entry if matches an input file name (or a previous output glob), it matchers => skip, otherwise copy it.

To check the if a file name matches a glob it can be used compgen

rsuchecki · 2019-01-07T00:19:45Z

Yes that looks ok, just a minor point - wondering if eval is needed here at all,

for name in \$(eval "ls -1d \$pattern");do

should be equivalent to

for name in \$(ls -1d \$pattern);do

pditommaso · 2019-01-07T12:45:49Z

I remember that I tried to not use eval but there was a problem, which frankly I'm unable to remind now.

pditommaso · 2019-01-18T22:50:56Z

I've mocked up a possible patch

all_patterns=('')

nxf_cp() {
    local pattern=$1
    local target=$2
    IFS=$'\n'
    for name in $(eval "ls -1d $pattern");do
       for var in "${all_patterns[@]}"; do [[ $name =~ $var ]] && continue 2; done
       cp $name $target
    done
    unset IFS
    all_patterns+=($pattern)
}

rsuchecki · 2019-01-21T23:59:56Z

I don't fully understand the context on how you would apply it (copy scratch to work? or first use it for copy from work to scratch, collecting the patterns to be excluded) but the mechanism appears to be sound, and if e.g. input file patterns are given they will not be copied.

Just thought about a scenario that can be relevant to this topic. I already have cases where I declare some file as one of the inputs AND one of the outputs - sort of pass-through a process - no issues there I think, BUT what if someone does that, but also modifies such file? I don't think that would be reasonable thing to do but if possible, it would affect caching, and would introduce ambiguity into what is copied or not copied from scratch.

pditommaso · 2019-01-22T07:57:22Z

This is supposed to be used in the place of the cp commands to stage out tasks outputs from the scratch to the task work dir

nextflow/modules/nextflow/src/test/groovy/nextflow/executor/BashWrapperBuilderTest.groovy

Lines 391 to 396 in 07df38e

    
                   then: 
        
                   binding.unstage_outputs == '''\ 
        
                           mkdir -p /work/dir 
        
                           cp -fRL test.bam /work/dir || true 
        
                           cp -fRL test.bai /work/dir || true 
        
                           '''.stripIndent().rightTrim()

Actually it should be extended to handle the command in a parametric manner so that it can use cp, mv or rsycn along the same line of this method.

Just thought about a scenario that can be relevant to this topic. I already have cases where I declare some file as one of the inputs AND one of the outputs - sort of pass-through a process - no issues there I think, BUT what if someone does that, but also modifies such file? I don't think that would be reasonable thing to do but if possible, it would affect caching, and would introduce ambiguity into what is copied or not copied from scratch.

Currently NF excludes all the input files names from the output set when you specify a wildcard. It doesn't if the output is punctual.

Therefore I think this method should be changed so that:

find all output file names that contains no wildcard
copy those files using the current mechanism
create a list of all input names and the above outputs and use as exclusion list
copies the remaining outputs including a wildcard with the above bash function.

pditommaso · 2019-02-04T11:14:55Z

A possible strategy I have identified to handle this consist on create a file listing all task staged inputs jut before the task execution, with something like:

ls -1a > .command.skip_files

Then use the following script to copy or more task outputs taking into account staged inputs that needs to be skipped

nxf_sync() {
    local cmd=($1)      # command to be executed
    local source=$2     # source path, it may use glob pattern
    local target=$3     # target directory 
    local skip_inputs=${4:-true}    # flag to enable the skipping of input file (true when source is a glob)
    local INPUT_FILES=.command.skip_files

    IFS=$'\n'
    # expand the source pattern and iterate over each matching file 
    for name in $(eval "ls -1d $source");do
      # skip if the file name is in the list of input files
      if $skip_inputs && grep "^$name$" $INPUT_FILES &>/dev/null; then continue; fi
      # compute the new file name 
      local newfile=$target/$name
      # create the target directory
      [[ ! -e $newfile ]] && mkdir -p $(dirname $newfile)
      # copy / move the file 
      ${cmd[*]} $name $newfile
      # append the copied file to the list of input file to avoid to process again if it's matched in a different source pattern
      echo $name >> $INPUT_FILES
    done
    unset IFS
}

However I think introducing this logic seems overkill to me. I think it would add more sense to put a warning in the documentation and eventually to solve in the future.

Jokendo-collab · 2019-04-26T07:16:43Z

@rsuchecki I am running Nextflow version 19.04.0 and it stops and when I check the slurn output error it shows this. What do you think is the problem?
N E X T F L O W ~ version 19.04.0
Launching main.nf [maniac_payne] - revision: 807b6d97e9

rsuchecki · 2019-04-28T23:43:13Z

Can you be more specific @javanOkendo? Perhaps discuss on https://gitter.im/nextflow-io/nextflow?

pditommaso added the kind/enhancement label Dec 6, 2018

pditommaso changed the title ~~Process output glob captures input files when scratch true~~ Unnecessary copy of input files to the task work directory Dec 6, 2018

pditommaso added this to the v19.04.0 milestone Jan 7, 2019

pditommaso added the priority label Jan 30, 2019

pditommaso pinned this issue Jan 30, 2019

pditommaso added known issue wishlist and removed priority labels Feb 4, 2019

pditommaso closed this as completed Feb 4, 2019

pditommaso unpinned this issue Feb 4, 2019

pditommaso added a commit that referenced this issue Feb 6, 2019

Improved output file stage out behaviour docs #961 #1013

9ea5e45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary copy of input files to the task work directory #961

Unnecessary copy of input files to the task work directory #961

rsuchecki commented Dec 5, 2018

rsuchecki commented Dec 5, 2018 •

edited

pditommaso commented Dec 5, 2018 •

edited

rsuchecki commented Dec 5, 2018

pditommaso commented Dec 6, 2018

rsuchecki commented Dec 7, 2018

pditommaso commented Dec 19, 2018

rsuchecki commented Dec 20, 2018 •

edited

pditommaso commented Dec 20, 2018

rsuchecki commented Jan 7, 2019

pditommaso commented Jan 7, 2019

pditommaso commented Jan 18, 2019

rsuchecki commented Jan 21, 2019

pditommaso commented Jan 22, 2019

pditommaso commented Feb 4, 2019

Jokendo-collab commented Apr 26, 2019

rsuchecki commented Apr 28, 2019

Unnecessary copy of input files to the task work directory #961

Unnecessary copy of input files to the task work directory #961

Comments

rsuchecki commented Dec 5, 2018

Bug report - or perhaps just an inconsistency

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

With scratch false

With scratch true

Environment

rsuchecki commented Dec 5, 2018 • edited

pditommaso commented Dec 5, 2018 • edited

rsuchecki commented Dec 5, 2018

pditommaso commented Dec 6, 2018

rsuchecki commented Dec 7, 2018

pditommaso commented Dec 19, 2018

rsuchecki commented Dec 20, 2018 • edited

pditommaso commented Dec 20, 2018

rsuchecki commented Jan 7, 2019

pditommaso commented Jan 7, 2019

pditommaso commented Jan 18, 2019

rsuchecki commented Jan 21, 2019

pditommaso commented Jan 22, 2019

pditommaso commented Feb 4, 2019

Jokendo-collab commented Apr 26, 2019

rsuchecki commented Apr 28, 2019

With `scratch false`

With `scratch true`

rsuchecki commented Dec 5, 2018 •

edited

pditommaso commented Dec 5, 2018 •

edited

rsuchecki commented Dec 20, 2018 •

edited