-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary copy of input files to the task work directory #961
Comments
By the way, this is specific to the default |
Many things, let's focus on the issue subject. I've modified the test case as Channel.from('Bonjour')
.collectFile()
.set { greetings }
process echo {
input:
file ('greeting.in') from greetings
output:
file('*') into result
script:
"""
cat greeting.in > greeting.out
"""
}
result.println() It output on file independently the use os scratch:
Therefore the the assertion Input files are not included in the list of possible matches is valid in both cases. |
Thanks @pditommaso - I misdiagnosed the issue and (surprise, surprise) you are right about the main point so this issue should be closed as it is misleading. I guess the actual concern is the fact that input file symlinks are being resolved when copying from scratch. If the process in question is executed multiple times, we'll end up with multiple copies of input files. |
Damn! you are right, I'm getting what you mean, the problem is not the input file is copied from the work dir to the scratch but the other way around. Now the problem is how to exclude from a |
Correct, it is about files being copied from scratch to work dir. As for how - have to be careful, in simple terms this should work:
but things could get tricky if multiple outputs specified especially in sub-dirs (?) Exclusion would be ideal, as current set-up leads to a range of different outcomes depending on combination of settings used:
So the cases with ✔️ indicate that although files are captured by glob, these are (sym/hard) links so additional consumption of disk space is not an issue. As a separate point, can hard-links be of much use in the context of proper scratch space? That would typically be a separate file system? |
Sorry for the very late reply, staring from point 1. resolve glob(s) not so trivial. The main problem is that the glob can only resolved at runtime (a task can create any file) therefore the it must be resolved by the Bash script. Ideally it would needed a kind of |
That is right, but glob can be resolved in bash before
(where Also interesting: extended globbing Side note: the |
Nice the extended globs, tho not sure it cal help in the case. I'm more thing something like already implement for Batch nextflow/src/main/groovy/nextflow/cloud/aws/batch/S3Helper.groovy Lines 34 to 41 in 8932eb7
Evan the output file name glob in a for loop, then check for each entry if matches an input file name (or a previous output glob), it matchers => skip, otherwise copy it. To check the if a file name matches a glob it can be used compgen |
Yes that looks ok, just a minor point - wondering if for name in \$(eval "ls -1d \$pattern");do should be equivalent to for name in \$(ls -1d \$pattern);do |
I remember that I tried to not use |
I've mocked up a possible patch
|
I don't fully understand the context on how you would apply it (copy scratch to work? or first use it for copy from work to scratch, collecting the patterns to be excluded) but the mechanism appears to be sound, and if e.g. input file patterns are given they will not be copied. Just thought about a scenario that can be relevant to this topic. I already have cases where I declare some file as one of the inputs AND one of the outputs - sort of pass-through a process - no issues there I think, BUT what if someone does that, but also modifies such file? I don't think that would be reasonable thing to do but if possible, it would affect caching, and would introduce ambiguity into what is copied or not copied from scratch. |
This is supposed to be used in the place of the nextflow/modules/nextflow/src/test/groovy/nextflow/executor/BashWrapperBuilderTest.groovy Lines 391 to 396 in 07df38e
Actually it should be extended to handle the command in a parametric manner so that it can use
Currently NF excludes all the input files names from the output set when you specify a wildcard. It doesn't if the output is punctual. Therefore I think this method should be changed so that:
|
A possible strategy I have identified to handle this consist on create a file listing all task staged inputs jut before the task execution, with something like:
Then use the following script to copy or more task outputs taking into account staged inputs that needs to be skipped nxf_sync() {
local cmd=($1) # command to be executed
local source=$2 # source path, it may use glob pattern
local target=$3 # target directory
local skip_inputs=${4:-true} # flag to enable the skipping of input file (true when source is a glob)
local INPUT_FILES=.command.skip_files
IFS=$'\n'
# expand the source pattern and iterate over each matching file
for name in $(eval "ls -1d $source");do
# skip if the file name is in the list of input files
if $skip_inputs && grep "^$name$" $INPUT_FILES &>/dev/null; then continue; fi
# compute the new file name
local newfile=$target/$name
# create the target directory
[[ ! -e $newfile ]] && mkdir -p $(dirname $newfile)
# copy / move the file
${cmd[*]} $name $newfile
# append the copied file to the list of input file to avoid to process again if it's matched in a different source pattern
echo $name >> $INPUT_FILES
done
unset IFS
}
However I think introducing this logic seems overkill to me. I think it would add more sense to put a warning in the documentation and eventually to solve in the future. |
@rsuchecki I am running Nextflow version 19.04.0 and it stops and when I check the slurn output error it shows this. What do you think is the problem? |
Can you be more specific @javanOkendo? Perhaps discuss on https://gitter.im/nextflow-io/nextflow? |
Bug report - or perhaps just an inconsistency
Expected behavior and actual behavior
As per documentation,
This no longer holds if
scratch true
directive is set for a process.Steps to reproduce the problem
Program output
With
scratch false
File
greetings.in
is a symlink as expected.With
scratch true
File
greetings.in
is a regular file, in a real case scenario this could be a large file duplicated as many times as process is executed.Environment
The text was updated successfully, but these errors were encountered: