Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume should work with list-returning operators #1475

Closed
micans opened this issue Jan 29, 2020 · 17 comments
Closed

Resume should work with list-returning operators #1475

micans opened this issue Jan 29, 2020 · 17 comments
Labels

Comments

@micans
Copy link

micans commented Jan 29, 2020

Bug report

Expected behavior and actual behavior

A pipeline may restart a process when resuming if the process uses collate() or another list-returning operator.

Steps to reproduce the problem

Use a process that uses collate() on a sufficiently large input channel. (Can construct example if needed, but we discussed this on gitter). Run the workflow, run it again with -resume.

@pditommaso
Copy link
Member

A minimal example to replicate the problem is

process foo {
  input: val(x) from (1,2,3)
  output:  val(x) into ch 
  /echo $x/
}

process bar {
  echo true
  input: val(y) from ch.collate(3)
  /your_command --items $y/
}

I see your issue with the caching, however this is a subtle point because it really depends if the bar command does not care about the collated elements order or if it does. In the latter case, it's correct that the task is re-executed.

@micans
Copy link
Author

micans commented Jan 30, 2020

In my case the command does not care about the order of collated elements. I assume this is a common scenario, as collate depends on channel ordering which is unpredictable/random as far as the user is concerned. When I checked, it seemed to me that the resumed process collated elements in a different order even and I ended up with different 'collate' buckets. I will check my findings and either post an example here or retract the statement; hopefully later this week.

@pditommaso
Copy link
Member

pditommaso commented Jan 30, 2020

A quick workaround should be

process bar {
  echo true
  input: val(y) from ch.collate(3).toSortedList()
  /your_command --items $y/
}

@micans
Copy link
Author

micans commented Jan 30, 2020

With this code:

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}

process bar {
  echo true
  input: val(y) from ch.collate(3)
  shell: 'echo -n !{y}'
}

I get:

node-10-4-3,it/cellgeni/patterns (master *%) (pat) nextflow run collate.nf 
N E X T F L O W  ~  version 19.10.0
Launching `collate.nf` [gloomy_mercator] - revision: 69a4f3f7dd
executor >  local (24)
[47/2fce86] process > foo (2) [100%] 18 of 18 ✔
[4f/8e209c] process > bar (6) [100%] 6 of 6 ✔
[13, 8, 7]
[6, 10, 11]
[15, 4, 5]
[17, 9, 16]
[1, 12, 14]
[3, 18, 2]

node-10-4-3,it/cellgeni/patterns (master *%) (pat) nextflow run collate.nf  -resume
N E X T F L O W  ~  version 19.10.0
Launching `collate.nf` [gigantic_engelbart] - revision: 69a4f3f7dd
executor >  local (6)
[d4/cd44d0] process > foo (11) [100%] 18 of 18, cached: 18 ✔
[ea/8670c2] process > bar (3)  [100%] 6 of 6 ✔
[15, 8, 2]
[10, 4, 7]
[18, 14, 5]
[1, 13, 6]
[9, 11, 12]
[3, 17, 16]

The issue that hampers me is that the collate acts on a different order entirely after the resumption.
When I use toSortedList() I get:

N E X T F L O W  ~  version 19.10.0
Launching `collate.nf` [spontaneous_cajal] - revision: 616eff383b
executor >  local (19)
[5c/6f7d91] process > foo (16) [100%] 18 of 18 ✔
[b2/23fe48] process > bar      [100%] 1 of 1 ✔
[[2, 9, 12], [3, 14, 1], [4, 15, 17], [8, 6, 7], [11, 13, 16], [18, 10, 5]]

So it changes the logic of the process.

@micans
Copy link
Author

micans commented Jan 30, 2020

As a side note, I use collate and metafile approaches to reduce the load on the file system. I will soon have workflows with close to 100k samples; in many workflows it helps a lot if I do some mini-batching internally (outside NF knowledge) using collate/metafile-type approaches. Whilst not perhaps philosophically pure, it saves a large amount of headaches and time, and gets more out of the resources we have.

@pditommaso
Copy link
Member

Sorry, toSortList is not a workaround. I agree the resume should be handled properly. I've identified the possible solution:

  1. make collate to return an ArrayBag instead ArrayList so that order is ignored when computing the hash
  2. add to collate a sort option to make the result stable across resume. Cons: you need to remember to enable it. Bit tricky.
  3. make the collate sorting of the result implicit. Cons: the elements must be Comparable which must not always be the case.
  4. add to collect a by which allows the specification of the length of the collected chunks and making it work in practice as the collate but without resume issue. Cons: overlap in the operator semantics, more complicated to implement.

Maybe the best choice is 3, showing a warning if the elements cannot be sorted and falling back on the old behaviour.

@micans
Copy link
Author

micans commented Jan 31, 2020

Can I verify one thing: There are two elements to this as far as I can see: the channel order and the chunk/bag sorting. You write about the sorting of the output, but I believe the bigger problem is the order of the channel input. The change of channel order leads to the issue that chunks become different. For example compare these outputs for two runs, the second with -resume. I've included a view() in the code, so it now looks like this:

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}
process bar {
  echo true
  input: val(y) from ch.view().collate(3)
  shell: 'echo -n !{y}'
}
17 8 13 16 9 11 7 4 18 2 6 10 15 14 3 1 12 5
[16, 9, 11]
[7, 4, 18]
[17, 8, 13]
[2, 6, 10]
[15, 14, 3]
[1, 12, 5]

Second run using -resume:
7 17 16 14 8 4 9 15 13 12 1 18 3 6 10 2 5 11
[9, 15, 13]
[3, 6, 10]
[14, 8, 4]
[12, 1, 18]
[7, 17, 16]
[2, 5, 11]

You can see that collate chunks (viewed as bags) are defined by the channel order, and it is different between runs. The solutions above seem to focus on each individual output bag ordering, rather than the problem of the effect that changing channel orderings changes the bag definitions.

@pditommaso
Copy link
Member

pditommaso commented Jan 31, 2020

Then any of above will solve your use case, let's discuss a bit more on Gitter.

@micans
Copy link
Author

micans commented Feb 13, 2020

Two comments, one question.

(1) I've experienced/finally realised that collect() suffers from this as well. As far as I can see using toSortedList() instead alleviates this issue. In terms of -resume it seems to me it is best practice to either use toSortedList() instead of collect() when no collect closure is used, or to use .collect { stuff }.toSortedList() if a closure is used. Does that make sense?

(2) In my current pipeline the channel that needs collating has all its inputs ready. This allows

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}
process bar {
  echo true
  input: val(y) from ch.toSortedList().flatMap().collate(3)
  shell: 'echo -n !{y}'
}

This approach seems to work on the small test case above. I assume there is no issue with that, not even when there are 100k elements in the channel ... I'll go and test 56k elements now.

@pditommaso
Copy link
Member

No, collect should be resume safe, provided you are not chunking/ manipulating the result.

The problem of toSortedList is that it's blocking (you cannot sort without knowing all items) and my understanding is that you want to avoid that.

@micans
Copy link
Author

micans commented Feb 14, 2020

ok, I misunderstood. Yes I would want to avoid that, but I found that in one particular pipeline it is actually OK to use toSortedList as the inputs arrive at the same time anyway.

@micans
Copy link
Author

micans commented Feb 21, 2020

Documenting here that e.g. first() also does not play nicely with resume:

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}

process bar {
   echo true
   input: val(y) from ch.first()
   shell: 'echo -n !{y}'
}

bar will be re-run due to random ordering of ch. This affected a recent pipeline; a work-around can be found. It may be worthwhile collating these operators / resume subtleties.

@DaGaMs
Copy link
Contributor

DaGaMs commented Mar 11, 2020

I am having an issue with -resume restarting the pipeline from the start, despite the fact that all the processes up to a certain point have finished already. Having read (but only partly understood) the discussion here, I wonder if this has to do with the fact that I'm using merge somewhere in the pipeline (upstream of where it crashed)?

Here is part of my workflow:

workflow {
  main:
    Channel.of('tumour', 'control') | merge(Channel.of(params.tumour_frac, params.control_frac)) \
                                    | merge(Channel.fromPath([params.tumour_bam, params.control_bam])) \
                                    | subsample \
                                    | sortBamByName \
                                    | bamToFastq \
                                    | set { subsampledFastqs }
    // CRASH HAPPENED IN mapFastq
    subsampledFastqs | mapFastq | set { rawMappedBams }
[...]

The problem is that even though the two merges are entirely deterministic and all steps up to and including bamToFastq completed for the two files, nextflow keeps resuming from subsample. Is that because it thinks it can't know if the previous run's merge was incomplete?

@stale
Copy link

stale bot commented Aug 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@diego-rt
Copy link

Hi,

Is there any update on what's the solution to getting collate to play nicely with resume?

@nakib103
Copy link

nakib103 commented Feb 9, 2023

Hello,

I am also facing issue because of the channel order. Has there been any solution to handle this case?

@mribeirodantas
Copy link
Member

I can think of two possible solutions:

  1. If you don't mind blocking, toSortedList after collate should not break resume;
  2. You can also use the fair process directive to guarantee the order in the output channel, as this enables fair threading (possible decrease in performance, though). With the guarantee in the order, resume won't break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants