Resume should work with list-returning operators #1475

micans · 2020-01-29T16:52:31Z

Bug report

Expected behavior and actual behavior

A pipeline may restart a process when resuming if the process uses collate() or another list-returning operator.

Steps to reproduce the problem

Use a process that uses collate() on a sufficiently large input channel. (Can construct example if needed, but we discussed this on gitter). Run the workflow, run it again with -resume.

The text was updated successfully, but these errors were encountered:

pditommaso · 2020-01-30T13:37:01Z

A minimal example to replicate the problem is

process foo {
  input: val(x) from (1,2,3)
  output:  val(x) into ch 
  /echo $x/
}

process bar {
  echo true
  input: val(y) from ch.collate(3)
  /your_command --items $y/
}

I see your issue with the caching, however this is a subtle point because it really depends if the bar command does not care about the collated elements order or if it does. In the latter case, it's correct that the task is re-executed.

micans · 2020-01-30T13:51:41Z

In my case the command does not care about the order of collated elements. I assume this is a common scenario, as collate depends on channel ordering which is unpredictable/random as far as the user is concerned. When I checked, it seemed to me that the resumed process collated elements in a different order even and I ended up with different 'collate' buckets. I will check my findings and either post an example here or retract the statement; hopefully later this week.

pditommaso · 2020-01-30T14:04:19Z

A quick workaround should be

process bar {
  echo true
  input: val(y) from ch.collate(3).toSortedList()
  /your_command --items $y/
}

micans · 2020-01-30T14:36:28Z

With this code:

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}

process bar {
  echo true
  input: val(y) from ch.collate(3)
  shell: 'echo -n !{y}'
}

I get:

node-10-4-3,it/cellgeni/patterns (master *%) (pat) nextflow run collate.nf 
N E X T F L O W  ~  version 19.10.0
Launching `collate.nf` [gloomy_mercator] - revision: 69a4f3f7dd
executor >  local (24)
[47/2fce86] process > foo (2) [100%] 18 of 18 ✔
[4f/8e209c] process > bar (6) [100%] 6 of 6 ✔
[13, 8, 7]
[6, 10, 11]
[15, 4, 5]
[17, 9, 16]
[1, 12, 14]
[3, 18, 2]

node-10-4-3,it/cellgeni/patterns (master *%) (pat) nextflow run collate.nf  -resume
N E X T F L O W  ~  version 19.10.0
Launching `collate.nf` [gigantic_engelbart] - revision: 69a4f3f7dd
executor >  local (6)
[d4/cd44d0] process > foo (11) [100%] 18 of 18, cached: 18 ✔
[ea/8670c2] process > bar (3)  [100%] 6 of 6 ✔
[15, 8, 2]
[10, 4, 7]
[18, 14, 5]
[1, 13, 6]
[9, 11, 12]
[3, 17, 16]

The issue that hampers me is that the collate acts on a different order entirely after the resumption.
When I use toSortedList() I get:

N E X T F L O W  ~  version 19.10.0
Launching `collate.nf` [spontaneous_cajal] - revision: 616eff383b
executor >  local (19)
[5c/6f7d91] process > foo (16) [100%] 18 of 18 ✔
[b2/23fe48] process > bar      [100%] 1 of 1 ✔
[[2, 9, 12], [3, 14, 1], [4, 15, 17], [8, 6, 7], [11, 13, 16], [18, 10, 5]]

So it changes the logic of the process.

micans · 2020-01-30T14:42:36Z

As a side note, I use collate and metafile approaches to reduce the load on the file system. I will soon have workflows with close to 100k samples; in many workflows it helps a lot if I do some mini-batching internally (outside NF knowledge) using collate/metafile-type approaches. Whilst not perhaps philosophically pure, it saves a large amount of headaches and time, and gets more out of the resources we have.

pditommaso · 2020-01-31T11:05:41Z

Sorry, toSortList is not a workaround. I agree the resume should be handled properly. I've identified the possible solution:

make collate to return an ArrayBag instead ArrayList so that order is ignored when computing the hash
add to collate a sort option to make the result stable across resume. Cons: you need to remember to enable it. Bit tricky.
make the collate sorting of the result implicit. Cons: the elements must be Comparable which must not always be the case.
add to collect a by which allows the specification of the length of the collected chunks and making it work in practice as the collate but without resume issue. Cons: overlap in the operator semantics, more complicated to implement.

Maybe the best choice is 3, showing a warning if the elements cannot be sorted and falling back on the old behaviour.

micans · 2020-01-31T11:19:04Z

Can I verify one thing: There are two elements to this as far as I can see: the channel order and the chunk/bag sorting. You write about the sorting of the output, but I believe the bigger problem is the order of the channel input. The change of channel order leads to the issue that chunks become different. For example compare these outputs for two runs, the second with -resume. I've included a view() in the code, so it now looks like this:

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}
process bar {
  echo true
  input: val(y) from ch.view().collate(3)
  shell: 'echo -n !{y}'
}

17 8 13 16 9 11 7 4 18 2 6 10 15 14 3 1 12 5
[16, 9, 11]
[7, 4, 18]
[17, 8, 13]
[2, 6, 10]
[15, 14, 3]
[1, 12, 5]

Second run using -resume:
7 17 16 14 8 4 9 15 13 12 1 18 3 6 10 2 5 11
[9, 15, 13]
[3, 6, 10]
[14, 8, 4]
[12, 1, 18]
[7, 17, 16]
[2, 5, 11]

You can see that collate chunks (viewed as bags) are defined by the channel order, and it is different between runs. The solutions above seem to focus on each individual output bag ordering, rather than the problem of the effect that changing channel orderings changes the bag definitions.

pditommaso · 2020-01-31T11:25:42Z

Then any of above will solve your use case, let's discuss a bit more on Gitter.

micans · 2020-02-13T21:51:26Z

Two comments, one question.

(1) I've experienced/finally realised that collect() suffers from this as well. As far as I can see using toSortedList() instead alleviates this issue. In terms of -resume it seems to me it is best practice to either use toSortedList() instead of collect() when no collect closure is used, or to use .collect { stuff }.toSortedList() if a closure is used. Does that make sense?

(2) In my current pipeline the channel that needs collating has all its inputs ready. This allows

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}
process bar {
  echo true
  input: val(y) from ch.toSortedList().flatMap().collate(3)
  shell: 'echo -n !{y}'
}

This approach seems to work on the small test case above. I assume there is no issue with that, not even when there are 100k elements in the channel ... I'll go and test 56k elements now.

pditommaso · 2020-02-14T10:24:52Z

No, collect should be resume safe, provided you are not chunking/ manipulating the result.

The problem of toSortedList is that it's blocking (you cannot sort without knowing all items) and my understanding is that you want to avoid that.

micans · 2020-02-14T11:33:26Z

ok, I misunderstood. Yes I would want to avoid that, but I found that in one particular pipeline it is actually OK to use toSortedList as the inputs arrive at the same time anyway.

micans · 2020-02-21T22:13:15Z

Documenting here that e.g. first() also does not play nicely with resume:

process foo {
  input: val(x) from (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
  output:  val(x) into ch
  /echo $x/
}

process bar {
   echo true
   input: val(y) from ch.first()
   shell: 'echo -n !{y}'
}

bar will be re-run due to random ordering of ch. This affected a recent pipeline; a work-around can be found. It may be worthwhile collating these operators / resume subtleties.

DaGaMs · 2020-03-11T23:26:41Z

I am having an issue with -resume restarting the pipeline from the start, despite the fact that all the processes up to a certain point have finished already. Having read (but only partly understood) the discussion here, I wonder if this has to do with the fact that I'm using merge somewhere in the pipeline (upstream of where it crashed)?

Here is part of my workflow:

workflow {
  main:
    Channel.of('tumour', 'control') | merge(Channel.of(params.tumour_frac, params.control_frac)) \
                                    | merge(Channel.fromPath([params.tumour_bam, params.control_bam])) \
                                    | subsample \
                                    | sortBamByName \
                                    | bamToFastq \
                                    | set { subsampledFastqs }
    // CRASH HAPPENED IN mapFastq
    subsampledFastqs | mapFastq | set { rawMappedBams }
[...]

The problem is that even though the two merges are entirely deterministic and all steps up to and including bamToFastq completed for the two files, nextflow keeps resuming from subsample. Is that because it thinks it can't know if the previous run's merge was incomplete?

stale · 2020-08-08T23:36:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

diego-rt · 2022-04-11T09:46:22Z

Hi,

Is there any update on what's the solution to getting collate to play nicely with resume?

nakib103 · 2023-02-09T00:27:06Z

Hello,

I am also facing issue because of the channel order. Has there been any solution to handle this case?

mribeirodantas · 2023-02-09T04:06:42Z

I can think of two possible solutions:

If you don't mind blocking, toSortedList after collate should not break resume;
You can also use the fair process directive to guarantee the order in the output channel, as this enables fair threading (possible decrease in performance, though). With the guarantee in the order, resume won't break.

stale bot added the stale label Aug 8, 2020

stale bot closed this as completed Oct 8, 2020

skrakau mentioned this issue Jan 19, 2021

Handle caching issues qbic-pipelines/metapep#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume should work with list-returning operators #1475

Resume should work with list-returning operators #1475

micans commented Jan 29, 2020

pditommaso commented Jan 30, 2020

micans commented Jan 30, 2020

pditommaso commented Jan 30, 2020 •

edited

micans commented Jan 30, 2020

micans commented Jan 30, 2020

pditommaso commented Jan 31, 2020

micans commented Jan 31, 2020 •

edited

pditommaso commented Jan 31, 2020 •

edited

micans commented Feb 13, 2020

pditommaso commented Feb 14, 2020

micans commented Feb 14, 2020

micans commented Feb 21, 2020

DaGaMs commented Mar 11, 2020

stale bot commented Aug 8, 2020

diego-rt commented Apr 11, 2022

nakib103 commented Feb 9, 2023

mribeirodantas commented Feb 9, 2023

Resume should work with list-returning operators #1475

Resume should work with list-returning operators #1475

Comments

micans commented Jan 29, 2020

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

pditommaso commented Jan 30, 2020

micans commented Jan 30, 2020

pditommaso commented Jan 30, 2020 • edited

micans commented Jan 30, 2020

micans commented Jan 30, 2020

pditommaso commented Jan 31, 2020

micans commented Jan 31, 2020 • edited

pditommaso commented Jan 31, 2020 • edited

micans commented Feb 13, 2020

pditommaso commented Feb 14, 2020

micans commented Feb 14, 2020

micans commented Feb 21, 2020

DaGaMs commented Mar 11, 2020

stale bot commented Aug 8, 2020

diego-rt commented Apr 11, 2022

nakib103 commented Feb 9, 2023

mribeirodantas commented Feb 9, 2023

pditommaso commented Jan 30, 2020 •

edited

micans commented Jan 31, 2020 •

edited

pditommaso commented Jan 31, 2020 •

edited