Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer error exit status to end of pipeline run #4446

Closed
adamrtalbot opened this issue Oct 26, 2023 · 27 comments · Fixed by #4686 or #5071
Closed

Defer error exit status to end of pipeline run #4446

adamrtalbot opened this issue Oct 26, 2023 · 27 comments · Fixed by #4686 or #5071
Assignees

Comments

@adamrtalbot
Copy link
Collaborator

really go as far as possible in running the pipeline, but then exit 1 at the very end if any errors were ignored.

This is a pretty useful feature. Imagine the scenario where you have 96 independent samples, if 1 fails and it brings the whole pipeline crashing down that's annoying, but if you use errorStrategy = 'ignore' then you have to write your own solution. We could have process.errorStrategy = 'catch'. This might be a separate ticket though.

Having some form of introspection on tasks in the workflow.onComplete block would also help for similar reasons and be more generally useful.

Originally posted by @adamrtalbot in #4365 (comment)

@pditommaso
Copy link
Member

Is it not finish doing this?

@ewels
Copy link
Member

ewels commented Oct 26, 2023

No, finish allows the currently running jobs to complete. This would tell Nextflow to continue submitting new jobs for as long as it can before dying.

The use case is for automated runs. If there is one bad sample out of hundreds, the run will fail early and need to wait until it can be manually restarted. Then the whole pipeline needs to run through, which is slow. With this option, the other samples would all process so when the pipeline is resumed without the bad sample it would complete very quickly.

@pditommaso
Copy link
Member

yeah, you are right. I was trying to be lazy 😆

@bentsherman
Copy link
Member

We could either change ignore to return a non-zero exit code or add a new strategy like ignoreThenFail, which one would you guys prefer?

@pinin4fjords
Copy link
Contributor

I think ignnoreThenFail is the more generally understandable solution.

@adamrtalbot
Copy link
Collaborator Author

Agreed.

@adamrtalbot
Copy link
Collaborator Author

If you want to be fancy, we could call it defer but I'd like a non-native speaker to check that makes sense.

@ewels
Copy link
Member

ewels commented Oct 30, 2023

I generally prefer verbose and clear, for me ignoreThenFail is the winner in this list 👆🏻 defer is more likely to send me scurrying to the docs.

@boothms
Copy link

boothms commented Jan 25, 2024

This is the solution that I was looking for. Is it still under consideration ?

@ghost
Copy link

ghost commented Jan 25, 2024

I would like to second @boothms message. This would greatly improve the runs with multiple samples where at least one process for one sample might fail. It would be great if the progress of the execution could continue for the rest of the healthy samples.

@ghost
Copy link

ghost commented Jan 25, 2024

@bentsherman , thank you so much for jumping on this. I have a quick question about the implementation.

If one sets the errorStrategy 'ignoreThenFail', that implies that the task will never be retried, correct?

Would there be a place where these 2 are not mutually exclusive? Try the maxRetries, and then if it fails, still have the possibility to defer the exit status to the end of the pipeline?

@bentsherman
Copy link
Member

You could do that with a closure:

process.errorStrategy = { task.attempt < 3 ? 'retry' : 'ignoreThenFail' }

@ghost
Copy link

ghost commented Jan 25, 2024

This is great. Thank you once again.

I'm really looking forward to testing this.

@ghost
Copy link

ghost commented Feb 22, 2024

Hi folks, I saw that the PR is almost done. Do you have an estimate on when this might be merged?

@adamrtalbot
Copy link
Collaborator Author

Relevant community discussion: https://community.seqera.io/t/handling-single-sample-failures/618/3

@pditommaso
Copy link
Member

So ignoreThenFail cannot be used because errorStrategy cannot be longer than 10 chars.

Two choices:

  1. find a shorted name
  2. turn this setting into a nextflow config option (not an errorStrategy) e.g. nextflow.failOnIgnoredErrors = true

@pinin4fjords
Copy link
Contributor

pinin4fjords commented Jun 13, 2024

Where does the 10 character limit derive from?

But from conversations with users, I think the errorStrategy is the most intuitive thing, the thing people expect to find.

ignoreFail? Though that sounds a bit like we're ignoring a failure.

@adamrtalbot
Copy link
Collaborator Author

My other suggestion was defer, which isn't as clear but is shorter than 10 characters.

Like Jon, I have to ask where does the 10 character limit come from?

@pditommaso
Copy link
Member

where does the 10 character limit come from

The platform db schema

@pinin4fjords
Copy link
Contributor

deferFail?

@FriederikeHanssen
Copy link

delayFail ?

@pinin4fjords
Copy link
Contributor

I asked ChatGPT:

  • failAtEnd
  • finishFail
  • runThenFail
  • completeErr
  • ignoreFail
  • tryThenFail
  • endWithErr

@pditommaso
Copy link
Member

and the winner is failAtEnd 😄

@bentsherman
Copy link
Member

I think I'm coming around to ignoreFail and just rely the language server to provide an inline explanation (i.e. hover hint, code completion) if the user isn't clear on the meaning

@pditommaso
Copy link
Member

I'm still not convinced about the best to proceed with this. Apart from the strategy name length, I'd like to avoid introducing a new error strategy for doing a similar thing to ignore, above all because this is going to impact some logic in Platform about how error are reported, etc.

I've made a variation in which the user is expected to use the string ignoreThenFail as error strategy, however behind the scenes is mapped to ignore, but still causing the failure on completion.

#5066

In this way, the changes is transparent from the point of view of Platform. However I fear that it can be difficult to determine why it failed because tasks will report ignore as retry action

@pditommaso pditommaso linked a pull request Jun 17, 2024 that will close this issue
@pditommaso pditommaso linked a pull request Jun 17, 2024 that will close this issue
@pditommaso
Copy link
Member

I've made two other implementation for this feature that prevent the propagation of a new error strategy literal.

PR #5066 implements a pseudo-error strategy named ignoreThenFail, which is replaced behind the scenes with ignore and triggers the expected behaviour. However it sounds a bit hacky and not transparent the behavioural.

PR #5071 implements the same logic as a nextflow config setting.

@bentsherman
Copy link
Member

I think I prefer #5071. When the workflow fails due to failOnIgnore, we can say as much in the error message so that it's clear why the workflow failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment