Fix sporadic pipeline crash due to empty .exitcode file #3678

Lehmann-Fabian · 2023-02-23T14:57:35Z

Pipelines were crashing sporadically due to an empty .exitcode file. This was likely caused by incomplete data writes. To fix this issue, I added a sync command to ensure data is fully written before task completion.

I had a workflow where this error was reproducible in my particular environment, and with the sync, I have not faced the problem again.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

pditommaso · 2023-02-24T15:04:58Z

modules/nextflow/src/main/resources/nextflow/executor/command-run.txt

@@ -107,6 +107,7 @@ on_exit() {
    [[ "$tee2" ]] && kill $tee2 2>/dev/null
    [[ "$ctmp" ]] && rm -rf $ctmp || true
    {{cleanup_cmd}}
+    sync || true


I wonder how much this could impact performance on a network file system. It may be better to add it as an opt-in setting

I don't see a problem in network filesystems, as Nextflow implicitly already assumes that the data is synced after a task has finished.

to read the .exitfile if the head-pod runs on another machine than the task

to run a subsequent task on another machine, all data should be flushed and available

As this works 99.9% of the time, the syncing is already done.

Having the syncing as an opt-in will not help the user, as this bug occurs mostly randomly. This was the first time a workflow crashed repeatably, and we used this to find and fix the behavior.

However, as I only work with Kubernetes, I can only confirm this problem in Kubernetes. Accordingly, I can rewrite the commit so it only applies to Kubernetes.

I also do not see a problem to also use sync on network file system, it should always be supported.

I think that same problem can happen on other platforms that are not Kubernetes and use overlay filesystems. Most likely we are not hitting this problem on other platforms because not all of them use the exit file to check the process exit status.

and finally we got the sync crashing user pipelines!

https://nextflow.slack.com/archives/C02T98A23U7/p1695892557040359?thread_ts=1695821392.412229&cid=C02T98A23U7

I also faced a non-terminating sync, but it turned out to be a problem with our filesystem stack.
Disabling the sync frequently led to missing exitcodes for me.

Exit code error is better than hanging job, at least it can be retried!

Currently, Nextflow does not handle the empty exitcode well, crashing the whole pipeline.

pditommaso · 2023-02-28T16:32:51Z

Ok, let's add it. But it should be added an env variable setting to disable it if required e.g. NXF_DISABLE_FS_SYNC

To make it programmatic, it could be convenient to move it into the cleanup script

https://github.com/nextflow-io/nextflow/blob/main/modules/nextflow/src/main/groovy/nextflow/executor/BashWrapperBuilder.groovy#L227-L227

Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

pditommaso · 2023-03-01T09:44:48Z

Aplogies, if I gave you bad advice. At the end added more changes. I think at this point makes more sense adding a variable in the template to handle it. Something like {{exit_cmd}} (?)

nextflow/modules/nextflow/src/main/resources/nextflow/executor/command-run.txt

Line 109 in fab6bd5

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

pditommaso · 2023-09-28T20:25:11Z

The use of sync command has been made an opt-in setting, see f0d5cc5.

Moreover, to better manage this error condition the plan is to handle as task error return -1 as exit status. See here.

Lehmann-Fabian · 2023-09-29T07:54:12Z

Maybe you can log a reference to the user about this feature if an empty exitcode is found.

pditommaso · 2023-09-29T08:19:31Z

Do you have an example?

Lehmann-Fabian · 2023-10-02T07:11:44Z

How about:

Nextflow couldn't read the exitcode for process xy. To avoid this in the future, Nextflow can force an additional sync at the end of each process. See [link to the docs or the PR].

From my experience, it will happen more often if you face this once.
We have three Kubernetes clusters with CEPH filesystem. All use different hardware and OS, but I faced empty exitcodes everywhere.

This commit adds the execution of the `sync` command on job completion to synchronise the file system status. This may help to prevent the lack or incomplete state of the `.exitcode` task file created by nextflow to notify the job completion. The use of the sync can be disabled using the `NXF_DISABLE_FS_SYNC=true` environment variable Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de> Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>

Fix the problem that .exitfile was empty even if the task succeeded.

d638381

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

bentsherman requested a review from pditommaso February 23, 2023 18:06

pditommaso requested a review from jordeu February 23, 2023 21:09

pditommaso reviewed Feb 24, 2023

View reviewed changes

jordeu approved these changes Feb 27, 2023

View reviewed changes

Merge branch 'master' into fixEmptyExitfile

5024077

Lehmann-Fabian added 4 commits February 28, 2023 17:50

Allow disabling of sync

2289544

Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>

Use SysEnv

7e4c6be

Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>

Fixed Tests

71916df

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

Fixed AWS test

ce30425

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

Lehmann-Fabian requested a review from pditommaso February 28, 2023 18:39

Merge branch 'master' into fixEmptyExitfile

eb04049

Lehmann-Fabian and others added 3 commits March 1, 2023 12:38

Separate sync method

c659418

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

Merge branch 'master' into fixEmptyExitfile

62d3c90

Fix Google test

b237a9a

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

pditommaso merged commit e29c4e4 into nextflow-io:master Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sporadic pipeline crash due to empty .exitcode file #3678

Fix sporadic pipeline crash due to empty .exitcode file #3678

Lehmann-Fabian commented Feb 23, 2023

pditommaso Feb 24, 2023

Lehmann-Fabian Feb 27, 2023

jordeu Feb 27, 2023

pditommaso Sep 28, 2023 •

edited

Lehmann-Fabian Sep 28, 2023

pditommaso Sep 28, 2023 •

edited

Lehmann-Fabian Sep 28, 2023

pditommaso Sep 28, 2023

pditommaso commented Feb 28, 2023

pditommaso commented Mar 1, 2023

pditommaso commented Sep 28, 2023

Lehmann-Fabian commented Sep 29, 2023

pditommaso commented Sep 29, 2023

Lehmann-Fabian commented Oct 2, 2023

Fix sporadic pipeline crash due to empty .exitcode file #3678

Fix sporadic pipeline crash due to empty .exitcode file #3678

Conversation

Lehmann-Fabian commented Feb 23, 2023

pditommaso Feb 24, 2023

Choose a reason for hiding this comment

Lehmann-Fabian Feb 27, 2023

Choose a reason for hiding this comment

jordeu Feb 27, 2023

Choose a reason for hiding this comment

pditommaso Sep 28, 2023 • edited

Choose a reason for hiding this comment

Lehmann-Fabian Sep 28, 2023

Choose a reason for hiding this comment

pditommaso Sep 28, 2023 • edited

Choose a reason for hiding this comment

Lehmann-Fabian Sep 28, 2023

Choose a reason for hiding this comment

pditommaso Sep 28, 2023

Choose a reason for hiding this comment

pditommaso commented Feb 28, 2023

pditommaso commented Mar 1, 2023

pditommaso commented Sep 28, 2023

Lehmann-Fabian commented Sep 29, 2023

pditommaso commented Sep 29, 2023

Lehmann-Fabian commented Oct 2, 2023

pditommaso Sep 28, 2023 •

edited

pditommaso Sep 28, 2023 •

edited