Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sporadic pipeline crash due to empty .exitcode file #3678

Merged
merged 10 commits into from
Mar 1, 2023

Conversation

Lehmann-Fabian
Copy link
Contributor

Pipelines were crashing sporadically due to an empty .exitcode file. This was likely caused by incomplete data writes. To fix this issue, I added a sync command to ensure data is fully written before task completion.

I had a workflow where this error was reproducible in my particular environment, and with the sync, I have not faced the problem again.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@@ -107,6 +107,7 @@ on_exit() {
[[ "$tee2" ]] && kill $tee2 2>/dev/null
[[ "$ctmp" ]] && rm -rf $ctmp || true
{{cleanup_cmd}}
sync || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much this could impact performance on a network file system. It may be better to add it as an opt-in setting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem in network filesystems, as Nextflow implicitly already assumes that the data is synced after a task has finished.

  1. to read the .exitfile if the head-pod runs on another machine than the task
  2. to run a subsequent task on another machine, all data should be flushed and available

As this works 99.9% of the time, the syncing is already done.

Having the syncing as an opt-in will not help the user, as this bug occurs mostly randomly. This was the first time a workflow crashed repeatably, and we used this to find and fix the behavior.

However, as I only work with Kubernetes, I can only confirm this problem in Kubernetes. Accordingly, I can rewrite the commit so it only applies to Kubernetes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not see a problem to also use sync on network file system, it should always be supported.

I think that same problem can happen on other platforms that are not Kubernetes and use overlay filesystems. Most likely we are not hitting this problem on other platforms because not all of them use the exit file to check the process exit status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also faced a non-terminating sync, but it turned out to be a problem with our filesystem stack.
Disabling the sync frequently led to missing exitcodes for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exit code error is better than hanging job, at least it can be retried!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Nextflow does not handle the empty exitcode well, crashing the whole pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

@pditommaso
Copy link
Member

Ok, let's add it. But it should be added an env variable setting to disable it if required e.g. NXF_DISABLE_FS_SYNC

To make it programmatic, it could be convenient to move it into the cleanup script

https://github.com/nextflow-io/nextflow/blob/main/modules/nextflow/src/main/groovy/nextflow/executor/BashWrapperBuilder.groovy#L227-L227

Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@pditommaso
Copy link
Member

Aplogies, if I gave you bad advice. At the end added more changes. I think at this point makes more sense adding a variable in the template to handle it. Something like {{exit_cmd}} (?)

Lehmann-Fabian and others added 3 commits March 1, 2023 12:38
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@pditommaso pditommaso merged commit e29c4e4 into nextflow-io:master Mar 1, 2023
@pditommaso
Copy link
Member

The use of sync command has been made an opt-in setting, see f0d5cc5.

Moreover, to better manage this error condition the plan is to handle as task error return -1 as exit status. See here.

@Lehmann-Fabian
Copy link
Contributor Author

Maybe you can log a reference to the user about this feature if an empty exitcode is found.

@pditommaso
Copy link
Member

Do you have an example?

@Lehmann-Fabian
Copy link
Contributor Author

How about:

Nextflow couldn't read the exitcode for process xy. To avoid this in the future, Nextflow can force an additional sync at the end of each process. See [link to the docs or the PR].

From my experience, it will happen more often if you face this once.
We have three Kubernetes clusters with CEPH filesystem. All use different hardware and OS, but I faced empty exitcodes everywhere.

abhi18av pushed a commit to abhi18av/nextflow that referenced this pull request Oct 28, 2023
This commit adds the execution of the `sync` command on job 
completion to synchronise the file system status. 

This may help to prevent the lack or incomplete state of the `.exitcode` 
task file created by nextflow to notify the job completion. 

The use of the sync can be disabled using the `NXF_DISABLE_FS_SYNC=true` 
environment variable  

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants