-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sporadic pipeline crash due to empty .exitcode file #3678
Fix sporadic pipeline crash due to empty .exitcode file #3678
Conversation
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@@ -107,6 +107,7 @@ on_exit() { | |||
[[ "$tee2" ]] && kill $tee2 2>/dev/null | |||
[[ "$ctmp" ]] && rm -rf $ctmp || true | |||
{{cleanup_cmd}} | |||
sync || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much this could impact performance on a network file system. It may be better to add it as an opt-in setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a problem in network filesystems, as Nextflow implicitly already assumes that the data is synced after a task has finished.
- to read the .exitfile if the head-pod runs on another machine than the task
- to run a subsequent task on another machine, all data should be flushed and available
As this works 99.9% of the time, the syncing is already done.
Having the syncing as an opt-in will not help the user, as this bug occurs mostly randomly. This was the first time a workflow crashed repeatably, and we used this to find and fix the behavior.
However, as I only work with Kubernetes, I can only confirm this problem in Kubernetes. Accordingly, I can rewrite the commit so it only applies to Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do not see a problem to also use sync
on network file system, it should always be supported.
I think that same problem can happen on other platforms that are not Kubernetes and use overlay filesystems. Most likely we are not hitting this problem on other platforms because not all of them use the exit file to check the process exit status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and finally we got the sync
crashing user pipelines!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also faced a non-terminating sync, but it turned out to be a problem with our filesystem stack.
Disabling the sync frequently led to missing exitcodes for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exit code error is better than hanging job, at least it can be retried!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, Nextflow does not handle the empty exitcode well, crashing the whole pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
Ok, let's add it. But it should be added an env variable setting to disable it if required e.g. To make it programmatic, it could be convenient to move it into the cleanup script |
Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Aplogies, if I gave you bad advice. At the end added more changes. I think at this point makes more sense adding a variable in the template to handle it. Something like
|
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Maybe you can log a reference to the user about this feature if an empty exitcode is found. |
Do you have an example? |
How about:
From my experience, it will happen more often if you face this once. |
This commit adds the execution of the `sync` command on job completion to synchronise the file system status. This may help to prevent the lack or incomplete state of the `.exitcode` task file created by nextflow to notify the job completion. The use of the sync can be disabled using the `NXF_DISABLE_FS_SYNC=true` environment variable Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de> Signed-off-by: Lehmann_Fabian <fabian.lehmann@informatik.hu-berlin.de>
Pipelines were crashing sporadically due to an empty .exitcode file. This was likely caused by incomplete data writes. To fix this issue, I added a sync command to ensure data is fully written before task completion.
I had a workflow where this error was reproducible in my particular environment, and with the sync, I have not faced the problem again.