-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time limits for each state for a task #618
Comments
If tasks are getting stuck in the INITIALIZING state, it probably indicates that the scheduler (e.g. AWS Batch) is killing the jobs for some reason or another and the state update from the worker isn't making it to the database. You can turn on state reconciliation for your backend which may help: Relevant config section: Code doc: It probably wouldn't be all that hard to implement a routine that periodically scans QUEUED/INITIALIZING/RUNNING tasks and cancels them if they hit some sort of wall time specified in the config. However, it seems to me that this would just be masking an underlying issue. |
Thanks for the pointers. I enabled the reconciliation but I could not see any improvement (set it to check every 30m). I occasionally have jobs that are stuck either to INITIALIZATION or RUNNING state for days (until I kill them. The only common thread I have found between those is that they are stuck at stages that require transfers of many files (e.g. >40 files) each of several GB in size. It comes from a job that has finished running, and is stuck transferring files to s3 for hours. Initially it starts transferring with high speeds and then drops to a constant very slow speed. I have had similar plots for all other stuck jobs i checked. |
I've added an option to the worker config to limit the number of concurrent uploads/downloads. The default value is 10. |
Hi,
is there a way to set some default time limits for each state of a job?
e.g. if a job stays in INITIALIZING state for over 6h then consider it failed, cancel it and transition to an CANCELLED or ERROR state?
Thanks in advance for your help
The text was updated successfully, but these errors were encountered: