-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel finemapping: Implement retry policy for Batch runs #3314
Comments
Spot preemptionWhen debugging the Batch runs, ideally I wanted to weed out all problems and tune the running parameters, so that every task can be expected to succeed the first time. Then we could specify maxRetryCount = 0 and forget about it. However, it turns out when a Spot VM running the task is preempted, it also counts as a failure. This is a pretty dumb design on Google's part, but unfortunately there's no way to specify “retry on preemption but not on actual job failure”. In my experience running the batches, preemption is very rare, but it does happen. So we have two options:
|
Task lifecycle policiesThe problem with maxRetryCount > 0 is this: if your code has a bug which causes all/most tasks to eventually fail, you will not notice straight away, because they will keep retrying in vain for several times, wasting resources. There is a way to address it. Even though you can't explicitly handle preemption, what you can do is specify task lifecycle policy. Namely, depending on the specific exit code of the task, you can send it for a retry (if it still hasn't exceeded the maxRetryCount) or fail it immediately. So I set up Batch v6 run like this, with maxRetryCount = 3:
You can take a look at how task lifecycle policies look like here. |
Sporadic errorsThe approach described in the previous section almost worked; as I mentioned on Wednesday, only 4 out of 17393 tasks failed for Batch run v6. It turned out that, when you run 17k+ jobs, some rare events will happen:
I think a good solution here is to add those errors to the runner script and specifically retry in those cases, because we know those errors to be sporadic. The script will still explicitly fail the job on any unknown error. I will not be re-running the 17k batch again, because only 4 jobs failed and the rest of the data should be ready for downstream analysis, but I have added the modifications from above to the code, please see here. |
This issue is a part of the #3302 epic.
The goal of this issue is to configure retry policy in such a way that the entire run completes successfully and does not retry more tasks than needed.
The text was updated successfully, but these errors were encountered: