Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel finemapping: Implement retry policy for Batch runs #3314

Open
Tracked by #3302
tskir opened this issue May 3, 2024 · 3 comments
Open
Tracked by #3302

Parallel finemapping: Implement retry policy for Batch runs #3314

tskir opened this issue May 3, 2024 · 3 comments
Labels
gentropy Relates to the genetics ETL

Comments

@tskir
Copy link

tskir commented May 3, 2024

This issue is a part of the #3302 epic.

The goal of this issue is to configure retry policy in such a way that the entire run completes successfully and does not retry more tasks than needed.

@tskir tskir added the gentropy Relates to the genetics ETL label May 3, 2024
@tskir
Copy link
Author

tskir commented May 3, 2024

Spot preemption

When debugging the Batch runs, ideally I wanted to weed out all problems and tune the running parameters, so that every task can be expected to succeed the first time. Then we could specify maxRetryCount = 0 and forget about it.

However, it turns out when a Spot VM running the task is preempted, it also counts as a failure. This is a pretty dumb design on Google's part, but unfortunately there's no way to specify “retry on preemption but not on actual job failure”. In my experience running the batches, preemption is very rare, but it does happen.

So we have two options:

  • Do not use Spot VMs — this will immediately make our compute prices around 2× higher;
  • Concede that we have to specify maxRetryCount > 0, which is what I recommend doing. I think the value of 3 is sensible. The vast majority of jobs will success on the first try.

@tskir
Copy link
Author

tskir commented May 3, 2024

Task lifecycle policies

The problem with maxRetryCount > 0 is this: if your code has a bug which causes all/most tasks to eventually fail, you will not notice straight away, because they will keep retrying in vain for several times, wasting resources.

There is a way to address it. Even though you can't explicitly handle preemption, what you can do is specify task lifecycle policy. Namely, depending on the specific exit code of the task, you can send it for a retry (if it still hasn't exceeded the maxRetryCount) or fail it immediately.

So I set up Batch v6 run like this, with maxRetryCount = 3:

  • Inside the runner script:
    • If the Python part completed successfully, exit as 0 → task is done
    • If the Python part failed with a known error (ValueError which Daniel C recently addressed), also exit as 0
    • If the Python part failed with any other error, raise a specific error code (I chose 73), which lifecycle policy will pick up and fail the task immediately
  • Outside of the runner script: if the VM is preempted, this will not match any specific lifecycle policy, and the job will be retried up to 3 times.

You can take a look at how task lifecycle policies look like here.

@tskir
Copy link
Author

tskir commented May 3, 2024

Sporadic errors

The approach described in the previous section almost worked; as I mentioned on Wednesday, only 4 out of 17393 tasks failed for Batch run v6.

It turned out that, when you run 17k+ jobs, some rare events will happen:

  • In 3 tasks, Spark context failed to initialise (ERROR SparkContext: Error initializing SparkContext). It doesn't look to be caused by anything specific. Probably the Spark daemon was busy with other tasks and didn't respond to the request within N seconds or something.
  • In 1 task, Hail failed to download the data it needed due to a sporadic server error (requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))

I think a good solution here is to add those errors to the runner script and specifically retry in those cases, because we know those errors to be sporadic. The script will still explicitly fail the job on any unknown error.

I will not be re-running the 17k batch again, because only 4 jobs failed and the rest of the data should be ready for downstream analysis, but I have added the modifications from above to the code, please see here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gentropy Relates to the genetics ETL
Projects
None yet
Development

No branches or pull requests

1 participant