-
Notifications
You must be signed in to change notification settings - Fork 250
hl.vep is flaky on dataproc, particularly when using more than two workers #12936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
I ran into this today...spent a bit of time debugging and was able to ssh to one of the workers and poke around the Adding |
@bpblanken do you mean at the end of vep-GRCh38.sh? |
@patrick-schultz There seems to be no harm in adding a restart to the bottom of vep-GRCh38.sh, could you PR that? |
@danking Yep! That's where I put it. Sorry I missed this notification! |
There is a known issue with the official Docker deb. If you uninstall docker and re-install it later, it might fail to start. The root cause is the One of Google's "Dataproc components" is Docker. I believe Google installed and then uninstalled docker in this image, thus leaving it in the broken state. For evidence of that: find docker on a worker node of a *non-Hail* Dataproc cluster
There is a However, if I install Docker by hand into this worker of a non-Hail Dataproc cluster, it just works. I also tried to replicate the failure using an initialization action, but that also just worked.
Our users often report this error. In my experience, it has happened in 2/8 test_dataproc steps that I have run myself or seen run. The more workers you have, the higher the chance at least one worker fails. As @bpblanken suggested here, restarting docker on a failed worker works. Docker starts fine. However, I missed a subtlety: we must restart after installation but before we try to pull our VEP docker image. I also added a sleep in hopes that gives various things a chance to die off. |
CHANGELOG: Mitigate hail-is#12936 in which VEP Dataproc clusters fail to start. The root cause is complex. Docker has a bug which prevents it from cleanly starting if it is *re* installed. Whatever Google is doing in Dataproc to configure their Docker "component" appears to trigger this bug. See for details: hail-is#12936 (comment) The basic fix is to sleep to allow the system to coalesce a bit and then to restart Docker.
Fixed in 0.2.121 by #13580 |
What happened?
Julia Sealock reported this https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/vep.20issue/near/352790173
We also saw it in test_dataproc. Cal also reported it.
Version
0.2.114
Relevant log output
No response
The text was updated successfully, but these errors were encountered: