Skip to content

hl.vep is flaky on dataproc, particularly when using more than two workers #12936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danking opened this issue Apr 26, 2023 · 7 comments · Fixed by #13327
Closed

hl.vep is flaky on dataproc, particularly when using more than two workers #12936

danking opened this issue Apr 26, 2023 · 7 comments · Fixed by #13327
Labels

Comments

@danking
Copy link
Contributor

danking commented Apr 26, 2023

What happened?

Julia Sealock reported this https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/vep.20issue/near/352790173

We also saw it in test_dataproc. Cal also reported it.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 56 in stage 4.0 failed 20 times, most recent failure: Lost task 56.19 in stage 4.0 (TID 48622) (jsealock-schema-sw-43bq.c.daly-neale-sczmeta.internal executor 3): is.hail.utils.HailException: VEP command '/vep --format vcf --json --everything --allele_number --no_stats --cache --offline --minimal --assembly GRCh38 --fasta /opt/vep/.vep/homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --plugin LoF,loftee_path:/opt/vep/Plugins/,gerp_bigwig:/opt/vep/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:/opt/vep/.vep/human_ancestor.fa.gz,conservation_file:/opt/vep/.vep/loftee.sql --dir_plugins /opt/vep/Plugins/ -o STDOUT' failed with non-zero exit status 125
  VEP Error output:
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.

	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:17)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:17)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.methods.VEP$.waitFor(VEP.scala:73)
	at is.hail.methods.VEP.$anonfun$execute$5(VEP.scala:231)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.hasNext(RichContextRDD.scala:69)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at is.hail.io.RichContextRDDRegionValue$.writeRowsPartition(RichContextRDDRegionValue.scala:37)
	at is.hail.io.RichContextRDDLong$.$anonfun$writeRows$2(RichContextRDDRegionValue.scala:234)
	at is.hail.utils.richUtils.RichContextRDD$.writeParts(RichContextRDD.scala:42)
	at is.hail.utils.richUtils.RichContextRDD.$anonfun$writePartitions$1(RichContextRDD.scala:107)
	at is.hail.utils.richUtils.RichContextRDD.$anonfun$writePartitions$1$adapted(RichContextRDD.scala:105)
	at is.hail.sparkextras.ContextRDD.$anonfun$cmapPartitionsWithIndex$2(ContextRDD.scala:259)
	at is.hail.utils.richUtils.RichContextRDD.$anonfun$cleanupRegions$2(RichContextRDD.scala:60)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.hasNext(RichContextRDD.scala:69)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1021)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2276)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2673)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2609)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2608)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2861)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2803)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2792)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2257)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2276)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2301)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
	at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:176)
	at is.hail.utils.richUtils.RichContextRDD.writePartitions(RichContextRDD.scala:105)
	at is.hail.io.RichContextRDDLong$.writeRows$extension(RichContextRDDRegionValue.scala:234)
	at is.hail.rvd.RVD.write(RVD.scala:779)
	at is.hail.expr.ir.TableNativeWriter.apply(TableWriter.scala:128)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:865)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:59)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:463)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:499)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:351)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:496)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:495)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

Version

0.2.114

Relevant log output

No response

@danking danking added the bug label Apr 26, 2023
@tpoterba tpoterba self-assigned this Apr 26, 2023
@danking
Copy link
Contributor Author

danking commented May 10, 2023

hl.vep is now ready to go in QoB

@bpblanken
Copy link

bpblanken commented Jun 13, 2023

I ran into this today...spent a bit of time debugging and was able to ssh to one of the workers and poke around the docker logs. The issue appears to be some kind of race between the docker install that the VEP initialization script does and the limited number of retries by systemd to get the docker daemon up and running.

Adding sudo service docker restart at the end of the the VEP initialization bash script worked as a short term fix.

@danking
Copy link
Contributor Author

danking commented Jul 27, 2023

@bpblanken do you mean at the end of vep-GRCh38.sh?

@danking
Copy link
Contributor Author

danking commented Jul 27, 2023

@patrick-schultz There seems to be no harm in adding a restart to the bottom of vep-GRCh38.sh, could you PR that?

@bpblanken
Copy link

@danking Yep! That's where I put it. Sorry I missed this notification!

@danking danking reopened this Sep 6, 2023
@danking
Copy link
Contributor Author

danking commented Sep 6, 2023

There is a known issue with the official Docker deb. If you uninstall docker and re-install it later, it might fail to start.

The root cause is the docker.socket systemd unit failing to start because there are "insufficient file descriptors available". I think this is confusing verbiage. The socket's name must be /var/run/docker.sock. Clearly, if that filename is already in use, we cannot create a new socket at that filename.

One of Google's "Dataproc components" is Docker. I believe Google installed and then uninstalled docker in this image, thus leaving it in the broken state. For evidence of that:

find docker on a worker node of a *non-Hail* Dataproc cluster
sudo find / -iname '*docker*'
/opt/conda/miniconda3/pkgs/dbus-1.13.6-h5008d03_3/info/recipe/patches/0004-disable-fd-limit-tests-not-supported-in-docker.patch
/opt/conda/miniconda3/pkgs/nbclassic-0.5.6-pyhb4ecaf3_1/site-packages/nbclassic/static/components/codemirror/mode/dockerfile
/opt/conda/miniconda3/pkgs/nbclassic-0.5.6-pyhb4ecaf3_1/site-packages/nbclassic/static/components/codemirror/mode/dockerfile/dockerfile.js
/opt/conda/miniconda3/pkgs/notebook-6.2.0-py38h578d9bd_0/lib/python3.8/site-packages/notebook/static/components/codemirror/mode/dockerfile
/opt/conda/miniconda3/pkgs/notebook-6.2.0-py38h578d9bd_0/lib/python3.8/site-packages/notebook/static/components/codemirror/mode/dockerfile/dockerfile.js
/opt/conda/miniconda3/lib/python3.8/site-packages/nbclassic/static/components/codemirror/mode/dockerfile
/opt/conda/miniconda3/lib/python3.8/site-packages/nbclassic/static/components/codemirror/mode/dockerfile/dockerfile.js
/opt/conda/miniconda3/lib/python3.8/site-packages/notebook/static/components/codemirror/mode/dockerfile
/opt/conda/miniconda3/lib/python3.8/site-packages/notebook/static/components/codemirror/mode/dockerfile/dockerfile.js
/opt/google-fluentd/embedded/lib/ruby/gems/2.7.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/test/cassettes/kubernetes_docker_metadata_dotted_labels.yml
/opt/google-fluentd/embedded/lib/ruby/gems/2.7.0/gems/fluent-plugin-kubernetes_metadata_filter-2.5.2/test/cassettes/kubernetes_docker_metadata_annotations.yml
/usr/share/man/man1/gcloud_artifacts_docker_images_scan.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_images_list-vulnerabilities.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_images_describe.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_images_scan.1.gz
/usr/share/man/man1/gcloud_alpha_auth_configure-docker.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_images.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_images_list.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_images_delete.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_images_delete.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_images.1.gz
/usr/share/man/man1/gcloud_beta_auth_configure-docker.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_tags_list.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_tags_list.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_images_list.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_tags.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_images_delete.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_tags_delete.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_tags_delete.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_images_list.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_upgrade_print-iam-policy.1.gz
/usr/share/man/man1/gcloud_auth_configure-docker.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_images_get-operation.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_tags_add.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_tags.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_images.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_upgrade.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_tags_add.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_tags.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_tags_add.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_images_describe.1.gz
/usr/share/man/man1/gcloud_artifacts_docker.1.gz
/usr/share/man/man1/gcloud_docker.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_images_get-operation.1.gz
/usr/share/man/man1/gcloud_alpha_artifacts_docker_tags_list.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_images_describe.1.gz
/usr/share/man/man1/gcloud_artifacts_docker_tags_delete.1.gz
/usr/share/man/man1/gcloud_beta_artifacts_docker_images_list-vulnerabilities.1.gz
/usr/share/vim/vim81/syntax/dockerfile.vim
/usr/share/vim/vim81/ftplugin/dockerfile.vim
/usr/bin/docker-credential-gcloud
/usr/lib/google-cloud-sdk/platform/gsutil/third_party/google-auth-library-python/.kokoro/docker
/usr/lib/google-cloud-sdk/platform/gsutil/third_party/google-auth-library-python/.kokoro/docker/docs/Dockerfile
/usr/lib/google-cloud-sdk/platform/ext-runtime/ruby/templates/Dockerfile.template
/usr/lib/google-cloud-sdk/platform/ext-runtime/ruby/templates/dockerignore.template
/usr/lib/google-cloud-sdk/platform/ext-runtime/go/data/dockerignore
/usr/lib/google-cloud-sdk/platform/ext-runtime/go/data/Dockerfile
/usr/lib/google-cloud-sdk/platform/ext-runtime/java/data/dockerignore
/usr/lib/google-cloud-sdk/platform/ext-runtime/python/data/dockerignore
/usr/lib/google-cloud-sdk/platform/ext-runtime/python/data/Dockerfile.virtualenv.template
/usr/lib/google-cloud-sdk/platform/ext-runtime/python/data/Dockerfile.install_app
/usr/lib/google-cloud-sdk/platform/ext-runtime/python/data/Dockerfile.requirements_txt
/usr/lib/google-cloud-sdk/platform/ext-runtime/python/data/Dockerfile.preamble
/usr/lib/google-cloud-sdk/platform/ext-runtime/nodejs/data/dockerignore
/usr/lib/google-cloud-sdk/platform/ext-runtime/nodejs/data/Dockerfile
/usr/lib/google-cloud-sdk/platform/ext-runtime/php/templates/Dockerfile.template
/usr/lib/google-cloud-sdk/platform/ext-runtime/php/templates/Dockerfile.entrypoint.template
/usr/lib/google-cloud-sdk/platform/ext-runtime/php/templates/dockerignore.template
/usr/lib/google-cloud-sdk/bin/docker-credential-gcloud
/usr/lib/google-cloud-sdk/lib/surface/auth/configure_docker.py
/usr/lib/google-cloud-sdk/lib/surface/auth/docker_helper.py
/usr/lib/google-cloud-sdk/lib/surface/auth/__pycache__/docker_helper.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/surface/auth/__pycache__/configure_docker.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/surface/docker.py
/usr/lib/google-cloud-sdk/lib/surface/artifacts/docker
/usr/lib/google-cloud-sdk/lib/surface/__pycache__/docker.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/docker_image.py
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/__pycache__/docker_image.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/ai/docker
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/artifacts/docker_util.py
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/artifacts/__pycache__/docker_util.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/docker
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/docker/docker.py
/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/docker/__pycache__/docker.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/docker
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/docker_digest_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/docker_creds_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/docker_image_list_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/docker_http_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/docker_image_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/docker_session_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/__pycache__/docker_creds_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/__pycache__/docker_image_list_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/__pycache__/docker_digest_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/__pycache__/docker_session_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/__pycache__/docker_image_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2_2/__pycache__/docker_http_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/docker_creds_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/docker_digest_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/docker_creds_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/docker_http_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/docker_image_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/docker_session_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/__pycache__/docker_creds_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/__pycache__/docker_digest_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/__pycache__/docker_session_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/__pycache__/docker_image_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v2/__pycache__/docker_http_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/docker_name_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/docker_creds_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/docker_http_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/docker_image_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/docker_session_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/__pycache__/docker_creds_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/__pycache__/docker_session_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/__pycache__/docker_image_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/v1/__pycache__/docker_http_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/__pycache__/docker_creds_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/client/__pycache__/docker_name_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/tools/docker_puller_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/tools/docker_pusher_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/tools/docker_appender_.py
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/tools/__pycache__/docker_appender_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/tools/__pycache__/docker_puller_.cpython-39.pyc
/usr/lib/google-cloud-sdk/lib/third_party/containerregistry/tools/__pycache__/docker_pusher_.cpython-39.pyc
/usr/local/share/google/dataproc/npd-config/docker-monitor-counter.json
/usr/local/share/google/dataproc/npd-config/docker-monitor.json
/usr/local/share/google/dataproc/npd-config/health-checker-docker.json
/usr/local/share/google/dataproc/npd-config/docker-monitor-filelog.json
/usr/local/share/google/dataproc/bdutil/fluentd/container_logging/plugin/test/Dockerfile
/usr/local/share/google/dataproc/bdutil/components/initialize/docker-ce.sh
/usr/local/share/google/dataproc/bdutil/components/install/docker-ce.sh
/usr/local/share/google/dataproc/bdutil/components/uninstall/docker-ce.sh
/usr/local/share/google/dataproc/bdutil/components/post-install/docker-ce.sh
/usr/local/share/google/dataproc/bdutil/components/activate/docker-ce.sh
/usr/local/share/google/dataproc/bdutil/components/shared/docker.sh
/usr/local/share/google/dataproc/bdutil/components/pre-uninstall/docker-ce.sh
/usr/local/share/google/dataproc/bdutil/configure_docker.sh
/run/docker.sock
/tmp/dataproc/uninstall/docker-ce
/tmp/dataproc/components/uninstall/docker-ce.running
/tmp/dataproc/components/uninstall/docker-ce.done
/tmp/dataproc/components/pre-uninstall/docker-ce.running
/tmp/dataproc/components/pre-uninstall/docker-ce.done
/etc/apt/preferences.d/docker-ce.pref
/etc/apt/preferences.d/docker-ce-cli.pref
/etc/apt/sources.list.d/docker.list
/var/lib/apt/lists/download.docker.com_linux_debian_dists_buster_InRelease
/var/lib/apt/lists/download.docker.com_linux_debian_dists_buster_stable_binary-amd64_Packages

There is a /run/docker.sock but notice it is not /var/run/....

However, if I install Docker by hand into this worker of a non-Hail Dataproc cluster, it just works.


I also tried to replicate the failure using an initialization action, but that also just worked.

gcloud dataproc clusters create dk-test2 --initialization-actions=gs://hail-common/dk-test.sh

gs://hail-common/dk-test.sh:

apt-get update
apt-get -y install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg2 \
    software-properties-common \
    tabix
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
apt-get update
apt-get install -y --allow-unauthenticated docker-ce

Our users often report this error. In my experience, it has happened in 2/8 test_dataproc steps that I have run myself or seen run. The more workers you have, the higher the chance at least one worker fails.

As @bpblanken suggested here, restarting docker on a failed worker works. Docker starts fine. However, I missed a subtlety: we must restart after installation but before we try to pull our VEP docker image.

I also added a sleep in hopes that gives various things a chance to die off.

danking pushed a commit to danking/hail that referenced this issue Sep 6, 2023
CHANGELOG: Mitigate hail-is#12936 in which VEP Dataproc clusters fail to start. The root cause is complex. Docker has a bug which prevents it from cleanly starting if it is *re* installed. Whatever Google is doing in Dataproc to configure their Docker "component" appears to trigger this bug.

See for details: hail-is#12936 (comment)

The basic fix is to sleep to allow the system to coalesce a bit and then to restart Docker.
@danking
Copy link
Contributor Author

danking commented Sep 7, 2023

Fixed in 0.2.121 by #13580

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants