Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAIL versions 0.2.43+ fail in Spark worker with a SIGSEGV when running subject_qc on a separate spark cluster #8944

Closed
teaguesterling opened this issue Jun 10, 2020 · 11 comments · Fixed by #9169
Assignees

Comments

@teaguesterling
Copy link

teaguesterling commented Jun 10, 2020

This may be an environment specific bug, but we have showed it only occurs with releases after 0.2.42 (in our environment) and only under very specific conditions.

When we run subject_qc on a matrix table with >= 354 partitions using an external spark cluster (i.e. specifying master in hail.init), the spark worker crashes with a SIGSEGV. The issue does not occur with variant_qc but we do not know know the extent of what specific operations trigger it.

Below is a test that consistently triggers the issue:

Setup:

$ $SPARK_HOME/sbin/start-master.sh --host localhost --port 7077
$ $SPARK_HOME/sbin/start-shuffle-service.sh
$ $SPARK_HOME/sbin/start-slave.sh spark://localhost:7077 --work-dir /scratch/local/

Test:

import hail
hail.init(master="spark://localhost:7077")
P = 1
S = 1000
V = 50000
for N in range(350, 400, 1):
    try:
        mt = hail.balding_nichols_model(P, S, V, N)
        mt = hail.sample_qc(mt)
        mt = mt.filter_cols(mt.sample_qc.n_hom_var > V*0.32)
        print("\n[PASS] with", N, "partitions:", mt.count())
    except Exception:
         print("\n[FAIL] with ", N, "partitions")
         break

Test Output (SIGSEGV is reported in Spark worker logs, see end):

2020-06-10 10:29:56 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.4.5
SparkUI available at http://US0HPN0036.cm.cluster:4047
Welcome to
	 __  __     <>__
	/ /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.44-6cfa355a1954
LOGGING: writing to /bmrn/apps/bmrn-hugelib/0.3.0/test/hail-20200610-1029-0.2.44-6cfa355a1954.log
2020-06-10 10:29:59 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 1000 samples, and 50000 variants...
[Stage 1:==========================>                           (171 + 80) / 350]
[PASS] with 350 partitions: (50000, 984)
2020-06-10 10:30:08 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 1000 samples, and 50000 variants...
[Stage 3:==========================>                           (169 + 80) / 351]
[PASS] with 351 partitions: (50000, 998)
2020-06-10 10:30:10 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 1000 samples, and 50000 variants...
[Stage 5:=====================================================> (344 + 8) / 352]
[PASS] with 352 partitions: (50000, 1000)
2020-06-10 10:30:13 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 1000 samples, and 50000 variants...
[Stage 7:=================================>                    (222 + 80) / 353]
[PASS] with 353 partitions: (50000, 973)
2020-06-10 10:30:15 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 1000 samples, and 50000 variants...
[Stage 9:>                                                        (0 + 18) / 18]
[FAIL] with  354 partitions
Traceback (most recent call last):
  File "test_11_cluster_sampleqc.py", line 20, in <module>
	print("\n[PASS] with", N, "partitions:", Y.count())
  File "/bmrn/apps/hail/0.2.44/python/hail-0.2.44-py3-none-any.egg/hail/matrixtable.py", line 2426, in count
	return Env.backend().execute(count_ir)
  File "/bmrn/apps/hail/0.2.44/python/hail-0.2.44-py3-none-any.egg/hail/backend/spark_backend.py", line 296, in execute
	result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/bmrn/apps/spark/2.4.5/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/bmrn/apps/hail/0.2.44/python/hail-0.2.44-py3-none-any.egg/hail/backend/spark_backend.py", line 41, in deco
	'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: ResultStage 9 (runJob at RVD.scala:688) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0   at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:882)     at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:878)    at scala.collection.Iterator$class.foreach(Iterator.scala:891)   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)       at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:878)     at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:691)  at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)       at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   at org.apache.spark.scheduler.Task.run(Task.scala:123)  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)       at java.lang.Thread.run(Thread.java:748)

Java stack trace:
java.lang.RuntimeException: error while applying lowering 'InterpretNonCompilable'
		at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:26)
		at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:18)
		at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
		at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
		at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:18)
		at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
		at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:317)
		at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:304)
		at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:303)
		at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
		at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
		at is.hail.utils.package$.using(package.scala:600)
		at is.hail.annotations.Region$.scoped(Region.scala:18)
		at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
		at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:229)
		at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:303)
		at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:323)
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
		at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
		at py4j.Gateway.invoke(Gateway.java:282)
		at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
		at py4j.commands.CallCommand.execute(CallCommand.java:79)
		at py4j.GatewayConnection.run(GatewayConnection.java:238)
		at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 9 (runJob at RVD.scala:688) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0      at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:882)     at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:878)    at scala.collection.Iterator$class.foreach(Iterator.scala:891)   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)       at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:878)     at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:691)  at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)       at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   at org.apache.spark.scheduler.Task.run(Task.scala:123)  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)       at java.lang.Thread.run(Thread.java:748)
		at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
		at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
		at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
		at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
		at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
		at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
		at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1495)
		at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2109)
		at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
		at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
		at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
		at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
		at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
		at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
		at is.hail.rvd.RVD.combine(RVD.scala:688)
		at is.hail.expr.ir.Interpret$.run(Interpret.scala:804)
		at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
		at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
		at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
		at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:39)
		at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
		at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:50)
		at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
		at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
		at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:69)
		at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
		at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
		at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:69)
		at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
		at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:45)
		at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:20)
		at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:18)
		at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
		at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
		at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:18)
		at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
		at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:317)
		at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:304)
		at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:303)
		at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:20)
		at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:18)
		at is.hail.utils.package$.using(package.scala:600)
		at is.hail.annotations.Region$.scoped(Region.scala:18)
		at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:18)
		at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:229)
		at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:303)
		at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:323)
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
		at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
		at py4j.Gateway.invoke(Gateway.java:282)
		at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
		at py4j.commands.CallCommand.execute(CallCommand.java:79)
		at py4j.GatewayConnection.run(GatewayConnection.java:238)
		at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.44-6cfa355a1954
Error summary: SparkException: Job aborted due to stage failure: ResultStage 9 (runJob at RVD.scala:688) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0        at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:882)     at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:878)    at scala.collection.Iterator$class.foreach(Iterator.scala:891)   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)       at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:878)     at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:691)  at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)       at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   at org.apache.spark.scheduler.Task.run(Task.scala:123)  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)       at java.lang.Thread.run(Thread.java:748)

Spark Worker Logs (truncated to crash):

2020-06-10 10:09:36 INFO  ShuffleBlockFetcherIterator:54 - Started 0 remote fetches in 16 ms
2020-06-10 10:09:36 INFO  ShuffleBlockFetcherIterator:54 - Started 0 remote fetches in 17 ms
2020-06-10 10:09:36 INFO  ShuffleBlockFetcherIterator:54 - Started 0 remote fetches in 17 ms
2020-06-10 10:09:36 INFO  ShuffleBlockFetcherIterator:54 - Started 0 remote fetches in 17 ms
[thread 46926922934016 also had an error][thread 46922053207808 also had an error][thread 46926901880576 also had an error][thread 46926888195840 also had an error][thread 46926887143168 also had an error][thread 46924854015744 also had an error]
[thread 46924847699712 also had an error]

#



# A fatal error has been detected by the Java Runtime Environment:

[thread 46926905038592 also had an error]#
#  
[thread 46926895564544 also had an error][thread 46926900827904 also had an error]

SIGSEGV (0xb) at pc=0x00002aaab5115c88, pid=34051, tid=0x00002aae05d1a700
#
# JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
[thread 46926929250048 also had an error]# 
[thread 46926881888000 also had an error]
J 5583 C2 __C111CompiledWithAggs.__m131wrapped(Lis/hail/annotations/Region;J)V (280 bytes) @ 0x00002aaab5115c88 [0x00002aaab5115ae0+0x1a8]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
[thread 46924863489792 also had an error]
[thread 46924861384448 also had an error]
# An error report file with more information is saved as:
# /local/scratch/app-20200610100916-0000/0/hs_err_pid34051.log
[thread 46926913459968 also had an error]
[thread 46924843489024 also had an error][thread 46926917670656 also had an error]

#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

To summarize our observations:

  • The issue does not occur when hail is initialized without an existing spark master
  • The issue does not occur in HAIL versions prior to 0.2.43 (tested: 0.2.42, 0.2.40, 0.2.38, 0.2.34, 0.2.33 all passed and 0.2.43, 0.2.44 both failed)
  • The issue occurs consistently when the number of partitions is >= 354 (tested: 500, 450, 400, 360, 354, 1000) and does not occur with lower numbers of partitions (tested: 5, 10, 20, 50, 100, 200, 300, 350, 351, 352, 353)
  • Changing the number of variants and/or subjects does not appear to change the issue (but we haven't tested that rigorously; increased/decreased by an order of magnitude and observed the same behavior at the same number of partitions)
  • The issue also occurs on real datasets (large datasets imported from VCF files).
@tpoterba
Copy link
Contributor

Thanks for the bug report. Will investigate!

@tpoterba
Copy link
Contributor

Really appreciate your effort in making this reproducible.

@teaguesterling
Copy link
Author

Please let me know if there is anything we can do to help!

@johnc1231
Copy link
Contributor

Sorry took a while to get to this, investigating.

@johnc1231
Copy link
Contributor

Confirmed that I'm able to replicate this, also seeing 354 as the cutoff. Looking into it.

@teaguesterling
Copy link
Author

teaguesterling commented Jun 24, 2020 via email

@johnc1231
Copy link
Contributor

Haven't figured it out yet, but reproduced the error with a simpler pipeline that just uses one annotate instead of sample_qc:

P = 1
S = 1000
V = 50000
for N in range(350, 400, 1):
    try:
        mt = hail.balding_nichols_model(P, S, V, N)
        mt = mt.annotate_cols(n_called = hl.agg.filter(hl.is_defined(mt.GT), hl.agg.count()))
        mt = mt.filter_cols(mt.n_called > 0)
        print("\n[PASS] with", N, "partitions:", mt.count())
    except Exception as e:
         print("\n[FAIL] with ", N, "partitions")
         raise e
         break

@johnc1231
Copy link
Contributor

Also, possible to run this on much smaller data:

P = 1
S = 500
V = 2000
N = 354 # anything larger breaks, anything smaller works
try:
    mt = hail.balding_nichols_model(P, S, V, N)
    mt = mt.annotate_cols(n_called = hl.agg.filter(hl.is_defined(mt.GT), hl.agg.count()))
    mt = mt.filter_cols(mt.n_called > 0)
    print("\n[PASS] with", N, "partitions:", mt.count())
except Exception as e:
     print("\n[FAIL] with ", N, "partitions")
     raise e

354 is still the breaking point.

@johnc1231
Copy link
Contributor

Caused by #8794

@johnc1231 johnc1231 assigned tpoterba and unassigned johnc1231 Jul 20, 2020
@johnc1231
Copy link
Contributor

Tim is going to look into this

@tpoterba tpoterba changed the title HAIL versions 0.2.44 (and 0.2.43) fail in Spark worker with a SIGSEGV when running subject_qc on a separate spark cluster HAIL versions 0.2.43+ fail in Spark worker with a SIGSEGV when running subject_qc on a separate spark cluster Jul 21, 2020
tpoterba added a commit to tpoterba/hail that referenced this issue Jul 29, 2020
Fixes hail-is#8944

CHANGELOG: Fixed crash (error 134 or SIGSEGV) in `MatrixTable.annotate_cols`, `hl.sample_qc`, and more.
danking pushed a commit that referenced this issue Jul 29, 2020
Fixes #8944

CHANGELOG: Fixed crash (error 134 or SIGSEGV) in `MatrixTable.annotate_cols`, `hl.sample_qc`, and more.
@johnc1231
Copy link
Contributor

Thanks a lot for reporting this error, Tim found the fix, and this detailed report was very helpful.

annamiraotoole pushed a commit to annamiraotoole/hail that referenced this issue Aug 3, 2020
Fixes hail-is#8944

CHANGELOG: Fixed crash (error 134 or SIGSEGV) in `MatrixTable.annotate_cols`, `hl.sample_qc`, and more.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants