You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have attached the log for writing and reading the MatrixTable from the same environment. Trying to read the same MatrixTable in a different environment gives:
---------------------------------------------------------------------------FatalErrorTraceback (mostrecentcalllast)
/opt/conda/lib/python3.6/site-packages/IPython/core/formatters.pyin__call__(self, obj)
700type_pprinters=self.type_printers,
701deferred_pprinters=self.deferred_printers)
-->702printer.pretty(obj)
703printer.flush()
704returnstream.getvalue()
/opt/conda/lib/python3.6/site-packages/IPython/lib/pretty.pyinpretty(self, obj)
392ifclsisnotobject \
393andcallable(cls.__dict__.get('__repr__')):
-->394return_repr_pprint(obj, self, cycle)
395396return_default_pprint(obj, self, cycle)
/opt/conda/lib/python3.6/site-packages/IPython/lib/pretty.pyin_repr_pprint(obj, p, cycle)
698"""A pprint that just redirects to the normal repr function."""699# Find newlines and replace them with p.break_()-->700output=repr(obj)
701lines=output.splitlines()
702withp.group():
/opt/conda/lib/python3.6/site-packages/hail/matrixtable.pyin__repr__(self)
25412542def__repr__(self):
->2543returnself.__str__()
25442545def_repr_html_(self):
/opt/conda/lib/python3.6/site-packages/hail/matrixtable.pyin__str__(self)
25352536def__str__(self):
->2537s=self.table_show.__str__()
2538ifself.displayed_n_cols!=self.actual_n_cols:
2539s+=f"showing the first {self.displayed_n_cols} of {self.actual_n_cols} columns"/opt/conda/lib/python3.6/site-packages/hail/table.pyin__str__(self)
12921293def__str__(self):
->1294returnself._ascii_str()
12951296def__repr__(self):
/opt/conda/lib/python3.6/site-packages/hail/table.pyin_ascii_str(self)
1318returns1319->1320rows, has_more, dtype=self.data()
1321fields=list(dtype)
1322trunc_fields= [trunc(f) forfinfields]
/opt/conda/lib/python3.6/site-packages/hail/table.pyindata(self)
1302row_dtype=t.row.dtype1303t=t.select(**{k: hl._showstr(v) for (k, v) int.row.items()})
->1304rows, has_more=t._take_n(self.n)
1305self._data= (rows, has_more, row_dtype)
1306returnself._data/opt/conda/lib/python3.6/site-packages/hail/table.pyin_take_n(self, n)
1449has_more=False1450else:
->1451rows=self.take(n+1)
1452has_more=len(rows) >n1453rows=rows[:n]
<decorator-gen-1119>intake(self, n, _localize)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.pyinwrapper(__original_func, *args, **kwargs)
612defwrapper(__original_func, *args, **kwargs):
613args_, kwargs_=check_all(__original_func, args, kwargs, checkers, is_method=is_method)
-->614return__original_func(*args_, **kwargs_)
615616returnwrapper/opt/conda/lib/python3.6/site-packages/hail/table.pyintake(self, n, _localize)
2119""" 2120 -> 2121 return self.head(n).collect(_localize) 2122 2123 @typecheck_method(n=int)<decorator-gen-1113> in collect(self, _localize)/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs) 612 def wrapper(__original_func, *args, **kwargs): 613 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)--> 614 return __original_func(*args_, **kwargs_) 615 616 return wrapper/opt/conda/lib/python3.6/site-packages/hail/table.py in collect(self, _localize) 1918 e = construct_expr(rows_ir, hl.tarray(t.row.dtype)) 1919 if _localize:-> 1920 return Env.backend().execute(e._ir) 1921 else: 1922 return e/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed) 96 raise HailUserError(message_and_trace) from None 97 ---> 98 raise e/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed) 72 # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1)) 73 try:---> 74 result = json.loads(self._jhc.backend().executeJSON(jir)) 75 value = ir.typ._from_json(result['value']) 76 timings = result['timings']/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value(-> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args:/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs) 30 raise FatalError('%s\n\nJava stack trace:\n%s\n' 31 'Hail version: %s\n'---> 32 'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None 33 except pyspark.sql.utils.CapturedException as e: 34 raise FatalError('%s\n\nJava stack trace:\n%s\n'FatalError: FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not existJava stack trace:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-10-60-2-89.eu-west-2.compute.internal, executor 0): java.io.FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:83) at is.hail.io.fs.FS$class.open(FS.scala:139) at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70) at is.hail.io.fs.FS$class.open(FS.scala:148) at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70) at is.hail.HailContext$$anon$1.compute(HailContext.scala:276) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2001) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1984) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1983) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1983) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1033) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2223) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2172) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2161) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at is.hail.sparkextras.ContextRDD.runJob(ContextRDD.scala:351) at is.hail.rvd.RVD$$anonfun$13.apply(RVD.scala:526) at is.hail.rvd.RVD$$anonfun$13.apply(RVD.scala:526) at is.hail.utils.PartitionCounts$.incrementalPCSubsetOffset(PartitionCounts.scala:73) at is.hail.rvd.RVD.head(RVD.scala:525) at is.hail.expr.ir.TableSubset$class.execute(TableIR.scala:1326) at is.hail.expr.ir.TableHead.execute(TableIR.scala:1332) at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:1845) at is.hail.expr.ir.Interpret$.run(Interpret.scala:819) at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53) at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16) at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53) at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25) at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25) at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54) at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58) at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67) at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15) at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15) at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81) at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15) at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13) at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81) at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13) at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62) at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14) at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12) at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28) at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354) at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338) at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335) at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25) at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23) at is.hail.utils.package$.using(package.scala:618) at is.hail.annotations.Region$.scoped(Region.scala:18) at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23) at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247) at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335) at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379) at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377) at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52) at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)java.io.FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:83) at is.hail.io.fs.FS$class.open(FS.scala:139) at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70) at is.hail.io.fs.FS$class.open(FS.scala:148) at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70) at is.hail.HailContext$$anon$1.compute(HailContext.scala:276) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Hail version: 0.2.61-3c86d3ba497aError summary: FileNotFoundException: File file:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7 does not exist---------------------------------------------------------------------------FatalError Traceback (most recent call last)/opt/conda/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj) 343 method = get_real_method(obj, self.print_method) 344 if method is not None:--> 345 return method() 346 return None 347 else:/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in _repr_html_(self) 2544 2545 def _repr_html_(self):-> 2546 s = self.table_show._repr_html_() 2547 if self.displayed_n_cols != self.actual_n_cols: 2548 s += '<p style="background: #fdd; padding: 0.4em;">'/opt/conda/lib/python3.6/site-packages/hail/table.py in _repr_html_(self) 1307 1308 def _repr_html_(self):-> 1309 return self._html_str() 1310 1311 def _ascii_str(self):/opt/conda/lib/python3.6/site-packages/hail/table.py in _html_str(self) 1397 types = self.types 1398 -> 1399 rows, has_more, dtype = self.data() 1400 fields = list(dtype) 1401 /opt/conda/lib/python3.6/site-packages/hail/table.py in data(self) 1302 row_dtype = t.row.dtype 1303 t = t.select(**{k: hl._showstr(v) for (k, v) in t.row.items()})-> 1304 rows, has_more = t._take_n(self.n) 1305 self._data = (rows, has_more, row_dtype) 1306 return self._data/opt/conda/lib/python3.6/site-packages/hail/table.py in _take_n(self, n) 1449 has_more = False 1450 else:-> 1451 rows = self.take(n + 1) 1452 has_more = len(rows) > n 1453 rows = rows[:n]<decorator-gen-1119> in take(self, n, _localize)/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs) 612 def wrapper(__original_func, *args, **kwargs): 613 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)--> 614 return __original_func(*args_, **kwargs_) 615 616 return wrapper/opt/conda/lib/python3.6/site-packages/hail/table.py in take(self, n, _localize) 2119 """2120->2121returnself.head(n).collect(_localize)
21222123 @typecheck_method(n=int)
<decorator-gen-1113>incollect(self, _localize)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.pyinwrapper(__original_func, *args, **kwargs)
612defwrapper(__original_func, *args, **kwargs):
613args_, kwargs_=check_all(__original_func, args, kwargs, checkers, is_method=is_method)
-->614return__original_func(*args_, **kwargs_)
615616returnwrapper/opt/conda/lib/python3.6/site-packages/hail/table.pyincollect(self, _localize)
1918e=construct_expr(rows_ir, hl.tarray(t.row.dtype))
1919if_localize:
->1920returnEnv.backend().execute(e._ir)
1921else:
1922returne/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.pyinexecute(self, ir, timed)
96raiseHailUserError(message_and_trace) fromNone97--->98raisee/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.pyinexecute(self, ir, timed)
72# print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))73try:
--->74result=json.loads(self._jhc.backend().executeJSON(jir))
75value=ir.typ._from_json(result['value'])
76timings=result['timings']
/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.pyin__call__(self, *args)
1255answer=self.gateway_client.send_command(command)
1256return_value=get_return_value(
->1257answer, self.gateway_client, self.target_id, self.name)
12581259fortemp_argintemp_args:
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.pyindeco(*args, **kwargs)
30raiseFatalError('%s\n\nJavastacktrace:\n%s\n'
31'Hail version: %s\n'--->32'Error summary: %s'% (deepest, full, hail.__version__, deepest), error_id) fromNone33exceptpyspark.sql.utils.CapturedExceptionase:
34raiseFatalError('%s\n\nJava stack trace:\n%s\n'FatalError: FileNotFoundException: Filefile:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7doesnotexistSHORTENEDDUETOCHARACTERLIMIT. LOGATTACHED.
Hailversion: 0.2.61-3c86d3ba497aErrorsummary: FileNotFoundException: Filefile:/opt/notebooks/GIPR.mt/rows/rows/parts/part-0-2-0-0-ba507024-1211-ab56-179a-832d6e98beb7doesnotexist
I cannot for the life of me figure out what is going on. It seems the rows/rows/parts/ folder is empty somehow? I hope it makes sense but let me know if it does not.
What is the runtime? I’m assuming this is running on a cluster?
I think the core issue here is that the file system you’re writing to / reading from is a local file system, not network-visible. The issue is similar to the one described here:
The core problem is that you’re loading a file on a local file system (visible only to the cluster driver node) across a cluster of multiple machines.
This is a bad error message (should probably say “file doesn’t exist”), but this is happening because if you don’t supply the file scheme (scheme://) then the default is used, which is HDFS for Dataproc clusters. There’s no folder in HDFS in the default user directory with the name training_pca.ht.
You can read this file because the read_table …
The solution is that you should write to a file system that’s network visible – a cloud object store like Google Storage / S3 / Azure Blob Storage, or a network file system like Lustre/HDFS.
danking said:
I want to echo Tim here that you really don’t want to do this. Your suggestion involves three transfers of data which will be prohibitively slow for all but the smallest datasets:
Write the data to Hadoop at data/1kg.mt
Copy the data from Hadoop to your local file system (almost certainly of extremely limited size) at /opt/notebooks/1kg.mt.
Copy the data form your local file system to whatever this dx thing is.
I don’t know much about the DNANexus platform, but if you hope to do any serious work on large datasets, you need to figure out how to use an object store like Google Cloud Storage, Amazon S3, or Azure Blob Storage from within DNANexus.
The root problem is that dxpy.download_folder command does not work the way you expect. I strongly suspect that it downloads to the Hadoop file system at /opt/notebooks/GIPR.mt. Downloading a huge file to a local file system just doesn’t make any sense.
Thanks for taking your time! I hope to do some serious work on the upcoming UKB 300k exomes release but their Material Transfer Agreement prohibits moving the data off of their own platform. It seems I am stuck with DNAnexus.
The second error message indicates that GIPR.mt is, somehow, not a folder.
jsmadsen said:
hadoop fs -ls /opt/notebooks/GIPR.mt/
It does not exist after dxpy.download_folder. I think at this point, the issue might better be directed at UKB support, no?
danking said:
Agreed. I’m not sure what’s going on. It does seem that the dx downloads are sending the files into Hadoop (which is good!). In the future, I’d plan to use the Hadoop URLs (no file:) instead of the file URLs.
Good luck!
EDIT: Upon further investigation ourselves, I might be wrong about downloading to the Hadoop filesystem.
jsmadsen said:
I have at least ruled it out as a Hail issue. I figure more people are going to run into this when they have to move unto the Research Analysis Platform end of September.
jsmadsen said:
Alright, here is the right way to go about it as per DNAnexus support:
I am also stuck in how to read the VCF files from the bulk in DNAnexus with a cluster environment (as a bucket). Please follow up with problem. Thanks!
I don’t think the instruction that DNANexus gave will be viable. The mounted folder is not visible by the workers. I think they are actually loading through the master node, which is very slow.
I don’t have access to the DNA Nexus environment myself, so I can’t confirm or deny if this works. I agree, unless every Spark worker node has the VCFs network-mounted or stored in /mnt/project/..., this code will not work.
Hail cannot load data through the master node. If this code works for you, then it is indeed using the workers.
henryhmo said:
My colleague says that the /mnt/project/… works, but just excruciatingly slow. I guess it is the way they gz the pVCF files was the problem (instead of bgz). Btw, How can we bgz a csv/tsv file outside of Hail? Thanks!
RossDeVito said:
Has your colleague found any faster solutions to loading the genotype data? Currently it appears we cannot use the BGEN files with hail because of the compression type (discussion here) and for pVCFs compression format also is an issue (only works with forse=true, which is slow and the docs say is highly discouraged). The only other format is PLINK which I’ll try.
tpoterba said:
Your plight is sympathetic – here’s a PR for zstd support:
jsmadsen said:
Editor’s Note:
The Hail team does not recommend the solution posted here, please read the entire thread for details and possible alternatives.
~ danking
This may be an issue with the UKB RAP but I cannot tell. In the simplest case, I am just trying to read and write a MatrixTable as:
I have attached the log for writing and reading the MatrixTable from the same environment. Trying to read the same MatrixTable in a different environment gives:
I cannot for the life of me figure out what is going on. It seems the rows/rows/parts/ folder is empty somehow? I hope it makes sense but let me know if it does not.
Also, info on the Jupyter environment is here.
read_MatrixTable_same_environment.log (192.6 KB)
write_MatrixTable.log (178.0 KB)
read_matrixtable.log (197.2 KB)
tpoterba said:
What is the runtime? I’m assuming this is running on a cluster?
I think the core issue here is that the file system you’re writing to / reading from is a local file system, not network-visible. The issue is similar to the one described here:
Cannot load a Hail Table in Terra notebook
The solution is that you should write to a file system that’s network visible – a cloud object store like Google Storage / S3 / Azure Blob Storage, or a network file system like Lustre/HDFS.
danking said:
I want to echo Tim here that you really don’t want to do this. Your suggestion involves three transfers of data which will be prohibitively slow for all but the smallest datasets:
data/1kg.mt
/opt/notebooks/1kg.mt
.dx
thing is.I don’t know much about the DNANexus platform, but if you hope to do any serious work on large datasets, you need to figure out how to use an object store like Google Cloud Storage, Amazon S3, or Azure Blob Storage from within DNANexus.
The root problem is that
dxpy.download_folder
command does not work the way you expect. I strongly suspect that it downloads to the Hadoop file system at/opt/notebooks/GIPR.mt
. Downloading a huge file to a local file system just doesn’t make any sense.Can you try the following and report back?
If that fails can you please execute the following cells in your Jupyter notebook?
jsmadsen said:
Hi Dan
Thanks for taking your time! I hope to do some serious work on the upcoming UKB 300k exomes release but their Material Transfer Agreement prohibits moving the data off of their own platform. It seems I am stuck with DNAnexus.
Will post the other outputs in a separate reply.
jsmadsen said:
I hope it is not too much trouble. What goes on behind the scenes of Hail is a bit beyond me.
Also, documentation, in case it helps.
danking said:
Huh, and what is the output of:
?
The second error message indicates that GIPR.mt is, somehow, not a folder.
jsmadsen said:
It does not exist after dxpy.download_folder. I think at this point, the issue might better be directed at UKB support, no?
danking said:
Agreed. I’m not sure what’s going on. It does seem that the
dx
downloads are sending the files into Hadoop (which is good!). In the future, I’d plan to use the Hadoop URLs (nofile:
) instead of the file URLs.Good luck!
EDIT: Upon further investigation ourselves, I might be wrong about downloading to the Hadoop filesystem.
jsmadsen said:
I have at least ruled it out as a Hail issue. I figure more people are going to run into this when they have to move unto the Research Analysis Platform end of September.
jsmadsen said:
Alright, here is the right way to go about it as per DNAnexus support:
Thank you again for the help.
danking said:
Hey jsmadsen,
I am glad you have found a solution! I’m somewhat surprised this works!
Just to be absolutely clear, this line:
successfully reads a matrix table? And you can successfully execute the following command?
Are you able to read VCF files stored in
dx
using thednax
protocol as well?jsmadsen said:
Hey Dan
It does indeed work. I have attached write_mt.log (279.9 KB) and read_mt.log (217.0 KB)
(new environment), in case those are interesting.
It does not seem I can use
dnax
for readingBulk
files but I may just be doing it wrong. At least, they are not included in the dispensed dataset.henryhmo said:
I am also stuck in how to read the VCF files from the bulk in DNAnexus with a cluster environment (as a bucket). Please follow up with problem. Thanks!
I don’t think the instruction that DNANexus gave will be viable. The mounted folder is not visible by the workers. I think they are actually loading through the master node, which is very slow.
https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/pVCF_import.ipynb
https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data#import-pvcf-genomic-data-into-a-hail-matrixtable-mt
danking said:
Hi henryhmo ,
I don’t have access to the DNA Nexus environment myself, so I can’t confirm or deny if this works. I agree, unless every Spark worker node has the VCFs network-mounted or stored in
/mnt/project/...
, this code will not work.Hail cannot load data through the master node. If this code works for you, then it is indeed using the workers.
henryhmo said:
My colleague says that the /mnt/project/… works, but just excruciatingly slow. I guess it is the way they gz the pVCF files was the problem (instead of bgz). Btw, How can we bgz a csv/tsv file outside of Hail? Thanks!
RossDeVito said:
Has your colleague found any faster solutions to loading the genotype data? Currently it appears we cannot use the BGEN files with hail because of the compression type (discussion here) and for pVCFs compression format also is an issue (only works with forse=true, which is slow and the docs say is highly discouraged). The only other format is PLINK which I’ll try.
tpoterba said:
Your plight is sympathetic – here’s a PR for zstd support:
hail-is/hail#12576
Should be in a new release in a day or two if all goes well.
danking said:
You’re looking for the tool called “block gzip”/
bgzip
danking said:
Have you already tried
force_bgz=True
? If the files are block-gzipped compressed but named.gz
,force_bgz=True
will treat them as block-gzipped.I would be shocked if someone generated non-block compressed PVCFs. They would be unusable for a lot of important use-cases!
RossDeVito said:
Just ran it again and you’re right
The text was updated successfully, but these errors were encountered: