Wrong FS on loading json with spark from s3 #114

perunac · 2017-06-09T20:12:25Z

When trying to load json file from s3 with magellan on AWS EMR cluster:
val polygons = spark.read.format("magellan").option("type", "geojson").load(inJson)
you get:

17/06/07 07:07:26 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-19-102.eu-west-1.compute.internal, executor 1): java.lang.IllegalArgumentException: Wrong FS: s3n://logindex-dev-data/eta/geojsons/convexhulloneline.json, expected: hdfs://ip-172-31-27-182.eu-west-1.compute.internal:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:653)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at magellan.mapreduce.WholeFileReader.nextKeyValue(WholeFileReader.scala:45)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

WholeFileReader is getting the default FileSystem. However, it should use the split path to determine the appropriate FileSystem, and thus it does not get the FileSystem for S3 (EmrFS by default), but it gets the one for HDFS (which is the default FileSystem on EMR clusters).

magellan/src/main/scala/magellan/mapreduce/WholeFileReader.scala

Line 42 in 3d282cd

val fs = FileSystem.get(conf)

The text was updated successfully, but these errors were encountered:

harsha2010 · 2017-06-09T20:31:31Z

Thanks ! Would you like to contribute the fix?

perunac · 2017-06-09T21:09:25Z

@harsha2010 I don't really know how to fix it, the part with split path is from someone on the AWS forum.
It really depends on time, could you estimate when you could fix this?

…URI itself

Fixes issue #114 by correctly initializing the FileSystem from the …

harsha2010 pushed a commit that referenced this issue Jun 10, 2017

Fixes issue #114 by correctling initializing the FileSystem from the …

43487d9

…URI itself

harsha2010 mentioned this issue Jun 10, 2017

Fixes issue #114 by correctly initializing the FileSystem from the … #115

Merged

harsha2010 added a commit that referenced this issue Jun 13, 2017

Merge pull request #115 from harsha2010/FIX_ISSUE_114

18418cc

Fixes issue #114 by correctly initializing the FileSystem from the …

perunac closed this as completed Jun 29, 2017

ashok12 mentioned this issue Feb 26, 2018

Wrong FS when loading PerceptronModel JohnSnowLabs/spark-nlp#121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong FS on loading json with spark from s3 #114

Wrong FS on loading json with spark from s3 #114

perunac commented Jun 9, 2017

harsha2010 commented Jun 9, 2017

perunac commented Jun 9, 2017

Wrong FS on loading json with spark from s3 #114

Wrong FS on loading json with spark from s3 #114

Comments

perunac commented Jun 9, 2017

harsha2010 commented Jun 9, 2017

perunac commented Jun 9, 2017