Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong FS on loading json with spark from s3 #114

Closed
perunac opened this issue Jun 9, 2017 · 2 comments
Closed

Wrong FS on loading json with spark from s3 #114

perunac opened this issue Jun 9, 2017 · 2 comments

Comments

@perunac
Copy link

perunac commented Jun 9, 2017

When trying to load json file from s3 with magellan on AWS EMR cluster:
val polygons = spark.read.format("magellan").option("type", "geojson").load(inJson)
you get:

17/06/07 07:07:26 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-19-102.eu-west-1.compute.internal, executor 1): java.lang.IllegalArgumentException: Wrong FS: s3n://logindex-dev-data/eta/geojsons/convexhulloneline.json, expected: hdfs://ip-172-31-27-182.eu-west-1.compute.internal:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:653)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at magellan.mapreduce.WholeFileReader.nextKeyValue(WholeFileReader.scala:45)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

WholeFileReader is getting the default FileSystem. However, it should use the split path to determine the appropriate FileSystem, and thus it does not get the FileSystem for S3 (EmrFS by default), but it gets the one for HDFS (which is the default FileSystem on EMR clusters).

@harsha2010
Copy link
Owner

Thanks ! Would you like to contribute the fix?

@perunac
Copy link
Author

perunac commented Jun 9, 2017

@harsha2010 I don't really know how to fix it, the part with split path is from someone on the AWS forum.
It really depends on time, could you estimate when you could fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants