Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error saving Pipeline model on s3 via Pyspark #41

Closed
rush4ratio opened this issue Oct 4, 2021 · 3 comments
Closed

Error saving Pipeline model on s3 via Pyspark #41

rush4ratio opened this issue Oct 4, 2021 · 3 comments

Comments

@rush4ratio
Copy link

rush4ratio commented Oct 4, 2021

I get the error below when I try to save the pipeline on s3, when providing a uri with s3:// (note replaced certain info with ellipsis):

[Stage 56:=======================================================>(87 + 1) / 88]21/10/04 18:48:57 WARN TaskSetManager: Lost task 87.0 in stage 56.0 (TID 2541, ip-10-195-26-28.ec2.internal, executor 7): java.lang.IllegalArgumentException: Wrong FS: s3://../stages/3_HnswSimilarity_7c6781618491/indices/0, expected: hdfs://..
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:782)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:213)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:527)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:376)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:316)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply$mcVJ$sp(KnnAlgorithm.scala:290)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply(KnnAlgorithm.scala:283)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply(KnnAlgorithm.scala:283)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.SparkContext$$anonfun$range$1$$anonfun$28$$anon$1.foreach(SparkContext.scala:765)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

21/10/04 18:48:58 ERROR TaskSetManager: Task 87 in stage 56.0 failed 4 times; aborting job
0%| | 0/1 [00:22<?, ?it/s]


Py4JJavaError Traceback (most recent call last)
/tmp/ipykernel_113378/3755559733.py in
1 nn_run(...
----> 2 save_model=True)

/tmp/ipykernel_113378/66834136.py in nn_run(input_cols, tbl_name, nn_dir, output_tbl, save_model, num_neighbors)
40
41 if save_model:
---> 42 nn_model.write().overwrite().save(os.path.join(nn_dir, f'{name}_nn.model'))
43
44 # pivoted_data.write.format("parquet").mode("append").save(os.path.join(nn_dir, f'{output_tbl}.parquet'))

/usr/lib/spark/python/pyspark/ml/util.py in save(self, path)
181 if not isinstance(path, basestring):
182 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 183 self._jwrite.save(path)
184
185 def overwrite(self):

/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling o768.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 87 in stage 56.0 failed 4 times, most recent failure: Lost task 87.3 in stage 56.0 (TID 2544, ip-10-195-26-28.ec2.internal, executor 7): java.lang.IllegalArgumentException: Wrong FS: s3://../stages/3_HnswSimilarity_7c6781618491/indices/0, expected: hdfs://
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:782)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:213)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:527)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:376)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:316)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply$mcVJ$sp(KnnAlgorithm.scala:290)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply(KnnAlgorithm.scala:283)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply(KnnAlgorithm.scala:283)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.SparkContext$$anonfun$range$1$$anonfun$28$$anon$1.foreach(SparkContext.scala:765)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2080)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2068)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2067)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2067)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:988)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:988)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:988)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2301)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2250)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2239)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:799)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:972)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:970)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:970)
at com.github.jelmerk.spark.knn.KnnModelWriter.saveImpl(KnnAlgorithm.scala:283)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:180)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:338)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:180)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Wrong FS: s3://../3_HnswSimilarity_7c6781618491/indices/0, expected: hdfs://
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:782)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:213)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:527)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:376)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:316)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply$mcVJ$sp(KnnAlgorithm.scala:290)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply(KnnAlgorithm.scala:283)
at com.github.jelmerk.spark.knn.KnnModelWriter$$anonfun$saveImpl$1.apply(KnnAlgorithm.scala:283)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.SparkContext$$anonfun$range$1$$anonfun$28$$anon$1.foreach(SparkContext.scala:765)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

It's probably analogous to: JohnSnowLabs/spark-nlp#121

jelmerk pushed a commit that referenced this issue Oct 5, 2021
@jelmerk
Copy link
Owner

jelmerk commented Oct 5, 2021

Thanks, i released a new version, could you try with --packages 'com.github.jelmerk:hnswlib-spark_2.3.0_2.11:0.0.48'

@rush4ratio
Copy link
Author

Thanks! Works as expected.

@jelmerk
Copy link
Owner

jelmerk commented Oct 5, 2021

thanks for verifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants