New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark driver cannot access --py-files and --files (http / gs / s3 ) #945
Comments
@ocassetti I thought only executors will download those remote files on the fly, the driver doesn't need, as it merely passes the configurations to the executor. That's what I observed for --files |
@dexterhu the driver also requires those files |
Anyone? I am having the same issue accessing files from --files in spark/K8S. |
I think it depends on Spark version of the job. Spark 3.0's driver may download it, but not spark 2.4. |
I have the same issue with spark3.0.0 |
The same in pyspark 3.2.0, both spark-submit and spark-on-k8s-operator. Has anyone managed to work it out? |
I am not too sure if this is an issue with Spark/K8S or the entry-point of the image, but the result is that the driver can not access the files passed through
--py-files
and--files
when these are fromHTTP
/HDFS
includinggs://
,s3://
Note that the files are actually correctly downloaded by spark on the driver pod but they are downloaded in
SPARK_LOCAL_DIRS
and the working directory is/opt/spark/work-dir
the result is that files will not be found by python.This relates to the comment made by @mrow4a in #181 where he said he had to use
SparkFiles.get(...)
.So either the working directory is non set correctly (enty-point) or the files are downloaded in the wrong place.
Note that everything works just fine in YARN and MESOS which makes me think that the problem could be related to Spark/K8 code but I am not too sure where in Spark that might be.
Here the steps to replicate the issue:
--files
and--py-files
are specified using HTTP / HDSFe.g.
https://raw.githubusercontent.com/ocassetti/spark-docker/master/samples/main.py
loads
lib.zip
as extra patheverything will work
https://raw.githubusercontent.com/ocassetti/spark-docker/master/samples/main_chdir.py
All files can be found here
https://github.com/ocassetti/spark-docker/tree/master/samples
If you can point me out where the issue might be in either Spark or K8S-Spark operator I will be happy to look into it.
Further info
Logs from executing main.py in K8S
Logs from executing main.py in YARN
The text was updated successfully, but these errors were encountered: