Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark driver cannot access --py-files and --files (http / gs / s3 ) #945

Open
ocassetti opened this issue Jun 13, 2020 · 6 comments
Open

Comments

@ocassetti
Copy link

I am not too sure if this is an issue with Spark/K8S or the entry-point of the image, but the result is that the driver can not access the files passed through --py-files and --files when these are from HTTP / HDFS including gs://, s3://

Note that the files are actually correctly downloaded by spark on the driver pod but they are downloaded in SPARK_LOCAL_DIRS and the working directory is /opt/spark/work-dir the result is that files will not be found by python.
This relates to the comment made by @mrow4a in #181 where he said he had to use SparkFiles.get(...).

So either the working directory is non set correctly (enty-point) or the files are downloaded in the wrong place.

Note that everything works just fine in YARN and MESOS which makes me think that the problem could be related to Spark/K8 code but I am not too sure where in Spark that might be.

Here the steps to replicate the issue:

  1. Submit a spark application where --files and --py-files are specified using HTTP / HDSF
    e.g.
spark-submit \
   --master k8s://https://172.17.0.2:8443 \
   --deploy-mode cluster \
   --name ocassetti-test \
   --conf spark.executor.instances=2 \
   --conf spark.kubernetes.namespace=spark \
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
   --py-files  https://raw.githubusercontent.com/ocassetti/spark-docker/master/samples/lib.zip \
   --conf spark.kubernetes.pyspark.pythonVersion="3" \
   --files https://raw.githubusercontent.com/ocassetti/spark-docker/master/samples/data.txt \
   --conf spark.kubernetes.container.image=gcr.io/spark-operator/spark-py:v2.4.5 \
   https://raw.githubusercontent.com/ocassetti/spark-docker/master/samples/main.py

  1. Check the driver container that will fail with an error about not be able to find the file

https://raw.githubusercontent.com/ocassetti/spark-docker/master/samples/main.py
loads lib.zip as extra path

  1. By adding to the main file
os.chdir(SparkFiles.getRootDirectory())

everything will work

https://raw.githubusercontent.com/ocassetti/spark-docker/master/samples/main_chdir.py

All files can be found here
https://github.com/ocassetti/spark-docker/tree/master/samples

If you can point me out where the issue might be in either Spark or K8S-Spark operator I will be happy to look into it.

Further info

Logs from executing main.py in K8S

============
[], [], []
============
/opt/spark/work-dir
[], [], []
...

Logs from executing main.py in YARN

/tmp/hadoop-oscar/nm-local-dir/usercache/oscar/appcache/application_1592055847768_0003/container_1592055847768_0003_01_000001
['main.py', 'data.txt', 'py4j-0.10.7-src.zip', 'lib.zip', 'pyspark.zip', '.default_container_executor.sh.crc', 'default_container_executor.sh', '.default_container_executor_session.sh.crc', 'default_container_executor_session.sh', '.launch_container.sh.crc', 'launch_container.sh', '.container_tokens.crc', 'container_tokens'], ['__spark_conf__', '__spark_libs__', 'tmp'], ['main.py', 'data.txt', 'py4j-0.10.7-src.zip', 'lib.zip', 'pyspark.zip', '.default_container_executor.sh.crc', 'default_container_executor.sh', '.default_container_executor_session.sh.crc', 'default_container_executor_session.sh', '.launch_container.sh.crc', 'launch_container.sh', '.container_tokens.crc', 'container_tokens']
[], [], []
Hello :)!
Hello from file :D

@dexterhu
Copy link

dexterhu commented Oct 9, 2020

@ocassetti I thought only executors will download those remote files on the fly, the driver doesn't need, as it merely passes the configurations to the executor. That's what I observed for --files

@ocassetti
Copy link
Author

@dexterhu the driver also requires those files

@eduardorochasoares
Copy link

Anyone? I am having the same issue accessing files from --files in spark/K8S.

@dexterhu
Copy link

I think it depends on Spark version of the job. Spark 3.0's driver may download it, but not spark 2.4.
If spark 3.0 still doesn't, it's a spark bug or feature request.

@joanjiao2016
Copy link

I have the same issue with spark3.0.0

@kacperlukawski
Copy link

The same in pyspark 3.2.0, both spark-submit and spark-on-k8s-operator. Has anyone managed to work it out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants