Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the operator work for PySpark in spark master #181

Closed
2 tasks done
liyinan926 opened this issue Jun 11, 2018 · 13 comments · Fixed by #222
Closed
2 tasks done

Make the operator work for PySpark in spark master #181

liyinan926 opened this issue Jun 11, 2018 · 13 comments · Fixed by #222
Labels
enhancement New feature or request
Milestone

Comments

@liyinan926
Copy link
Collaborator

liyinan926 commented Jun 11, 2018

The operator currently does not support PySpark, which is available now in the master branch of Spark. The following changes are needed to make the operator support PySpark in the master branch:

@liyinan926 liyinan926 added the enhancement New feature or request label Jun 11, 2018
@mrow4a
Copy link
Contributor

mrow4a commented Jun 12, 2018

@liyinan926 if this is not high priority for you, you can assign me for this, I will be playing soon with spark master anyways. (1st task only)

CC @prasanthkothuri

@liyinan926
Copy link
Collaborator Author

@mrow4a Are you still interested in taking this?

@mrow4a
Copy link
Contributor

mrow4a commented Jul 13, 2018

@liyinan926 Yes, on it now - I see cluster mode is broken for us in the master branch https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L87

@mrow4a
Copy link
Contributor

mrow4a commented Jul 13, 2018

It appears, that that flag is not taken into account, if specified otherwise by properties file - seems to lunch in cluster mode, but just a printout of driver log is a bit confusing..

@mrow4a
Copy link
Contributor

mrow4a commented Jul 17, 2018

@liyinan926 as I understand by #129, the goal is to reuse existing API, right (to freeze it for beta release) ?

@liyinan926
Copy link
Collaborator Author

@mrow4a Can you clarify what you mean by the goal is to reuse existing API, right (to freeze it for beta release)?

@liyinan926
Copy link
Collaborator Author

Oops, I think I accidentally marked #129 as all done and closed it. The PySpark one is not done yet. Reopened #129.

@mrow4a
Copy link
Contributor

mrow4a commented Jul 18, 2018

@liyinan926 Python works out of the box without --py-files dependencies. Now the question is about the api:

spec:
  type: Python
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  deps:
    pyfiles:
      - {{ some-path-to-dependency }}

or reusing existing deps->files and type should resolve it to --py-files

spec:
  type: Python
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  deps:
    files:
      - {{ some-path-to-dependency }}

What do you think? (I personally would just resolve it by type:Python and mention in README)

@liyinan926
Copy link
Collaborator Author

@mrow4a I think we should keep pyFiles as files may contain non-Python files so blindly resolving files to --py-files is not always appropriate.

@mrow4a
Copy link
Contributor

mrow4a commented Jul 18, 2018

@liyinan926 you are totally right, I just remembered that some cases pass both --files and --py-files

@liyinan926
Copy link
Collaborator Author

liyinan926 commented Jul 18, 2018

We still need to add support for the new config options spark.kubernetes.memoryOverheadFactor and spark.kubernetes.pyspark.pythonversion introduced in apache/spark#21092. Both can be represented as optional fields in SparkApplicationSpec, particularly spark.kubernetes.pyspark.pythonversion.

@mrow4a
Copy link
Contributor

mrow4a commented Jul 19, 2018

Very strange is this with --py-files:

+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n s3a://spark-on-k8s-cluster/spark-app-dependencies/default/pyspark-pi/wordcount_python_job.zip ']'
+ PYTHONPATH='/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:s3a://spark-on-k8s-cluster/spark-app-dependencies/default/pyspark-pi/wordcount_python_job.zip'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ '[' 2 == 2 ']'
++ python -V
+ pyv='Python 2.7.5'
+ export PYTHON_VERSION=2.7.5

And of course does not work. It adds the remote dependency to Python Path as s3a - needs investigation.

@mrow4a
Copy link
Contributor

mrow4a commented Jul 19, 2018

Strange, to make .zip files work, I needed to add

    wordcount_job_path = SparkFiles.get("wordcount_python_job.zip")
    spark.sparkContext.addPyFile(wordcount_job_path)

otherwise with just addPyFile it gives me

py4j.protocol.Py4JJavaError: An error occurred while calling o65.addFile.
: java.io.FileNotFoundException: File file:/opt/spark/work-dir/wordcount_python_job.zip does not exist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants