Make the operator work for PySpark in spark master #181

liyinan926 · 2018-06-11T20:03:14Z

The operator currently does not support PySpark, which is available now in the master branch of Spark. The following changes are needed to make the operator support PySpark in the master branch:

Make the operator use the new Spark configuration properties introduced in [SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s apache/spark#21092.
Create a version of the operator image based on a Spark image of the master branch.

mrow4a · 2018-06-12T09:38:15Z

@liyinan926 if this is not high priority for you, you can assign me for this, I will be playing soon with spark master anyways. (1st task only)

CC @prasanthkothuri

liyinan926 · 2018-07-10T18:52:36Z

@mrow4a Are you still interested in taking this?

mrow4a · 2018-07-13T14:30:07Z

@liyinan926 Yes, on it now - I see cluster mode is broken for us in the master branch https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L87

mrow4a · 2018-07-13T14:51:25Z

It appears, that that flag is not taken into account, if specified otherwise by properties file - seems to lunch in cluster mode, but just a printout of driver log is a bit confusing..

mrow4a · 2018-07-17T20:44:43Z

@liyinan926 as I understand by #129, the goal is to reuse existing API, right (to freeze it for beta release) ?

liyinan926 · 2018-07-17T21:00:03Z

@mrow4a Can you clarify what you mean by the goal is to reuse existing API, right (to freeze it for beta release)?

liyinan926 · 2018-07-17T21:05:36Z

Oops, I think I accidentally marked #129 as all done and closed it. The PySpark one is not done yet. Reopened #129.

mrow4a · 2018-07-18T16:32:02Z

@liyinan926 Python works out of the box without --py-files dependencies. Now the question is about the api:

spec:
  type: Python
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  deps:
    pyfiles:
      - {{ some-path-to-dependency }}

or reusing existing deps->files and type should resolve it to --py-files

spec:
  type: Python
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  deps:
    files:
      - {{ some-path-to-dependency }}

What do you think? (I personally would just resolve it by type:Python and mention in README)

liyinan926 · 2018-07-18T16:43:41Z

@mrow4a I think we should keep pyFiles as files may contain non-Python files so blindly resolving files to --py-files is not always appropriate.

mrow4a · 2018-07-18T17:42:47Z

@liyinan926 you are totally right, I just remembered that some cases pass both --files and --py-files

liyinan926 · 2018-07-18T17:59:24Z

We still need to add support for the new config options spark.kubernetes.memoryOverheadFactor and spark.kubernetes.pyspark.pythonversion introduced in apache/spark#21092. Both can be represented as optional fields in SparkApplicationSpec, particularly spark.kubernetes.pyspark.pythonversion.

mrow4a · 2018-07-19T07:36:47Z

Very strange is this with --py-files:

+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n s3a://spark-on-k8s-cluster/spark-app-dependencies/default/pyspark-pi/wordcount_python_job.zip ']'
+ PYTHONPATH='/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:s3a://spark-on-k8s-cluster/spark-app-dependencies/default/pyspark-pi/wordcount_python_job.zip'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ '[' 2 == 2 ']'
++ python -V
+ pyv='Python 2.7.5'
+ export PYTHON_VERSION=2.7.5

And of course does not work. It adds the remote dependency to Python Path as s3a - needs investigation.

mrow4a · 2018-07-19T08:39:13Z

Strange, to make .zip files work, I needed to add

    wordcount_job_path = SparkFiles.get("wordcount_python_job.zip")
    spark.sparkContext.addPyFile(wordcount_job_path)

otherwise with just addPyFile it gives me

py4j.protocol.Py4JJavaError: An error occurred while calling o65.addFile.
: java.io.FileNotFoundException: File file:/opt/spark/work-dir/wordcount_python_job.zip does not exist

liyinan926 added the enhancement New feature or request label Jun 11, 2018

liyinan926 mentioned this issue Jul 17, 2018

Sync with apache/spark/master #129

Closed

8 tasks

mrow4a mentioned this issue Jul 19, 2018

Add PySpark support to sparkctl and Spark Operator. #222

Merged

liyinan926 closed this as completed in #222 Jul 27, 2018

liyinan926 added this to the beta milestone Aug 23, 2018

ocassetti mentioned this issue Jun 13, 2020

Spark driver cannot access --py-files and --files (http / gs / s3 ) #945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the operator work for PySpark in spark master #181

Make the operator work for PySpark in spark master #181

liyinan926 commented Jun 11, 2018 •

edited

mrow4a commented Jun 12, 2018

liyinan926 commented Jul 10, 2018

mrow4a commented Jul 13, 2018

mrow4a commented Jul 13, 2018

mrow4a commented Jul 17, 2018 •

edited

liyinan926 commented Jul 17, 2018

liyinan926 commented Jul 17, 2018

mrow4a commented Jul 18, 2018 •

edited

liyinan926 commented Jul 18, 2018

mrow4a commented Jul 18, 2018

liyinan926 commented Jul 18, 2018 •

edited

mrow4a commented Jul 19, 2018 •

edited

mrow4a commented Jul 19, 2018

Make the operator work for PySpark in spark master #181

Make the operator work for PySpark in spark master #181

Comments

liyinan926 commented Jun 11, 2018 • edited

mrow4a commented Jun 12, 2018

liyinan926 commented Jul 10, 2018

mrow4a commented Jul 13, 2018

mrow4a commented Jul 13, 2018

mrow4a commented Jul 17, 2018 • edited

liyinan926 commented Jul 17, 2018

liyinan926 commented Jul 17, 2018

mrow4a commented Jul 18, 2018 • edited

liyinan926 commented Jul 18, 2018

mrow4a commented Jul 18, 2018

liyinan926 commented Jul 18, 2018 • edited

mrow4a commented Jul 19, 2018 • edited

mrow4a commented Jul 19, 2018

liyinan926 commented Jun 11, 2018 •

edited

mrow4a commented Jul 17, 2018 •

edited

mrow4a commented Jul 18, 2018 •

edited

liyinan926 commented Jul 18, 2018 •

edited

mrow4a commented Jul 19, 2018 •

edited