Add PySpark support to sparkctl and Spark Operator. #222

mrow4a · 2018-07-19T12:54:08Z

Description

This PR add a support for PySpark, and thus closes #181

mrow4a · 2018-07-19T17:37:57Z

examples/spark-pyfiles.yaml

+  type: Python
+  pythonVersion: "2"
+  mode: cluster
+  image: "gcr.io/ynli-k8s/spark:v2.4.0-SNAPSHOT"


@liyinan926 I for now placed SNAPSHOT tag of your image.

That's fine. I will make sure the image exists before merging this.

liyinan926 · 2018-07-20T16:55:17Z

pkg/apis/sparkoperator.k8s.io/v1alpha1/types.go

+	// image used to run the driver and executor containers. Can either be 2 or 3, default 2.
+	// Optional.
+	PythonVersion *string `json:"pythonVersion,omitempty"`
+	// This sets the Memory Overhead Factor that will allocate memory to non-JVM memory.


It's worth mentioning that the value of this field will be overridden by Spec.Driver.MemoryOverhead and Spec.Executor.MemoryOverhead if they are set. Correspondingly, it's worth mentioning the same in the comments for Spec.Driver.MemoryOverhead and Spec.Executor.MemoryOverhead as well.

liyinan926 · 2018-07-20T16:56:29Z

sparkctl/cmd/create.go

+	if len(localPyFiles) > 0 {
+		uploadedPyFiles, err := uploadLocalDependencies(app, localPyFiles)
+		if err != nil {
+			return fmt.Errorf("failed to upload local files: %v", err)


s/local files/local pyfiles/.

liyinan926 · 2018-07-20T16:59:23Z

docs/user-guide.md

+can be used.
+
+```
+val absPathToFile = SparkFiles.get("data-file-1.txt")


This does not apply to Spark 2.3.x, in which the files are downloaded by the init container to where fileDownloadDir points to. In Spark 2.4, this is the right way.

liyinan926 · 2018-07-20T16:59:53Z

docs/user-guide.md

+### Python Support
+
+Python support can be enabled by setting `mainApplicationFile` with path to your python application.
+Optionaly, `pythonVersion` parameter can be used to set the major Python version of the docker image used 


Please use "field .spec.pythonVersion".

liyinan926 · 2018-07-20T17:01:32Z

docs/user-guide.md

+  mainApplicationFile: local:///opt/spark/examples/src/main/python/pyfiles.py
+```
+
+Some PySpark applications need additional Python packages additionally to the main application resource to run. 


additionally is duplicated and can be removed.

liyinan926 · 2018-07-20T17:05:15Z

docs/user-guide.md

+```
+
+Some PySpark applications need additional Python packages additionally to the main application resource to run. 
+Such dependencies are specified using the `--py-files` option of `spark-submit` command. 


I would rephrase this sentence as "Such dependencies are specified using the optional field .spec.deps.pyFiles , which translates to the --py-files option of the spark-submit command.".

liyinan926 · 2018-07-20T17:05:50Z

docs/user-guide.md

+can be used.
+
+```
+python_dep_file_path = SparkFiles.get("python-dep.zip")


Ditto. Using SparkFiles.get() only works for Spark 2.4.

liyinan926 · 2018-07-20T17:06:32Z

docs/api.md

@@ -53,6 +54,7 @@ A `SparkApplicationSpec` has the following top-level fields:
 | `NodeSelector` | `spark.kubernetes.node.selector.[labelKey]` | Node selector of the driver pod and executor pods, with key `labelKey` and value as the label's value. |
 | `MaxSubmissionRetries` | N/A | The maximum number of times to retry a failed submission. |
 | `SubmissionRetryInterval` | N/A | The unit of intervals in seconds between submission retries. Depending on the implementation, the actual interval between two submission retries may be a multiple of `SubmissionRetryInterval`, e.g., if linear or exponential backoff is used. |
+| `MemoryOverheadFactor` | `spark.kubernetes.memoryOverheadFactor` | This sets the Memory Overhead Factor that will allocate memory to non-JVM memory. For JVM-based jobs this value will default to 0.10, for non-JVM jobs 0.40. |


It's worth mentioning that the value of this field will be overridden by Spec.Driver.MemoryOverhead and Spec.Executor.MemoryOverhead if they are set.

liyinan926 · 2018-07-20T17:26:57Z

Please also make sure to run go fmt ./... to format the code.

mrow4a · 2018-07-27T08:21:02Z

@liyinan926 Addressed required changes.

liyinan926 · 2018-07-27T16:29:35Z

docs/user-guide.md

+spark.sparkContext.addPyFile(dep_file_path)
+``` 
+
+Note that Python binding for PySpark will available in Apache Spark 2.4, 


will be available.

liyinan926 · 2018-07-27T16:30:07Z

docs/user-guide.md

+``` 
+
+Note that Python binding for PySpark will available in Apache Spark 2.4, 
+and currently requires building custom 2.4.0-SNAPSHOT Docker image.


currently requires building a custom Docker image from the Spark master branch..

liyinan926

LGTM with minor comments.

mrow4a · 2018-07-27T17:35:24Z

@liyinan926 done

liyinan926 · 2018-07-27T17:40:39Z

Thanks! Will merge once I have the image gcr.io/ynli-k8s/spark:v2.4.0-SNAPSHOT pushed.

liyinan926 · 2018-07-27T18:32:18Z

@mrow4a Can you change the image to gcr.io/ynli-k8s/spark-py:v2.4.0-SNAPSHOT?

liyinan926 · 2018-07-27T18:35:51Z

@mrow4a No worries, I will update the image after merging. Thanks!

mrow4a force-pushed the upstream_pyspark branch from f044332 to 1cba2be Compare July 19, 2018 12:55

mrow4a commented Jul 19, 2018

View reviewed changes

liyinan926 reviewed Jul 20, 2018

View reviewed changes

mrow4a force-pushed the upstream_pyspark branch 2 times, most recently from cdc634d to fc5ae03 Compare July 27, 2018 08:20

liyinan926 reviewed Jul 27, 2018

View reviewed changes

liyinan926 approved these changes Jul 27, 2018

View reviewed changes

mrow4a added 3 commits July 27, 2018 19:34

Add PySpark support to Operator

20c42ef

Add PySpark pyfiles.py example

0c3534d

Add PySpark support to sparkctl

fe68451

mrow4a force-pushed the upstream_pyspark branch from fc5ae03 to fe68451 Compare July 27, 2018 17:34

liyinan926 merged commit c21d0c6 into kubeflow:master Jul 27, 2018

mrow4a deleted the upstream_pyspark branch July 30, 2018 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PySpark support to sparkctl and Spark Operator. #222

Add PySpark support to sparkctl and Spark Operator. #222

mrow4a commented Jul 19, 2018

mrow4a Jul 19, 2018

liyinan926 Jul 19, 2018

mrow4a Jul 27, 2018

liyinan926 Jul 20, 2018

mrow4a Jul 27, 2018

liyinan926 Jul 20, 2018

liyinan926 Jul 20, 2018

liyinan926 Jul 20, 2018

liyinan926 Jul 20, 2018

liyinan926 Jul 20, 2018

liyinan926 Jul 20, 2018

liyinan926 Jul 20, 2018

liyinan926 commented Jul 20, 2018

mrow4a commented Jul 27, 2018

liyinan926 Jul 27, 2018

liyinan926 Jul 27, 2018

liyinan926 left a comment

mrow4a commented Jul 27, 2018

liyinan926 commented Jul 27, 2018

liyinan926 commented Jul 27, 2018

liyinan926 commented Jul 27, 2018

Add PySpark support to sparkctl and Spark Operator. #222

Add PySpark support to sparkctl and Spark Operator. #222

Conversation

mrow4a commented Jul 19, 2018

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 commented Jul 20, 2018

mrow4a commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 left a comment

Choose a reason for hiding this comment

mrow4a commented Jul 27, 2018

liyinan926 commented Jul 27, 2018

liyinan926 commented Jul 27, 2018

liyinan926 commented Jul 27, 2018