Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PySpark support to sparkctl and Spark Operator. #222

Merged
merged 3 commits into from Jul 27, 2018

Conversation

mrow4a
Copy link
Contributor

@mrow4a mrow4a commented Jul 19, 2018

Description

This PR add a support for PySpark, and thus closes #181

@liyinan926
CC @prasanthkothuri

type: Python
pythonVersion: "2"
mode: cluster
image: "gcr.io/ynli-k8s/spark:v2.4.0-SNAPSHOT"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyinan926 I for now placed SNAPSHOT tag of your image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. I will make sure the image exists before merging this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

// image used to run the driver and executor containers. Can either be 2 or 3, default 2.
// Optional.
PythonVersion *string `json:"pythonVersion,omitempty"`
// This sets the Memory Overhead Factor that will allocate memory to non-JVM memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that the value of this field will be overridden by Spec.Driver.MemoryOverhead and Spec.Executor.MemoryOverhead if they are set. Correspondingly, it's worth mentioning the same in the comments for Spec.Driver.MemoryOverhead and Spec.Executor.MemoryOverhead as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

if len(localPyFiles) > 0 {
uploadedPyFiles, err := uploadLocalDependencies(app, localPyFiles)
if err != nil {
return fmt.Errorf("failed to upload local files: %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/local files/local pyfiles/.

can be used.

```
val absPathToFile = SparkFiles.get("data-file-1.txt")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not apply to Spark 2.3.x, in which the files are downloaded by the init container to where fileDownloadDir points to. In Spark 2.4, this is the right way.

### Python Support

Python support can be enabled by setting `mainApplicationFile` with path to your python application.
Optionaly, `pythonVersion` parameter can be used to set the major Python version of the docker image used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use "field .spec.pythonVersion".

mainApplicationFile: local:///opt/spark/examples/src/main/python/pyfiles.py
```

Some PySpark applications need additional Python packages additionally to the main application resource to run.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additionally is duplicated and can be removed.

```

Some PySpark applications need additional Python packages additionally to the main application resource to run.
Such dependencies are specified using the `--py-files` option of `spark-submit` command.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase this sentence as "Such dependencies are specified using the optional field .spec.deps.pyFiles , which translates to the --py-files option of the spark-submit command.".

can be used.

```
python_dep_file_path = SparkFiles.get("python-dep.zip")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Using SparkFiles.get() only works for Spark 2.4.

docs/api.md Outdated
@@ -53,6 +54,7 @@ A `SparkApplicationSpec` has the following top-level fields:
| `NodeSelector` | `spark.kubernetes.node.selector.[labelKey]` | Node selector of the driver pod and executor pods, with key `labelKey` and value as the label's value. |
| `MaxSubmissionRetries` | N/A | The maximum number of times to retry a failed submission. |
| `SubmissionRetryInterval` | N/A | The unit of intervals in seconds between submission retries. Depending on the implementation, the actual interval between two submission retries may be a multiple of `SubmissionRetryInterval`, e.g., if linear or exponential backoff is used. |
| `MemoryOverheadFactor` | `spark.kubernetes.memoryOverheadFactor` | This sets the Memory Overhead Factor that will allocate memory to non-JVM memory. For JVM-based jobs this value will default to 0.10, for non-JVM jobs 0.40. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that the value of this field will be overridden by Spec.Driver.MemoryOverhead and Spec.Executor.MemoryOverhead if they are set.

@liyinan926
Copy link
Collaborator

Please also make sure to run go fmt ./... to format the code.

@mrow4a mrow4a force-pushed the upstream_pyspark branch 2 times, most recently from cdc634d to fc5ae03 Compare July 27, 2018 08:20
@mrow4a
Copy link
Contributor Author

mrow4a commented Jul 27, 2018

@liyinan926 Addressed required changes.

spark.sparkContext.addPyFile(dep_file_path)
```

Note that Python binding for PySpark will available in Apache Spark 2.4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be available.

```

Note that Python binding for PySpark will available in Apache Spark 2.4,
and currently requires building custom 2.4.0-SNAPSHOT Docker image.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently requires building a custom Docker image from the Spark master branch..

Copy link
Collaborator

@liyinan926 liyinan926 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments.

@mrow4a
Copy link
Contributor Author

mrow4a commented Jul 27, 2018

@liyinan926 done

@liyinan926
Copy link
Collaborator

Thanks! Will merge once I have the image gcr.io/ynli-k8s/spark:v2.4.0-SNAPSHOT pushed.

@liyinan926
Copy link
Collaborator

@mrow4a Can you change the image to gcr.io/ynli-k8s/spark-py:v2.4.0-SNAPSHOT?

@liyinan926
Copy link
Collaborator

@mrow4a No worries, I will update the image after merging. Thanks!

@liyinan926 liyinan926 merged commit c21d0c6 into kubeflow:master Jul 27, 2018
@mrow4a mrow4a deleted the upstream_pyspark branch July 30, 2018 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the operator work for PySpark in spark master
2 participants