New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PySpark support to sparkctl and Spark Operator. #222
Conversation
type: Python | ||
pythonVersion: "2" | ||
mode: cluster | ||
image: "gcr.io/ynli-k8s/spark:v2.4.0-SNAPSHOT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liyinan926 I for now placed SNAPSHOT tag of your image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. I will make sure the image exists before merging this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
// image used to run the driver and executor containers. Can either be 2 or 3, default 2. | ||
// Optional. | ||
PythonVersion *string `json:"pythonVersion,omitempty"` | ||
// This sets the Memory Overhead Factor that will allocate memory to non-JVM memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth mentioning that the value of this field will be overridden by Spec.Driver.MemoryOverhead
and Spec.Executor.MemoryOverhead
if they are set. Correspondingly, it's worth mentioning the same in the comments for Spec.Driver.MemoryOverhead
and Spec.Executor.MemoryOverhead
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
sparkctl/cmd/create.go
Outdated
if len(localPyFiles) > 0 { | ||
uploadedPyFiles, err := uploadLocalDependencies(app, localPyFiles) | ||
if err != nil { | ||
return fmt.Errorf("failed to upload local files: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/local files
/local pyfiles
/.
docs/user-guide.md
Outdated
can be used. | ||
|
||
``` | ||
val absPathToFile = SparkFiles.get("data-file-1.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not apply to Spark 2.3.x, in which the files are downloaded by the init container to where fileDownloadDir
points to. In Spark 2.4, this is the right way.
docs/user-guide.md
Outdated
### Python Support | ||
|
||
Python support can be enabled by setting `mainApplicationFile` with path to your python application. | ||
Optionaly, `pythonVersion` parameter can be used to set the major Python version of the docker image used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use "field .spec.pythonVersion
".
docs/user-guide.md
Outdated
mainApplicationFile: local:///opt/spark/examples/src/main/python/pyfiles.py | ||
``` | ||
|
||
Some PySpark applications need additional Python packages additionally to the main application resource to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally
is duplicated and can be removed.
docs/user-guide.md
Outdated
``` | ||
|
||
Some PySpark applications need additional Python packages additionally to the main application resource to run. | ||
Such dependencies are specified using the `--py-files` option of `spark-submit` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rephrase this sentence as "Such dependencies are specified using the optional field .spec.deps.pyFiles
, which translates to the --py-files
option of the spark-submit
command.".
docs/user-guide.md
Outdated
can be used. | ||
|
||
``` | ||
python_dep_file_path = SparkFiles.get("python-dep.zip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Using SparkFiles.get()
only works for Spark 2.4.
docs/api.md
Outdated
@@ -53,6 +54,7 @@ A `SparkApplicationSpec` has the following top-level fields: | |||
| `NodeSelector` | `spark.kubernetes.node.selector.[labelKey]` | Node selector of the driver pod and executor pods, with key `labelKey` and value as the label's value. | | |||
| `MaxSubmissionRetries` | N/A | The maximum number of times to retry a failed submission. | | |||
| `SubmissionRetryInterval` | N/A | The unit of intervals in seconds between submission retries. Depending on the implementation, the actual interval between two submission retries may be a multiple of `SubmissionRetryInterval`, e.g., if linear or exponential backoff is used. | | |||
| `MemoryOverheadFactor` | `spark.kubernetes.memoryOverheadFactor` | This sets the Memory Overhead Factor that will allocate memory to non-JVM memory. For JVM-based jobs this value will default to 0.10, for non-JVM jobs 0.40. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth mentioning that the value of this field will be overridden by Spec.Driver.MemoryOverhead and Spec.Executor.MemoryOverhead if they are set.
Please also make sure to run |
cdc634d
to
fc5ae03
Compare
@liyinan926 Addressed required changes. |
docs/user-guide.md
Outdated
spark.sparkContext.addPyFile(dep_file_path) | ||
``` | ||
|
||
Note that Python binding for PySpark will available in Apache Spark 2.4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be available
.
docs/user-guide.md
Outdated
``` | ||
|
||
Note that Python binding for PySpark will available in Apache Spark 2.4, | ||
and currently requires building custom 2.4.0-SNAPSHOT Docker image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently requires building a custom Docker image from the Spark master branch.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comments.
@liyinan926 done |
Thanks! Will merge once I have the image |
@mrow4a Can you change the image to |
@mrow4a No worries, I will update the image after merging. Thanks! |
Description
This PR add a support for PySpark, and thus closes #181
@liyinan926
CC @prasanthkothuri