Initial documentation on design and API definition #68

liyinan926 · 2018-01-31T19:16:19Z

@foxish @kow3ns @enisoc @ifilonenko @kimoonkim

#54. Opened this PR as a way of getting feedbacks on the CRD API definition for the upcoming alpha release. Regarding SparkPodSpec, I have a TODO item of looking into v1.PodSpec and seeing if that should be used instead.

ifilonenko · 2018-01-31T20:11:16Z

docs/design.md

@@ -0,0 +1,59 @@
+# Spark Operation Design


Spark Operation or Spark Operator Design?

Hmm, typos. Should be Operator.

enisoc · 2018-01-31T22:13:08Z

docs/design.md

+
+### Controller, Submission Runner, and Spark Pod Monitor
+
+The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When it receives a new `SparkApplication` object, it prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster. 


To ensure level-triggered semantics, it should be possible to phrase your CRD controller's behavior in terms like, "For each SparkApplication object that exists, the controller will ..."

The current wording includes things like, "When it receives a new object, it sends a submission," which sounds like edge-triggered logic. In a level-triggered system, you simply observe that an object exists, and you have no idea whether it's the first time you've seen it or not. In the controller design, it's important to define how it will decide whether the observed SparkApplication was already submitted, for example. Or, will it re-submit on every sync, and some downstream component is responsible for deciding whether it was already submitted?

The current wording includes things like, "When it receives a new object, it sends a submission," which sounds like edge-triggered logic. In a level-triggered system, you simply observe that an object exists, and you have no idea whether it's the first time you've seen it or not.

The implementation actually submits whenever the AddFunc callback is called, with re-sync disabled though. It's a good point.

In the controller design, it's important to define how it will decide whether the observed SparkApplication was already submitted

Good point. Will give more thoughts on this.

Gave more thoughts on this. The current behavior is the operator submits when a new SparkApplication object is added, i.e., when AddFunc is called. It also submits if re-submission is necessary, for example if the application should be restarted because of the user-specified RestartPolicy. I think the question that remains is how updates to specifications of existing SparkApplication objects should be handled. Currently updates are no-ops.

Stepping back a bit, compared to workloads like deployments, daemonsets, or statefulsets, Spark applications are often batch oriented and are meant to run to completion. So the desire state of a Spark application is simple: running to completion with the driver pod in succeeded state eventually. From a Spark perspective, updates to the specification of an application could mean user intentions to re-run the application. A decision needs to be made here regarding how to handle the instance of the application that is currently running when the update happens. However, due to the batch oriented nature, updates not necessarily always mean intentions to re-run. But if this is put into the context of the operator and the declarative nature of the CRD, I think we can safely assume that an update to the specification of a SparkApplication object carries the user intention to re-run the application with the updated spec.

To track thoughts and work on handling updates, I created #83.

I have updated the design doc with thoughts and changes made to address some of the concerns in d92abcf. @enisoc PTAL.

enisoc · 2018-01-31T22:17:30Z

docs/design.md

+
+![Architecture Diagram](architecture-diagram.png)
+
+Specifically, a user uses the `sparkctl` (or `kubectl`) to create a `SparkApplication` object. The `SparkApplication` controller receives the object through a watcher from the API server, creates a submission carrying the `spark-submit` arguments, and sends the submission to the *submission runner*. The submission runner submits the application to run and creates the driver pod of the application. Upon starting, the driver pod creates the executor pods. While the application is running, the *Spark pod monitor* watches the pods of the application and sends status updates of the pods back to the controller, which then updates the status of the application accordingly. 


When you talk about creating the driver Pod, and the driver creating executor Pods, are these really bare Pods? For example, what happens if a Pod gets preempted or drained?

They are bare pods. The Spark submission client (called by spark-submit) creates the driver pod, which in turn creates the executor pods. We (the Spark k8s community) haven't really given much thoughts/discussions on pod preemption AFAIK. But this is a good point. I will need to do some investigation.

I assume the driver would just make a replacement. It has a reconciliation loop, right?

No, there's not a reconciliation loop in Spark on K8s itself. The submission client simply creates the driver pod and watches the pod if it is configured to wait for the application to complete. If the pod completes, fails, or gets deleted, the client simply terminates and reports the exit code of the driver pod if it's available. This issue actually falls into the scope of high-availability driver, which is critical for Spark Streaming applications. Some people from Tencent addressed the failure case by using a K8s Job to run the driver. This is the only take on the issue as far as I am aware of.

Is supporting Spark Streaming and/or high availability a goal of this current work? If so, replacing deleted Pods is critical.

Even if not, Kubernetes users generally expect Pods to be disposable and automatically replaced if necessary, even for one-off batch work as you noted with the Job API. We would need a very good reason to not support this notion, especially for something we call an Operator (which should automate as much as possible anything a human would otherwise have to do).

In the context of batch workloads, this is becoming more critical with the upcoming introduction of priority and preemption. The level of Pod churn is going to increase significantly, and batch workloads are going to need to be resilient to Pod rescheduling in order to make progress, since they are inherently prime targets for preemption.

Yes, supporting restart with a configurable restart policy is a goal and a feature. Unlike pods in other types of workloads, particularly stateless workloads, the driver pod manages the executor pods and owns them (through OwnerReference on the executor pods by design). The Spark submission client is responsible for building and creating the driver pod. In this design, replacing the driver pod is done by invoking the submission client that re-creates a new driver pod.

enisoc · 2018-01-31T22:22:15Z

docs/design.md

+
+The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When it receives a new `SparkApplication` object, it prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster. 
+
+The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. The controller maintains the collection of `SparkApplication` objects of running applications and update the status of them based on the messages received.


The pattern used by existing controllers is to have a single function ("sync") that observes the current state, takes action, and updates status. Is there a reason you're deviating from that to what sounds like a decomposition of those steps into an asynchronous set of agents within the controller?

The thoughts behind this was to achieve a clear separation of concerns: the CRD controller is responsible for creating submissions (the kind of actions it takes) and updating status, whereas the submission runner handles actually submitting applications to run. Based on the same thoughts, the pod monitor is responsible for tracking status of pods. I'm gonna take a closer look at the pattern used by the core controllers.

enisoc · 2018-01-31T22:26:54Z

docs/design.md

+
+The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. The controller maintains the collection of `SparkApplication` objects of running applications and update the status of them based on the messages received.
+
+As part of preparing a submission for a newly created `SparkApplication` object, the controller parses the object and adds configuration options for adding certain annotations to the driver and executor pods of the application. The annotations are later used by the Spark pod initializer to configure the pods before they start to run. For example,if a Spark application needs a certain Kubernetes ConfigMap to be mounted into the driver and executor pods, the controller adds an annotation that specifies the name of the ConfigMap to mount. Later the Spark pod initializer sees the annotation on the pods and mount the ConfigMap to the pods.


Is there a reason this has to be done as an initializer (or mutating admission), rather than just propagating the information directly down the chain? For example, I would have expected that the SparkApplication controller could create a SparkDriver object, which contains the information necessary for the SparkDriver controller to create executor Pods that have all the needed ConfigMaps mounted to begin with.

Do you need to inject things indirectly because you don't have control over the driver code that's creating the executor Pods? If so, why?

Because the operator has no control over the driver/executor creation code in Spark itself. The operator simply calls the spark-submit command, which in turns calls the submission client in Spark that creates the driver pod. The Kubernetes mode of Spark supports a limited set of customization points for pods.

kimoonkim · 2018-02-05T19:08:49Z

docs/api.md

+# SparkApplication API
+
+The Spark Operator uses a [CustomResourceDefinition](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) 
+named `SparkApplication` for specifying Spark applications to be run in a Kubernetes cluster. Similarly to other kinds of Kubernetes resources, a `SparkApplication` consists of a specification in a `Spec` field of type `SparkApplicationSpec`and a `Status` field of type `SparkApplicationStatus`. The definition is organized in the following structure. The v1alpha1 version of the API definition is implemented [here](../pkg/apis/sparkoperator.k8s.io/v1alpha1/typpes.go).


typpes.go in the link looks like a typo with two p's.

kimoonkim · 2018-02-05T19:10:07Z

docs/api.md

+| Field | Spark configuration property or `spark-submit` option | Note |
+| ------------- | ------------- | ------------- |
+| `Type`  | N/A  | The type of the Spark application. Valid values are `Java`, `Scala`, `Python`, and `R`. |
+| `Mode`  | `--mode` | Spark deployment mode. Valid values are `clsuter` and `client`. |


Typo, clsuter.

kimoonkim · 2018-02-05T19:17:17Z

docs/design.md

+`SparkApplication` objects and acts on the watch events,
+* a *submission runner* that runs `spark-submit` for submissions received from the controller,
+* a *Spark pod monitor* that watches for Spark pods and sends pod status updates to the controller,
+* a Spark pod [initializer](https://kubernetes.io/docs/admin/extensible-admission-controllers/#initializers)that performs initialization tasks on Spark driver and executor pods based on the annotations on the pods added by the controller,


Add a space before that performs.

kimoonkim · 2018-02-05T19:25:32Z

@liyinan926 I took a glance at the docs and I like them so far. But I am not an expert with K8s operator or CRD. I am interested in points raised by @enisoc, so I'll keep my eyes on how we answer them. Thanks!

erictune · 2018-02-05T22:58:22Z

docs/api.md

+
+| Field | Spark configuration property or `spark-submit` option | Note |
+| ------------- | ------------- | ------------- |
+| `Type`  | N/A  | The type of the Spark application. Valid values are `Java`, `Scala`, `Python`, and `R`. |


Consider using jsonschema to validate things like this (e.g. enums).
Documentation is here:
https://github.com/kubernetes/community/blob/c3516b15302936570756ba4862506a24b168cc65/contributors/design-proposals/api-machinery/customresources-validation.md#examples

This feature is alpha in 1.8 and beta in 1.9.
If you create the CRD with jsonschema, it should do nothing in GKE 1.8 and have an effect on 1.9 clusters.

enisoc · 2018-02-14T21:59:40Z

docs/design.md

+
+![Architecture Diagram](architecture-diagram.png)
+
+Specifically, a user uses the `sparkctl` (or `kubectl`) to create a `SparkApplication` object. The `SparkApplication` controller receives the object through a watcher from the API server, creates a submission carrying the `spark-submit` arguments, and sends the submission to the *submission runner*. The submission runner submits the application to run and creates the driver pod of the application. Upon starting, the driver pod creates the executor pods. While the application is running, the *Spark pod monitor* watches the pods of the application and sends status updates of the pods back to the controller, which then updates the status of the application accordingly. 


Is supporting Spark Streaming and/or high availability a goal of this current work? If so, replacing deleted Pods is critical.

Even if not, Kubernetes users generally expect Pods to be disposable and automatically replaced if necessary, even for one-off batch work as you noted with the Job API. We would need a very good reason to not support this notion, especially for something we call an Operator (which should automate as much as possible anything a human would otherwise have to do).

In the context of batch workloads, this is becoming more critical with the upcoming introduction of priority and preemption. The level of Pod churn is going to increase significantly, and batch workloads are going to need to be resilient to Pod rescheduling in order to make progress, since they are inherently prime targets for preemption.

enisoc · 2018-02-14T22:14:05Z

docs/design.md

+
+## The CRD Controller
+
+The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When a new `SparkApplication` object is added (i.e., when the `AddFunc` callback function of the `ResourceEventHandlerFuncs` is called), it enqueues the object into an internal work queue, from which a worker picks it up prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster. When a `SparkApplication` object is deleted, the object is dequeued from the internal work queue and all the Kubernetes resources associated with the application get deleted or garbage collected.


I'm still concerned that this is relying on event handlers in order to function correctly. Kubernetes controllers should be able to pick up at any point and begin converging toward the desired state.

Consider this scenario:

Your controller goes down for upgrade.

A SparkApplication object is created.

Your controller comes back up.

If you rely on getting the AddFunc callback to know that you haven't submitted a SparkApplication yet, you'll never converge toward the desired state (you'll never run spark-submit). Note that this scenario is just an example; many scenarios can cause you to miss events, which is why we don't rely on them (e.g. why we have informer resync period).

In order to be resilient, Kubernetes controllers are designed to be fully functional in the absence of any events. It must be possible to explain the design as:

List all objects of type SparkApplication.

For each SparkApplication that exists, list all objects that it depends on (Pods it owns, etc.).

Compare observed state with actual state and decide what action to take, if any.

In this case, step 3 might include something like: check if a Job exists to run the spark-submit command, and create the Job if needed.

The event handlers are merely an optimization on top of this pattern to improve response time relative to polling.

Is the informer supposed to inform all existing objects when the operator starts (or restarts)? This is what I observed when restarting the controller.

enisoc · 2018-02-14T22:21:35Z

docs/design.md

+
+The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When a new `SparkApplication` object is added (i.e., when the `AddFunc` callback function of the `ResourceEventHandlerFuncs` is called), it enqueues the object into an internal work queue, from which a worker picks it up prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster. When a `SparkApplication` object is deleted, the object is dequeued from the internal work queue and all the Kubernetes resources associated with the application get deleted or garbage collected.
+
+When a `SparkApplication` object gets updated (i.e., when the `UpdateFunc` callback function of the `ResourceEventHandlerFuncs` is called), e.g., from the user who used `kubectl apply` to apply the update. The controller checks if the application specification in `SparkApplicationSpec` has changed. If the application specification remains the same, the controller simply ignores the update. If otherwise the update was made to the application specification, the controller cancels the current run of the application by deleting the driver pod of the current run, and submits a new run of the application with the updated specification. Note that deleting the driver pod of the old run of the application effectively kills the run and causes the executor pods to be deleted as well because the driver is the owner of the executor pods. 


causes the executor pods to be deleted as well because the driver is the owner of the executor pods

Do you mean specifically that the executor Pods have an OwnerReference pointing to the driver Pod? Since Pods are considered disposable, it's generally dangerous to have Pods owning other Pods. All Pods should be owned by some controller. If the driver Pod itself is acting as a controller, there should be some CRD that represents "the abstract notion of a driver instance", which should have a lifetime that's independent of any Pod. This SparkDriver CRD would be the owner of the executor Pods.

Yes, and this is by design in Spark on K8s itself. When the Spark scheduler backend code running in the driver creates the executor pods, it adds OwnerReference to the executor pods that points to the driver pod. On top of that, the controller adds OwnerReference to the pods that points to the SparkApplication CRD object.

enisoc · 2018-02-14T22:31:19Z

docs/design.md

+
+When a `SparkApplication` object gets updated (i.e., when the `UpdateFunc` callback function of the `ResourceEventHandlerFuncs` is called), e.g., from the user who used `kubectl apply` to apply the update. The controller checks if the application specification in `SparkApplicationSpec` has changed. If the application specification remains the same, the controller simply ignores the update. If otherwise the update was made to the application specification, the controller cancels the current run of the application by deleting the driver pod of the current run, and submits a new run of the application with the updated specification. Note that deleting the driver pod of the old run of the application effectively kills the run and causes the executor pods to be deleted as well because the driver is the owner of the executor pods. 
+
+The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. When the controller receives a status update message, it gets the corresponding `SparkApplication` object from the cache store and updates the the `Status` accordingly. 


Similar to above, Kubernetes controllers should be able to compute status without relying on any events. It should be possible to describe the design as:

List all objects of type SparkApplication.

For each SparkApplication that exists, list all objects that it depends on (Pods it owns, etc.).

Compute a Status value based on the observed state.

Note that steps 1 and 2 are the same as for the reconcile loop described above. That's why controllers usually do both things in one loop: list everything, compute status, take action, repeat.

Please see reply above.

enisoc · 2018-02-14T22:35:47Z

docs/design.md

+
+The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. When the controller receives a status update message, it gets the corresponding `SparkApplication` object from the cache store and updates the the `Status` accordingly. 
+
+As described in [API Definition](api.md), the `Status` field (of type `SparkApplicationStatus`) records the overall state of the application as well as the state of each executor pod. Note that the overall state of an application is determined by the driver pod state, except when submission fails, in which case no driver pod gets launched. Particulrly, the final application state is set to the termination state of the driver pod when applicable, i.e., `COMPLETED` if the driver pod completed or `FAILED` if the driver pod failed. If the driver pod gets deleted while running, the final application state is set to `FAILED`. If submission fails, the application state is set to `FAILED_SUBMISSION`.


Since the controller needs to be designed without any knowledge of events (to be resilient in a distributed system), you have to assume it's impossible to tell the difference between "my driver Pod never started" and "my driver Pod got deleted". You can either consider this a feature (if you delete the driver Pod, we'll pretend the submission never happened and re-submit it for you), or if avoiding that is desired, you'll have to persist the fact that the driver Pod did start somewhere other than the Pod object itself.

enisoc · 2018-02-14T22:46:44Z

docs/design.md

+
+As described in [API Definition](api.md), the `Status` field (of type `SparkApplicationStatus`) records the overall state of the application as well as the state of each executor pod. Note that the overall state of an application is determined by the driver pod state, except when submission fails, in which case no driver pod gets launched. Particulrly, the final application state is set to the termination state of the driver pod when applicable, i.e., `COMPLETED` if the driver pod completed or `FAILED` if the driver pod failed. If the driver pod gets deleted while running, the final application state is set to `FAILED`. If submission fails, the application state is set to `FAILED_SUBMISSION`.
+
+As part of preparing a submission for a newly created `SparkApplication` object, the controller parses the object and adds configuration options for adding certain annotations to the driver and executor pods of the application. The annotations are later used by the Spark pod initializer to configure the pods before they start to run. For example,if a Spark application needs a certain Kubernetes ConfigMap to be mounted into the driver and executor pods, the controller adds an annotation that specifies the name of the ConfigMap to mount. Later the Spark pod initializer sees the annotation on the pods and mount the ConfigMap to the pods.


If the spark-submit command isn't flexible enough to let us pass this info in directly, maybe we should consider bypassing it and creating the driver Pod ourselves (managed by a controller). I think avoiding an initializer would be a big win for user experience, especially since initializers are alpha (and dangerous) and admission webhooks are a burden to maintain (and also dangerous) for such a simple use case. Dynamic admission is a big hammer that is intended for use cases that are so broad, it's impossible to fix things at the source (like, enforce this policy for all Pods that all users of the cluster create for any app).

I assume doing that would require reimplementing some things that spark-submit in k8s mode already does, but we should at least document the alternative and justify the compromise in design for time to market.

There's a lot of behind-the-scene things that the submission client does and adds to the driver pod at run time. So bypassing the submission client is not really an option. That's also one of the major reasons the design chose to have the submission client re-create the driver pod instead of re-creating the driver pod directly when restart is necessary. I agree that both the initializer and mutating webhook are dangerous. But the idea of using an initializer actually came from discussions within the community to extend the ability to customize the pods without introducing too many Spark configuration properties as we already have a lot. PodPresets were considered as an option. Given that the initializer may or may not go to beta and people are encouraged to move to using a mutating webhook, I think that's the way to go.

enisoc · 2018-02-14T22:49:52Z

docs/design.md

+
+## Handling Retries of Failed Submissions
+
+The submission of an application may fail for various reasons. Sometimes a submission may fail due to transient errors and a retry may succeed. The Spark Operator supports retries of failed submissions through a combination of the `MaxSubmissionRetries` field of `SparkApplicationSpec` and the `SubmissionRetries` field of `SparkApplicationStatus` (see the [API Definition](api.md) for more details). When the operator decides to retry a failed submission, it simply enqueues the `SparkApplication` object of the application into the internal work queue, from which it gets picked up by a worker who will handle the submission.   


How is info about previous attempts persisted in API objects so that it survives controller restart?

There's a MaxSubmissionRetries field in Spec and a SubmissionRetries field in Status.

liyinan926 · 2018-02-16T22:27:05Z

@enisoc per our discussion offline, I created #90 to look into enabling cache re-sync and the implication of that. For bookkeeping, the other two concerns are: 1) life cycle management of the CustomResourceDefinition of SparkApplication should be owned by the users instead of the controller, and particularly, the controller should not delete the CustomResourceDefinition when it's terminated; and 2) better alternatives to initializers and mutating webhooks for pod customization.

liyinan926 · 2018-02-19T01:19:06Z

Merging this now. #90 and #91 as follow-ups.

liyinan926 force-pushed the master branch from dfde63e to 5601913 Compare January 31, 2018 19:34

ifilonenko reviewed Jan 31, 2018

View reviewed changes

enisoc reviewed Jan 31, 2018

View reviewed changes

liyinan926 force-pushed the master branch from 5601913 to b4216b6 Compare February 1, 2018 03:59

kimoonkim reviewed Feb 5, 2018

View reviewed changes

erictune reviewed Feb 5, 2018

View reviewed changes

liyinan926 force-pushed the master branch 3 times, most recently from b939ab2 to 5cfeba0 Compare February 14, 2018 20:41

enisoc reviewed Feb 14, 2018

View reviewed changes

liyinan926 force-pushed the master branch from 5cfeba0 to 2d9cfbc Compare February 16, 2018 18:22

liyinan926 force-pushed the master branch from 2d9cfbc to 6efa990 Compare February 19, 2018 00:59

liyinan926 added 3 commits February 18, 2018 17:00

Initial documentation

3cdce79

Fixed typos

bc34149

Updated api.md and design.md

d92bf51

liyinan926 force-pushed the master branch from 6efa990 to d92bf51 Compare February 19, 2018 01:00

liyinan926 merged commit 64626c8 into kubeflow:master Feb 19, 2018


		### Controller, Submission Runner, and Spark Pod Monitor

		The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When it receives a new `SparkApplication` object, it prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster.


		![Architecture Diagram](architecture-diagram.png)

		Specifically, a user uses the `sparkctl` (or `kubectl`) to create a `SparkApplication` object. The `SparkApplication` controller receives the object through a watcher from the API server, creates a submission carrying the `spark-submit` arguments, and sends the submission to the submission runner. The submission runner submits the application to run and creates the driver pod of the application. Upon starting, the driver pod creates the executor pods. While the application is running, the Spark pod monitor watches the pods of the application and sends status updates of the pods back to the controller, which then updates the status of the application accordingly.


		The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When it receives a new `SparkApplication` object, it prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster.

		The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. The controller maintains the collection of `SparkApplication` objects of running applications and update the status of them based on the messages received.


		The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. The controller maintains the collection of `SparkApplication` objects of running applications and update the status of them based on the messages received.

		As part of preparing a submission for a newly created `SparkApplication` object, the controller parses the object and adds configuration options for adding certain annotations to the driver and executor pods of the application. The annotations are later used by the Spark pod initializer to configure the pods before they start to run. For example,if a Spark application needs a certain Kubernetes ConfigMap to be mounted into the driver and executor pods, the controller adds an annotation that specifies the name of the ConfigMap to mount. Later the Spark pod initializer sees the annotation on the pods and mount the ConfigMap to the pods.


		## The CRD Controller

		The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When a new `SparkApplication` object is added (i.e., when the `AddFunc` callback function of the `ResourceEventHandlerFuncs` is called), it enqueues the object into an internal work queue, from which a worker picks it up prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster. When a `SparkApplication` object is deleted, the object is dequeued from the internal work queue and all the Kubernetes resources associated with the application get deleted or garbage collected.


		The `SparkApplication` controller, or CRD controller in short, watches events of creation, updates, and deletion of `SparkApplication` objects in any namespaces in a Kubernetes cluster, and acts on the watch events. When a new `SparkApplication` object is added (i.e., when the `AddFunc` callback function of the `ResourceEventHandlerFuncs` is called), it enqueues the object into an internal work queue, from which a worker picks it up prepares a submission and sends the submission to the submission runner, which actually submits the application to run in the Kubernetes cluster. The submission includes the list of arguments for the `spark-submit` command. The submission runner has a configurable number of workers for submitting applications to run in the cluster. When a `SparkApplication` object is deleted, the object is dequeued from the internal work queue and all the Kubernetes resources associated with the application get deleted or garbage collected.

		When a `SparkApplication` object gets updated (i.e., when the `UpdateFunc` callback function of the `ResourceEventHandlerFuncs` is called), e.g., from the user who used `kubectl apply` to apply the update. The controller checks if the application specification in `SparkApplicationSpec` has changed. If the application specification remains the same, the controller simply ignores the update. If otherwise the update was made to the application specification, the controller cancels the current run of the application by deleting the driver pod of the current run, and submits a new run of the application with the updated specification. Note that deleting the driver pod of the old run of the application effectively kills the run and causes the executor pods to be deleted as well because the driver is the owner of the executor pods.


		The controller is also responsible for updating the status of a `SparkApplication` object with the help of the Spark pod monitor, which watches Spark pods and update the `SparkApplicationStatus` field of corresponding `SparkApplication` objects based on the status of the pods. The Spark pod monitor watches events of creation, updates, and deletion of Spark pods, creates status update messages based on the status of the pods, and sends the messages to the controller to process. When the controller receives a status update message, it gets the corresponding `SparkApplication` object from the cache store and updates the the `Status` accordingly.

		As described in [API Definition](api.md), the `Status` field (of type `SparkApplicationStatus`) records the overall state of the application as well as the state of each executor pod. Note that the overall state of an application is determined by the driver pod state, except when submission fails, in which case no driver pod gets launched. Particulrly, the final application state is set to the termination state of the driver pod when applicable, i.e., `COMPLETED` if the driver pod completed or `FAILED` if the driver pod failed. If the driver pod gets deleted while running, the final application state is set to `FAILED`. If submission fails, the application state is set to `FAILED_SUBMISSION`.


		## Handling Retries of Failed Submissions

		The submission of an application may fail for various reasons. Sometimes a submission may fail due to transient errors and a retry may succeed. The Spark Operator supports retries of failed submissions through a combination of the `MaxSubmissionRetries` field of `SparkApplicationSpec` and the `SubmissionRetries` field of `SparkApplicationStatus` (see the [API Definition](api.md) for more details). When the operator decides to retry a failed submission, it simply enqueues the `SparkApplication` object of the application into the internal work queue, from which it gets picked up by a worker who will handle the submission.

Initial documentation on design and API definition #68

Initial documentation on design and API definition #68

Conversation

liyinan926 commented Jan 31, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 Jan 31, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 Feb 14, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 Jan 31, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimoonkim commented Feb 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 Feb 15, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 commented Feb 16, 2018

liyinan926 commented Feb 19, 2018

liyinan926 commented Jan 31, 2018 •

edited

liyinan926 Jan 31, 2018 •

edited

liyinan926 Feb 14, 2018 •

edited

liyinan926 Jan 31, 2018 •

edited

liyinan926 Feb 15, 2018 •

edited