KEP-20, Part 1: Extend admission webhook and implement `Instance.Plan…

…Execution` (#1188) This is part one of the KEP-20 implementation. **tl;dr**: - added new `Instance.Spec.PlanExecution` field that is populated (holding plan name) by the admission webhook either: - directly by the user through a manual plan trigger or - indirectly by the instance admission webhook on parameter updates or upgrades - `PlanExecution` field is ignored by the instance controller (yet), the later still has the old logic to decide what to do next. This will be part of part two of the implementation. - admission webhook (`instance_admission.go`) was extended with a set of rules guarding conflicting direct and indirectly triggered plans in combination with upgrades. See this beautiful (!) [table](https://github.com/kudobuilder/kudo/pull/1188/files#diff-3ee7e3fb450b5a364e3d9139fb154ad2R78) and this [test](https://github.com/kudobuilder/kudo/pull/1188/files#diff-c3a36ee52a0157a756e794446d2f4e6bR67) for more details. If an instance update is invalid, an error (status code and error message) will be returned to the user - KEP-20 was extended to reflect current implementation approach Co-authored-by: alenkacz <avarkockova@mesosphere.com>
kudobuilder · Jan 28, 2020 · 0ed9632 · 0ed9632
1 parent cf56876
commit 0ed9632
Show file tree

Hide file tree

Showing 20 changed files with 1,076 additions and 339 deletions.
diff --git a/cmd/manager/main.go b/cmd/manager/main.go
@@ -38,6 +38,7 @@ import (
 	"github.com/kudobuilder/kudo/pkg/controller/operator"
 	"github.com/kudobuilder/kudo/pkg/controller/operatorversion"
 	"github.com/kudobuilder/kudo/pkg/version"
+	kudohook "github.com/kudobuilder/kudo/pkg/webhook"
 )
 
 // parseSyncPeriod determines the minimum frequency at which watched resources are reconciled.
@@ -125,7 +126,7 @@ func main() {
 	if strings.ToLower(os.Getenv("ENABLE_WEBHOOKS")) == "true" {
 		log.Printf("Setting up webhooks")
 
-		if err := registerWebhook("/validate", &v1beta1.Instance{}, &webhook.Admission{Handler: &v1beta1.InstanceValidator{}}, mgr); err != nil {
+		if err := registerWebhook("/validate", &v1beta1.Instance{}, &webhook.Admission{Handler: &kudohook.InstanceAdmission{}}, mgr); err != nil {
 			log.Printf("unable to create instance validation webhook: %v", err)
 			os.Exit(1)
 		}
@@ -143,7 +144,7 @@ func main() {
 
 // registerWebhook method registers passed webhook using a give prefix (e.g. "/validate") and runtime object
 // (e.g. v1beta1.Instance) to generate a webhook path e.g. "/validate-kudo-dev-v1beta1-instances". Webhook
-// has to implement http.Handler interface (see v1beta1.InstanceValidator for an example)
+// has to implement http.Handler interface (see v1beta1.InstanceAdmission for an example)
 func registerWebhook(prefix string, obj runtime.Object, hook http.Handler, mgr manager.Manager) error {
 	path, err := webhookPath(prefix, obj, mgr)
 	if err != nil {

diff --git a/config/crds/kudo.dev_instances.yaml b/config/crds/kudo.dev_instances.yaml
@@ -77,6 +77,24 @@ spec:
               additionalProperties:
                 type: string
               type: object
+            planExecution:
+              description: 'There are two ways a plan execution can be triggered:  1)
+                indirectly through update of a corresponding parameter in the InstanceSpec.Parameters
+                map  2) directly through setting of the InstanceSpec.PlanExecution
+                field While indirect (1) triggers happens every time a user changes
+                a parameter, a directly (2) triggered plan is reserved for the situations
+                when parameters doesn''t change e.g. a periodic backup is triggered
+                overriding the existing backup file. Additionally, this opens room
+                for canceling and overriding currently running plans in the future.
+                Note: PlanExecution field defines plan name and corresponding parameters
+                that IS CURRENTLY executed. Once the instance controller (IC) is done
+                with the execution, this field will be cleared.'
+              properties:
+                planName:
+                  type: string
+              required:
+              - planName
+              type: object
           type: object
         status:
           description: InstanceStatus defines the observed state of Instance

diff --git a/keps/0020-manual-plan-execution.md b/keps/0020-manual-plan-execution.md
@@ -3,10 +3,12 @@ kep-number: 20
 title: Manual plan execution
 authors:
   - "@alenkacz"
+  - "@zen-dog"
 owners:
   - "@alenkacz"
+  - "@zen-dog"
 creation-date: 2019-11-08
-last-updated: 2019-11-08
+last-updated: 2020-01-20
 status: provisional
 see-also:
   - KEP-18
@@ -26,7 +28,7 @@ see-also:
          * [Update the status of Instance with name of the plan you want to run, let controller pick it up](#update-the-status-of-instance-with-name-of-the-plan-you-want-to-run-let-controller-pick-it-up)
             * [Pros](#pros)
             * [Cons](#cons)
-         * [Expose HTTP API from controller to allow receiving commands](#expose-http-api-from-controller-to-allow-receiving-commands)
+         * [Expose custom HTTP API from controller-manager to allow receiving commands](#expose-custom-http-api-from-controller-manager-to-allow-receiving-commands)
             * [Pros](#pros-1)
             * [Cons](#cons-1)
          * [Have a separate CRD just for maintaining commands to execute a plan](#have-a-separate-crd-just-for-maintaining-commands-to-execute-a-plan)
@@ -35,11 +37,11 @@ see-also:
          * [Introduce new sub-resource for triggering plan executions](#introduce-new-sub-resource-for-triggering-plan-executions)
             * [Pros](#pros-3)
             * [Cons](#cons-3)
+         * [Introducing a new field in the Instance.Spec](#introducing-a-new-field-in-the-instancespec)
+            * [Pros](#pros-4)
+            * [Cons](#cons-4)
          * [User Stories](#user-stories)
-            * [As an operator (person) I want to be able to run a custom plan on my KUDO operator in order to trigger non-default plan](#as-an-operator-person-i-want-to-be-able-to-run-a-custom-plan-on-my-kudo-operator-in-order-to-trigger-non-default-plan)
-            * [As a stateful service operator (person) I want to be able to run backup on my stateful service when a backup plan is defined.](#as-a-stateful-service-operator-person-i-want-to-be-able-to-run-backup-on-my-stateful-service-when-a-backup-plan-is-defined)
-         * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
-         * [Risks and Mitigations](#risks-and-mitigations)
+         * [The Solution](#the-solution)
 
 [Tools for generating]: https://github.com/ekalinin/github-markdown-toc
 
@@ -70,26 +72,26 @@ This is by far the simplest solution but probably not the one with cleanest desi
 newStatus := instance.Status.DeepCopy()
 // TODO planStatus now contains status of that plan we want to run
 // for implementation details of that look at instance.StartPlanExecution
-planStatus.Status = ExecutionPending
-planStatus.UID = uuid.NewUUID()
+newStatus.Status = ExecutionPending
+newStatus.UID = uuid.NewUUID()
 
 instance.Status = newStatus
 client.Status().Update(context.TODO(), instance)
 ```
 
 This also counts with the fact that on the server-side we have an admission webhook that won't allow setting a status like that in case there is another plan running. Such update would be rejected.
 
-Although it’s very easy to do this, it’s not very kubernetes idiomatic way of doing things especially because Status should never be updated from a client and it should just capture the state of the object.
+Although it’s very easy to do this, it’s not very Kubernetes idiomatic way of doing things especially because Status should never be updated from a client and it should just capture the state of the object.
 
-For some background, including definition of status subresource by [API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status): `The status summarizes the current state of the object in the system, and is usually persisted with the object by an automated processes but may be generated on the fly. At some cost and perhaps some temporary degradation in behavior, the status could be reconstructed by observation if it were lost.`  
+For some background, including definition of status sub-resource by [API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status): `The status summarizes the current state of the object in the system, and is usually persisted with the object by an automated processes but may be generated on the fly. At some cost and perhaps some temporary degradation in behavior, the status could be reconstructed by observation if it were lost.`  
 
 #### Pros
 
-Very easy to implement considering the current state of the code
+- Easy to implement considering the current state of the code
 
 #### Cons
 
-No way to start a plan with just pure kubectl - this is a reply from sig-cli channel where I posed that question (`It doesn't make sense to update the status manually, it should reflect the current state in the cluster. So no, you cannot set it though kubectl`)
+- No way to start a plan with just pure kubectl - this is a reply from sig-cli channel where I posed that question (`It doesn't make sense to update the status manually, it should reflect the current state in the cluster. So no, you cannot set it though kubectl`)
 
 ### Expose custom HTTP API from controller-manager to allow receiving commands
 
@@ -101,14 +103,14 @@ The validation whether a plan can be started or not would have to be part of the
 
 #### Pros
 
-The complexity of the implementation is still pretty managable and lower than having a full-featured sub-resource
+- The complexity of the implementation is still pretty manageable and lower than having a full-featured sub-resource
 
 #### Cons
 
-Have to maintain a separate API (with API documentation)
-It is a non-kubernetes API, so no clients for various languages and no kubectl support (although if we keep the API that simple, almost a CURL can do the trick)
-More complex manager pod deployment with additional service
-How do we manage RBAC?
+- Have to maintain a separate API (with API documentation)
+- It is a non-kubernetes API, so no clients for various languages and no kubectl support (although if we keep the API that simple, almost a CURL can do the trick)
+- More complex manager pod deployment with additional service
+- How do we manage RBACs?
 
 ### Have a separate CRD just for maintaining commands to execute a plan
 
@@ -128,53 +130,75 @@ Status:
   Status: Accepted # one of Accepted, Rejected, Cancelled (added in the future)
 ```
 
-PlanExecutionRequest will also contain status subresource but it will be a status reflecting whether the request was accepted or not, it should in no way be a duplication of `Instance` status as that would move us to pre-KEP-18 situation where updating status in two separate CRDs caused some incosistencies as that cannot be done in a transactional way.
+`PlanExecutionRequest` will also contain status sub-resource but it will be a status reflecting whether the request was accepted or not, it should in no way be a duplication of `Instance` status as that would move us to pre-KEP-18 situation where updating status in two separate CRDs caused some inconsistencies as that cannot be done in a transactional way.
 
-But at the same time this seems to be the way people are leaning toward when coming to similar use-cases. The discussion under [this issue](https://github.com/kubernetes/kubernetes/issues/72637#issuecomment-515566586) is kind of touching on the topic of Request objects. What this [KEP](https://github.com/metal3-io/metal3-docs/pull/48/files) proposes for reboot is also kind of similar to what we’re trying to do. Alongside what they’re proposing we might not even need a controller for this CRD and we could treat it as “queue” meaning every request will be fulfilled so e.g. when you create a request and a plan is running, instance will pick up the next request when it has a capacity to run another plan.
+But at the same time this seems to be the way people are leaning toward when coming to similar use-cases. The discussion under [this issue](https://github.com/kubernetes/kubernetes/issues/72637#issuecomment-515566586) is kind of touching on the topic of Request objects. What this [KEP](https://github.com/metal3-io/metal3-docs/pull/48/files) proposes for reboot is also kind of similar to what we’re trying to do. Alongside what they’re proposing we might not even need a controller for this CR and we could treat it as “queue” meaning every request will be fulfilled so e.g. when you create a request and a plan is running, instance will pick up the next request when it has a capacity to run another plan.
 
 The implementation will have following properties:
-- there is no controller for PlanExecutionRequest
-- you cannot tell status of the execution just from the PlanExecutionRequest object
-- PlanExecutionRequest is managed by the `InstanceController`. We might keep N last PERs for audit purposes, and delete old PERs eventually. All PERs will be GCed on Instance deletion.
+
+- there is no controller for `PlanExecutionRequest`s
+- you cannot tell status of the execution just from the `PlanExecutionRequest` object
+- `PlanExecutionRequest` is managed by the `InstanceController`. We might keep N last PERs for audit purposes, and delete old PERs eventually. All PERs will be GCed on Instance deletion.
 
 #### Pros
 
-Kubernetes-native way of API for executing plans, can be easily done via kubectl or any other kubernetes client
-Seems to currently be the de-facto pattern to do these things before sub-resources via webhooks are introduced for CRDs.
+- Kubernetes-native way of API for executing plans, can be easily done via kubectl or any other kubernetes client
+- Seems to currently be the de-facto pattern to do these things before sub-resources via webhooks are introduced for CRDs.
 
 #### Cons
 
-Challenging to communicate back to the user if execution was possible and how it happened (would need a good support for this in CLI)
-Have to maintain another CRD
-Need to think about garbage collection of the Request objects
+General:
+- Challenging to communicate back to the user if the execution was possible and how it happened (would need good support for this in CLI)
+- Maintaining another CRD
+- Garbage collection of the Request objects
+
+Have to maintain a request queue. This is trickier than it looks at first glance: 
+- having a Request sequence at all is counter-intuitive: e.g. an operator will have to check for what's already in the queue, before scheduling a new Request. If the current plan is stuck in execution (due to an error, insufficient cluster resources or otherwise) an operator would likely want to correct the existing plan, instead of scheduling a new one. This will require an additional layer of API complexity (scheduling call vs. overriding a current one)
+- not all Request sequences are meaningful, e.g. an Upgrade followed by a Request to a non-existing, renamed or otherwise outdated plan can't be executed without any meaningful way of telling the user why the Request was not handled (except for logs or events)
+
+Cross-resource consistency:  
+This is a fundamental limitation of Kubernetes (or rather it is inherited from etcd). if a controller has to read/write two resources to make a decision, there is always a room for inconsistencies. And while there are strategies of dealing with said inconsistencies, most of which fall into the "eventually-consistent" category, this is still not trivial. While, on the one hand, an Instance can always be updated (triggering plans) and upgraded directly a user can always manually create Requests in parallel. There is always an ambiguity of what is handled first and a problem of no immediate feedback to the user. And webhooks can not help here, because a webhook can only meaningfully guard **one** resource and not multiple.
+
+**tl;dr**: We believe that a queue is not a meaningful paradigm when working with data services (or other complex applications). Each plan execution has a big failure potential and scheduling a series of such Requests complicate dealing with systems unnecessary. And dealing with cross-resource inconsistencies and lack of immediate feedback just add an insult to misery.
 
 ### Introduce new sub-resource for triggering plan executions
 
-The ideal solution for this would be if custom sub-resources for CRDs are supported. That is not the case right now, see discussion under [this issue](https://github.com/kubernetes/kubernetes/issues/72637).
+Another solution for this problem would Kubernetes support for CR sub-resources. That is not the case right now, see discussion under [this issue](https://github.com/kubernetes/kubernetes/issues/72637).
 
-The only solution here is to switch to a custom aggregated API server. That means:
-Running api server deployment
-Moving instance CRD to be managed by this extension api server (breaking change)
-Running another controller-manager (? could we use the same one ?)
+In the absence of CR sub-resource support, the only way of implementing it is to switch to a custom aggregated API server. That means:
+- Running api server deployment
+- Moving instance CRD to be managed by this extension API server (breaking change)
+- Running another controller-manager (? could we use the same one ?)
 TODO: is this supported in kubectl somehow?
 
 #### Pros
 
-More kubernetes native way of implementing such API with build in RBAC
+- More kubernetes native way of implementing such API with build in RBAC
 
 #### Cons
 
-Lots of added complexity in terms of deployment and debugging
-Does not look like people are really doing this that much (they usually rather find a way how to make their use case work with plain CRDs)
+- Lots of added complexity in terms of deployment and debugging
+- Does not look like people are really doing this that much (they usually rather find a way how to make their use case work with plain CRDs)
 
-### User Stories
 
-#### As an operator (person) I want to be able to run a custom plan on my KUDO operator in order to trigger non-default plan
+### Introducing a new field in the Instance.Spec
+
+- This is very similar to the sub-resource solution. Same pros and none of the API-server-related cons.
 
-#### As a stateful service operator (person) I want to be able to run backup on my stateful service when a backup plan is defined.
+#### Pros
+
+- Technically the simplest way of implementing the feature
+- We can guard against the inconsistencies of plan triggering in the Instance webhook
+- Immediate feedback for the CLI when e.g. a plan can not be started since webhook can decline requests with an status code and an error message
+
+#### Cons
+- Growing `Instance.Spec` and some denormalization of the data model 
+
+### User Stories
 
-### Implementation Details/Notes/Constraints [optional]
+- As an operator (person) I want to be able to run a custom plan on my KUDO operator in order to trigger non-default plan
+- As a stateful service operator (person) I want to be able to run backup on my stateful service when a backup plan is defined.
 
-TODO: choose a solution
+### The Solution
 
-### Risks and Mitigations
+After lots of thought experiments and prototyping around potential solutions, we decided to move on with introducing a new field to the `Instance.Spec`. The pros of keeping parameter updates, upgrades and directly triggered plans consistently and providing immediate user feedback far outweigh the data-model denormalization cons. Should Kubernetes implement support for custom sub-resources we might switch to that in the future.
diff --git a/pkg/apis/kudo/v1beta1/instance_types.go b/pkg/apis/kudo/v1beta1/instance_types.go
@@ -29,6 +29,23 @@ type InstanceSpec struct {
 	OperatorVersion corev1.ObjectReference `json:"operatorVersion,omitempty"`
 
 	Parameters map[string]string `json:"parameters,omitempty"`
+
+	PlanExecution PlanExecution `json:"planExecution,omitempty"`
+}
+
+// There are two ways a plan execution can be triggered:
+//  1) indirectly through update of a corresponding parameter in the InstanceSpec.Parameters map
+//  2) directly through setting of the InstanceSpec.PlanExecution field
+// While indirect (1) triggers happens every time a user changes a parameter, a directly (2) triggered
+// plan is reserved for the situations when parameters doesn't change e.g. a periodic backup is triggered
+// overriding the existing backup file. Additionally, this opens room for canceling and overriding
+// currently running plans in the future.
+// Note: PlanExecution field defines plan name and corresponding parameters that IS CURRENTLY executed.
+// Once the instance controller (IC) is done with the execution, this field will be cleared.
+type PlanExecution struct {
+	PlanName string `json:"planName"  validate:"required"`
+
+	// Future PE options like Force: bool. Not needed for now
 }
 
 // InstanceStatus defines the observed state of Instance