New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

KEP-0018 Controller Redesign #830

Merged

alenkacz merged 13 commits into master from av/kep-controller

Sep 19, 2019

Contributor

alenkacz commented Sep 16, 2019

What this PR does / why we need it:
Introduces new KEP - all other should be answered in the KEP itself.

Which issue(s) this PR fixes:

alenkacz added 2 commits

September 16, 2019 17:12


          KEP about controller overhaul

c6a35de


          Edits

5e617e8

alenkacz requested a review from gerred as a code owner

September 16, 2019 15:18

alenkacz requested a review from zen-dog

September 16, 2019 15:18

zen-dog changed the title ~~KEP about controller redesign~~ KEP-0018 Controller Redesign

alenkacz requested review from kensipe and yankcrime

September 17, 2019 06:52

zen-dog approved these changes

View reviewed changes

Contributor

zen-dog left a comment

I left a few minor nits/suggestions. My biggest pain-point is lastAppliedInstanceSpec being part of Instance.Status. But I don't have a better solution for this so 🤷‍♂ Overall it is certainly a big step towards a more robust design.

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved

keps/0018-controller-overhaul.md Show resolved Hide resolved

keps/0018-controller-overhaul.md Outdated

+                  state: IN_PROGRESS
+                  activePlan: deploy-1478631057 # (will be null if no plan is run right now)
+                  lastAppliedInstanceSpec:
+                    version: 123 # probably some version from metadata?

Contributor

zen-dog Sep 17, 2019

Do you mean K8 resourceVersion? I think we should introduce our own counter that is increased by the controller on each Instance update.

Contributor Author

alenkacz Sep 17, 2019

YEah I don't have a clear idea right now what that version should be and if we even need it. Maybe @gerred will have some opinion

Member

gerred Sep 18, 2019

we should look at the server side apply APIs for this. if we implement it on our CRD now we might be able to delete a bunch of code later once that is on by default.

keps/0018-controller-overhaul.md Outdated

+                                  state: IN_PROGRESS
+                                  delete: false
+                    - upgrade: null # (never run)
+                  state: IN_PROGRESS

Contributor

zen-dog Sep 17, 2019

wdyt about using aggregatedStatus to express the fact, that this is indeed aggregated from the status.planStatus:

Suggested change

      
                state: IN_PROGRESS
          
                aggregatedStatus: 
          
                    - planStatus: IN_PROGRESS
          
                      activePlan: deploy-1478631057 # (will be null if no plan is run right now)

P.S. also, why do we use state and not status in plans/phases?

Contributor Author

alenkacz Sep 17, 2019

yeah, it should be status - changed

keps/0018-controller-overhaul.md Outdated

+              ```
+              `planStatus` is a property that basically replaces the current PlanExecution CRD - it reports on the status of the plans that are run right now or last runs of all the plans available for that operator version. This is also what `kudo plan history` and `kudo plan status` will query to get overview of the plans.
+              `lastAppliedInstanceSpec` is persisting the state of the instance from the previous successful deploy/upgrade/update plan finished. We need this to be able to solve flaw n.1 described in the summary. This gets updated after a plan succeeds.

Contributor

zen-dog Sep 17, 2019

I'm not happy about lastAppliedInstanceSpec being part of Instance.Status field but then again: it is certainly not metadata or spec and I don't see a better place for it so 🤷‍♂

Contributor

zen-dog Sep 17, 2019

Do we know of anybody who has to hold a CRD history? And if yes - how do they solve it?

Member

gerred Sep 18, 2019

apply does this now on the client for all resources. server side apply semantics will bring this to the server. Check out this managed-fields section @alenkacz @zen-dog:

https://kubernetes.io/docs/reference/using-api/api-concepts/#server-side-apply

keps/0018-controller-overhaul.md Outdated


		### Admission webhook

		Part of the solution (addressing problem n.3 from the summary) is an admission webhook. This one will guard the Instance CRD and will disallow any changes in a spec if plan is running.

Contributor

zen-dog Sep 17, 2019

Is admission hook part of the refactoring (1), or does the refactoring allow us to add the hook later (2), solving consistency issues? If (2) is the case, then we should probably mention it as an optional goal, that can be tackled in a separate effort.

Contributor Author

alenkacz Sep 17, 2019

Good point, I've marked it as "stretch goal" in the following paragraph

alenkacz and others added 8 commits

September 17, 2019 14:35


          Update keps/0018-controller-overhaul.md

0962cf4

Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>


          Update keps/0018-controller-overhaul.md

b5bc33d

Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>


          Update keps/0018-controller-overhaul.md

a1088cf

Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>


          Update keps/0018-controller-overhaul.md

2d07a93

Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>


          Update keps/0018-controller-overhaul.md

f573a05

Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>


          Update keps/0018-controller-overhaul.md

8f913ac

Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>


          Update keps/0018-controller-overhaul.md

6eba796

Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>


          KEP edits based on review

2bf5b12

zen-dog added the kind/kep label

zen-dog reviewed

View reviewed changes

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved

zen-dog reviewed

View reviewed changes

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved

alenkacz added 2 commits

September 18, 2019 13:55


          Merge branch 'master' into av/kep-controller


          Remove properties of plan that can be found in OV

8ffec1e

gerred reviewed

View reviewed changes

keps/0018-controller-overhaul.md Outdated

+                  state: IN_PROGRESS
+                  activePlan: deploy-1478631057 # (will be null if no plan is run right now)
+                  lastAppliedInstanceSpec:
+                    version: 123 # probably some version from metadata?

Member

gerred Sep 18, 2019

we should look at the server side apply APIs for this. if we implement it on our CRD now we might be able to delete a bunch of code later once that is on by default.

keps/0018-controller-overhaul.md Outdated

+              ```
+              `planStatus` is a property that basically replaces the current PlanExecution CRD - it reports on the status of the plans that are run right now or last runs of all the plans available for that operator version. This is also what `kudo plan history` and `kudo plan status` will query to get overview of the plans.
+              `lastAppliedInstanceSpec` is persisting the state of the instance from the previous successful deploy/upgrade/update plan finished. We need this to be able to solve flaw n.1 described in the summary. This gets updated after a plan succeeds.

Member

gerred Sep 18, 2019

apply does this now on the client for all resources. server side apply semantics will bring this to the server. Check out this managed-fields section @alenkacz @zen-dog:

https://kubernetes.io/docs/reference/using-api/api-concepts/#server-side-apply

keps/0018-controller-overhaul.md


		Part of the solution (addressing problem n.3 from the summary - ensuring atomicity) is an admission webhook. This one will guard the Instance CRD and will disallow any changes in a spec if plan is running. Admission webhook in kubernetes world is the only way how to prevent resource from being updated. All following filters (like controller predicates) are called AFTER the change was applied so it's too late to validate.

		Although admission webook addresses one of the problems outlined in this KEP it's considered a stretch goal and can be delivered in a following release (should NOT be part of the initial refactoring).

Member

gerred Sep 18, 2019

I think I like this. I don't think there's a way to actually make this conflict free without some sort of lock, so this is a good solution by me.

keps/0018-controller-overhaul.md

+              - overall try to aim to use as much best practices in using controller-runtime as possible
+              - implementation of this will be a breaking change meaning that KUDO on existing clusters will have to be reinstalled to work (CRDs dropped and recreated)
+              - temporarily we won't be able to execute manual jobs until we agree on a design there (none of the current operators use it anyway so should be fine for one release)

Member

gerred Sep 18, 2019

I think this is a great opportunity to put together a state flow diagram that goes inline with this KEP, so that we have an agreed notion on how this controller should work. Be great for the documentation and discussing the core controller loop as well.

Contributor Author

alenkacz Sep 18, 2019

Love that idea! I'll work on it

Contributor

zen-dog Sep 18, 2019

If you want to put your diagram into the KEP, http://asciiflow.com/ can be your friend ;)

keps/0018-controller-overhaul.md

+. When KUDO manager is down (or restarted) we might miss an update on a CRD meaning we won’t execute the plan that was supposed to run or we execute an incorrect one ([issue](https://github.com/kudobuilder/kudo/issues/422))
+. Multiple plan can be seemingly in progress leading to misleading status communicated ([issue](https://github.com/kudobuilder/kudo/issues/628))
+. No way to ensure atomicity on plan execution (given the current CRD design where information is spread across several CRDs)
+. Very low test coverage and overall confidence in the code

Member

gerred Sep 18, 2019

Are there any linters/metrics we can put together? Not as something we have to live by but just to give us a general sense in data of whether or not we are improving. Maybe a separate KEP, but I don't think we need to institutionalize anything, just collect data as we go even if we're running it manually.

Contributor Author

alenkacz Sep 18, 2019

you mean around coverage? Overall I don't like measuring code coverage, getting that number higher not always improves things. Right now there are 0 unit tests, my goal was to go above that.

But I know that the claim I made here sounds very generic and it would be good to have something objective that would mean it's good enough... I am open to suggestions here :)

Member

gerred Sep 18, 2019

Nah, not coverage. I'm thinking simple gut checks we can run, like running gocyclo and a few other stats. Basically, we should have a general sense that are we improving those main functions by driving different sources complexity down. Totally agree that code coverage on its own doesn't really tell us anything.

zen-dog reviewed

View reviewed changes

keps/0018-controller-overhaul.md Outdated

+                      activePlan: deploy-1478631057 # (will be null if no plan is run right now)
+                  lastAppliedInstanceSpec:
+                    version: 123 # probably some version from metadata?
+                    spec:

Contributor

zen-dog Sep 18, 2019

suggestion: lastAppliedInstanceSpec.spec should probably be lastAppliedInstance.spec


          Use annotation instead of status

26d42fd

alenkacz merged commit 190c86f into master

alenkacz deleted the av/kep-controller branch

October 30, 2019 09:07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment