New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-0018 Controller Redesign #830
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minor nits/suggestions. My biggest pain-point is lastAppliedInstanceSpec
being part of Instance.Status
. But I don't have a better solution for this so 🤷♂ Overall it is certainly a big step towards a more robust design.
keps/0018-controller-overhaul.md
Outdated
state: IN_PROGRESS | ||
activePlan: deploy-1478631057 # (will be null if no plan is run right now) | ||
lastAppliedInstanceSpec: | ||
version: 123 # probably some version from metadata? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean K8 resourceVersion
? I think we should introduce our own counter that is increased by the controller on each Instance update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YEah I don't have a clear idea right now what that version should be and if we even need it. Maybe @gerred will have some opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should look at the server side apply APIs for this. if we implement it on our CRD now we might be able to delete a bunch of code later once that is on by default.
keps/0018-controller-overhaul.md
Outdated
state: IN_PROGRESS | ||
delete: false | ||
- upgrade: null # (never run) | ||
state: IN_PROGRESS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about using aggregatedStatus
to express the fact, that this is indeed aggregated from the status.planStatus
:
state: IN_PROGRESS | |
aggregatedStatus: | |
- planStatus: IN_PROGRESS | |
activePlan: deploy-1478631057 # (will be null if no plan is run right now) |
P.S. also, why do we use state
and not status
in plans/phases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it should be status - changed
keps/0018-controller-overhaul.md
Outdated
``` | ||
`planStatus` is a property that basically replaces the current PlanExecution CRD - it reports on the status of the plans that are run right now or last runs of all the plans available for that operator version. This is also what `kudo plan history` and `kudo plan status` will query to get overview of the plans. | ||
|
||
`lastAppliedInstanceSpec` is persisting the state of the instance from the previous successful deploy/upgrade/update plan finished. We need this to be able to solve flaw n.1 described in the summary. This gets updated after a plan succeeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy about lastAppliedInstanceSpec
being part of Instance.Status
field but then again: it is certainly not metadata
or spec
and I don't see a better place for it so 🤷♂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know of anybody who has to hold a CRD history? And if yes - how do they solve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply
does this now on the client for all resources. server side apply semantics will bring this to the server. Check out this managed-fields section @alenkacz @zen-dog:
https://kubernetes.io/docs/reference/using-api/api-concepts/#server-side-apply
keps/0018-controller-overhaul.md
Outdated
|
||
### Admission webhook | ||
|
||
Part of the solution (addressing problem n.3 from the summary) is an admission webhook. This one will guard the Instance CRD and will disallow any changes in a spec if plan is running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is admission hook part of the refactoring (1), or does the refactoring allow us to add the hook later (2), solving consistency issues? If (2) is the case, then we should probably mention it as an optional goal, that can be tackled in a separate effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've marked it as "stretch goal" in the following paragraph
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
keps/0018-controller-overhaul.md
Outdated
state: IN_PROGRESS | ||
activePlan: deploy-1478631057 # (will be null if no plan is run right now) | ||
lastAppliedInstanceSpec: | ||
version: 123 # probably some version from metadata? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should look at the server side apply APIs for this. if we implement it on our CRD now we might be able to delete a bunch of code later once that is on by default.
keps/0018-controller-overhaul.md
Outdated
``` | ||
`planStatus` is a property that basically replaces the current PlanExecution CRD - it reports on the status of the plans that are run right now or last runs of all the plans available for that operator version. This is also what `kudo plan history` and `kudo plan status` will query to get overview of the plans. | ||
|
||
`lastAppliedInstanceSpec` is persisting the state of the instance from the previous successful deploy/upgrade/update plan finished. We need this to be able to solve flaw n.1 described in the summary. This gets updated after a plan succeeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply
does this now on the client for all resources. server side apply semantics will bring this to the server. Check out this managed-fields section @alenkacz @zen-dog:
https://kubernetes.io/docs/reference/using-api/api-concepts/#server-side-apply
|
||
Part of the solution (addressing problem n.3 from the summary - ensuring atomicity) is an admission webhook. This one will guard the Instance CRD and will disallow any changes in a spec if plan is running. Admission webhook in kubernetes world is the only way how to prevent resource from being updated. All following filters (like controller predicates) are called AFTER the change was applied so it's too late to validate. | ||
|
||
Although admission webook addresses one of the problems outlined in this KEP it's considered a stretch goal and can be delivered in a following release (should NOT be part of the initial refactoring). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like this. I don't think there's a way to actually make this conflict free without some sort of lock, so this is a good solution by me.
- overall try to aim to use as much best practices in using controller-runtime as possible | ||
- implementation of this will be a breaking change meaning that KUDO on existing clusters will have to be reinstalled to work (CRDs dropped and recreated) | ||
- temporarily we won't be able to execute manual jobs until we agree on a design there (none of the current operators use it anyway so should be fine for one release) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great opportunity to put together a state flow diagram that goes inline with this KEP, so that we have an agreed notion on how this controller should work. Be great for the documentation and discussing the core controller loop as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love that idea! I'll work on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to put your diagram into the KEP, http://asciiflow.com/ can be your friend ;)
1. When KUDO manager is down (or restarted) we might miss an update on a CRD meaning we won’t execute the plan that was supposed to run or we execute an incorrect one ([issue](https://github.com/kudobuilder/kudo/issues/422)) | ||
2. Multiple plan can be seemingly in progress leading to misleading status communicated ([issue](https://github.com/kudobuilder/kudo/issues/628)) | ||
3. No way to ensure atomicity on plan execution (given the current CRD design where information is spread across several CRDs) | ||
4. Very low test coverage and overall confidence in the code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any linters/metrics we can put together? Not as something we have to live by but just to give us a general sense in data of whether or not we are improving. Maybe a separate KEP, but I don't think we need to institutionalize anything, just collect data as we go even if we're running it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean around coverage? Overall I don't like measuring code coverage, getting that number higher not always improves things. Right now there are 0 unit tests, my goal was to go above that.
But I know that the claim I made here sounds very generic and it would be good to have something objective that would mean it's good enough... I am open to suggestions here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, not coverage. I'm thinking simple gut checks we can run, like running gocyclo and a few other stats. Basically, we should have a general sense that are we improving those main functions by driving different sources complexity down. Totally agree that code coverage on its own doesn't really tell us anything.
keps/0018-controller-overhaul.md
Outdated
activePlan: deploy-1478631057 # (will be null if no plan is run right now) | ||
lastAppliedInstanceSpec: | ||
version: 123 # probably some version from metadata? | ||
spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: lastAppliedInstanceSpec.spec
should probably be lastAppliedInstance.spec
What this PR does / why we need it:
Introduces new KEP - all other should be answered in the KEP itself.
Which issue(s) this PR fixes: