Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-0018 Controller Redesign #830

Merged
merged 13 commits into from Sep 19, 2019
Merged

KEP-0018 Controller Redesign #830

merged 13 commits into from Sep 19, 2019

Conversation

alenkacz
Copy link
Contributor

What this PR does / why we need it:
Introduces new KEP - all other should be answered in the KEP itself.

Which issue(s) this PR fixes:

@zen-dog zen-dog changed the title KEP about controller redesign KEP-0018 Controller Redesign Sep 16, 2019
Copy link
Contributor

@zen-dog zen-dog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor nits/suggestions. My biggest pain-point is lastAppliedInstanceSpec being part of Instance.Status. But I don't have a better solution for this so 🤷‍♂ Overall it is certainly a big step towards a more robust design.

keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved
keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved
keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved
keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved
keps/0018-controller-overhaul.md Outdated Show resolved Hide resolved
keps/0018-controller-overhaul.md Show resolved Hide resolved
state: IN_PROGRESS
activePlan: deploy-1478631057 # (will be null if no plan is run right now)
lastAppliedInstanceSpec:
version: 123 # probably some version from metadata?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean K8 resourceVersion? I think we should introduce our own counter that is increased by the controller on each Instance update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YEah I don't have a clear idea right now what that version should be and if we even need it. Maybe @gerred will have some opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should look at the server side apply APIs for this. if we implement it on our CRD now we might be able to delete a bunch of code later once that is on by default.

state: IN_PROGRESS
delete: false
- upgrade: null # (never run)
state: IN_PROGRESS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about using aggregatedStatus to express the fact, that this is indeed aggregated from the status.planStatus:

Suggested change
state: IN_PROGRESS
aggregatedStatus:
- planStatus: IN_PROGRESS
activePlan: deploy-1478631057 # (will be null if no plan is run right now)

P.S. also, why do we use state and not status in plans/phases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it should be status - changed

```
`planStatus` is a property that basically replaces the current PlanExecution CRD - it reports on the status of the plans that are run right now or last runs of all the plans available for that operator version. This is also what `kudo plan history` and `kudo plan status` will query to get overview of the plans.

`lastAppliedInstanceSpec` is persisting the state of the instance from the previous successful deploy/upgrade/update plan finished. We need this to be able to solve flaw n.1 described in the summary. This gets updated after a plan succeeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy about lastAppliedInstanceSpec being part of Instance.Status field but then again: it is certainly not metadata or spec and I don't see a better place for it so 🤷‍♂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know of anybody who has to hold a CRD history? And if yes - how do they solve it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply does this now on the client for all resources. server side apply semantics will bring this to the server. Check out this managed-fields section @alenkacz @zen-dog:

https://kubernetes.io/docs/reference/using-api/api-concepts/#server-side-apply


### Admission webhook

Part of the solution (addressing problem n.3 from the summary) is an admission webhook. This one will guard the Instance CRD and will disallow any changes in a spec if plan is running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is admission hook part of the refactoring (1), or does the refactoring allow us to add the hook later (2), solving consistency issues? If (2) is the case, then we should probably mention it as an optional goal, that can be tackled in a separate effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've marked it as "stretch goal" in the following paragraph

alenkacz and others added 8 commits September 17, 2019 14:35
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
Co-Authored-By: Aleksey Dukhovniy <alex.dukhovniy@googlemail.com>
@zen-dog zen-dog added the kind/kep Kudo Enhancement label Sep 17, 2019
state: IN_PROGRESS
activePlan: deploy-1478631057 # (will be null if no plan is run right now)
lastAppliedInstanceSpec:
version: 123 # probably some version from metadata?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should look at the server side apply APIs for this. if we implement it on our CRD now we might be able to delete a bunch of code later once that is on by default.

```
`planStatus` is a property that basically replaces the current PlanExecution CRD - it reports on the status of the plans that are run right now or last runs of all the plans available for that operator version. This is also what `kudo plan history` and `kudo plan status` will query to get overview of the plans.

`lastAppliedInstanceSpec` is persisting the state of the instance from the previous successful deploy/upgrade/update plan finished. We need this to be able to solve flaw n.1 described in the summary. This gets updated after a plan succeeds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply does this now on the client for all resources. server side apply semantics will bring this to the server. Check out this managed-fields section @alenkacz @zen-dog:

https://kubernetes.io/docs/reference/using-api/api-concepts/#server-side-apply


Part of the solution (addressing problem n.3 from the summary - ensuring atomicity) is an admission webhook. This one will guard the Instance CRD and will disallow any changes in a spec if plan is running. Admission webhook in kubernetes world is the only way how to prevent resource from being updated. All following filters (like controller predicates) are called AFTER the change was applied so it's too late to validate.

Although admission webook addresses one of the problems outlined in this KEP it's considered a stretch goal and can be delivered in a following release (should NOT be part of the initial refactoring).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this. I don't think there's a way to actually make this conflict free without some sort of lock, so this is a good solution by me.

- overall try to aim to use as much best practices in using controller-runtime as possible
- implementation of this will be a breaking change meaning that KUDO on existing clusters will have to be reinstalled to work (CRDs dropped and recreated)
- temporarily we won't be able to execute manual jobs until we agree on a design there (none of the current operators use it anyway so should be fine for one release)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great opportunity to put together a state flow diagram that goes inline with this KEP, so that we have an agreed notion on how this controller should work. Be great for the documentation and discussing the core controller loop as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that idea! I'll work on it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to put your diagram into the KEP, http://asciiflow.com/ can be your friend ;)

1. When KUDO manager is down (or restarted) we might miss an update on a CRD meaning we won’t execute the plan that was supposed to run or we execute an incorrect one ([issue](https://github.com/kudobuilder/kudo/issues/422))
2. Multiple plan can be seemingly in progress leading to misleading status communicated ([issue](https://github.com/kudobuilder/kudo/issues/628))
3. No way to ensure atomicity on plan execution (given the current CRD design where information is spread across several CRDs)
4. Very low test coverage and overall confidence in the code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any linters/metrics we can put together? Not as something we have to live by but just to give us a general sense in data of whether or not we are improving. Maybe a separate KEP, but I don't think we need to institutionalize anything, just collect data as we go even if we're running it manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean around coverage? Overall I don't like measuring code coverage, getting that number higher not always improves things. Right now there are 0 unit tests, my goal was to go above that.

But I know that the claim I made here sounds very generic and it would be good to have something objective that would mean it's good enough... I am open to suggestions here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, not coverage. I'm thinking simple gut checks we can run, like running gocyclo and a few other stats. Basically, we should have a general sense that are we improving those main functions by driving different sources complexity down. Totally agree that code coverage on its own doesn't really tell us anything.

activePlan: deploy-1478631057 # (will be null if no plan is run right now)
lastAppliedInstanceSpec:
version: 123 # probably some version from metadata?
spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: lastAppliedInstanceSpec.spec should probably be lastAppliedInstance.spec

@alenkacz alenkacz merged commit 190c86f into master Sep 19, 2019
@alenkacz alenkacz deleted the av/kep-controller branch October 30, 2019 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/kep Kudo Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants