Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Controller #7459

Closed
lavalamp opened this issue Apr 28, 2015 · 10 comments
Closed

Cluster Controller #7459

lavalamp opened this issue Apr 28, 2015 · 10 comments
Assignees
Labels
area/upgrade kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@lavalamp
Copy link
Member

This is my post 1.0 vision for how cluster upgrades happen. Writing it up since it's been mentioned a few times.

cluster-controller

  • A new cluster component.
  • Runs in a pod, with admin privileges.
  • Takes over from the bootstrapping process when a new cluster is started.
  • Keeps a desired cluster state object, stored in the system with other api objects. This includes:
    • The desired node configuration.
    • The desired master component versions & configurations.
  • Runs a sync loop, observing cluster state, and changing it to match desired state.
    • Important: rate limited!
  • Provisioning a new node and/or reinstalling an existing node is pushed into the cloudprovider interface.
    • We add a declarative node configuration so we can say "make node X look like {config}" or "make a new node like {config}". We could start out with this being the salt script we currently use.
  • Cluster upgrades and rollbacks are done by pushing a new desired state.
@lavalamp lavalamp added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. kind/design Categorizes issue or PR as related to design. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Apr 28, 2015
@kelseyhightower
Copy link
Contributor

@lavalamp Thanks for posting this. We are doing a lot of this work in Tectonic today and from this list I see plenty of areas where we can collaborate around managing the cluster lifecycle.

@alex-mohr
Copy link
Contributor

@lavalamp Thanks -- and this does mirror some of what GKE has started to build as well, but we'd rather have more of that functionality living within the cluster rather than out-of-cluster functionality.

FYI @roberthbailey @zmerlynn

@mbforbes
Copy link
Contributor

/sub

@davidopp
Copy link
Member

/subscribe

@alex-mohr alex-mohr added this to the v1.0-post milestone Apr 30, 2015
@alex-mohr alex-mohr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Apr 30, 2015
@alex-mohr
Copy link
Contributor

Flagging this as v1.0-post because I don't think we block v1 on it, but also upping priority to p1 because I want us to drive this soon after v1 ships.

That said, @kelseyhightower if you have cycles in the short term and/or want to upstream what you've found works in Tectonic, happy to review.

@davidopp
Copy link
Member

Just to make sure I understand, when you say "provisioning a new node and/or reinstalling an existing node is pushed into the cloudprovider interface" you mean the cluster-controller, which is running the sync loop, will call into the cloudprovider interface to providion/reinstall?

@alex-mohr
Copy link
Contributor

@davidopp I read @lavalamp's proposal as an implementation proposal to meet a requirement of roughly "in order to support k8s.apiserver.resizeClusterToHaveNodes(N), we need to make some 'master' component have sufficient knowledge and capability to dynamically create new nodes according to a template."

Still unclear to me whether that's exactly a ClusterController that does it all or the necessary functionality gets federated out to a ProvisionsAndDeletesNodesController, a ManagesExistingNodesController, a CloudProviderReconcilerController, etc.

Longer term, we'll want to support more than one node type, so my vague mental model looks a lot like making a "NodeReplicationController" a first-class entity, allowing multiples of them, and having an optional overall CapacityManagerController that can turn the various NodeReplicationController knobs as needed to meet SLOs.

And of course, if someone wants to manually manage their node database or use their own custom automation, I assume we have some form of opt outs for all of it, except for the basic RegisterNodeWithKubernetes plumbing.

@lavalamp
Copy link
Member Author

lavalamp commented May 1, 2015

@davidopp Yes.

@alex-mohr Yeah that's one possible factoring of this, although we may want to defer some of that complexity until after we have the basic path working.

@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015
@roberthbailey roberthbailey added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. team/control-plane labels Aug 27, 2015
@bgrant0607 bgrant0607 removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Sep 10, 2015
@bgrant0607
Copy link
Member

Let's not build something that's unnecessarily restricted to just cluster configuration.

Related: component registration #13216

First thing we should do is move addons to Deployment, when that is ready. Then, once we have kubectl apply (#1702), updates of addons will be fully declarative.

If we want to effectively run kubectl apply as a continuous service, then we need something like Deployment Manager #3685.

Also, component configuration should be stored by the system #1627. If we need a way to orchestrate a rolling update of node configuration, we should think about how to do that in a way that would be useful in other scenarios, as well.

cc @jackgr @karlkfi @nikhiljindal @derekwaynecarr @fgrzadkowski

@bgrant0607 bgrant0607 added this to the v1.2-candidate milestone Sep 12, 2015
@davidopp davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 15, 2015
@roberthbailey
Copy link
Contributor

Quoting from Kubernetes Architectural Roadmap (was Core/Layers Working Doc):

The community has developed numerous tools, such as minikube, kubeadm, bootkube, kube-aws, kops, kargo, kubernetes-anywhere, and so on. As can be seen from the diversity of tools, there is no one-size-fits-all solution for cluster deployment and management (e.g., upgrades).

I'm going to close this generic issue and let each provisioning solution solve this as they see fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

8 participants