Minimalistic Machines API proposal. #298

pipejakob · 2017-10-19T07:06:32Z

This is a proposal to add a new API for managing Nodes in a declarative way: Machines.

It is part of the overall Cluster API effort.

k8s-reviewable · 2017-10-19T07:06:38Z

This change is

mrIncompetent · 2017-10-19T09:15:23Z

Hey,

we are working on a very similar concept:
https://github.com/kube-node/nodeset

Basic idea is to rely on the already existing node resources.
We defined 2 extra resources:

nodeset (Replicaset for nodes)
nodeclass (provisioning information for a node) https://github.com/kube-node/kube-machine/blob/master/examples/NodeClass_do.yaml

Flow is:

nodeset-controller creates node resources
node-controller provisions machines at the cloud-provider based upon the node resource and the assigned nodeclass

We have a example node-controller (using docker-machine): https://github.com/kube-node/kube-machine
Example nodeset controller (using archon): https://github.com/kube-node/archon-nodeset
We are working right now on a generic nodeset controller & a gke-nodeset controller.

How can we align and possible collaborate on this topic?

Contributors (not sorted by anything):
@sttts @ledzep2 @s-urbaniak @adieu @scheeles @realfake @chaosaffe @guusvw @metalmatze

pipejakob · 2017-10-19T20:22:48Z

@mrIncompetent Your nodeset looks similar to what we're trying to do, but I'm wondering if there's a write-up somewhere of the explicit goals of the project? It's not clear to me from reading the code where its boundaries are, or what user problems it's trying to solve.

For instance, I don't see any notions of software versions in the nodeset, so it doesn't seem like you're targeting being able to upgrade nodes like we are. Also, only having a "set" concept for nodes without being able to address specific ones seems like it would give you no control over which nodes to scale down when you have excess capacity (which could be important, if you want to target the most idle nodes for deletion). I could be wrong, though, if you're intending for the usecase to be handled by deleting Node objects themselves.

That said, it would be great to have any/all of its contributors join us in our ongoing Cluster/Machines API discussions. We're meeting weekly on Wednesdays at 11:00 PST via Zoom (https://zoom.us/j/166836624). If you want to get the invite on your calendar, you can join the SIG Cluster Lifecycle mailing list, where we can also start a thread to discuss this before our next Zoom meeting.

justinsb

Looks great!

justinsb · 2017-10-21T04:18:50Z

cluster-api/machines-api/proposal.md

+
+The ProviderConfig is recommended to be a serialized API object in a format
+owned by that provider, akin to the [Component Config](https://goo.gl/opSc2o)
+pattern. This will allow the configuration to be strongly typed, versioned, and


I personally think this is a little bit of a cheat, and I'd like us to avoid this if possible, but I also recognize that we'll always need an escape hatch in the long term. My concern in the short/medium term is that allowing this means we avoid defining common fields where we actually could do so.

I absolutely, 100% agree that it's a simplification for the short-term. I'll be fully transparent here:

My other proposal was a little too over-indexed on trying to support a cloud-agnostic cluster autoscaler as an initial customer. That was appealing from a design standpoint, because the cluster autoscaler already exists and actually has concrete requirements to build off of, but also seemed like a logical thing you would want to do with a declarative NodeSet concept. After a lot of discussions, though, the feedback was:

For a very first proposal of a brand-new API, it had a lot of new concepts to grasp.

To be fair, I had new types across two dimensions: Machine / MachineSet / MachineDeployment in one dimension, and Machine / MachineClass / MachineTemplate across the other. I was urged to think about the absolute minimum we could get away with for the first iteration, and only add more complexity and new concepts later when we were certain they were necessary.

My other design also had the requirement that in order to create even a single simple node, you had to first create a cloud-specific MachineTemplate, then register it as a MachineClass, and then finally create a Machine that referenced the class. It seemed like a lot of overhead just to support a single custom node that you had no intent to reuse. So, one design principle that emerged from that was that if I just want to create a single Machine and not care about portability, I should be able to. I think there's still room to evolve the API and introduce other concepts down the road, but it seems reasonable to me that if you don't care about portability, it should be possible to create a single custom Machine without the overhead of needing to use other concepts, which means that we'll need something akin to the opaque ProviderConfig blob we have now to actually be able to feed the right values into the cloud-specific node creation APIs.

Rebasing the cluster autoscaler on top of our new APIs isn't actually delivering any new value, since cluster autoscaling already exists today.

It's a nice-to-have down the road, but in order to get the momentum we want for the project, we should focus first on the new functionality this enables that was never possible before (like cloud- and deployment-agnostic cluster upgrades). If you remove cluster autoscaling (and node autoprovisioning) as a client, at least for the short term, then I think most of the benefits of having more cloud-agnostic fields in the MachineSpec disappear as well.

I could definitely be wrong here, and welcome more feedback. But, the intent of this proposal was "what is the absolute bare minimum that we can all agree is a good starting point?" and then continue to add more as we see appropriate.

Some of the things I had considered for inclusion in the MachineSpec:

OS image

This could be represented as a single string across most providers, but the values wouldn't be portable anyway. If the values are cloud-specific, then moving the entire field to be cloud-specific means that each provider can represent this however it would like. For instance, in GCE, OS images are actually more naturally represented in a structured way, since they have a project, family, name, etc. In DigitalOcean, OS Images can also be referred to via int IDs, so an IntOrStr might be more appropriate there.

Disk configuration

I don't know if I personally have enough of a grasp on how much power is needed to be useful here. I've swung back and forth between two extremes: a simplified view of just specifying a total amount of working space desired, and on the other end of the spectrum, having a full array of structs to represent disks, along with fields for how much capacity each disk should have, whether it should be bootable, etc. At this point, deferring disk setup to the ProviderConfig lets us look at how different early adopters represent this in their config, and we can always bubble it back up to a generic representation in MachineSpec in the future once we think we have a grasp of where we want to end up on the power/usability trade-off scale.

Preemptible

AWS and GCE support these, but I think Azure only supports low-priority VMs in Batch, and I don't know if this concept exists in other on-premise environments like vSphere.

Topology

Like OS Image, these are similar-enough concepts across environments (regions, zones, availability zones, etc), but I couldn't see the actual values being portable at all. Also, the number of dimensions you might want to support in different environments might differ wildly. In some clouds, a single availability zone might be enough to specify. If you're on-premise, you might want a lot of custom fields like datacenter, rack, etc.

There are many more, but you get the point. Are there particular concepts that you feel we should represent in the ClusterSpec now (or in the near term)?

@justinsb it shouldn't be a problem to lift common fields into this API as we notice them. Providers can always extend their controllers to check for the API field and fall back on their old provider-specific field.

@pipejakob After re-watching all meeting recordings, I see the justification for using serialized blob - simplicity. I'm 100% for simplicity. However I think that providerConfig is very similar to EnvVar in Container. There you can define a literal value, when you don't need reusability (or doing it quickly) and reference, when you need it. Something like

providerConfig: # either value or valueRef value: > { "apiVersion": "gceproviderconfig/v1alpha1", "kind": "GCEProviderConfig", "machineType": "n1-standard-1", ... } valueRef: apiVersion: gceproviderconfig/v1alpha1 kind: GCEProviderConfig name: config-1

can be suitable for both use-cases and it feels very similar to what people know and use in Pods

Edit:
even better it would be if we have:

providerConfig: value: # either value (runtime.Object/runtime.RawExtension) or valueRef apiVersion: gceproviderconfig/v1alpha1, kind: GCEProviderConfig, machineType: n1-standard-1, project: test-1-test ... valueRef: apiVersion: gceproviderconfig/v1alpha1 kind: GCEProviderConfig name: config-1

p.s. this is a re-post of a previous comment (wrong user)

Why not using runtime.Object/runtime.RawExtension instead of string?

I second this. runtime.RawExtension needs an apiVersion and will probably fall back to Unstructured (@mrIncompetent can you confirm that?). Alternatively, a generic JSON type will do, compare https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1beta1/types_jsonschema.go#L28.

@sttts runtime.RawExtension.Object would fall back to runtime.Unknown.
Though i've been using it without apiVersion and more like https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1beta1/types_jsonschema.go#L28
So i only used runtime.RawExtension.Raw. In this case, RawExtension.Object will be nil

@mvladev @pipejakob @justinsb @sttts @roberthbailey
Any objections on using runtime.RawExtension for providerConfig

sounds good

No objection, it's way better than string.

justinsb · 2017-10-21T04:26:35Z

cluster-api/machines-api/proposal.md

+        update machine
+
+and allow the provider to decide if it is capable of performing an in-place
+update, or if a full Node replacement is necessary.


Agree - and this could also be an option / policy that can be set by the machine controller. For example, kops encourages full node replacement (a little more reliable, perhaps). But certainly it is slower, and some people would likely choose in-place replacement if it was available. And if we have a kops-controller, it'll have to support in-place for bare-metal.

justinsb · 2017-10-21T04:35:13Z

cluster-api/machines-api/types.go

+	//                 | Master present        | Master absent          |
+	// +---------------+-----------------------+------------------------|
+	// | Node present: | Install control plane | Join the cluster as    |
+	// |               | and be schedulable    | just a node            |


I'm not sure schedulable exists as a concept any more... AIUI the masters are supposed to be schedulable, but tainted so user pods aren't scheduled to it. But e.g. a monitoring daemonset or networking daemonset should tolerate the taint and thus be on the master.

Ah, this is poor wording on my part. If I replace "schedulable" with "untainted" and "unschedulable' with "tainted," is that sufficient? Or should we not differentiate between (Master) and (Master, Node)? Or do you think there's a better way to represent the desire to install the control plane altogether?

From what @justinsb said, it sounds like there shouldn't be a distinction between Master and Node, and I think I agree with this position. Is there any scenario where an un-tainted master would be significantly preferable to a tainted one?

justinsb · 2017-10-21T04:37:31Z

cluster-api/machines-api/types.go

+type MachineStatus struct {
+	// If the corresponding Node exists, this will point to its object.
+	// +optional
+	NodeRef *api.ObjectReference


Not sure if we should define a field like ProviderId (https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L3092), for the window in between when a machine is created and when it registers with kube-apiserver. But happy to wait and see...

This should be *corev1.ObjectReference

justinsb · 2017-10-21T04:38:53Z

cluster-api/machines-api/types.go

+
+	// When was this status last observed
+	// +optional
+	LastUpdated metav1.Time


Let's make clear that this is last-transition, rather than a heartbeat. (If we hadn't done the node heartbeat, I'd wager we'd be at 10k nodes by now...)

justinsb · 2017-10-21T04:49:31Z

cluster-api/machines-api/types.go

+	Name string
+
+	// Semantic version of the container runtime to use
+	Version string


Going to guess these should be optional to mean "use the controller-recommended setting for the k8s/kubelet version" (which IMO is what we should be encouraging!)

Ah, good point. I think it's a good idea to make these optional at cluster creation time, but do you think they should stay optional afterwards?

Be careful with default values for fields. Changing defaults can easily break an API and unlike ProviderConfig there's no clear provider-specific versioning mechanism for the defaults applied to these common fields.

Committing these types so we can start prototyping against them, but the full proposal is still under review (and accepting feedback) here: kubernetes-retired#298

luxas · 2017-10-24T16:29:39Z

@kubernetes/sig-cluster-lifecycle-pr-reviews

adieu · 2017-10-25T06:46:58Z

@pipejakob Just wanted to share some thoughts while we were working on a project called Archon with similar ideas.

The basic idea is the same. Using declarative resources in Kubernetes to define the node machines and delegating real work to controllers. We chose Instance and InstanceGroup as the resource names but they could be easily mapped to Machine and MachineSet.

Here is the definition for Instance:

type Instance struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata"`
	Spec              InstanceSpec       `json:"spec,omitempty"`
	Status            InstanceStatus     `json:"status,omitempty"`
	Dependency        InstanceDependency `json:"-"`
}

type InstanceSpec struct {
	OS                  string                 `json:"os,omitempty"`
	Image               string                 `json:"image,omitempty"`
	InstanceType        string                 `json:"instanceType,omitempty"`
	NetworkName         string                 `json:"networkName,omitempty"`
	ReclaimPolicy       InstanceReclaimPolicy  `json:"reclaimPolicy,omitempty"`
	Files               []FileSpec             `json:"files,omitempty"`
	Secrets             []LocalObjectReference `json:"secrets,omitempty"`
	Configs             []ConfigSpec           `json:"configs,omitempty"`
	Users               []LocalObjectReference `json:"users,omitempty"`
	Hostname            string                 `json:"hostname,omitempty"`
	ReservedInstanceRef *LocalObjectReference  `json:"reservedInstanceRef,omitempty"`
}

type InstanceStatus struct {
	Phase      InstancePhase       `json:"phase,omitempty"`
	Conditions []InstanceCondition `json:"conditions,omitempty"`
	// TODO: allow multiple ips
	PrivateIP         string      `json:"privateIP,omitempty"`
	PublicIP          string      `json:"publicIP,omitempty"`
	InstanceID        string      `json:"instanceID,omitempty"`
	CreationTimestamp metav1.Time `json:"creationTimestamp,omitempty" protobuf:"bytes,8,opt,name=creationTimestamp"`
}

type FileSpec struct {
	Name               string `json:"name,omitempty" yaml:"name,omitempty"`
	Encoding           string `json:"encoding,omitempty" yaml:"encoding,omitempty" valid:"^(base64|b64|gz|gzip|gz\\+base64|gzip\\+base64|gz\\+b64|gzip\\+b64)$"`
	Content            string `json:"content,omitempty" yaml:"content,omitempty"`
	Template           string `json:"template,omitempty" yaml:"template,omitempty"`
	Owner              string `json:"owner,omitempty" yaml:"owner,omitempty"`
	UserID             int    `json:"userID,omitempty" yaml:"userID,omitempty"`
	GroupID            int    `json:"groupID,omitempty" yaml:"groupID,omitempty"`
	Filesystem         string `json:"filesystem,omitempty" yaml:"filesystem,omitempty"`
	Path               string `json:"path,omitempty" yaml:"path,omitempty"`
	RawFilePermissions string `json:"permissions,omitempty" yaml:"permissions,omitempty" valid:"^0?[0-7]{3,4}$"`
}

We put lots of information in InstanceSpec because we think Instance should contain all the information needed to create the machine and later we could introduce InstanceGroup, InstaceDeployment using Instance as a base. It's just like the relationship between Pod, ReplicaSet and Deployment which could be easily adopted by Kubernetes users.

To adapt for different cloud controllers, we put common fields like OS, Image, InstanceType in InstanceSpec and left cloud specific configs in annotations and Files. A File is like a unix file. We use File to inject files directly into the target machine and controllers could watch for specific path to retrieve additional configuration information just like using /proc/ files. There should be an agreement between the controller author and controller user on the path and format for the configuration files but it should be hidden from the generic view as it's kind of implementation detail. The concept behind the special File idea and your ProviderConfig idea is the same. We should left some extension point to the controllers.

There are two minor differences:

We chose to treat Instance as a read only resource just like Pod. One could only modify the machine by delete and recreate the Instance resource. In order to reuse some existing machines, we introduced a new concept called ReservedInstance. We found this approach easier to implement and reason about. However if we went down the mutable approach, controller authors could still stay in the immutable approach if they like. So there's no real problem here.

Another difference is that we target Archon as a general purpose computing resource management tool instead of a Kubernetes specific one. The fundamental design could be used to build an etcd cluster or any distributed system but we have Kubernetes support builtin. I can understand that kube-deploy is a Kubernetes project and it's main focus is on Kubernetes, but in real world sysadmins manage a bunch of other servers besides the Kubernetes cluster. Many of them had their own tools for server bootstrapping and configuration. It will be very hard to persuade them to adopt a new tool which could only be used to manage Kubernetes clusters. If we could support generic server management, adapting tools could be made to support existing tools. terraform-provider-archon is a adapting tool we made for existing terraform users.

Whether the Machine resource represents only a Kubernetes node or a more generic server is an important design decision. Maybe we should hear from more people about the pros and cons. Personally I would support the generic server model, because it's easy to add a higher level abstraction for Kubernetes but not vise versa. There are use cases like dedicated etcd cluster and storage cluster which could be covered in this model. In order to be more generic, we have resources like Network and User in Archon and controllers to manage VPCs and certificates.

I just wanted to raise another question here. Is the Machine resource more like an Ingress resource or a Pod resource? The answer will influence the controller design. If it's like Pod then we probably have a central controller and something like cri for different backends. If it's like Ingress, then we would not have a master controller and all the controllers would consume the resource definition by themselves. We could leave this question aside until we begin to implement the controller but I think the answer to the question will influent the design of the Machine resource.

Working with @mrIncompetent and others, we introduced concepts like NodeSet and NodeClass. Like what you said in your comments, it's another dimension. We made archon-nodeset to consume the NodeSet resources and translate them into Archon InstanceGroup resources. Hope the work we had done in the kube-node project could be used as a reference for your Machine, MachineClass and MachineTemplate design.

BTW, we use jsonnet to build the final InstanceGroup definition in modules and we bundled the jsonnet files into one single executable file to improve user experience. I think they might be useful for your Machine resource too. I'll share more information on this topic in another thread.

I'm really looking forward to see a common resource shared by all the Kubernetes bootstrapping tools got defined. My thoughts on this topic might not be correct nor optimal. I just wanted to bring some discussions on the design decisions to be made. Hope we could attract more people interested in this idea and polish the resource definition together.

colemickens · 2017-10-25T23:42:00Z

cluster-api/machines-api/types.go

+	ContainerRuntime ContainerRuntimeInfo
+}
+
+type ContainerRuntimeInfo struct {


What is the scope of this particular object/concept? Should we expect to see {Rkt,Frakti,Containerd}RuntimeConfig structs at some point?

The intended purpose was to:

Know what runtime to install at provisioning time. It's completely fine for an implementation to say "sorry, I only know how to install Docker. Anything else will result in an UnsupportedConfiguration error." Or even "I have no idea how to install that version of Docker," etc.

Know how to set kubelet's --container-runtime flag.

You're right to question how this will evolve in the future, with cluster admins potentially wanting to fine-tune the settings of the runtime itself. I think it would largely follow the same pattern we use to handle the provider-specific configuration for Machines. At first, we would likely have an opaque blob to capture all of the settings that were fed to the runtime installer, which would allow each container runtime to version its configuration independently of the Machines API. Then, potentially upgrade this to an ObjectReference so that it identical configuration wouldn't have to be inlined with every object that uses it. Any time we have configuration that seems to be useful across every runtime we support, we can graduate it out of the opaque config blob and into the ContainerRuntimeInfo struct if desired.

I think the ContainerRuntimeInfo is a good candidate for culling in v1alpha1, actually, until there's a strong need to add it. Definitely worth discussion.

mtaufen · 2017-10-27T20:52:11Z

cluster-api/machines-api/types.go

+	// | Node absent:  | Install control plane | Invalid configuration  |
+	// |               | and be unscheduleable |                        |
+	// +---------------+-----------------------+------------------------+
+	Roles []string


I'd suggest just making this a string and having a single identifier for each possible configuration. It's much less confusing that way. e.g. have "SchedulableMaster," "UnschedulableMaster," and "Node."

I actually originally had it that way in an earlier draft that was Google-internal only, and got the opposite feedback from Brian Grant, Tim Hockin, and Daniel Smith: make it a list of strings instead.

I think there might be contention over exactly what roles or node installation scenarios we want to support directly in this API, however.

Can you ping me the internal doc so I can get up to speed on the rationale?

mtaufen · 2017-10-27T20:55:01Z

cluster-api/machines-api/types.go

+
+type MachineSpec struct {
+	// This ObjectMeta will autopopulate the Node created. Use this to
+	// indicate what labels, annotations, name prefix, etc., should be used


Is the Name of the MachineSpec used as the name prefix? This sounds reasonable to me just double checking.

Actually, I was aiming to follow the same pattern as we do for pods (and other objects): you can specify name: value if you know exactly what name you want to use, or generateName: value to use value as a prefix and have a unique suffix generated for you.

I can call this out much more explicitly, though.

Isn't generateName handled generically by the API server? How will you prevent it from renaming the MachineSpec you create, so that providers can generate the names (e.g. GKE has a way of generating names that is not the same as what the API server does)?

OTOH you could just force providers to use the name generated by the API server. Though IDK what kinds of incompatibility that would introduce.

Keep in mind that using generateName will prevent you from making idempotent creation requests, because it is not deterministic.

mtaufen · 2017-10-27T20:58:35Z

cluster-api/machines-api/proposal.md

+        attempt to upgrade machine in-place
+        if error:
+            create new machine
+            delete old machine


Remember to drain the old machine before deleting it, so that the containers get a chance to exit gracefully. IIRC drain also ensures you respect any pod disruption budgets that are set up in the cluster.

@mtaufen there is an open issue to move the drain operation into the k8s server.

In the meantime, I think that it's much better to have a special drain controller, that will mark Machines for draining, adding finalizer to prevent deletion and removing it once the node has been drained.

mtaufen · 2017-10-27T21:00:52Z

cluster-api/machines-api/proposal.md

+built on top of the Machines API would follow the same pattern:
+
+    for machine in machines:
+        attempt to upgrade machine in-place


Just a note: With in-place upgrades, providers should determine how disruptive a given in-place mutation is and ensure that they respect the pod disruption budget.

mtaufen · 2017-10-27T21:03:02Z

cluster-api/machines-api/proposal.md

+## In-place vs. Replace
+
+One simplification that might be controversial in this proposal is the lack of
+API control over "in-place" versus "replace" reconciliation strategies. For


Users may end up wanting this to make a trade-off between the disruptiveness and cleanliness of a rollout, but I think it's fine to push this down to the ProviderConfig and leave it out of the top-level API.

mtaufen · 2017-10-27T21:05:04Z

cluster-api/machines-api/proposal.md

+
+The ProviderConfig is recommended to be a serialized API object in a format
+owned by that provider, akin to the [Component Config](https://goo.gl/opSc2o)
+pattern. This will allow the configuration to be strongly typed, versioned, and


@justinsb it shouldn't be a problem to lift common fields into this API as we notice them. Providers can always extend their controllers to check for the API field and fall back on their old provider-specific field.

mtaufen · 2017-10-27T21:06:24Z

cluster-api/machines-api/proposal.md

+* Dynamic API endpoint
+
+This proposal lacks the ability to declaratively update the kube-apiserver
+endpoint for the kubelet to register with. This feature could be added later,


I'm unclear on what this section means, but it kinda sounds like something CRD could handle?

krisnova · 2017-10-28T12:37:52Z

cluster-api/machines-api/types.go

+	// controller observes that the spec has changed and no longer matches
+	// reality, it should update Ready to false before reconciling the
+	// state, and then set back to true when the state matches the spec.
+	Ready bool


What criteria will govern this value?
Also is there an expectation that this value will always be accurate?

This feels like it might be a duplication of data that exists already in API. Moreover, what is actually updating this? My take would be this would be a lot of overhead, and may not scale. Making this generic seems to be an issue. Can we rely on kubelet ready status, or what other options do we have?

krisnova

I am very much in favor of building out a pre-alpha version of this so we can start testing sooner than later. The whole point is that we can version these, and I will probably have much more concrete feedback once I tried mutating infrastructure with these 😄

TLDR; LGTM

mattbates · 2017-10-28T21:53:58Z

cc @munnerz @simonswine

pipejakob · 2017-10-30T22:47:45Z

@adieu Yes, Archon is one of the many projects we looked at when starting to work on the Cluster API project. I do think there's a lot of overlap conceptually, but that the two efforts have fundamentally different goals.

As you say yourself, Archon's Instances are deliberately general purpose and Kubernetes-unaware. I believe that this makes configuring a Kubernetes cluster much closer to the Kubernetes The Hard Way experience than the ease of other installers. Rather than saying conceptually "I would like to use the 1.8.1 version of the control plane," one must explicitly model the entire static pod manifest of every component, including every flag to pass them, the liveness probe to use, the volumes to mount, etc. This offers infinite flexibility, but because you give this much power to the enduser by only abstracting away the concept of files to place on disk, I believe it actually becomes much more tedious to configure a functioning cluster from scratch, with no guarantee that two clusters created by two different users actually look the same. They could choose to put their static pod manifests in different directories, or even run all of their control plane components via systemd instead of kubelet. This flexibility is very powerful, but makes it very difficult to operate on those clusters in a generic way.

One of the most important usecases we're targeting with the Cluster and Machines APIs is for developers to be able to write operational tooling on top of these APIs that is completely agnostic of the cluster's environment, the cloud that it's running in, and even the deployment mechanism used to provision the cluster. With the current proposal of having Kubernetes concepts be first-class citizens of these APIs, we will enable tooling to be written like generic cluster upgraders that only have to update the value of a single field on an object, and have the right thing happen with little room for shooting oneself in the foot.

In the Archon world, I don't see how a tool could generically upgrade a single Instance without understanding whether that Instance is supposed to be running the control plane, or just run a kubelet that registers with a cluster master. Further, the tool will have to understand which Files refer to which control plane components in order to understand how to even upgrade them. Flag names can change between different Kubernetes versions, so the upgrade tool would need to know about what name transformations, deprecations, and additions to make to the flags passed to each component separately. Please correct me if I'm wrong, but I think any tooling written on top of Archon's Instances would need to deeply inspect every object and have many switch statements to know how to safely upgrade or downgrade Instances or the components running on them, or else have these sections maintained by hand by the cluster admin. Also, if an Instance is not Kubernetes-aware, then would tooling that deletes an Instance need to handle safely evicting workloads from that Instance first?

I understand the desire for creating a completely generic abstraction of hosts, but I think that direction is off the spectrum of what would be useable for the usecases we're targeting to support with this particular project.

However, one way that I can see these two projects potentially collaborating is by having a Machine -> Instance adapter that allows us to take advantage of all of the existing work in Archon in order to jumpstart the number of providers supported by the Machines API, and to keep taking advantage of any new providers implemented in the Archon project.

As for your question of whether a Machine object is modeled more like Ingress than a Pod, I would say Ingress (based on my understanding of the intent of your question). It is fully expected that your cluster should have a cloud- or environment-specific controller handling your Machine objects. However, most of the Machine's spec is generic, so the power comes from these objects being operated on in an generic way. You don't need to know what cloud a Machine is in in order to change its kubelet version, but a cloud-specific controller will handle reconciling the real world with that new declared spec.

kfox1111 · 2017-10-30T23:37:23Z

@pipejakob Couldn't the abstraction still be done with archon like language with some kind of helper on the host and annotations of some kind on the node? Say user sets k8s version previous+1 annotation. node itself could pull the annotation and tweak things like performing a kubelet upgrade. Maybe some of the logic your talking about belongs in kubeadm? Being a ndoe level thing has the advantage of being very agnostic to the tool (could be driven in the image by go, ansible, chef, etc). Could map annotations to chef roles for example.

It would be really nice if the both use cases could be handled by the same object with some extra fields somehow. For an operator, sometimes the distinction between a fully k8s node, and oh, I need to override this one thing on the host is a very fine line. being able to mix both use cases together would really be helpful I think.

mvladev · 2017-11-03T15:28:20Z

cluster-api/machines-api/types.go

+
+type MachineVersionInfo struct {
+	// Semantic version of kubelet to run
+	Kubelet string


Is this an optional value? In most cases you want to match your kubelet and api-server versions.

I was thinking that it would be optional at installation time, but that the installer would fill it in with the value that was actually used. Then, tooling built on top of this API can inspect and potentially modify a concrete value here, instead of having to reason about what an empty value here means. I'll document the expectations here more clearly.

Can we make this a struct? kubelet to me is an object, not a string. Is a Struct with a single string a good start? For instance, we have multiple components. Also how does this relate not relate to component config.

I'm not against making this a struct, but I'm wondering how you see that evolving in the future. What other fields would you envision in the struct? The way this is laid out currently is:

machine: spec: versions: kubelet: 1.8.0

Are you hoping to have different ways of specifying the version of kubelet to use beyond a single semver, or were hoping to gather together any configuration related to the kubelet into a single struct, rather than having the version be stand-alone here?

mvladev · 2017-11-04T11:42:00Z

cluster-api/machines-api/types.go

+import (
+	corev1 "k8s.io/api/core/v1"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/kubernetes/pkg/api"


This can be removed and we can use k8s.io/api/core/v1 instead.

Yup, this was an accident that got fixed in the merged types but not this PR. I'll update the PR with the newest definition from the codebase.

mvladev · 2017-11-04T11:43:48Z

cluster-api/machines-api/types.go

+type MachineStatus struct {
+	// If the corresponding Node exists, this will point to its object.
+	// +optional
+	NodeRef *api.ObjectReference


This should be *corev1.ObjectReference

mvladev · 2017-11-05T11:05:27Z

cluster-api/machines-api/types.go

+	// If set, indicates that there is a problem reconciling state, and
+	// will be set to a human readable string to indicate the problem.
+	ErrorMessage *string
+}


Where does the Provider puts the status for the cloud resources it creates? ProviderA might create / update / delete security_groups, keys or anything_related for every machine it reconciles and those cloud resources are going to be completely different from ProviderB's resources.

Would ErrorReason *MachineStatusError and ErrorMessage *string make more sense in a list of a struct that is an encapsulation of those values? What does the event data structure look like? This almost appears to be a list of Error Events.

+ + // +optional + // If set, indicates that there is a problem reconciling state, and + // will be set to a human readable string to indicate the problem. + ErrorMessage *string

Maybe this is a better question, since the struct may be below. What is the ErrorMessage? How does it relate to ErrorReason?

We discussed this in one of the Cluster API breakout sessions, but I forgot to follow-up here: it's completely up to the controller as an implementation detail. A decent pattern is for the controller to add custom annotations to the Machines it's reconciling, to keep track of information about external resources it has created. The controller could also create its own ConfigMaps or CustomResources to have better control (and RBAC scoping) of its state, or even store state outside of the cluster if it makes sense.

As for ErrorMessage / ErrorReason, this is definitely one of the warts of the design so far that I'd like to flesh out with any better approaches.

We did get the advice from Brian Grant and Eric Tune that the existing pattern of Conditions in object statuses is near-deprecated. As part of trying to bring several controllers to GA, a survey was sent out to understand how people use Conditions and how effective they are for the intent. The overall response was that having Conditions be a list of state transitions generally made them not useful for the kinds of checks people wanted to make against the Status, which are to answer the question "is this done yet?". This resulted in clients always just looking at the most recent Condition in the list and treating it as the current state, which on top of making the client logic more difficult, also made them deteriorate into Phases anyway (which are thoroughly deprecated).

So, they suggested two different patterns to replace Conditions:

If you really want a timeseries stream of state transitions, we should use Events.

For Status fields that we think clients will care to watch, we should just have fine-grained top-level entries (rather than lists) for the current state.

We're on the bleeding edge, so there aren't other parts of Kubernetes that have migrated off of Conditions yet, and we might be setting the precedents, or we may just need to slightly alter our types once better recommendations are in place.

The Error* fields were my attempt at (2), with the general guidelines that if a client modifies a field in the Machine's Spec, it should specifically watch for the corresponding field of the Machine Status see whether or not it has been reconciled, while watching the Error* fields for any errors that occur in the meantime. If you're updating the version of kubelet in the Spec, you should watch the corresponding field in the Status to know when it's been reconciled. This works decently with the Error* fields so long as you have a single controller responsible for the entirety of the Machine Spec, but breaks down somewhat if you want different controllers to handle different fields of the same object, or handle reconciling the same Machine under different circumstances.

For instance, one controller may be responsible whenever a full VM replacement is needed, while another may specialize in being able to update a VM in-place for certain Spec changes. It's not fantastic if they're unable to report errors separately, and instead have to overwrite the same fields in the Status. One mitigation is to always publish an Event on the Machine as well, so anyone who cares can still see the full stream of all errors. Another mitigation is to provide very strong guidance over what constitutes an error worth reporting in the Status.

I'll clarify this in the documentation, but I think it's a good idea for Machine.Status.Error* to be reserved for errors that are considered terminal, rather than transient. A terminal error would be something like the fact that the Machine.Spec has an invalid configuration, so the controller won't be able to make progress until someone fixes some aspect of it. Another terminal error would be if the machine-controller is getting Unauthorized or Permission Denied responses from the cloud provider it's calling to create/delete VMs -- it's likely going to require some manual intervention to fix IAM permissions or service credentials before it's able to do anything useful. However, any transient service failures can just be logged in the controller's output and/or added as Events to the Machine, since they should only represent delays in reconciliation and not errors that require intervention.

If two different controllers want to report terminal errors on the same Machine object, then I think it's okay that they are overwriting each other's errors in the Machine.Status, since they are both valid errors that need to be taken care of. By definition, neither controller is able to make progress until someone steps in and modifies the Machine.Spec or some aspect of the environment to fix the error, so someone will need to address both errors anyway (which may have the same underlying root cause). Based on the timing, they'll either see one error or the other first, and will have to fix it. Then, that error will disappear as the first controller gets unwedged, but a second error might replace it from the second controller, and the admin can take action on that as well.

We can figure out some way to represent both errors in the Status at the same time, but I'm not sure how much value there is in that over the above model, or above the cluster admin looking at the Machine events anyway.

tamalsaha · 2017-11-06T00:12:17Z

Do we really need a new object that presents Node? We can just use the existing Noe object. I think using a separate object type will create another point of reconciliation.

Especially for cloud providers like AWS/GCE where the actual nodes are created via AutoScaler (hence these names are generated randomly), this will require sync among 3 things - Cloud providers IntanceGroup , []Machine and []Node.

In case of appscode/pharmer, we defined a NodeGroup object https://github.com/appscode/pharmer/blob/dd266cded7e686bdbdc037496351b947bf8081eb/apis/v1alpha1/node.go#L16

This gets translated to appropriate group concept depending for cloud providers that support them (AWs & GCE). For simple VPS provider this just becomes a simple loop over Node creation. To maintain the sync, we pass the NodeGroup name via kubelet's --node-labels flag.

roberthbailey · 2018-01-16T07:16:07Z

Thanks @pipejakob both for your work putting this PR together and handling feedback and also helping up wrap it up as you ramp up on Istio.

I will try to take a pass at comparing the types files, but it may take a day or two so if someone wants to jump in (@medinatiger? @jessicaochen?) that'd be really helpful.

I've created #503 and #504 to address your last comment.

roberthbailey · 2018-01-17T21:58:59Z

@mvladev volunteered during our meeting today to take a pass on step one (sanity checking the latest changes). Once that is done I'll do a quick pass and lgtm as well.

mvladev · 2018-01-18T16:16:05Z

@roberthbailey after I:

pulled the latest master branch
merged manually @pipejakob's branch in master and copy + pasted machines-api/types.go in api/cluster/v1alpha1/types.go

see that there is a little miss-alignment in the comment for MachineSpec's ObjectMeta:

$ git diff

diff --git a/cluster-api/api/cluster/v1alpha1/types.go b/cluster-api/api/cluster/v1alpha1/types.go
index c5b829a7..d2787c39 100644
--- a/cluster-api/api/cluster/v1alpha1/types.go
+++ b/cluster-api/api/cluster/v1alpha1/types.go
@@ -168,8 +168,8 @@ type Machine struct {

 type MachineSpec struct {
        // This ObjectMeta will autopopulate the Node created. Use this to
-       // indicate what labels, annotations, name prefix, etc., should be used
-       // when creating the Node.
+       // indicate what labels, annotations, etc., should be used when
+       // creating the Node.
        // +optional
        metav1.ObjectMeta `json:"metadata,omitempty"`

added lines are coming from machines-api/types.go

Except for that small comment difference, types and structs match the proposal.

roberthbailey · 2018-01-18T18:28:27Z

Thanks Martin!

@pipejakob - go ahead and delete machines-api/types.go because as soon as I lgtm this will be merged by the submit queue.

krisnova · 2018-02-05T17:30:27Z

hey @pipejakob can you address the conflict? Also the PR LGTM

pipejakob · 2018-02-05T18:34:39Z

On it, but it looks like the upstream build of cluster-api-gcp is broken, so I can't fully test my rebase yet. I'll wait for #561 or a similar patch to fix the build.

rsdcastro · 2018-02-06T03:11:40Z

#568 is the fix for the compilation error, which is being reviewed.

cc @karan @jessicaochen

pipejakob · 2018-02-07T18:38:01Z

Okay, rebased and tested end-to-end via cluster-api-gcp. Ready to ship?

roberthbailey · 2018-02-07T18:55:18Z

/lgtm

k8s-ci-robot · 2018-02-07T18:55:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pipejakob, roberthbailey

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~OWNERS~~ [roberthbailey]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

This is a proposal to add a new API for managing Nodes in a declarative way: Machines. It is part of the overall Cluster API effort.

kubernetes-retired#298

Committing these types so we can start prototyping against them, but the full proposal is still under review (and accepting feedback) here: kubernetes-retired#298

…posal Minimalistic Machines API proposal.

kubernetes-retired#298

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 19, 2017

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 19, 2017

pipejakob force-pushed the machines_proposal branch 7 times, most recently from e4232ee to 0302668 Compare October 19, 2017 08:01

pipejakob mentioned this pull request Oct 19, 2017

Experimental node classes API (WIP) kubicorn/kubicorn#371

Closed

justinsb reviewed Oct 21, 2017

View reviewed changes

pipejakob added a commit to pipejakob/kube-deploy that referenced this pull request Oct 23, 2017

Machines API types.

5cf21ab

Committing these types so we can start prototyping against them, but the full proposal is still under review (and accepting feedback) here: kubernetes-retired#298

colemickens reviewed Oct 25, 2017

View reviewed changes

mtaufen reviewed Oct 27, 2017

View reviewed changes

krisnova reviewed Oct 28, 2017

View reviewed changes

krisnova approved these changes Oct 28, 2017

View reviewed changes

krisnova mentioned this pull request Oct 28, 2017

A first pass at a minimal cluster control-plane API #306

Closed

mvladev reviewed Nov 3, 2017

View reviewed changes

mvladev reviewed Nov 4, 2017

View reviewed changes

mvladev reviewed Nov 5, 2017

View reviewed changes

jfoy mentioned this pull request Jan 18, 2018

move "kubectl drain" into the server kubernetes/kubernetes#25625

Closed

mrIncompetent mentioned this pull request Jan 25, 2018

Expose public address of Openstack machines kubermatic/machine-controller#29

Closed

rsdcastro added this to the cluster-api-alpha milestone Feb 2, 2018

pipejakob force-pushed the machines_proposal branch from 4e58ac6 to a9dcd93 Compare February 7, 2018 18:37

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 7, 2018

k8s-ci-robot assigned roberthbailey Feb 7, 2018

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2018

Minimalistic Machines API proposal.

ff3a7fc

This is a proposal to add a new API for managing Nodes in a declarative way: Machines. It is part of the overall Cluster API effort.

pipejakob force-pushed the machines_proposal branch from a9dcd93 to ff3a7fc Compare February 7, 2018 18:58

k8s-ci-robot merged commit a53df90 into kubernetes-retired:master Feb 7, 2018

mrIncompetent mentioned this pull request Feb 13, 2018

add MachineSet type #501

Merged

medinatiger added a commit to medinatiger/kube-deploy that referenced this pull request Feb 15, 2018

Pick up PR Minimalistic Machines API proposal.

f729862

kubernetes-retired#298

medinatiger mentioned this pull request Feb 15, 2018

Cherry-pick the MachineAPI chanage from cluster-api to ext-apiserver #614

Merged

k4leung4 pushed a commit to k4leung4/kube-deploy that referenced this pull request Apr 4, 2018

Machines API types.

f17f99d

Committing these types so we can start prototyping against them, but the full proposal is still under review (and accepting feedback) here: kubernetes-retired#298

k4leung4 pushed a commit to k4leung4/kube-deploy that referenced this pull request Apr 4, 2018

Merge pull request kubernetes-retired#298 from pipejakob/machines_pro…

2c92d5f

…posal Minimalistic Machines API proposal.

k4leung4 pushed a commit to k4leung4/kube-deploy that referenced this pull request Apr 4, 2018

Pick up PR Minimalistic Machines API proposal.

2e49ce9

kubernetes-retired#298

rsdcastro mentioned this pull request Apr 12, 2018

Extensions to machine status errors kubernetes-sigs/cluster-api#14

Closed

scruplelesswizard mentioned this pull request Oct 4, 2018

REQUEST: New membership for @chaosaffe kubernetes/org#139

Closed

6 tasks

Minimalistic Machines API proposal. #298

Minimalistic Machines API proposal. #298

Conversation

pipejakob commented Oct 19, 2017

k8s-reviewable commented Oct 19, 2017

mrIncompetent commented Oct 19, 2017

pipejakob commented Oct 19, 2017

justinsb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvladev Nov 5, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrIncompetent Jan 10, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luxas commented Oct 24, 2017

adieu commented Oct 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krisnova left a comment

Choose a reason for hiding this comment

mattbates commented Oct 28, 2017 • edited

pipejakob commented Oct 30, 2017

kfox1111 commented Oct 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrislovecnm Nov 15, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamalsaha commented Nov 6, 2017 • edited

roberthbailey commented Jan 16, 2018

roberthbailey commented Jan 17, 2018

mvladev commented Jan 18, 2018

roberthbailey commented Jan 18, 2018

krisnova commented Feb 5, 2018 • edited

pipejakob commented Feb 5, 2018 • edited

rsdcastro commented Feb 6, 2018

pipejakob commented Feb 7, 2018

roberthbailey commented Feb 7, 2018

k8s-ci-robot commented Feb 7, 2018

mvladev Nov 5, 2017 •

edited

mrIncompetent Jan 10, 2018 •

edited

mattbates commented Oct 28, 2017 •

edited

chrislovecnm Nov 15, 2017 •

edited

tamalsaha commented Nov 6, 2017 •

edited

krisnova commented Feb 5, 2018 •

edited

pipejakob commented Feb 5, 2018 •

edited