Skip to content
This repository has been archived by the owner on May 22, 2020. It is now read-only.

Minimalistic Machines API proposal. #298

Merged

Conversation

pipejakob
Copy link
Contributor

This is a proposal to add a new API for managing Nodes in a declarative way: Machines.

It is part of the overall Cluster API effort.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 19, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 19, 2017
@pipejakob pipejakob force-pushed the machines_proposal branch 7 times, most recently from e4232ee to 0302668 Compare October 19, 2017 08:01
@mrIncompetent
Copy link
Contributor

Hey,

we are working on a very similar concept:
https://github.com/kube-node/nodeset

Basic idea is to rely on the already existing node resources.
We defined 2 extra resources:

Flow is:

  • nodeset-controller creates node resources
  • node-controller provisions machines at the cloud-provider based upon the node resource and the assigned nodeclass

We have a example node-controller (using docker-machine): https://github.com/kube-node/kube-machine
Example nodeset controller (using archon): https://github.com/kube-node/archon-nodeset
We are working right now on a generic nodeset controller & a gke-nodeset controller.

How can we align and possible collaborate on this topic?

Contributors (not sorted by anything):
@sttts @ledzep2 @s-urbaniak @adieu @scheeles @realfake @chaosaffe @guusvw @metalmatze

@pipejakob
Copy link
Contributor Author

@mrIncompetent Your nodeset looks similar to what we're trying to do, but I'm wondering if there's a write-up somewhere of the explicit goals of the project? It's not clear to me from reading the code where its boundaries are, or what user problems it's trying to solve.

For instance, I don't see any notions of software versions in the nodeset, so it doesn't seem like you're targeting being able to upgrade nodes like we are. Also, only having a "set" concept for nodes without being able to address specific ones seems like it would give you no control over which nodes to scale down when you have excess capacity (which could be important, if you want to target the most idle nodes for deletion). I could be wrong, though, if you're intending for the usecase to be handled by deleting Node objects themselves.

That said, it would be great to have any/all of its contributors join us in our ongoing Cluster/Machines API discussions. We're meeting weekly on Wednesdays at 11:00 PST via Zoom (https://zoom.us/j/166836624). If you want to get the invite on your calendar, you can join the SIG Cluster Lifecycle mailing list, where we can also start a thread to discuss this before our next Zoom meeting.

Copy link
Contributor

@justinsb justinsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!


The ProviderConfig is recommended to be a serialized API object in a format
owned by that provider, akin to the [Component Config](https://goo.gl/opSc2o)
pattern. This will allow the configuration to be strongly typed, versioned, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think this is a little bit of a cheat, and I'd like us to avoid this if possible, but I also recognize that we'll always need an escape hatch in the long term. My concern in the short/medium term is that allowing this means we avoid defining common fields where we actually could do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely, 100% agree that it's a simplification for the short-term. I'll be fully transparent here:

My other proposal was a little too over-indexed on trying to support a cloud-agnostic cluster autoscaler as an initial customer. That was appealing from a design standpoint, because the cluster autoscaler already exists and actually has concrete requirements to build off of, but also seemed like a logical thing you would want to do with a declarative NodeSet concept. After a lot of discussions, though, the feedback was:

  1. For a very first proposal of a brand-new API, it had a lot of new concepts to grasp.

To be fair, I had new types across two dimensions: Machine / MachineSet / MachineDeployment in one dimension, and Machine / MachineClass / MachineTemplate across the other. I was urged to think about the absolute minimum we could get away with for the first iteration, and only add more complexity and new concepts later when we were certain they were necessary.

My other design also had the requirement that in order to create even a single simple node, you had to first create a cloud-specific MachineTemplate, then register it as a MachineClass, and then finally create a Machine that referenced the class. It seemed like a lot of overhead just to support a single custom node that you had no intent to reuse. So, one design principle that emerged from that was that if I just want to create a single Machine and not care about portability, I should be able to. I think there's still room to evolve the API and introduce other concepts down the road, but it seems reasonable to me that if you don't care about portability, it should be possible to create a single custom Machine without the overhead of needing to use other concepts, which means that we'll need something akin to the opaque ProviderConfig blob we have now to actually be able to feed the right values into the cloud-specific node creation APIs.

  1. Rebasing the cluster autoscaler on top of our new APIs isn't actually delivering any new value, since cluster autoscaling already exists today.

It's a nice-to-have down the road, but in order to get the momentum we want for the project, we should focus first on the new functionality this enables that was never possible before (like cloud- and deployment-agnostic cluster upgrades). If you remove cluster autoscaling (and node autoprovisioning) as a client, at least for the short term, then I think most of the benefits of having more cloud-agnostic fields in the MachineSpec disappear as well.

I could definitely be wrong here, and welcome more feedback. But, the intent of this proposal was "what is the absolute bare minimum that we can all agree is a good starting point?" and then continue to add more as we see appropriate.

Some of the things I had considered for inclusion in the MachineSpec:

  • OS image
    • This could be represented as a single string across most providers, but the values wouldn't be portable anyway. If the values are cloud-specific, then moving the entire field to be cloud-specific means that each provider can represent this however it would like. For instance, in GCE, OS images are actually more naturally represented in a structured way, since they have a project, family, name, etc. In DigitalOcean, OS Images can also be referred to via int IDs, so an IntOrStr might be more appropriate there.
  • Disk configuration
    • I don't know if I personally have enough of a grasp on how much power is needed to be useful here. I've swung back and forth between two extremes: a simplified view of just specifying a total amount of working space desired, and on the other end of the spectrum, having a full array of structs to represent disks, along with fields for how much capacity each disk should have, whether it should be bootable, etc. At this point, deferring disk setup to the ProviderConfig lets us look at how different early adopters represent this in their config, and we can always bubble it back up to a generic representation in MachineSpec in the future once we think we have a grasp of where we want to end up on the power/usability trade-off scale.
  • Preemptible
    • AWS and GCE support these, but I think Azure only supports low-priority VMs in Batch, and I don't know if this concept exists in other on-premise environments like vSphere.
  • Topology
    • Like OS Image, these are similar-enough concepts across environments (regions, zones, availability zones, etc), but I couldn't see the actual values being portable at all. Also, the number of dimensions you might want to support in different environments might differ wildly. In some clouds, a single availability zone might be enough to specify. If you're on-premise, you might want a lot of custom fields like datacenter, rack, etc.

There are many more, but you get the point. Are there particular concepts that you feel we should represent in the ClusterSpec now (or in the near term)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb it shouldn't be a problem to lift common fields into this API as we notice them. Providers can always extend their controllers to check for the API field and fall back on their old provider-specific field.

Copy link
Contributor

@mvladev mvladev Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pipejakob After re-watching all meeting recordings, I see the justification for using serialized blob - simplicity. I'm 100% for simplicity. However I think that providerConfig is very similar to EnvVar in Container. There you can define a literal value, when you don't need reusability (or doing it quickly) and reference, when you need it. Something like

providerConfig: # either value or valueRef
  value: >
    {
      "apiVersion": "gceproviderconfig/v1alpha1",
      "kind": "GCEProviderConfig",
      "machineType": "n1-standard-1",
      ...
    }
  valueRef:
    apiVersion: gceproviderconfig/v1alpha1
    kind: GCEProviderConfig
    name: config-1

can be suitable for both use-cases and it feels very similar to what people know and use in Pods

Edit:
even better it would be if we have:

providerConfig:
  value: # either value (runtime.Object/runtime.RawExtension) or valueRef
    apiVersion: gceproviderconfig/v1alpha1,
    kind: GCEProviderConfig,
    machineType: n1-standard-1,
    project: test-1-test
    ...
  valueRef:
    apiVersion: gceproviderconfig/v1alpha1
    kind: GCEProviderConfig
    name: config-1

p.s. this is a re-post of a previous comment (wrong user)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using runtime.Object/runtime.RawExtension instead of string?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second this. runtime.RawExtension needs an apiVersion and will probably fall back to Unstructured (@mrIncompetent can you confirm that?). Alternatively, a generic JSON type will do, compare https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1beta1/types_jsonschema.go#L28.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts runtime.RawExtension.Object would fall back to runtime.Unknown.
Though i've been using it without apiVersion and more like https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1beta1/types_jsonschema.go#L28
So i only used runtime.RawExtension.Raw. In this case, RawExtension.Object will be nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvladev @pipejakob @justinsb @sttts @roberthbailey
Any objections on using runtime.RawExtension for providerConfig

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection, it's way better than string.

update machine

and allow the provider to decide if it is capable of performing an in-place
update, or if a full Node replacement is necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - and this could also be an option / policy that can be set by the machine controller. For example, kops encourages full node replacement (a little more reliable, perhaps). But certainly it is slower, and some people would likely choose in-place replacement if it was available. And if we have a kops-controller, it'll have to support in-place for bare-metal.

// | Master present | Master absent |
// +---------------+-----------------------+------------------------|
// | Node present: | Install control plane | Join the cluster as |
// | | and be schedulable | just a node |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure schedulable exists as a concept any more... AIUI the masters are supposed to be schedulable, but tainted so user pods aren't scheduled to it. But e.g. a monitoring daemonset or networking daemonset should tolerate the taint and thus be on the master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is poor wording on my part. If I replace "schedulable" with "untainted" and "unschedulable' with "tainted," is that sufficient? Or should we not differentiate between (Master) and (Master, Node)? Or do you think there's a better way to represent the desire to install the control plane altogether?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what @justinsb said, it sounds like there shouldn't be a distinction between Master and Node, and I think I agree with this position. Is there any scenario where an un-tainted master would be significantly preferable to a tainted one?

type MachineStatus struct {
// If the corresponding Node exists, this will point to its object.
// +optional
NodeRef *api.ObjectReference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should define a field like ProviderId (https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L3092), for the window in between when a machine is created and when it registers with kube-apiserver. But happy to wait and see...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be *corev1.ObjectReference


// When was this status last observed
// +optional
LastUpdated metav1.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make clear that this is last-transition, rather than a heartbeat. (If we hadn't done the node heartbeat, I'd wager we'd be at 10k nodes by now...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Name string

// Semantic version of the container runtime to use
Version string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to guess these should be optional to mean "use the controller-recommended setting for the k8s/kubelet version" (which IMO is what we should be encouraging!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I think it's a good idea to make these optional at cluster creation time, but do you think they should stay optional afterwards?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful with default values for fields. Changing defaults can easily break an API and unlike ProviderConfig there's no clear provider-specific versioning mechanism for the defaults applied to these common fields.

pipejakob added a commit to pipejakob/kube-deploy that referenced this pull request Oct 23, 2017
Committing these types so we can start prototyping against them, but the
full proposal is still under review (and accepting feedback) here:

kubernetes-retired#298
@luxas
Copy link
Contributor

luxas commented Oct 24, 2017

@kubernetes/sig-cluster-lifecycle-pr-reviews

@adieu
Copy link

adieu commented Oct 25, 2017

@pipejakob Just wanted to share some thoughts while we were working on a project called Archon with similar ideas.

The basic idea is the same. Using declarative resources in Kubernetes to define the node machines and delegating real work to controllers. We chose Instance and InstanceGroup as the resource names but they could be easily mapped to Machine and MachineSet.

Here is the definition for Instance:

type Instance struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata"`
	Spec              InstanceSpec       `json:"spec,omitempty"`
	Status            InstanceStatus     `json:"status,omitempty"`
	Dependency        InstanceDependency `json:"-"`
}

type InstanceSpec struct {
	OS                  string                 `json:"os,omitempty"`
	Image               string                 `json:"image,omitempty"`
	InstanceType        string                 `json:"instanceType,omitempty"`
	NetworkName         string                 `json:"networkName,omitempty"`
	ReclaimPolicy       InstanceReclaimPolicy  `json:"reclaimPolicy,omitempty"`
	Files               []FileSpec             `json:"files,omitempty"`
	Secrets             []LocalObjectReference `json:"secrets,omitempty"`
	Configs             []ConfigSpec           `json:"configs,omitempty"`
	Users               []LocalObjectReference `json:"users,omitempty"`
	Hostname            string                 `json:"hostname,omitempty"`
	ReservedInstanceRef *LocalObjectReference  `json:"reservedInstanceRef,omitempty"`
}

type InstanceStatus struct {
	Phase      InstancePhase       `json:"phase,omitempty"`
	Conditions []InstanceCondition `json:"conditions,omitempty"`
	// TODO: allow multiple ips
	PrivateIP         string      `json:"privateIP,omitempty"`
	PublicIP          string      `json:"publicIP,omitempty"`
	InstanceID        string      `json:"instanceID,omitempty"`
	CreationTimestamp metav1.Time `json:"creationTimestamp,omitempty" protobuf:"bytes,8,opt,name=creationTimestamp"`
}

type FileSpec struct {
	Name               string `json:"name,omitempty" yaml:"name,omitempty"`
	Encoding           string `json:"encoding,omitempty" yaml:"encoding,omitempty" valid:"^(base64|b64|gz|gzip|gz\\+base64|gzip\\+base64|gz\\+b64|gzip\\+b64)$"`
	Content            string `json:"content,omitempty" yaml:"content,omitempty"`
	Template           string `json:"template,omitempty" yaml:"template,omitempty"`
	Owner              string `json:"owner,omitempty" yaml:"owner,omitempty"`
	UserID             int    `json:"userID,omitempty" yaml:"userID,omitempty"`
	GroupID            int    `json:"groupID,omitempty" yaml:"groupID,omitempty"`
	Filesystem         string `json:"filesystem,omitempty" yaml:"filesystem,omitempty"`
	Path               string `json:"path,omitempty" yaml:"path,omitempty"`
	RawFilePermissions string `json:"permissions,omitempty" yaml:"permissions,omitempty" valid:"^0?[0-7]{3,4}$"`
}

We put lots of information in InstanceSpec because we think Instance should contain all the information needed to create the machine and later we could introduce InstanceGroup, InstaceDeployment using Instance as a base. It's just like the relationship between Pod, ReplicaSet and Deployment which could be easily adopted by Kubernetes users.

To adapt for different cloud controllers, we put common fields like OS, Image, InstanceType in InstanceSpec and left cloud specific configs in annotations and Files. A File is like a unix file. We use File to inject files directly into the target machine and controllers could watch for specific path to retrieve additional configuration information just like using /proc/ files. There should be an agreement between the controller author and controller user on the path and format for the configuration files but it should be hidden from the generic view as it's kind of implementation detail. The concept behind the special File idea and your ProviderConfig idea is the same. We should left some extension point to the controllers.

There are two minor differences:

We chose to treat Instance as a read only resource just like Pod. One could only modify the machine by delete and recreate the Instance resource. In order to reuse some existing machines, we introduced a new concept called ReservedInstance. We found this approach easier to implement and reason about. However if we went down the mutable approach, controller authors could still stay in the immutable approach if they like. So there's no real problem here.

Another difference is that we target Archon as a general purpose computing resource management tool instead of a Kubernetes specific one. The fundamental design could be used to build an etcd cluster or any distributed system but we have Kubernetes support builtin. I can understand that kube-deploy is a Kubernetes project and it's main focus is on Kubernetes, but in real world sysadmins manage a bunch of other servers besides the Kubernetes cluster. Many of them had their own tools for server bootstrapping and configuration. It will be very hard to persuade them to adopt a new tool which could only be used to manage Kubernetes clusters. If we could support generic server management, adapting tools could be made to support existing tools. terraform-provider-archon is a adapting tool we made for existing terraform users.

Whether the Machine resource represents only a Kubernetes node or a more generic server is an important design decision. Maybe we should hear from more people about the pros and cons. Personally I would support the generic server model, because it's easy to add a higher level abstraction for Kubernetes but not vise versa. There are use cases like dedicated etcd cluster and storage cluster which could be covered in this model. In order to be more generic, we have resources like Network and User in Archon and controllers to manage VPCs and certificates.

I just wanted to raise another question here. Is the Machine resource more like an Ingress resource or a Pod resource? The answer will influence the controller design. If it's like Pod then we probably have a central controller and something like cri for different backends. If it's like Ingress, then we would not have a master controller and all the controllers would consume the resource definition by themselves. We could leave this question aside until we begin to implement the controller but I think the answer to the question will influent the design of the Machine resource.

Working with @mrIncompetent and others, we introduced concepts like NodeSet and NodeClass. Like what you said in your comments, it's another dimension. We made archon-nodeset to consume the NodeSet resources and translate them into Archon InstanceGroup resources. Hope the work we had done in the kube-node project could be used as a reference for your Machine, MachineClass and MachineTemplate design.

BTW, we use jsonnet to build the final InstanceGroup definition in modules and we bundled the jsonnet files into one single executable file to improve user experience. I think they might be useful for your Machine resource too. I'll share more information on this topic in another thread.

I'm really looking forward to see a common resource shared by all the Kubernetes bootstrapping tools got defined. My thoughts on this topic might not be correct nor optimal. I just wanted to bring some discussions on the design decisions to be made. Hope we could attract more people interested in this idea and polish the resource definition together.

ContainerRuntime ContainerRuntimeInfo
}

type ContainerRuntimeInfo struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope of this particular object/concept? Should we expect to see {Rkt,Frakti,Containerd}RuntimeConfig structs at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended purpose was to:

  1. Know what runtime to install at provisioning time. It's completely fine for an implementation to say "sorry, I only know how to install Docker. Anything else will result in an UnsupportedConfiguration error." Or even "I have no idea how to install that version of Docker," etc.
  2. Know how to set kubelet's --container-runtime flag.

You're right to question how this will evolve in the future, with cluster admins potentially wanting to fine-tune the settings of the runtime itself. I think it would largely follow the same pattern we use to handle the provider-specific configuration for Machines. At first, we would likely have an opaque blob to capture all of the settings that were fed to the runtime installer, which would allow each container runtime to version its configuration independently of the Machines API. Then, potentially upgrade this to an ObjectReference so that it identical configuration wouldn't have to be inlined with every object that uses it. Any time we have configuration that seems to be useful across every runtime we support, we can graduate it out of the opaque config blob and into the ContainerRuntimeInfo struct if desired.

I think the ContainerRuntimeInfo is a good candidate for culling in v1alpha1, actually, until there's a strong need to add it. Definitely worth discussion.

// | Node absent: | Install control plane | Invalid configuration |
// | | and be unscheduleable | |
// +---------------+-----------------------+------------------------+
Roles []string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest just making this a string and having a single identifier for each possible configuration. It's much less confusing that way. e.g. have "SchedulableMaster," "UnschedulableMaster," and "Node."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually originally had it that way in an earlier draft that was Google-internal only, and got the opposite feedback from Brian Grant, Tim Hockin, and Daniel Smith: make it a list of strings instead.

I think there might be contention over exactly what roles or node installation scenarios we want to support directly in this API, however.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ping me the internal doc so I can get up to speed on the rationale?


type MachineSpec struct {
// This ObjectMeta will autopopulate the Node created. Use this to
// indicate what labels, annotations, name prefix, etc., should be used
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Name of the MachineSpec used as the name prefix? This sounds reasonable to me just double checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was aiming to follow the same pattern as we do for pods (and other objects): you can specify name: value if you know exactly what name you want to use, or generateName: value to use value as a prefix and have a unique suffix generated for you.

I can call this out much more explicitly, though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't generateName handled generically by the API server? How will you prevent it from renaming the MachineSpec you create, so that providers can generate the names (e.g. GKE has a way of generating names that is not the same as what the API server does)?

OTOH you could just force providers to use the name generated by the API server. Though IDK what kinds of incompatibility that would introduce.

Keep in mind that using generateName will prevent you from making idempotent creation requests, because it is not deterministic.

attempt to upgrade machine in-place
if error:
create new machine
delete old machine
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to drain the old machine before deleting it, so that the containers get a chance to exit gracefully. IIRC drain also ensures you respect any pod disruption budgets that are set up in the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtaufen there is an open issue to move the drain operation into the k8s server.

In the meantime, I think that it's much better to have a special drain controller, that will mark Machines for draining, adding finalizer to prevent deletion and removing it once the node has been drained.

built on top of the Machines API would follow the same pattern:

for machine in machines:
attempt to upgrade machine in-place
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: With in-place upgrades, providers should determine how disruptive a given in-place mutation is and ensure that they respect the pod disruption budget.

## In-place vs. Replace

One simplification that might be controversial in this proposal is the lack of
API control over "in-place" versus "replace" reconciliation strategies. For
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may end up wanting this to make a trade-off between the disruptiveness and cleanliness of a rollout, but I think it's fine to push this down to the ProviderConfig and leave it out of the top-level API.


The ProviderConfig is recommended to be a serialized API object in a format
owned by that provider, akin to the [Component Config](https://goo.gl/opSc2o)
pattern. This will allow the configuration to be strongly typed, versioned, and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinsb it shouldn't be a problem to lift common fields into this API as we notice them. Providers can always extend their controllers to check for the API field and fall back on their old provider-specific field.

* Dynamic API endpoint

This proposal lacks the ability to declaratively update the kube-apiserver
endpoint for the kubelet to register with. This feature could be added later,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on what this section means, but it kinda sounds like something CRD could handle?

// controller observes that the spec has changed and no longer matches
// reality, it should update Ready to false before reconciling the
// state, and then set back to true when the state matches the spec.
Ready bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What criteria will govern this value?
Also is there an expectation that this value will always be accurate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it might be a duplication of data that exists already in API. Moreover, what is actually updating this? My take would be this would be a lot of overhead, and may not scale. Making this generic seems to be an issue. Can we rely on kubelet ready status, or what other options do we have?

Copy link
Contributor

@krisnova krisnova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very much in favor of building out a pre-alpha version of this so we can start testing sooner than later. The whole point is that we can version these, and I will probably have much more concrete feedback once I tried mutating infrastructure with these 😄

TLDR; LGTM

@mattbates
Copy link

mattbates commented Oct 28, 2017

cc @munnerz @simonswine

@pipejakob
Copy link
Contributor Author

@adieu Yes, Archon is one of the many projects we looked at when starting to work on the Cluster API project. I do think there's a lot of overlap conceptually, but that the two efforts have fundamentally different goals.

As you say yourself, Archon's Instances are deliberately general purpose and Kubernetes-unaware. I believe that this makes configuring a Kubernetes cluster much closer to the Kubernetes The Hard Way experience than the ease of other installers. Rather than saying conceptually "I would like to use the 1.8.1 version of the control plane," one must explicitly model the entire static pod manifest of every component, including every flag to pass them, the liveness probe to use, the volumes to mount, etc. This offers infinite flexibility, but because you give this much power to the enduser by only abstracting away the concept of files to place on disk, I believe it actually becomes much more tedious to configure a functioning cluster from scratch, with no guarantee that two clusters created by two different users actually look the same. They could choose to put their static pod manifests in different directories, or even run all of their control plane components via systemd instead of kubelet. This flexibility is very powerful, but makes it very difficult to operate on those clusters in a generic way.

One of the most important usecases we're targeting with the Cluster and Machines APIs is for developers to be able to write operational tooling on top of these APIs that is completely agnostic of the cluster's environment, the cloud that it's running in, and even the deployment mechanism used to provision the cluster. With the current proposal of having Kubernetes concepts be first-class citizens of these APIs, we will enable tooling to be written like generic cluster upgraders that only have to update the value of a single field on an object, and have the right thing happen with little room for shooting oneself in the foot.

In the Archon world, I don't see how a tool could generically upgrade a single Instance without understanding whether that Instance is supposed to be running the control plane, or just run a kubelet that registers with a cluster master. Further, the tool will have to understand which Files refer to which control plane components in order to understand how to even upgrade them. Flag names can change between different Kubernetes versions, so the upgrade tool would need to know about what name transformations, deprecations, and additions to make to the flags passed to each component separately. Please correct me if I'm wrong, but I think any tooling written on top of Archon's Instances would need to deeply inspect every object and have many switch statements to know how to safely upgrade or downgrade Instances or the components running on them, or else have these sections maintained by hand by the cluster admin. Also, if an Instance is not Kubernetes-aware, then would tooling that deletes an Instance need to handle safely evicting workloads from that Instance first?

I understand the desire for creating a completely generic abstraction of hosts, but I think that direction is off the spectrum of what would be useable for the usecases we're targeting to support with this particular project.

However, one way that I can see these two projects potentially collaborating is by having a Machine -> Instance adapter that allows us to take advantage of all of the existing work in Archon in order to jumpstart the number of providers supported by the Machines API, and to keep taking advantage of any new providers implemented in the Archon project.

As for your question of whether a Machine object is modeled more like Ingress than a Pod, I would say Ingress (based on my understanding of the intent of your question). It is fully expected that your cluster should have a cloud- or environment-specific controller handling your Machine objects. However, most of the Machine's spec is generic, so the power comes from these objects being operated on in an generic way. You don't need to know what cloud a Machine is in in order to change its kubelet version, but a cloud-specific controller will handle reconciling the real world with that new declared spec.

@kfox1111
Copy link

@pipejakob Couldn't the abstraction still be done with archon like language with some kind of helper on the host and annotations of some kind on the node? Say user sets k8s version previous+1 annotation. node itself could pull the annotation and tweak things like performing a kubelet upgrade. Maybe some of the logic your talking about belongs in kubeadm? Being a ndoe level thing has the advantage of being very agnostic to the tool (could be driven in the image by go, ansible, chef, etc). Could map annotations to chef roles for example.

It would be really nice if the both use cases could be handled by the same object with some extra fields somehow. For an operator, sometimes the distinction between a fully k8s node, and oh, I need to override this one thing on the host is a very fine line. being able to mix both use cases together would really be helpful I think.


type MachineVersionInfo struct {
// Semantic version of kubelet to run
Kubelet string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an optional value? In most cases you want to match your kubelet and api-server versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that it would be optional at installation time, but that the installer would fill it in with the value that was actually used. Then, tooling built on top of this API can inspect and potentially modify a concrete value here, instead of having to reason about what an empty value here means. I'll document the expectations here more clearly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a struct? kubelet to me is an object, not a string. Is a Struct with a single string a good start? For instance, we have multiple components. Also how does this relate not relate to component config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against making this a struct, but I'm wondering how you see that evolving in the future. What other fields would you envision in the struct? The way this is laid out currently is:

machine:
  spec:
    versions:
      kubelet: 1.8.0

Are you hoping to have different ways of specifying the version of kubelet to use beyond a single semver, or were hoping to gather together any configuration related to the kubelet into a single struct, rather than having the version be stand-alone here?

import (
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/kubernetes/pkg/api"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed and we can use k8s.io/api/core/v1 instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this was an accident that got fixed in the merged types but not this PR. I'll update the PR with the newest definition from the codebase.

type MachineStatus struct {
// If the corresponding Node exists, this will point to its object.
// +optional
NodeRef *api.ObjectReference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be *corev1.ObjectReference

// If set, indicates that there is a problem reconciling state, and
// will be set to a human readable string to indicate the problem.
ErrorMessage *string
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the Provider puts the status for the cloud resources it creates? ProviderA might create / update / delete security_groups, keys or anything_related for every machine it reconciles and those cloud resources are going to be completely different from ProviderB's resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would ErrorReason *MachineStatusError and ErrorMessage *string make more sense in a list of a struct that is an encapsulation of those values? What does the event data structure look like? This almost appears to be a list of Error Events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+
 +	// +optional
 +	// If set, indicates that there is a problem reconciling state, and
 +	// will be set to a human readable string to indicate the problem.
 +	ErrorMessage *string

Maybe this is a better question, since the struct may be below. What is the ErrorMessage? How does it relate to ErrorReason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in one of the Cluster API breakout sessions, but I forgot to follow-up here: it's completely up to the controller as an implementation detail. A decent pattern is for the controller to add custom annotations to the Machines it's reconciling, to keep track of information about external resources it has created. The controller could also create its own ConfigMaps or CustomResources to have better control (and RBAC scoping) of its state, or even store state outside of the cluster if it makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for ErrorMessage / ErrorReason, this is definitely one of the warts of the design so far that I'd like to flesh out with any better approaches.

We did get the advice from Brian Grant and Eric Tune that the existing pattern of Conditions in object statuses is near-deprecated. As part of trying to bring several controllers to GA, a survey was sent out to understand how people use Conditions and how effective they are for the intent. The overall response was that having Conditions be a list of state transitions generally made them not useful for the kinds of checks people wanted to make against the Status, which are to answer the question "is this done yet?". This resulted in clients always just looking at the most recent Condition in the list and treating it as the current state, which on top of making the client logic more difficult, also made them deteriorate into Phases anyway (which are thoroughly deprecated).

So, they suggested two different patterns to replace Conditions:

  1. If you really want a timeseries stream of state transitions, we should use Events.
  2. For Status fields that we think clients will care to watch, we should just have fine-grained top-level entries (rather than lists) for the current state.

We're on the bleeding edge, so there aren't other parts of Kubernetes that have migrated off of Conditions yet, and we might be setting the precedents, or we may just need to slightly alter our types once better recommendations are in place.

The Error* fields were my attempt at (2), with the general guidelines that if a client modifies a field in the Machine's Spec, it should specifically watch for the corresponding field of the Machine Status see whether or not it has been reconciled, while watching the Error* fields for any errors that occur in the meantime. If you're updating the version of kubelet in the Spec, you should watch the corresponding field in the Status to know when it's been reconciled. This works decently with the Error* fields so long as you have a single controller responsible for the entirety of the Machine Spec, but breaks down somewhat if you want different controllers to handle different fields of the same object, or handle reconciling the same Machine under different circumstances.

For instance, one controller may be responsible whenever a full VM replacement is needed, while another may specialize in being able to update a VM in-place for certain Spec changes. It's not fantastic if they're unable to report errors separately, and instead have to overwrite the same fields in the Status. One mitigation is to always publish an Event on the Machine as well, so anyone who cares can still see the full stream of all errors. Another mitigation is to provide very strong guidance over what constitutes an error worth reporting in the Status.

I'll clarify this in the documentation, but I think it's a good idea for Machine.Status.Error* to be reserved for errors that are considered terminal, rather than transient. A terminal error would be something like the fact that the Machine.Spec has an invalid configuration, so the controller won't be able to make progress until someone fixes some aspect of it. Another terminal error would be if the machine-controller is getting Unauthorized or Permission Denied responses from the cloud provider it's calling to create/delete VMs -- it's likely going to require some manual intervention to fix IAM permissions or service credentials before it's able to do anything useful. However, any transient service failures can just be logged in the controller's output and/or added as Events to the Machine, since they should only represent delays in reconciliation and not errors that require intervention.

If two different controllers want to report terminal errors on the same Machine object, then I think it's okay that they are overwriting each other's errors in the Machine.Status, since they are both valid errors that need to be taken care of. By definition, neither controller is able to make progress until someone steps in and modifies the Machine.Spec or some aspect of the environment to fix the error, so someone will need to address both errors anyway (which may have the same underlying root cause). Based on the timing, they'll either see one error or the other first, and will have to fix it. Then, that error will disappear as the first controller gets unwedged, but a second error might replace it from the second controller, and the admin can take action on that as well.

We can figure out some way to represent both errors in the Status at the same time, but I'm not sure how much value there is in that over the above model, or above the cluster admin looking at the Machine events anyway.

@tamalsaha
Copy link

tamalsaha commented Nov 6, 2017

Do we really need a new object that presents Node? We can just use the existing Noe object. I think using a separate object type will create another point of reconciliation.

Especially for cloud providers like AWS/GCE where the actual nodes are created via AutoScaler (hence these names are generated randomly), this will require sync among 3 things - Cloud providers IntanceGroup , []Machine and []Node.

In case of appscode/pharmer, we defined a NodeGroup object https://github.com/appscode/pharmer/blob/dd266cded7e686bdbdc037496351b947bf8081eb/apis/v1alpha1/node.go#L16

This gets translated to appropriate group concept depending for cloud providers that support them (AWs & GCE). For simple VPS provider this just becomes a simple loop over Node creation. To maintain the sync, we pass the NodeGroup name via kubelet's --node-labels flag.

@roberthbailey
Copy link
Contributor

Thanks @pipejakob both for your work putting this PR together and handling feedback and also helping up wrap it up as you ramp up on Istio.

I will try to take a pass at comparing the types files, but it may take a day or two so if someone wants to jump in (@medinatiger? @jessicaochen?) that'd be really helpful.

I've created #503 and #504 to address your last comment.

@roberthbailey
Copy link
Contributor

@mvladev volunteered during our meeting today to take a pass on step one (sanity checking the latest changes). Once that is done I'll do a quick pass and lgtm as well.

@mvladev
Copy link
Contributor

mvladev commented Jan 18, 2018

@roberthbailey after I:

  • pulled the latest master branch
  • merged manually @pipejakob's branch in master and copy + pasted machines-api/types.go in api/cluster/v1alpha1/types.go

see that there is a little miss-alignment in the comment for MachineSpec's ObjectMeta:

$ git diff

diff --git a/cluster-api/api/cluster/v1alpha1/types.go b/cluster-api/api/cluster/v1alpha1/types.go
index c5b829a7..d2787c39 100644
--- a/cluster-api/api/cluster/v1alpha1/types.go
+++ b/cluster-api/api/cluster/v1alpha1/types.go
@@ -168,8 +168,8 @@ type Machine struct {

 type MachineSpec struct {
        // This ObjectMeta will autopopulate the Node created. Use this to
-       // indicate what labels, annotations, name prefix, etc., should be used
-       // when creating the Node.
+       // indicate what labels, annotations, etc., should be used when
+       // creating the Node.
        // +optional
        metav1.ObjectMeta `json:"metadata,omitempty"`

added lines are coming from machines-api/types.go

Except for that small comment difference, types and structs match the proposal.

@roberthbailey
Copy link
Contributor

Thanks Martin!

@pipejakob - go ahead and delete machines-api/types.go because as soon as I lgtm this will be merged by the submit queue.

@krisnova
Copy link
Contributor

krisnova commented Feb 5, 2018

hey @pipejakob can you address the conflict? Also the PR LGTM

@pipejakob
Copy link
Contributor Author

pipejakob commented Feb 5, 2018

On it, but it looks like the upstream build of cluster-api-gcp is broken, so I can't fully test my rebase yet. I'll wait for #561 or a similar patch to fix the build.

@rsdcastro
Copy link

#568 is the fix for the compilation error, which is being reviewed.

cc @karan @jessicaochen

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 7, 2018
@pipejakob
Copy link
Contributor Author

Okay, rebased and tested end-to-end via cluster-api-gcp. Ready to ship?

@roberthbailey
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pipejakob, roberthbailey

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2018
This is a proposal to add a new API for managing Nodes in a declarative
way: Machines.

It is part of the overall Cluster API effort.
@k8s-ci-robot k8s-ci-robot merged commit a53df90 into kubernetes-retired:master Feb 7, 2018
medinatiger added a commit to medinatiger/kube-deploy that referenced this pull request Feb 15, 2018
k4leung4 pushed a commit to k4leung4/kube-deploy that referenced this pull request Apr 4, 2018
Committing these types so we can start prototyping against them, but the
full proposal is still under review (and accepting feedback) here:

kubernetes-retired#298
k4leung4 pushed a commit to k4leung4/kube-deploy that referenced this pull request Apr 4, 2018
…posal

Minimalistic Machines API proposal.
k4leung4 pushed a commit to k4leung4/kube-deploy that referenced this pull request Apr 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet