-
Couldn't load subscription status.
- Fork 5.3k
[Federation] Federated statefulsets design proposal #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Federation] Federated statefulsets design proposal #437
Conversation
|
cc @kubernetes/sig-federation-misc @smarterclayton |
|
|
||
| 1 – An unique, consistent and discoverable identity for each replica/instance across the federated clusters. | ||
|
|
||
| 2 – Predictability on the number of instances and sequentialization of pod creation (for example instance 1 creation starts only when instance 0 is up and running). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed relaxing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have specified (later in the doc) that we might not be able to guarantee this requirement for the federated statefulsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now dropped this requirement altogether from this section.
|
Please break your lines at some fixed interval, the proposal is unreviewable asnis |
Thanks @smarterclayton for having a look. I have added line breaks on all the lines. PTAL. |
|
Some high level questions:
|
|
|
||
| If we consider the use cases listed above, the main design requirements can roughly be listed as: | ||
|
|
||
| 1 – An unique, consistent and discoverable identity for each replica/instance across the federated clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a new design requirement.
- It must be possible for the federated set to SAFELY form an initial quorum that adds the rest of the set.
Any design which doesn't allow (for instance) cluster 1 to form an initial quorum and then safely add members in cluster 2 (or any other order, really), is a non-starter because then stateful set at federation has a different set of guarantees.
I think that's implicit in some of your comments, but it needs to be explicit, obvious, and impossible to break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you mention quorum what quorum are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smartclayton I have mentioned the requirement explicitly as you suggested. I have also dropped the sequentialization requirement, because we are not honoring that as of now.
@chrislovecnm I think @smarterclayton is referring to the quorum application pod instances form after discovering each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions. Looks awesome!!!
| 1 – A stateful app, for the reasons of high availability, wants the stateful pods distributed in different clusters, such that the set can withstand cluster failures. This represents an app with one single global quorum. | ||
|
|
||
| 2 – A stateful app, wants replicas distributed in multiple clusters, such that it can form multiple smaller sized local clusters using only local replicas. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about two stateful sets that need to comminicate with each other. Many applications such as Cassandra, Kafka and elasticsearch include the capability to span physical data centers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario you mention is infact solved by design alternative 2 specified below, where a particular statefulset instance gets a local and global identity both. Applications can choose to use either identities or both at the same time. The use cases I mentioned are for reference alone and not exhaustive. Do you think its important to mention another use case specifying the scenario u mention?
|
|
||
| ## Storage volumes | ||
|
|
||
| There are two ways a federated statefulset can be assigned a persistent storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we handling storage classes? Am I my missing that section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need not mention anything related to the storage classes. The chosen design proposal intends to reuse the cluster local statefulsets implementation, when the statefulsets are deployed in individual clusters.
One catch might be that the some clusters among the federated clusters might not have the given storage class available. I think for now, we leave it to the user to ensure that the specified storage class is available across all federated clusters that he/she mentions in the federated statefulset spec. The behavior if this does not happen is also deterministic for now. The volume provisioning for the stateful pods in a particular cluster which does not have that class will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say that we are going to test against them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry; I did not get you here.. ? You mean test against storage classes.. ?
|
|
||
| elaborated by @quinton-hoole | ||
|
|
||
| Strictly speaking, migration between clusters in the same zone is quite feasible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more clarity to this statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More content as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Will update with something more explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More detail is still required here. If storage can be replicated between clusters (at least those hosted by the same infrastructure e.g. gce), then including the cluster name in the identity doesn't seem like such a good idea.
|
|
||
| If we consider the use cases listed above, the main design requirements can roughly be listed as: | ||
|
|
||
| 1 – An unique, consistent and discoverable identity for each replica/instance across the federated clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you mention quorum what quorum are you referring to?
|
|
||
| 2 – Predictability on the number of instances and sequentialization of pod creation (for example instance 1 creation starts only when instance 0 is up and running). | ||
|
|
||
| 3 – The ability to scale, across clusters in some deterministic fashion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be optional? Some technologies do not allow more than one pod at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scaling feature will be on demand, but I think we need to have the ability in the federated statefulsets.
|
|
||
| _3 – What happens if a cluster dies_ | ||
|
|
||
| Nothing, statefulset would need to run with lesser replicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One section that I am not seeing is networking. Often ha distributed applications use patterns such as gossip. That can require every pod to talk to every other pod. Thoughts around how we are going to handle those patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We handle this using the dns names similar to current k8s statefulset implementation. The "Instance identity and discovery" sections specify the details.
Apps need to discover and use the dns names to communicate across instances and not the IPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how will the routing work? Does CNI provide the capability to route between a headless service in the uk and a headless service in the US? Kinda confused a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This routing is not handled by CNI.
This routing will be handled very similar to the way it is handled for federated services.
(check one of my other reply to your comments about need of ELB).
f3a9511 to
c5b7c9e
Compare
In the current proposal, when a new cluster joins and the federation already has a running statefulset, nothing will happen (meaning no rebalance). However if the statefulset is scaled after the cluster joins it might get replica(s).
What you mention is a valid scenario, but this is probably something we need to live with as of now. I believe the federated statefulsets will still be pretty useful even with this constraint. |
|
Thanks @smarterclayton @chrislovecnm @madhusudancs for the comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple more questions :)
|
|
||
| In the case of in-cluster statefulset, pods discover each other using the in-cluster dns names. | ||
| A headless service with selectors enables creation of dns records against the pod names. | ||
| This cannot work across clusters, as local pod IPs cannot be reached across the clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true with all CNI providers or are we just saying that this is the expectation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now (and in my limited knowledge) the CNI overlays are limited within the cluster. However there probably isn't much of a technical bottleneck which stops the overlay to work across clusters. Perhaps that might be another proposal for cluster federation in the pipeline if the need arises. Right now this is true with all CNI providers (and again in my limited knowledge :) ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing preventing a CNI overlay from working cross-cluster. Tigera has discussed use cases involving multi-cluster overlay in the past.
@caseydavenport Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is true and if there is a solution, which can easily enable federated objects to use or access networks across k8s clusters; I would want to pursue it.
In the absence of such available or usable solution, we go ahead with the current proposal, and improve the same when a cross cluster overlay is easy to achieve or inherently built in as cross cluster communication (probably another proposal within the scope of federation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irfanurrehman because this is internal to the cluster using a networking solution that can mesh or expose pod ips, is a possible solution. The networking is complex, so for smaller deployments loadbalancers may be another solution as well.
This is a challenging problem to say the least ;) But I have done a POC with weave and have gotten the network setup that I would need.
| A headless service with selectors enables creation of dns records against the pod names. | ||
| This cannot work across clusters, as local pod IPs cannot be reached across the clusters. | ||
|
|
||
| The proposal to make it work across the clusters is to assign a service type 'LoadBalancer' for each pod instance that is created in the clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to do this differently? Think about this at scale. Cost, quotas and provisioning would get seriously fun :) How are we doing guaranteed network identity for each ELB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also one of the amazing advantages is to be able to have a baremetal stateful set talking to a cloud stateful set. How are we going to allow for that? Think about expanding your footprint during a busy shopping season.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur, this is a horrible idea. Imagine trying to manage a Cassandra cluster of hundreds, of nodes per DC. This isn't feasible. Do one headless service per k8s cluster, and then create a VPN link between virtual networks. Setup DNS resolution across clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrislovecnm you are completely right about the cost involved in provisioning ELB's. Right now the options (using k8s available features) to communicate across clusters are:
- ELB's provisioned from the cloud provider
- Ingress (autoprovisioning is tricky, and is quite cloud specific)
- Nodeport
As of today, the federated services have out of the box support for ELB only and a proper working for ingress and nodeport for federated services is still evolving (somebody please correct me if I am wrong). This proposal in this design in fact is to adhere to the same facility/evolution of federated services for communication across clusters.
The point I am trying to make is that, there already are use cases which demand cross cluster/federated statefulsets and would benefit from, even with the current proposal in place.
Better solutions can evolve over time.
@mstump thanks for your suggestion. Can u please elaborate a little bit more with the point of this functionality getting into some existing k8s/federation feature, or it being a new feature in itself. The point is the overlay u mention ideally need to be possible as an auto provisioned method of communication across clusters. If it makes sense in near term and is doable, I would not mind pursuing the same.
| In this approach, the federated statefulset controller will behave quite similar to federated replicaset or the federated deployment controller. | ||
| The federated controller would create and monitor individual statefulsets (rather then pods directly) partitioning and distributing the total stateful replicas across the federated clusters. | ||
|
|
||
| As a proposal in this design we suggest the possibility of the pods having multiple identities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this break the contract that statefuls sets already have? I probably need more details to fully understand this design case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doe not. In a normal in-cluster statefulset a stateful pod gets 2 identities:
- the dns name accessible within the cluster
- the hostname visible to the stateful app
The suggestion in this design retains both and then adds another dns name accessible across the cluster. I specify the additional dns name as another identity, thus multiple identities. I dont know of any such use case as of now, but if needed the same design can be extended to have even more dns names visible locally or globally reflecting more identities.
|
|
||
| We propose using alternative 1 listed here, as it fits broader scheme of things and is more consistent with the user expectation of being able to query all needed resources from the federation control plane and is less confusing to use at the same time. | ||
|
|
||
| # Conclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some applications, you can only create one pod at a time. How does this design proposal maintain the ordinal order as defined by the stateful set contract?
Does this proposal define the capability of migrating from a regular stateful set to a federated stateful set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If u see line 99 (and as suggested by @smarterclayton ) the proposal is to not have a hard requirement to preserve the order of stateful pod creation.
This document has not handled the migration of a regular in-cluster statefulset to a federated one. Do u think its needed to address it in this design.. ?
|
|
||
| It is known, that as of now, the same persistent volume, even if can be used in a different cluster, k8s directly does not provide an API, yet, which can aid the same. In the absence of no direct way of quick migration of the storage data from one zone to another, or from one cloud provider to another if the case be, the proposal is to disallow migration of pods across clusters. | ||
|
|
||
| ## Scale up/Scale down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the database and search systems that I can think of this would be undesirable. k8s isn't workload aware, what it means to scale up, re-shard data and all the implications of those actions. Additionally, workloads across DCs are not always homogenous or even distributed, you want to be able to scale them independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not get the question completely.
Are you saying that scale up/down of a statefulset is undesirable, or an overall scale up/down of a federated statefulset whose pods are distributed across clusters undesirable?
In either case I think its wise to have a scale function atleast available to the users rather then not having this function at all.
|
|
||
| (1) only allow the use of node-local persistent (bad), | ||
|
|
||
| (2) disallow migration of replicas between clusters if they use non-node-local volumes (simple, but perhaps overly restrictive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the preferred method. Don't attempt to move storage across clusters. Most distributed databases have their own notion of identity, consistency and replication. This is an app specific concern, don't overcomplicate k8s by trying to manage replication for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree.. ! That is what this doc also proposes.
|
|
||
| ### Replica distribution (across federated clusters) | ||
|
|
||
| The proposed default behaviour of the federation controller, when a statefulset creation request is sent to federation API server, would be to partition the statefulset replicas and create a statefulset into each of the clusters with the reduced replica number (after partitioning), quite similar to the behaviour of the replicaset or daemonset controllers of the k8s federation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is desirable or needed. Let each statefulset be managed independently, but create network links between the federated sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that the federation user needs to see a consolidated view of the statefulset. Also, the current k8s federation design philosophy is that the federation ideally appears as another k8s cluster, where some users might not even know that they are talking to federation and not a normal k8s cluster.
|
cc @kubernetes/sig-apps-misc @kubernetes/sig-apps-feature-requests |
More line breaks and minor nits.
|
@irfanurrehman I think that the design looks good overall. It would be useful to add a few non-trivial, concrete examples to illustrate both useful applications, and some of the explicit limitations. Off the top of my head, these examples might include:
|
| in each of the clusters. | ||
| It will further partition the total number of replicas and create statefulsets with partitioned | ||
| replica numbers into at least 1 or more clusters. | ||
| The noteworthy point is the proposal that federated stateful controller would additionally modify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this you might hit name limit problem. When any of the names is close to max and you concatenate them you might exceed allowed limit. It is an important problem that should be described in this proposal, imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the specific suggestion, and apologies for the delayed response.
I have added as solution for this implementable as an admission controller, but given that its a difficult to hit problem, it might not be a necessary implementation in the first phase solution. Hope that is ok?
|
#503 seems also very relevant for this proposal. |
e2143d3 to
20168e3
Compare
yes, It is indeed relevant. My suggestion however for now is to treat the federated statefulset update design separate (or a later extension) from the federated statefulset feature. The same way its happening for this feature in local k8s. |
20168e3 to
3a39431
Compare
fc3e785 to
215def2
Compare
…8-themes touches up SIG Node 1.8 release themes
|
This PR hasn't been active in 109 days. Closing this PR. Please reopen if you would like to work towards merging this change, if/when the PR is ready for the next round of review. cc @irfanurrehman @quinton-hoole You can add 'keep-open' label to prevent this from happening again, or add a comment to keep it open another 90 days |
|
Why was this closed? |
|
It auto closed because it was not merged and has not received attention. I don't think this proposal is complete enough to implement. |
|
@jwaldrip, as @kow3ns pointed out correctly, it was autoclosed. Meanwhile sig multicluster relooked at its priority list a little while back, and we refocused our efforts to moving the federation code out of core first and then moving the existing features to GA as the top priorities, rather then implementing advanced features like this now. This does not mean, we are not going to implement this; however it will take some time until we get back to this (probably a quarter). Reopening this, as this will be implemented anyhow. |
|
@kubernetes/sig-multicluster-feature-requests |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Don't know if I'm allowed to respond here. I am waiting for this feature, so I don't want this issue to be closed (because of lifecycle/rotten), otherwise it gets completely forgotten... |
|
/remove-lifecycle rotten |
|
@dionysius This is not dead. It's being pursued in https:://github.com/kubernetes/federation . Don't worry :-) |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
@fejta-bot: Closing this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
* Update VOTERS.md * Update steering/elections/2020/VOTERS.md Co-authored-by: craigbox <craigbox@google.com> Co-authored-by: craigbox <craigbox@google.com>
Design proposal for federated statefulsets.
@kubernetes/sig-federation-misc @kubernetes/sig-federation-proposals @kubernetes/sig-federation-pr-reviews
cc @quinton-hoole @deepak-vij @shashidharatd @dhilipkumars
Please feel free to cc any probable reviewers, not part of the targeted groups.