-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Federated Deployment Design Doc #325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @chom |
|
||
## Background | ||
|
||
With Federation we are trying to provide Kubernetes objects and functionality spanning over multiple clusters in geographically distributed locations. One of the key objects in Kubernetes is Deployment. Deployment allows users to maintain a defined number of pod replicas and conveniently update them if needed. The update functionality is the main thing that differentiate them from replica sets. ReplicaSets update everything at once while Deployments allow a slow, rolling update and rollback in case the update is unsuccessful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will drop "in geographically distributed locations". "spanning over multiple clusters" is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
+ For each of the pod templates, Federated Deployment will make sure that puppetized FederatedReplicaset with the name deployment_name + “-” + pod_template_hash is created, with the appropriate revision number. | ||
|
||
+ Local deployment checks will occur according to annotation (if present) or alphabetically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt understand this. Can you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
+ For each of the pod templates, Federated Deployment will make sure that puppetized FederatedReplicaset with the name deployment_name + “-” + pod_template_hash is created, with the appropriate revision number. | ||
|
||
+ Local deployment checks will occur according to annotation (if present) or alphabetically. | ||
+ If Federated Deployment with `strategy == RollingUpdate` spots a Local Deployment with different pod template it updates its only if there is no other update going on. Updates are spotted by checking if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updates its -> updates it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
+ Local deployment checks will occur according to annotation (if present) or alphabetically. | ||
+ If Federated Deployment with `strategy == RollingUpdate` spots a Local Deployment with different pod template it updates its only if there is no other update going on. Updates are spotted by checking if: | ||
+ `deployment.status.updatedReplicas == deployment.status.replicas` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean the negation of these conditions i.e update is going on if observed_generation != generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Fixed.
+ `deployment.status.unavailableReplicas == 0` | ||
+ `deployment.status.observed_generation == deployment.metadata.generation` | ||
|
||
The Local Deployment will get a chance to be updated at a later time. Every change in Local Deployments trigger a Federated Deployment reconciliation. So once one Local Deployment is done with the update and publishes its new state the reconciliation will be triggered and next Local Deployment update started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wait for a rollout to complete before triggering a new one? This is not what happens in kubernetes. If a rollout was going on (lets say half of the pods were updated) and user updates the deployment again, then deployment controller will just drop the in progress rollout and start rolling out the new template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about federated deployment updates but federated deployment updating its childs. Obviously if federated deployment is updated then the old locals updates need to be stopped. Added a comment on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aah this is for update one cluster at a time. I think we can clarify it more. Added some comments.
Moreover a “puppetized” Federated ReplicaSet (PFRS) support will be added to Federated Replicaset Controller: | ||
|
||
+ PFRS are marked with an annotation. The presence of the annotation makes the FRS a puppet. | ||
+ PFRS are created by Federated Deployments and contain the proper pod spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, not just the pod spec, PFRS will get complete ObjectMeta and ReplicaSetSpec (everything except replicas) from deployment. and federated deployment controller will keep it updated. For example: add deletion timestamp, update generation, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
+ PFRS are created by Federated Deployments and contain the proper pod spec. | ||
+ PFRS monitors all underlying clusters where the replicas could potentially be found and updates their statuses and spec. Replicas based on the underlying cluster content. | ||
+ PFRS never updates RS in underlying clusters. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding imo (feel free to disagree):
- PFRS will be recreated by federated deployment controller if user deletes it.
- federated deployment controller will overrite any direct updates to PFRS by user.
- It will be deleted by federated deployment controller when the federated deployment is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
## Tests | ||
|
||
Apart from similar tests that should be done for FRS, FD tests need to check: | ||
+ If a failure in update stops the whoe update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoe -> whole
+ If a new update is requested before the previous one was completed. | ||
+ If all of the rollout mechanics and kubectl command work | ||
+ If Federated ReplicaSets have the correct statuses and replica counts. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
- Make sure deletion works as expected: FRS is not deleted unless FD is deleted and is deleted when FD is deleted.
- federated deployment controller overrites any direct updates to PFRS by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
When adding the federation layer over the deployment the things get more complicated. Federated deployment needs to control the regular deployment. That’s understandable and in line with other controllers. Now what about ReplicaSets? Requirements [R1] [R2] [R3] imply that there should be ReplicaSets at the federation level. | ||
|
||
<img src="federated-deployment-2.png" width="500"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the great images!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack :)
|
||
Moreover a “puppetized” Federated ReplicaSet (PFRS) support will be added to Federated Replicaset Controller: | ||
|
||
+ PFRS are marked with an annotation. The presence of the annotation makes the FRS a puppet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to come up with a clear annotation. As you said, "is-puppet" is one option.
Also cc @kubernetes/sig-federation-proposals |
+ [R4] Allow similar scheduling extra features as Federated ReplicaSet | ||
+ Specifying min/max/weight for each of the clusters | ||
+ Be able to rebalance replicas once a cluster is added, lost or if there is no capacity. | ||
+ [R5] Each of the underlying clusters should be able to work independently if the federation control plane is down and be as easily updatable as federation. This basically means that Federated Deployment should create fully-functional deployments in the clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to detail the use cases justifying this requirement. Without it, Federated Deployment could be a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
|
||
<img src="federated-deployment-3.png" width="500"/> | ||
|
||
Federated nginx-701339712 gets all of the statistics as well as spec elements (like the total number of replicas) from ReplicaSets in the underlying clusters. Obviously none of the spec changes should be pushed back. With this change the whole thing would look like as a well-working “whole” while in fact Federated ReplicaSets would be “puppetized” - their original functionality (creating rs in clusters and controlling the balancing) would be turned off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'puppet' already has strong connotations to the ops/devops crowd. Assuming a strategy of reusing ReplicaSet to track aggregated stats, a term like 'shadow' might be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to shadow.
|
||
The implementation of Federated Deployment Controller is largely based on the existing Federated ReplicaSet Controller. The following changes however will be added (on top of obvious changes like “s/replicaset/deployment/”): | ||
|
||
+ For each of the pod templates, Federated Deployment will make sure that puppetized FederatedReplicaset with the name deployment_name + “-” + pod_template_hash is created, with the appropriate revision number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a PFRS would be created per template+revision and aggregate stats for the replica sets in member clusters that share that template+revision? Would it do anything else, and if not, is it worth reusing ReplicaSet vs creating a new federation-specific object to track those numbers? It seems kind of hacky to use an existing object for an entirely new purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is a bit hacky. However, in general, we would like to avoid creating new objects for federation. There is already a support for deployment+replicasets in kubectl rollout and if we had different objects we would have to rewrite the whole thing.
Personally, I would find this proposal much easier to follow if it did a step by step sequence of steps for what happens when I create and update a federated deployment in each individual cluster. |
@derekwaynecarr Ok - i can add a section describing that. |
@derekwaynecarr Added. |
Comments addressed - PTAL. |
@Kargakis and @kubernetes/sig-apps-misc |
|
||
The implementation of Federated Deployment Controller is largely based on the existing Federated ReplicaSet Controller. The following changes however will be added (on top of obvious changes like “s/replicaset/deployment/”): | ||
|
||
+ For each of the pod templates, Federated Deployment will make sure that shadow FederatedReplicaset with the name deployment_name + “-” + pod_template_hash is created, with the appropriate revision number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are discussions going on about changing this. Also, between two cluster versions pod-template-hash may not be consistent, so there's no guarantee the federated RS matches up to the version the local D creates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwielgus I assumed the shadow replica set would be hashing the template defined for the federated deployment, and wouldn't need to rely on a cluster-specific hash. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hashing the podtemplate or any kind of api object is not reliable due to api changes. I will prepare a proposal during the weekend for switching to something stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking of combining either the uid or the name of a deployment with its podtemplate generation (new concept that is not there yet) to produce the new hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #384
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out @Kargakis and @smarterclayton
Agreed that we cant rely on the rs in underlying clusters.
We do want to share the code with kube deployment controller. So we will create the shadow rs with the same name pattern.
Does kubectl rollout
rely on rs name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does kubectl rollout rely on rs name?
Not a subcommand that I know of
|
||
<img src="federated-deployment-1.png" width="400"/> | ||
|
||
When adding the federation layer over the deployment the things get more complicated. Federated deployment needs to control the regular deployment. That’s understandable and in line with other controllers. Now what about ReplicaSets? Requirements [R1] [R2] [R3] imply that there should be ReplicaSets at the federation level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't necessarily that the design of a federated deployment should assume it understands the implementations of a local deployment. We have talked significantly about having custom deployment strategies. That's information that the federator can't and shouldn't have.
I would have started from the opposite position - that federated deployments should be modifying deployments in a cluster and watching them, vs doing anything that assumes it understands how the deployment works in that cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that a federated deployment was going to monitor deployments in the cluster and aggregate statistics about those deployments in the shadow replicasets. I don't think the intent is for the shadow replicasets to be more than a convenient place to store data, and that their use is just confusing things. Maybe another object or 3rd party resource would be a better match for this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton I agree with you and thats how we started. That is what we had in our initial design proposal (which is what is implemented now). In that design deployment just delegates to underlying deployments.
The problem with that is that rollback breaks if a cluster was added after the version we want to rollback to. For ex: a cluster was added at version 5 and we now want to revert to version 3. The new cluster has no idea of what the template was at revision 3. So we need a way to keep track of templates at all versions in federation.
@marun You are right. The federated replicaset is just to store data. The reason we chose to store it as a replicaset instead of any other resource is because kubectl rollout
commands expect a replicaset.
I'm not sure it's desirable to require isomorphic behavior between all federated objects and their cluster-local equivalents. It might simplify initial implementation, but the longer-term cognitive and maintenance burden is likely to be considerable. Cross-cluster coordination of deployments is likely to have different behavioral characteristics compared to in-cluster deployments, so why should a federated deployment mirror from a user perspective how a non-federated deployment works? |
@marun Well if a federated object works significantly different than its "regular" equivalent then it probably should have a different name/definition to avoid confusions. I don't believe we have a capacity to come up with a brand new object hierarchy in the near future. As of deployments - @nikhiljindal, @csbell and I analyzed probably all possible ways to have deployments with rollout support and it seemed that the option with shadow replicasets was the lesser evil. However if you have a better/different idea I would be happy to discuss/consider it. |
@mwielgus I'm not sure how the behavior of federated deployments can't differ significantly from regular deployments. The mechanics of coordinating across clusters is not the same as within a cluster. I agree that an entirely new object hierarchy may not be appropriate, but it may make sense to special-case resources like deployments. Why is it necessary to store deployment state (vs using cached state)? |
|
||
## Design | ||
|
||
Regular Deployments don’t maintain the replicas by themselves. They let ReplicaSets keep track of them. Each of the deployments create a separate ReplicaSet for each of the pod template versions they have. To do a rolling upgrade the deployment slowly adds replicas to the replica set with the new pod template and takes them from the old one. The names of ReplicaSets are in form of deployment_name + “-” + podspec_hash. So it is relatively easy to tell which replicaset is responsible for pods with the given spec (if you are a machine of course). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented elsewhere, podtemplate hash is not stable across Kubernetes versions. We will need to change the naming scheme for ReplicaSets. The simplest solution is deployment.name + deployment.templategeneration. Or hash that and append to deployment.name but the new hash should differ in size from the old one so we can avoid conflicts with old replica sets that we cannot migrate.
+ SFRS monitors all underlying clusters where the replicas could potentially be found and updates their ReplicaSetStatuses and ReplicaSetSpec.Replicas based on the underlying cluster content. | ||
+ SFRS never updates RS in underlying clusters. | ||
+ SFRS will be recreated by federated deployment controller if user deletes it. | ||
+ Federated Deployment controller will overrite any direct updates to SFRS by user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@marun The problem with special object for storing revisions is that then we will have to maintain two version of kubectl deployment-related code - one for federation and one for regular Kubernetes. I would rather like to avoid that. |
@mwielgus I think I must be missing some context. How do you propose to provide UX parity between federated and non-federated deployments given the extra dimension of the member clusters implied by a federated deployment? Maybe we can discuss in tomorrow's sig meeting? Higher bandwidth might make it easier for you to educate me. |
@marun Unfortunately, I won't be on the sig-meeting. As of the UX - I don't see any problem there. we have regular kubectl support for all other objects. |
Reviewers, |
Ping |
|
||
<img src="federated-deployment-1.png" width="400"/> | ||
|
||
When adding the federation layer over the deployment the things get more complicated. Federated deployment needs to control the regular deployment. That’s understandable and in line with other controllers. Now what about ReplicaSets? Requirements [R1] [R2] [R3] imply that there should be ReplicaSets at the federation level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton I agree with you and thats how we started. That is what we had in our initial design proposal (which is what is implemented now). In that design deployment just delegates to underlying deployments.
The problem with that is that rollback breaks if a cluster was added after the version we want to rollback to. For ex: a cluster was added at version 5 and we now want to revert to version 3. The new cluster has no idea of what the template was at revision 3. So we need a way to keep track of templates at all versions in federation.
@marun You are right. The federated replicaset is just to store data. The reason we chose to store it as a replicaset instead of any other resource is because kubectl rollout
commands expect a replicaset.
|
||
<img src="federated-deployment-3.png" width="500"/> | ||
|
||
Federated nginx-701339712 gets all of the statistics as well as spec elements (like the total number of replicas) from ReplicaSets in the underlying clusters. Obviously none of the spec changes should be pushed back. With this change the whole thing would look like as a well-working “whole” while in fact Federated ReplicaSets would be “shadows” (or puppets) - their original functionality (creating rs in clusters and controlling the balancing) would be turned off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about deployment controller updating Federated nginx-701339712
based on the federated deployment (total replicas and updated replicas) instead of federated replicaset controller updating it based on underlying replicasets? That way we wont be relying on the internal implementation of underlying deployments (whether they create a rs or not and with what name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don't understand the benefit of having federated controller updating federated replica set status based on local replica sets.
|
||
The implementation of Federated Deployment Controller is largely based on the existing Federated ReplicaSet Controller. The following changes however will be added (on top of obvious changes like “s/replicaset/deployment/”): | ||
|
||
+ For each of the pod templates, Federated Deployment will make sure that shadow FederatedReplicaset with the name deployment_name + “-” + pod_template_hash is created, with the appropriate revision number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out @Kargakis and @smarterclayton
Agreed that we cant rely on the rs in underlying clusters.
We do want to share the code with kube deployment controller. So we will create the shadow rs with the same name pattern.
Does kubectl rollout
rely on rs name?
+ Local deployment checks will occur according to annotation (if present) or alphabetically. So that the order | ||
of cluster updates is predictable. | ||
|
||
+ If Federated Deployment with `strategy == RollingUpdate` spots a Local Deployment with different pod template it updates it only if there is no other update going on. Updates are spotted by checking if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start this by saying that federated deployment controller updates one cluster at a time.
+ Local deployment checks will occur according to annotation (if present) or alphabetically. So that the order | ||
of cluster updates is predictable. | ||
|
||
+ If Federated Deployment with `strategy == RollingUpdate` spots a Local Deployment with different pod template it updates it only if there is no other update going on. Updates are spotted by checking if: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am no expert, but I think we can rewrite this along the lines of: "federated deployment controller updates local deployment in one cluster at a time. It determines if an update is going on in a cluster by checking: .
If it finds a cluster which should be updated and there is no update going on in any other cluster then it starts rolling update in that cluster.
Thanks for your updates to the proposal based on previous comments @mwielgus. I see that the main concern from reviewers is that this proposal relies on the internal implementation of deployments in underlying clusters. As it stands today, the internal implementation can generate different rs names across different kube versions (because of API changes). #384 will atleast fix that. But then, any future changes to the way replicasets are named will break us. Here is how an alternate implementation can look like (which does not depend on internal implementation of deployments in underlying clusters):
When a rolling update is going on, it can set:
What do you guys think about that? |
@nikhiljindal I don't understand why it fixes the naming problem. You need to name the shadow replicaset somehow and somehow get the stats there from the proper local replica sets. Who is exactly doing it (fed-dep or fed-rs) seems slightly irrelevant to me. |
+ History - provide all of the revisions of federated deployment. | ||
+ Pause/Resume - provide an ability to pause and resume an ongoing update | ||
+ Status - provide rollout status information | ||
+ Undo - provide go back to the previous version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/go back to the previous version/rollbacks to previous versions/
+ Pause/Resume - provide an ability to pause and resume an ongoing update | ||
+ Status - provide rollout status information | ||
+ Undo - provide go back to the previous version | ||
+ Require no/almost-no fixes in kubectl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess federated resources have their own aliases in the command line, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. same kubectl commads (like kubectl get rs
) can work for both kube-apiserver and federation-apiserver.
Yes, but we dont need to rely on the shadow replicaset having a fixed name.
We decided to do it in fed-rs since it already has that logic (update the status of fed rs based on rs's with the same name in underlying clusters). This is not true anymore. |
@mwielgus It's not clear why you closed this? Could you provide some context? If you're not able or interested in working on this any further, I'd be keen to find someone else to take it over. Any objections to that? |
@quinton-hoole It is rather unlikely that I will have capacity to work on this in the foreseeable future. Feel free to take over this PR. |
OK, thanks @mwielgus, taking it over. |
Grammar/style edits part 2
This PR hasn't been active in 30 days. It will be closed in 59 days (Jan 8, 2018). cc @csbell @irfanurrehman @mwielgus @nikhiljindal @quinton-hoole You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days |
As I won't work on this PR i see little point in keeping it open. @quinton-hoole If you want to continue the effort please pick the commits and proceed in your own PR. |
cc: @nikhiljindal @csbell @madhusudancs @kubernetes/sig-federation-misc