-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-27: Detailed Control for Pod Restarts #1449
Conversation
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great addition for improving stateful deployments of operators. Just have a comment regarding the backwards compatibility but happy to ship this as-is.
forcePodRestart: "false" | ||
``` | ||
|
||
The default value for this parameter would be `true` to keep backwards compatibility. The general behavior should stay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for keeping backwards compatibility. On the other hand I feel that enabling forced pod restarts is a feature that users have to opt-in, i.e. the default should be false
. Forced pod restarts enable updating config maps on parameter updates with the caveat of changing some assumptions regarding pod lifetimes. As such the default should be what would be the expected behavior of pods not restarting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think the pod restarting is something that you "usually" want. There may be applications that are "ConfigMap-Aware" and re-read config files without a pod restart. For these, a default of "false" would be ok, but I think most applications will require the pod restarts for most of the parameters.
Having this as "false" for default will probably lead to weird issues for operator developers where they wonder why they've updated a parameter, but the running application keeps running with an old value.
This sounds like something that would be useful to set as a "default" for all parameters, but I'm not sure adding that complexity in would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmalik what would you say should be the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ANeumann82 before we decide how to change the current behavior, I would like to understand better why it's there in the first place.
Do we have a record of the initial motivation/discussion/KEP for this kudo.dev/last-plan-execution-uid
feature? It seems like you're implying that this was an intentional mechanism put in place to force pod restarts on any parameter change or upgrade. Did I read that right?
Or perhaps you are just describing the current effect, but this was not a designed feature but just an accident - i.e. we were trying to mimic what kustomize was doing?
Given the above lack of information, it is hard to judge but I have to say that in my opinion this feature is surprising in the negative sense:
- it is not documented anywhere, as far as I can tell,
- it changes the default behaviour of k8s statefulsets and deployments,
- the name of the annotation does not provide any hint about for its (implied? intent) - for example when I first saw this annotation being changed very recently (when debugging the issue described in Bump to kudo v0.11.1 mesosphere/kudo-cassandra-operator#78) - it was far from obvious to me why this is happening
- there is a serious scoping issue - it affects all statefulsets/deployments in an operator regardless of whether they actually use the parameter at all. Yes, there are ways to work that around by triggering a special plan, but this also has issues.
Adding a forcePodReload
flag builds more complexity on an unstable foundation, which I think is exactly what we should be avoiding while we're still in v0...
stage.
In general I'd vote for breaking compatibility in this case and get rid of this feature altogether. We did break compatibility for more minor reasons in the past.
If an operator developer wants a certain statefulset/deployment to restart on a parameter change this can be achieved in another way, either using the tool you mentioned, or by us providing a hash function that can be used as seen below by an operator developer to explicitly declare which parameters require a restart:
spec:
template:
metadata:
annotations:
config-values-hash: {{ hash(
.Params.SOMETHING_THAT_AFFECTS_THIS_POD,
.Params.SOMETHING_ELSE_THAT_AFFECTS_THIS_POD,
...
) }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give this discussion some context: the feature (though not documented) is very much intentional. Originally, we used stakater/Reloader but it came with its own issues so we decided to implement our own solution. Which right now is applied to all templates and surely can be smarter by updating/restarting only things that are affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@porridge Thanks for that comment! I think @zen-dog answered some of the questions about the origin of this issue.
I kind of like your approach with the hash, although I don't like to put it directly into the statefulset-configuration: If we think about an operator with > 150 Parameters, the call to `hash( PARAM1, PARAM2, ..., PARAMN) would get really long, and require a operator developer to touch at least two or more places when adding a parameter.
I've added a second proposal to the document that takes up the idea of a hash calculation based on parameters, but have the hash-params defined on the parameters itself, please have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, I left a few questions, naming suggestions.
forcePodRestart: "false" | ||
``` | ||
|
||
The default value for this parameter would be `true` to keep backwards compatibility. The general behavior should stay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmalik what would you say should be the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! But let's get a few more eyes on it before we proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to another KEP
I would love discuss alternatives first
params to change require a plan to exec to change them... I'm not sure I understand need to annotate the param. aren't we talking about what should trigger a plan? Perhaps I'm missing something.
|
||
This KEP describes addition of a flag for parameters that controls the forced pod reloading | ||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This motivation is not... it describes how something will affect another thing... it doesn't not encompass the motivation of why a user would want to do that. Without the details of the motivation it is challenging to understand if this is the best solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to know the use cases which are in and out of scope for this kep... is this:
- to restart a stuck pod
- restart pods based on new config
- is this in a step, or in a plan or
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a more detailed motivation section.
About the second comment:
- No - I'm not sure how a pod can be stuck, but it's not about that.
- Yes, basically.
- It affects the whole plan - The effect for pod restarts is calculated based on the changed parameters. If all changed parameters have the
forcePodRestart
set to false, then the Pods will not automatically be restarted.
### Goals | ||
|
||
Make it possible for an operator developer to specify that a parameter will *not* automatically restart all pods from | ||
a stateful set or a deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow.. this went a different way... the goal is to be able to change a param and NOT have pods restart...
IS this only when a plan is being triggered? or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. This is calculated when a plan is triggered and affects if the currentPlanUID
is written into the spec.template.spec.metadata.attributes
- name: NODE_COUNT | ||
description: "Number of Cassandra nodes." | ||
default: "3" | ||
forcePodRestart: "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only for pods? how do we know that? what is there is control on statefulset or deployment or?
I'm question the name forcePodRestart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally for statefulsets and deployments, yes. It would affect all resources that create pods, so jobs as well - although as they don't usually keep pods running it won't really change behavior there.
I'm open for other suggestions with regards to naming, but forcePodRestart
was the most apt name I could come up with.
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
I like |
…es, cleanup Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome. Some nits in-line, and can I also ask you to include the references provided by @zen-dog somewhere in the text? These things are precious.
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I welcome the fine-grained control this KEP offers over restarting pods, I can't help but wonder if we're putting a too high of a micromanagement burden on the operator developers. With big operators, > 100 parameters, dozens of configMaps/secrets/statefulSets an operator developer has to keep in mind all the dependency chains and update them accordingly.
KUDO, among other things, is supposed to be this higher-level abstraction layer that provides the magic glue that makes developing and operating complex applications easier. In this sense, I'd like to explore the alternative proposal that builds a dependency graph for used resources and parameters. We don't have to make a 100% generic solution to support all cases. Instead, we should concentrate on the 80/20 case at hand and limit it to k8s pods, jobs, statefulSets, Deployments, configMaps, and secrets. This will already solve our own issues and probably most of our users. And if more fine-grained control is necessary later we could always reiterate.
If multiple parameters are changed in one update, the `forcePodRestart` flags of all attributes are `OR`ed: If at least one | ||
parameter has the `forcePodRestart` set to `true`, the pods will execute a rolling restart. | ||
|
||
This solution would be very easy to implement, but may be hard to explain and requires intimate knowledge of KUDO internals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if this is hard to explain then {{ .ParamGroups.MAIN_GROUP.hash }}
and managing all the GROUP
s might be even harder 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good!
but I feel we need a bit of discussion as the motivation feels like only one use case that is scaling up and down of the sts/deployments. And for that, we are introducing another dimension in parameters (groups) along with existing ones that might make the developer experience not so pleasant.
I would like to really talk about all the use cases where developers might use them. And if we can solve them through a less intrusive way. I see two use cases in the motivation but the use case where a parameter update should only update the configmap
and not the sts/deployment
is already possible by having specific plan that only updates configmaps.
Allow the operator developer to define groups of parameters that can then be used to trigger a restart of pods from | ||
a deployment or stateful set. | ||
|
||
Add a list of `groups` that each parameter is part of: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so with this operator developers will have to define which plan each parameter might trigger + define for all the groups the parameter belongs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the easiest case, to replicate the current behavior, the operator developer would just have to add:
spec:
template:
metadata:
annotations:
config-values-hash: {{ .ParamGroupHashes.ALL }}
without defining any custom groups.
But to define custom control, yes, the operator developer would have to add all required parameters to a custom group.
It doesn't affect which plans a parameter is used in though.
a rolling restart of Pods, and sometimes the restart is unwanted and can negatively affect the state of the application | ||
the operator is controlling. | ||
|
||
One example would be the `replica` count of a StatefulSet: An increase here should only start new Pods and not restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have more examples other than replica
? I feel we just have this one specific use case of scaling up and down sts/deployment.
in the case of another example, where an application regularly re-reads the config files, the trigger
plan of the CM shouldn't include in any of its tasks the sts/deployment. That won't trigger any update.
instead of the the `last-plan-execution-uid` in a deployment or stateful set to trigger the reload of the pods: | ||
|
||
```yaml | ||
spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets take the example of Cassandra
so if a statefulset is using MAIN_GROUP
like this:
annotations:
config-hash: {{ .ParamGroups.MAIN_GROUP.hash }}
if a configmap
uses replica to generate the configuration as Cassandra does
seeds: "{{- range $i, $node := until (int (min 2 .Params.NODE_COUNT)) -}}
that would still trigger a rolling restart of the stateful set with this solution. Even the content of seeds
is the same in case of scaling up, but might be different in case of scaling down.
So we will be asking operator developers to restrict the use of the parameters to generate configuration they need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would depend on if the NODE_COUNT
param would be in the MAIN_GROUP
.
In the case of Cassandra, I would argue that NODE_COUNT
would not be in the MAIN_GROUP
, even though it is used in the specific config map.
Which is actually an interesting argument against the automatic dependency graph...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that NODE_COUNT would not be in the MAIN_GROUP, even though it is used in the specific config map.
that's buggy, as what if seeds content has actually changed and we do really need a rolling restart?
Which is actually an interesting argument against the automatic dependency graph...
I see its an argument against the current approach 🤔
if the dependency graph is implemented correctly it will know that seeds are same so no need to restart, and if they aren't the same then it totally valid to restart all pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that depends on how the dependency graph is build. At the moment, the idea is to determine which parameters are used in which resource - and if any parameter changed, we need to restart.
Maybe I'd be possible to fully ignore the parameters in the dependency graph and only look at the rendered resources: If the rendered resource is the same as the currently deployed one, we don't need to restart.
The problem with this is the comparison, as K8s adds all the additional fields. But we do something similar with the 3 way merge when applying resources, so it might be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like my reject was reset... I would like to review today before we merge this
@kensipe Your reject wasn't reset - you're still marked as "Changes Requested" and the PR is blocked. I just re-requested your review because I changed so much :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm a fan of a combination of ideas surfaced in this kep discussion...
I like the label or annotation of a hash of params used...
I like the idea of that being auto generated based on params applied to that template (for params that are "included")
then any applied change that has a diff in the hash would require a restart ( I would prefer this to be separate logic... the controller reconciliation should pick up on the diff and restart or another controller for params)
I don't understand why the attribute forcePodRestart
needs to include force
. Can we just have restart
? or podRestart
The grouping of params is an interesting idea.. but can we do better? Can we have a convention over configuration approach?
An operator would still want to configure if and when pod restarts may be required, as a pod can be aware of changing | ||
ConfigMaps. | ||
|
||
Calculating the dependency graph would be a very complex undertaking, and may not even be possible in all cases. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is true... I'm not sure what you may have in mind for a dependency graph... but it is fairly easy to get a list of tasks for a plan and a list of resources for those tasks... and should be easy to get all the params used by those resources (we do this for package verification)... then we could have an attribute on params which indicates it is to be excluded (inclusion by default) in triggered update. The automatic hash would be the collection of all params for all resources for that plan that are "included". I think I like this approach more. It would be great to reduce the burden on the operator dev. I'm not a fan (as of yet) with the "grouping" of params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've talked with @zen-dog about the complexity of building this dependency graph, and it seems doable, although I feel it might be a bit complex. But I'll expand on this idea a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add another approach which is parameter-free
and we can use the task's rendered resources to build a dependency graph. If the rendered content of CM(resource) is the same we know we don't need to update all dependent resources. But if the content of CM has changed we need to update the dependent resources.
All resources can have an annotation of the hash of rendered resource on which they depend. And KUDO just updated the hash on each update.
When I say rendered resources I mean a rendered version of templates, for example, https://github.com/kudobuilder/operators/blob/master/repository/kafka/operator/templates/health-check.yaml where we can control the content.
Because with kubernetes objects any small status update will change the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds workable. I have some thoughts about resources about resources that are deployed by different plans, but that should be solvable. I'll flesh that out and add it to the KEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmalik I've updated the KEP with your proposal, and I think that looks like the best solution so far.
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
keps/0027-pod-restart-controls.md
Outdated
|
||
## Proposal 2 - Dependency Graph | ||
|
||
KUDO will analyze the operator and build a dependency graph of resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about instead of analyzing, providing developers a way to register the dependencies?
spec:
template:
metadata:
annotations:
- health-check-cm: {{ .TemplatesHash.healthcheck.yaml }}
- bootstrap-script: {{ .TemplatesHash.bootstrap.sh.yaml }}
on each update, we just update the hashes where they are used and make it easier the "dependency" approach.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be easier on us, but prone to getting out of sync with real dependencies. Analyzing actual ones means we have a single source of truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, i don't think the analysis would be too hard in these cases, so I don't think that part makes it more compelling.
If we were to go with this proposal, we'd at least have to replace any dots and special chars:
- bootstrap-script: {{ .TemplatesHash.bootstrap-sh-yaml }}
This might get more annoying if we at some point allow resources to be in subdirectories.
Pro:
- It would remove the requirement to have an annotation in a configmap/secret when that resource should not be included in the hash
- It would make very clear which resources trigger a pod restart
Con:
- It would require the operator developers to add new resources to the annotations.
- To replicate the current behavior, an operator developer would have to add all used resources in annotations
I'm very undecided. My gut feeling likes the dependency variant better.
keps/0027-pod-restart-controls.md
Outdated
- Each deployment, stateful set and batch job will be analysed and contains a list of required resources (ConfigMaps and Secrets). | ||
- When a resource is deployed, KUDO calculates a hash from all dependent resources and adds that to the spec template annotation. | ||
|
||
The resources required by a top-level resource may no necessarily be deployed in the same plan as the resource itself. This can lead to an update of a required resource in a plan that does not deploy the top-level resource and vice versa. To correctly calculate the hash of the required resources this needs to be done different ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resources required by a top-level resource may no necessarily be deployed in the same plan as the resource itself
I think it reasonable to simply document this as a current restriction (unless we actually need this as of today). E.g. Pipe-tasks can also only be referenced within the same plan and so far nobody complained 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave that as an implementation option. I think it'd we worth it to make it "correct" and have no restrictions, but if it turns out to be a hard problem to implement we can go with the restriction.
keps/0027-pod-restart-controls.md
Outdated
- A config map that is used by a stateful set or deployment may not necessarily require a pod restart: | ||
- In the Cassandra backup/restore branch, there is a side car container that waits for a backup command to be triggered. It reads the mounted config map every time the action is triggered, therefore removing the need to restart the full pod when the config map changes. | ||
- If this config map were to be included in the hash calculation, it could trigger a restart of pods the next time the stateful set is deployed. | ||
- A solution would be to allow a special annotation in config maps and secrets that lets KUDO skip this resource in the hash calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple to explain and implement 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote goes to the second proposal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the newer proposal is worth pursuing. More risky, but it would indeed put much less burden on the operator developer.
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann aneumann@mesosphere.com
Relates to #1424