-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the default weight of pod topology spread #88174
Comments
Quote from the deprecation policy:
I think the rule doesn't apply to this proposal since it is not impacting the correctness of applications or impact administration of the cluster. |
As nodepreferavoidpods already exercised to set its weight to 1000 by default, it sounds fair to me to increase the default weight for inter-pod affinity, node affinity, taint/toleration and pod topology spread. BTW: to some extent, we can use e2e tests verifying the to-proposed default weight. Right now we have some e2e tests covering priorities (scoring), but they depend on kubernetes/test/e2e/scheduling/priorities.go Line 345 in c2c3c47
and hence IMO a bit fragile. |
Strongly agree with this change. But, IMO, we may need to reconsider the default impact weights of each scoring plugin. Maybe they should have more weight ladders, rather than simply 1 or 1000. For example, from practical experience, in most cases |
IMO a reasonable criterion to guide which Score plugins should have non-default weight is: whether the Score plugin is associated with an API (e.g., PodTopologySpread Score plugin is associated with |
If we give a higher weight only to the API-facing scores, then workloads that don't use these features are not affected, which is a good thing. I think a good weight to give to these scores is one that avoids it to be diluted among the rest of the scores. So, maybe a weight that is half the sum of all non-API-facing scores. |
/cc @lavalamp Daniel, can you comment on whether or not the proposed change conforms with k8s deprecation policy? |
I hate to say it, but I think you have to make the new set of weights an option and use the deprecation process to move people over. It's quite difficult to predict what a change in these weights might do to the overall scheduling behavior, no? If it was already possible to specify the settings (via flag or config file), then maybe there's a case to be made that the deprecation process is heavier than needed here. To make that case, you'd need to test and be VERY sure that the new settings are actually an improvement. The change sounds like an improvement, but I'd like to see that quantified / tested somehow. Perhaps even more importantly, I'd like to know the conditions under which it makes things worse, and along which dimensions. @liggitt can give a second opinion if you don't like this answer :) |
This may have an impact on availability (since by default the scheduler will prefer less under utilized nodes) and pod startup (less preference to node that already has the image) IF the pod have explicit preference for being on a specific node or with a specific pod.
Those weights are actually configurable via the component config file, but most cloud providers don't allow customers to provide a custom configuration, they just use the default.
Regarding the change of behavior, this effectively create two level scoring: first level is by user-specified preferences (affinity), second level is by workload-agnostic ones (e.g., image existence). If users are not specifying preferences, then this change have no impact. |
Those are good points, but how many users specify preferences? Should we encourage those users to use non-optional constraints, instead? I can imagine arguing that making the preferences more likely to be obeyed makes it rarer and therefore more surprising and worse when they can't be honored. |
I'm not sure if I'm understanding this correctly. Examples of non-optional scores are "topology spreading", "resources utilization" and "image locality". Users (aka Pods) can not opt-in or opt-out from them. Cluster operators (on prem or cloud-providers) can opt-out or change weights, but for a whole cluster.
Remember that we are talking about node scoring. The nodes to be scored should have already passed the filters. Let's take an API-facing score (node affinity) and non-API-facing score (image locality) for illustrating the effect of weights: Imagine a pod is using "preferred" node affinity and there are no nodes available that satisfy that affinity. That means that in this case all nodes would get a low (or zero) score for node affinity. In this case the score given by image locality will provide the difference in score between all nodes. Now, let's say some nodes satisfy the node affinity. Those nodes will have the same high score for node affinity. But they will defer in image locality, which will be (in this case) the determining factor to select a node. However, in real (default) clusters we have more than 2 scores. And the problem is that pod preferences end up being diluted with all the other default scores summed up. What the proposal would achieve is that we first honor pod preferences and then we differentiate among the resulting nodes using the default scores. |
OK, so I guess the case where this hurts users is:
But it's still probabilistic, so we can expect some percentage of the time, the preferences continue to not get honored? Numerically, suppose 50% have preferences, 2% of those got them wrong, 80% of those are currently diluted (i.e., not already broken), and 10% of those are in environments where they'll suddenly get honored 100% after this change (becoming broken). That's .5*.02*.8*.1 = .08% of users, or 1 out of 1250 users this change could break. That's still maybe a little high for my taste, but maybe you have an alternative estimate? |
I think we can tweak how many of the currently diluted cases become honored. In #88174 (comment) I proposed that we use a weight of half the sum of the non-API-facing weights. We can be less aggressive and do a quarter or just 2 (we currently put weight of 1 for all). |
On another note, I would say weights are actually part of the component config API. In both v1alpha1 and the new v1alpha2 (#87617), you can tweak the weights of individual scores. Users might be setting up these weights assuming that the defaults are 1. This is not a problem with the old Policy API, because you have to specify all the scores with their weights. So, if we decide to change the weights, better do so now for v1alpha2. |
Daniel, I don't expect that 50% of k8s workloads have preferred pod/node affinities, but I don't have data to suggest an alternative estimate. I guess we could make the change behind a feature flag that gets graduated over multiple releases. |
It is permissible for the default to change over versions, in fact that's the way to do it for API types. Since the type is alpha, I guess it's OK to just do this. So, if the config file is v1alpha1, do the old defaults; if it's v1alpha2, use the new ones. How does that sound? |
sgtm. But the question of what's a good weight for api-facing scores still remains. By default we only have 4 non-api-facing scores:
So my two proposals are actually the same number: 2 😓 |
What about the case where a config file is not provided at all (which is very common across all cloud providers)? I guess we will have to keep the old defaults, and so we didn't actually solve the issue, right? it will also be confusing to users who pass a config file that doesn't change any of the weights. |
If there's no config file at all, I think it's fair game, users can easily get the old behavior if needed. Others might disagree with me, but that's what I think. A completely separate thought: are you sure changing the weights is good? What about changing the way they're evaluated? E.g. in ML it's common to use e.g. a softmax function at the last layer to force a NN to make a choice / assign a consistent set of probabilities to each option. I think the way that might look here is instead of using F(N) as F's priority for node N, use exp(F(N)) / SUM (N0->Nn) { exp(F(N)) }. This would have the effect of magnifying minor preferences. Basically, I'm claiming that priority functions should be distributing the same amount of "probability mass" over the N possible nodes. It's not fair for one function to say ".3" for everything while another says ".6" for everything. You maybe need to normalize this before applying the weighting function anyway. But if you do the softmax, you might not even need the weight, as priority functions that give each node the same score will basically be no-ops, letting ones with differing opinions win. |
Interesting proposal, but don't you think weights will still be needed because unlike NN, the scheduler didn't go through a training phase to discover which features (in this case priority functions) provide the most important signal? |
Let's put an scenario with three simplified scores:
Now let's assume they give the following scores:
Notice that they all have the same distribution (so they all give good signal). Even when using softmax the sum of scores (with the same weight for every score) will be:
Nodes 1, 4 and 7 have the same score and thus the same probability to be chosen. Now, in the default scheduler, we have 4 automatic scores, so the user preference is quite unlikely to get respected. It's not necessarily a matter of good signal from the score. |
Yup good points-- I agree a weight is still needed. If all the numbers are 0 or 1 (I had thought that it was a floating point and not a binary signal) then I don't think softmax is going to be different than a linear normalization (?). I guess it's debatable whether one wants a priority function that selects fewer nodes to have its opinion on those nodes be weighted more highly, which is what normalization would do. Anyway sorry if this is a rabbit hole. :) I think this idea would make more sense if the priority functions could express a strength for their preference and not just yes/no. |
It's not binary necessarily... I was over simplifying. Node affinity and image locality certainly are, but node utilization is not. |
It might be worth some experiments, then... if you have some way of telling
whether or not it's actually better?
…On Thu, Feb 20, 2020 at 6:30 PM Aldo Culquicondor ***@***.***> wrote:
It's not binary necessarily... I was over simplifying. Node affinity and
image locality certainly are, but node utilization is not.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88174?email_source=notifications&email_token=AAE6BFSTGRN7SHN5SVN465TRD44FDA5CNFSM4KVPJ7U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMRJJXA#issuecomment-589468892>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE6BFW57MGKJBPBUUZ56C3RD44FDANCNFSM4KVPJ7UQ>
.
|
We have an HA use case that involves spreading pods across nodes and zone with preferred scheduling since we'd rather schedule the pods than not. Before 1.18, we've been using the something like the following:
Now with the 1.18 beta, we tried the following:
The later yields better scheduling results than the former on a 9 nodes multi-zone cluster with 3 nodes in each node. So while I'm not sure how these results fit into the discussions so far but I would have expected either option by itself to yield a better spread. Combining the two lets me reinforce my desires to the scheduler. IMO, I would like to see the scheduler give my explicit request more weight. |
@rtheis thanks a lot for sharing. Giving more weight to explicit requests is what we are proposing. However, I'm a little surprised that you need these constraints. The scheduler already spreads by default when using services, replica set or stateful sets. Are these pods not members of any of those? I assume you are not modifying the scheduler configuration? But, of course, there are other scores in play. It makes sense that spreading constraints are more effective than antiaffinity, because there is a lot of antiaffinity with the node that is ultimately selected as well. Adding both specs works too. However, this increases scheduling latency, as we have to calculate more things. The main motivation for this issue was actually to make scoring more effective for the opposite case, which is not done by default: "packing". But your use case is just as valid. |
@alculquicondor These pods are part of a deployment. I tried without any scheduling preferences and the results are similar to podAntiAffinity preferred scheduling. Here is the baseline deployment yaml used for the tests:
I have not modified the scheduler configuration. |
Yes, deployments are affected because they create a replica set that owns the pods. But there are always other scores at play that might also be important, which is why a "preferred" spreading is never going to be perfect. This is kind of the point that @lavalamp was trying to make: guarding users from over-fitting and degrading the clusters in some way. |
urgh, this will be tricky to do since defaulting will be done via component config, and so we don't have a way to communicate whether the values were defaults or explicitly set by the user. If defaulting is done in factory.go then it might be possible, but we shouldn't do that. |
Defaulting is going to happen in v1alpha2/defaults.go in the near future |
We can have it in the internal type, but not the external one then. This will hide it from the users. But how do we communicate that to conversion logic?! |
It doesn't seem there is a way because defaulting happens before conversion... |
To make the new parameter semantically clear, we can treat it as a weight override rather than a percentage of the plugin weight, and so it's default would be 1 rather than 50%, internally the plugin computes the ratio as the override weight divided by the plugin weight, which it uses to scale scores computed for pods that don't have topology spread specified in their spec. |
Got it now. I thought the default constraints would default to nil if not specified by users.
I think we should introduce a field called "multiplexer" and default to 1, so it's friendly to defaulting and conversion. |
+1. |
What if we start giving meaning to MaxSkew for scoring? If you set a low
MaxSkew, you probably want it to have a higher effect than if you set a
bigger value.
We can discuss the exact formula, but I think adding these semantics make
sense.
…On Sat., Apr. 18, 2020, 5:50 p.m. Wei Huang, ***@***.***> wrote:
To make the new parameter semantically clear, we can treat it as a weight
override rather than a percentage of the plugin weight, and so it's default
would be 1 rather than 50%, internally the plugin computes the ratio as the
override weight divided by the plugin weight, which it uses to scale scores
computed for pods that don't have topology spread specified in their spec.
+1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88174 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ5E6C7MEJ72OTQVSUR2XTRNIOCVANCNFSM4KVPJ7UQ>
.
|
That's possible. Current Score function doesn't take |
Note that if we add another parameter to the args, it would only be relevant for scoring, adding to the confusion. |
The scores
Each score on their own give the differentiation we want, but they kind of have different scales. As a result SimulationsEnvironmentI'm using a 9 nodes (3 in each zone) cluster as an example. I've set different initial loads, and in all cases I simulate a Deployment of 30 pods (each assumed to take 1% of node resources). Here are the scenarios, where numbers are percentage of use:
ResultsFor equal score weights (current configuration), this is the distribution of new pods:
Note how a node with only 40% utilization gets only 1 or 0 pods. For double score weights
Only at 80% utilization we observe bigger skews, which is reasonable. But the spreading is still pretty good at 40% For planned default spreading See KEP, but this might change, depending on feedback here.
ConclusionI think this is enough justification to increase the weight of |
Sounds reasonable. It's mathematically equivalent to changing the formula to e.g. Have you considered doing that instead of changing the weight? It seems like that'd help everyone except those who specifically already increased |
It'd make sense to me if the available functions were in the same scale / units so that they'd compose better, but it's not obvious how to do that--what would the units be? One thought is to make the units "std deviations from the mean"--if we knew the output frequency of each function, we could maybe do that. So basically this would measure how unusual a preference is, I guess. (I don't think this is an actionable thought. It really bugs me that the outputs of these functions are combined when they are not really in the same units, though!) |
Yes
100 is the maximum score we have defined any plugins to have.
I'm not convinced this would work. The variance in my examples for initial resource usage is quite high. So it would still get a lot of importance. But, you have triggered me to start thinking about the linearity of the scores. The formula On the other hand, I'll experiment changing the linearity of either function, but preferring to do so for the beta plugin. |
Here go 2 alternatives, with the same situations as #88174 (comment) A Modify PodTopologySpread
B Modify NodeResourcesLeastAllocated
ConclusionWe either solution, we wouldn't need to modify the weights 🥳 I think option B is simpler to understand and gives slightly better results (only noticeable in the last case). So this would be my algorithmic preference. However, I'm fine with option A. |
Yeah, that's a great idea. You can generalize that and give each function a both a linear weight and a "gamma" (name from image processing): (but in this case I'm in favor of just making a targeted fix to that one function) |
Awesome, |
how would option B impact node ranking in the absence of spreading? would image size scores start to have bigger impact? |
I think SIG Scheduling should probably have a discussion. I could go either way, my naive inclination is to just go for the change, but I'm sure it'll break someone. We probably need a way to get back the old behavior, or a suggested weight change that would approximate it (not clear how well that could do though). I also have @ahg-g's question, which is how well does this new value compose with other functions? I feel like the proper answer looks like an evaluation criteria + {long list of scenarios} + {list of workloads to schedule} + randomized testing / benchmarking. Basically the multi-dimensional version of the analysis here. That sounds like a lot of work but it would permit actual optimization of these functions. |
configurable gamma is an obvious way... :/ |
I don't think we want another config parameter, we already have the weight concept and it is configurable, so I think we should go with that, just double the spreading score weight by default, and users who don't favor this behavior can configure it back to 1. |
Yeah, that sounds expedient.
…On Fri, May 15, 2020 at 3:13 PM ahg-g ***@***.***> wrote:
I don't think we want another config parameter, we already have the weight
concept and it is configurable, so I think we should go with that, just
double the spreading score weight by default, and users who don't favor
this behavior can configure it back to 1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88174 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE6BFRIDZLOUD3ATB6ATNDRRW5AXANCNFSM4KVPJ7UQ>
.
|
In short, yes. Although that score has a mechanism to avoid the "node heating problem" kubernetes/pkg/scheduler/framework/plugins/imagelocality/image_locality.go Lines 105 to 112 in 0f2cccc
But that same reasoning doesn't apply to option A (improving PodTopologySpreading), for the following reasons:
I would argue that option |
I am not clear on the following: you say "we could have undesirable consequences unless we carefully analyze the impact." but then you argue for the option at the end, can you clarify?
We had default spreading that was giving similar strong signal, we should compare with that. My concern with modifying the |
Sorry, I meant option A, improve PodTopologySpread |
One potential advantage to doubling the weight is that topology spread currently offers a signal comparable to default spreading, which we can think of as a baseline for spreading signal since it has been used for a while now. Doubling the weight will allow admins to revert back to the old behavior if they wish to do so. Also, from my perspective, doubling the weight is giving better placement at high utilization ( |
Problem:
Preferred pod/node affinity and preferred PodTopologySpread may not be respected because of conflicts with other scoring functions, like node utilization or image locality.
Scheduler Scoring functions have weights, by default they are all set to 1 (apart from one): https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithmprovider/registry.go#L118-L125. Having equal weights is problematic because of the way we score: each Scoring plugin returns a score [0, MaxScore], which get multiplied by the configured weight of the plugin, and a node score is the sum over all scores. Therefore, even if the user explicitly indicates that they prefer to place pod x on node y, this preference may not be respected and the scheduler may decide to place the pod on node z because it has lower usage and the container that the pod is requesting.
Proposed Solution:
Scoring functions that allow users to explicitly express preference should have higher weights than generic scoring functions. This includes: inter-pod affinity, node affinity, taint/toleration and pod topology spread.
This issue is proposing to change the default behavior of the scheduler, I am not sure how k8s deprecation policy applies here, although I would argue that this change should help the scheduler make decisions closer to users expectations.
We will start with pod topology pod spread since as of creating this issue, the feature is still in beta.
/sig scheduling
The text was updated successfully, but these errors were encountered: