Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the default weight of pod topology spread #88174

Closed
ahg-g opened this issue Feb 14, 2020 · 61 comments · Fixed by #91258
Closed

Increase the default weight of pod topology spread #88174

ahg-g opened this issue Feb 14, 2020 · 61 comments · Fixed by #91258
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@ahg-g
Copy link
Member

ahg-g commented Feb 14, 2020

Problem:
Preferred pod/node affinity and preferred PodTopologySpread may not be respected because of conflicts with other scoring functions, like node utilization or image locality.

Scheduler Scoring functions have weights, by default they are all set to 1 (apart from one): https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithmprovider/registry.go#L118-L125. Having equal weights is problematic because of the way we score: each Scoring plugin returns a score [0, MaxScore], which get multiplied by the configured weight of the plugin, and a node score is the sum over all scores. Therefore, even if the user explicitly indicates that they prefer to place pod x on node y, this preference may not be respected and the scheduler may decide to place the pod on node z because it has lower usage and the container that the pod is requesting.

Proposed Solution:
Scoring functions that allow users to explicitly express preference should have higher weights than generic scoring functions. This includes: inter-pod affinity, node affinity, taint/toleration and pod topology spread.

This issue is proposing to change the default behavior of the scheduler, I am not sure how k8s deprecation policy applies here, although I would argue that this change should help the scheduler make decisions closer to users expectations.

We will start with pod topology pod spread since as of creating this issue, the feature is still in beta.

/sig scheduling

@ahg-g ahg-g added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 14, 2020
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 14, 2020
@ahg-g
Copy link
Member Author

ahg-g commented Feb 14, 2020

Quote from the deprecation policy:

Rule #7: Deprecated behaviors must function for no less than 1 year after their announced deprecation.

This does not imply that all changes to the system are governed by this policy. This applies only to significant, user-visible behaviors which impact the correctness of applications running on Kubernetes or that impact the administration of Kubernetes clusters, and which are being removed entirely.

I think the rule doesn't apply to this proposal since it is not impacting the correctness of applications or impact administration of the cluster.

@Huang-Wei
Copy link
Member

Huang-Wei commented Feb 14, 2020

As nodepreferavoidpods already exercised to set its weight to 1000 by default, it sounds fair to me to increase the default weight for inter-pod affinity, node affinity, taint/toleration and pod topology spread.

BTW: to some extent, we can use e2e tests verifying the to-proposed default weight. Right now we have some e2e tests covering priorities (scoring), but they depend on createBalancedPodForNodes to ensure cpu/mem are at the same ratio of usage:

func createBalancedPodForNodes(f *framework.Framework, cs clientset.Interface, ns string, nodes []v1.Node, requestedResource *v1.ResourceRequirements, ratio float64) error {

and hence IMO a bit fragile.

@wgliang
Copy link
Contributor

wgliang commented Feb 15, 2020

As nodepreferavoidpods already exercised to set its weight to 1000 by default, it sounds fair to me to increase the default weight for inter-pod affinity, node affinity, taint/toleration and pod topology spread.

Strongly agree with this change. But, IMO, we may need to reconsider the default impact weights of each scoring plugin. Maybe they should have more weight ladders, rather than simply 1 or 1000. For example, from practical experience, in most cases nodeaffinity should not be set to the same weight as nodepreferavoidpods, nodepreferavoidpods, etc. (Yes, we need to consider this for users, because most users may not notice this change).

@Huang-Wei
Copy link
Member

IMO a reasonable criterion to guide which Score plugins should have non-default weight is: whether the Score plugin is associated with an API (e.g., PodTopologySpread Score plugin is associated with .spec. topologySpreadConstraints[*].whenUnsatisfiable == ScheduleAnyway.

@alculquicondor
Copy link
Member

If we give a higher weight only to the API-facing scores, then workloads that don't use these features are not affected, which is a good thing.

I think a good weight to give to these scores is one that avoids it to be diluted among the rest of the scores. So, maybe a weight that is half the sum of all non-API-facing scores.
Now, if a user uses the API for more than one score, I think we can safely assume that they want equal weights for them.

@ahg-g
Copy link
Member Author

ahg-g commented Feb 19, 2020

/cc @lavalamp

Daniel, can you comment on whether or not the proposed change conforms with k8s deprecation policy?

@lavalamp
Copy link
Member

I hate to say it, but I think you have to make the new set of weights an option and use the deprecation process to move people over. It's quite difficult to predict what a change in these weights might do to the overall scheduling behavior, no?

If it was already possible to specify the settings (via flag or config file), then maybe there's a case to be made that the deprecation process is heavier than needed here. To make that case, you'd need to test and be VERY sure that the new settings are actually an improvement. The change sounds like an improvement, but I'd like to see that quantified / tested somehow. Perhaps even more importantly, I'd like to know the conditions under which it makes things worse, and along which dimensions.

@liggitt can give a second opinion if you don't like this answer :)

@ahg-g
Copy link
Member Author

ahg-g commented Feb 19, 2020

I hate to say it, but I think you have to make the new set of weights an option and use the deprecation process to move people over. It's quite difficult to predict what a change in these weights might do to the overall scheduling behavior, no?

This may have an impact on availability (since by default the scheduler will prefer less under utilized nodes) and pod startup (less preference to node that already has the image) IF the pod have explicit preference for being on a specific node or with a specific pod.

If it was already possible to specify the settings (via flag or config file), then maybe there's a case to be made that the deprecation process is heavier than needed here.

Those weights are actually configurable via the component config file, but most cloud providers don't allow customers to provide a custom configuration, they just use the default.

To make that case, you'd need to test and be VERY sure that the new settings are actually an improvement. The change sounds like an improvement, but I'd like to see that quantified / tested somehow. Perhaps even more importantly, I'd like to know the conditions under which it makes things worse, and along which dimensions.

@liggitt can give a second opinion if you don't like this answer :)

Regarding the change of behavior, this effectively create two level scoring: first level is by user-specified preferences (affinity), second level is by workload-agnostic ones (e.g., image existence). If users are not specifying preferences, then this change have no impact.

@lavalamp
Copy link
Member

Those are good points, but how many users specify preferences? Should we encourage those users to use non-optional constraints, instead? I can imagine arguing that making the preferences more likely to be obeyed makes it rarer and therefore more surprising and worse when they can't be honored.

@alculquicondor
Copy link
Member

alculquicondor commented Feb 19, 2020

Should we encourage those users to use non-optional constraints, instead?

I'm not sure if I'm understanding this correctly. Examples of non-optional scores are "topology spreading", "resources utilization" and "image locality". Users (aka Pods) can not opt-in or opt-out from them. Cluster operators (on prem or cloud-providers) can opt-out or change weights, but for a whole cluster.

and worse when they can't be honored.

Remember that we are talking about node scoring. The nodes to be scored should have already passed the filters. Let's take an API-facing score (node affinity) and non-API-facing score (image locality) for illustrating the effect of weights:

Imagine a pod is using "preferred" node affinity and there are no nodes available that satisfy that affinity. That means that in this case all nodes would get a low (or zero) score for node affinity. In this case the score given by image locality will provide the difference in score between all nodes.

Now, let's say some nodes satisfy the node affinity. Those nodes will have the same high score for node affinity. But they will defer in image locality, which will be (in this case) the determining factor to select a node.

However, in real (default) clusters we have more than 2 scores. And the problem is that pod preferences end up being diluted with all the other default scores summed up. What the proposal would achieve is that we first honor pod preferences and then we differentiate among the resulting nodes using the default scores.

@lavalamp
Copy link
Member

OK, so I guess the case where this hurts users is:

  • They specified preferences
  • The preferences are actually bad (plot twist)
  • ...but they're diluted so much that they're not having much effect
  • but now the scheduler starts honoring them more often

But it's still probabilistic, so we can expect some percentage of the time, the preferences continue to not get honored?

Numerically, suppose 50% have preferences, 2% of those got them wrong, 80% of those are currently diluted (i.e., not already broken), and 10% of those are in environments where they'll suddenly get honored 100% after this change (becoming broken).

That's .5*.02*.8*.1 = .08% of users, or 1 out of 1250 users this change could break.

That's still maybe a little high for my taste, but maybe you have an alternative estimate?

@alculquicondor
Copy link
Member

I think we can tweak how many of the currently diluted cases become honored.

In #88174 (comment) I proposed that we use a weight of half the sum of the non-API-facing weights. We can be less aggressive and do a quarter or just 2 (we currently put weight of 1 for all).

@alculquicondor
Copy link
Member

On another note, I would say weights are actually part of the component config API. In both v1alpha1 and the new v1alpha2 (#87617), you can tweak the weights of individual scores. Users might be setting up these weights assuming that the defaults are 1. This is not a problem with the old Policy API, because you have to specify all the scores with their weights.

So, if we decide to change the weights, better do so now for v1alpha2.

@ahg-g
Copy link
Member Author

ahg-g commented Feb 20, 2020

Daniel, I don't expect that 50% of k8s workloads have preferred pod/node affinities, but I don't have data to suggest an alternative estimate.

I guess we could make the change behind a feature flag that gets graduated over multiple releases.

@lavalamp
Copy link
Member

It is permissible for the default to change over versions, in fact that's the way to do it for API types. Since the type is alpha, I guess it's OK to just do this.

So, if the config file is v1alpha1, do the old defaults; if it's v1alpha2, use the new ones. How does that sound?

@alculquicondor
Copy link
Member

sgtm. But the question of what's a good weight for api-facing scores still remains.

By default we only have 4 non-api-facing scores:

  • NodeResourcesBalancedAllocation
  • ImageLocality
  • NodeResourcesLeastAllocated
  • DefaultPodTopologySpread

So my two proposals are actually the same number: 2 😓

@ahg-g
Copy link
Member Author

ahg-g commented Feb 20, 2020

What about the case where a config file is not provided at all (which is very common across all cloud providers)? I guess we will have to keep the old defaults, and so we didn't actually solve the issue, right? it will also be confusing to users who pass a config file that doesn't change any of the weights.

@lavalamp
Copy link
Member

If there's no config file at all, I think it's fair game, users can easily get the old behavior if needed. Others might disagree with me, but that's what I think.

A completely separate thought: are you sure changing the weights is good? What about changing the way they're evaluated?

E.g. in ML it's common to use e.g. a softmax function at the last layer to force a NN to make a choice / assign a consistent set of probabilities to each option. I think the way that might look here is instead of using F(N) as F's priority for node N, use exp(F(N)) / SUM (N0->Nn) { exp(F(N)) }. This would have the effect of magnifying minor preferences.

Basically, I'm claiming that priority functions should be distributing the same amount of "probability mass" over the N possible nodes. It's not fair for one function to say ".3" for everything while another says ".6" for everything. You maybe need to normalize this before applying the weighting function anyway. But if you do the softmax, you might not even need the weight, as priority functions that give each node the same score will basically be no-ops, letting ones with differing opinions win.

@ahg-g
Copy link
Member Author

ahg-g commented Feb 20, 2020

Interesting proposal, but don't you think weights will still be needed because unlike NN, the scheduler didn't go through a training phase to discover which features (in this case priority functions) provide the most important signal?

@alculquicondor
Copy link
Member

alculquicondor commented Feb 20, 2020

Let's put an scenario with three simplified scores:

  1. user selected: a score that prefers a particular set of nodes (maybe a particular zone).
  2. automatic: existence of an image.
  3. automatic: low utilization

Now let's assume they give the following scores:

s1: 0 0 0 1 1 1 0 0 0
s2: 1 1 0 0 0 0 1 0 0
s3: 1 0 0 1 0 0 1 0 0

Notice that they all have the same distribution (so they all give good signal). Even when using softmax the sum of scores (with the same weight for every score) will be:

t: 2 1 0 2 1 1 2 0 0

Nodes 1, 4 and 7 have the same score and thus the same probability to be chosen.
I would argue that the user intention was that node 4 would be selected. It's still called a preference because if s1 would give 0 to all nodes, then the pod can still go to the nodes, and the automatic scores take all the decision.

Now, in the default scheduler, we have 4 automatic scores, so the user preference is quite unlikely to get respected. It's not necessarily a matter of good signal from the score.

@lavalamp
Copy link
Member

Yup good points-- I agree a weight is still needed.

If all the numbers are 0 or 1 (I had thought that it was a floating point and not a binary signal) then I don't think softmax is going to be different than a linear normalization (?).

I guess it's debatable whether one wants a priority function that selects fewer nodes to have its opinion on those nodes be weighted more highly, which is what normalization would do.

Anyway sorry if this is a rabbit hole. :) I think this idea would make more sense if the priority functions could express a strength for their preference and not just yes/no.

@alculquicondor
Copy link
Member

It's not binary necessarily... I was over simplifying. Node affinity and image locality certainly are, but node utilization is not.

@lavalamp
Copy link
Member

lavalamp commented Feb 21, 2020 via email

@rtheis
Copy link

rtheis commented Mar 6, 2020

We have an HA use case that involves spreading pods across nodes and zone with preferred scheduling since we'd rather schedule the pods than not. Before 1.18, we've been using the something like the following:

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: run
                  operator: In
                  values:
                  - both
              topologyKey: failure-domain.beta.kubernetes.io/zone
            weight: 50
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: run
                  operator: In
                  values:
                  - both
              topologyKey: kubernetes.io/hostname
            weight: 50

Now with the 1.18 beta, we tried the following:

      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: failure-domain.beta.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            run: both
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            run: both

The later yields better scheduling results than the former on a 9 nodes multi-zone cluster with 3 nodes in each node. So while topologySpreadConstraints was better it surprisingly didn't yield a perfect spread. So I decided to combine both and ... boom ... we had a perfect pod spread for our deployment with 9 replicas.

I'm not sure how these results fit into the discussions so far but I would have expected either option by itself to yield a better spread. Combining the two lets me reinforce my desires to the scheduler. IMO, I would like to see the scheduler give my explicit request more weight.

@alculquicondor
Copy link
Member

@rtheis thanks a lot for sharing. Giving more weight to explicit requests is what we are proposing.

However, I'm a little surprised that you need these constraints. The scheduler already spreads by default when using services, replica set or stateful sets. Are these pods not members of any of those? I assume you are not modifying the scheduler configuration? But, of course, there are other scores in play.

It makes sense that spreading constraints are more effective than antiaffinity, because there is a lot of antiaffinity with the node that is ultimately selected as well. Adding both specs works too. However, this increases scheduling latency, as we have to calculate more things.

The main motivation for this issue was actually to make scoring more effective for the opposite case, which is not done by default: "packing". But your use case is just as valid.

@rtheis
Copy link

rtheis commented Mar 6, 2020

@alculquicondor These pods are part of a deployment. I tried without any scheduling preferences and the results are similar to podAntiAffinity preferred scheduling. Here is the baseline deployment yaml used for the tests:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: none
  name: none
spec:
  selector:
    matchLabels:
      run: none
  template:
    metadata:
      labels:
        run: none
    spec:
      containers:
      - name: none
        image: k8s.gcr.io/pause:3.1

I have not modified the scheduler configuration.

@alculquicondor
Copy link
Member

Yes, deployments are affected because they create a replica set that owns the pods. But there are always other scores at play that might also be important, which is why a "preferred" spreading is never going to be perfect. This is kind of the point that @lavalamp was trying to make: guarding users from over-fitting and degrading the clusters in some way.

@ahg-g
Copy link
Member Author

ahg-g commented Apr 18, 2020

urgh, this will be tricky to do since defaulting will be done via component config, and so we don't have a way to communicate whether the values were defaults or explicitly set by the user. If defaulting is done in factory.go then it might be possible, but we shouldn't do that.

@alculquicondor
Copy link
Member

Defaulting is going to happen in v1alpha2/defaults.go in the near future

@ahg-g
Copy link
Member Author

ahg-g commented Apr 18, 2020

We can have it in the internal type, but not the external one then. This will hide it from the users. But how do we communicate that to conversion logic?!

@ahg-g
Copy link
Member Author

ahg-g commented Apr 18, 2020

It doesn't seem there is a way because defaulting happens before conversion...

@ahg-g
Copy link
Member Author

ahg-g commented Apr 18, 2020

To make the new parameter semantically clear, we can treat it as a weight override rather than a percentage of the plugin weight, and so it's default would be 1 rather than 50%, internally the plugin computes the ratio as the override weight divided by the plugin weight, which it uses to scale scores computed for pods that don't have topology spread specified in their spec.

@Huang-Wei
Copy link
Member

@Huang-Wei yes we are, but isn't the plan to have a default value for DefaultConstraints when not set: two constraints, one for zone and one for hostname, which mimic the current defaultpodspread behaviour and allows us to deprecate it?

Got it now. I thought the default constraints would default to nil if not specified by users.

It doesn't seem there is a way because defaulting happens before conversion...

I think we should introduce a field called "multiplexer" and default to 1, so it's friendly to defaulting and conversion.

@Huang-Wei
Copy link
Member

To make the new parameter semantically clear, we can treat it as a weight override rather than a percentage of the plugin weight, and so it's default would be 1 rather than 50%, internally the plugin computes the ratio as the override weight divided by the plugin weight, which it uses to scale scores computed for pods that don't have topology spread specified in their spec.

+1.

@alculquicondor
Copy link
Member

alculquicondor commented Apr 19, 2020 via email

@Huang-Wei
Copy link
Member

Huang-Wei commented Apr 20, 2020

What if we start giving meaning to MaxSkew for scoring?

That's possible. Current Score function doesn't take maxSkew into consideration - it's just following a simple "the less number of matching Pods a topology has, the higher score the Nodes on that topology are given" principle.

@alculquicondor
Copy link
Member

Note that if we add another parameter to the args, it would only be relevant for scoring, adding to the confusion.

@ahg-g ahg-g changed the title Increase the weights of scoring plugins that implement explicit preferences expressed in the pod spec Increase the default weight of pod topology spread May 5, 2020
@alculquicondor
Copy link
Member

alculquicondor commented May 14, 2020

The scores

  • PodTopologySpread: This score calculation is somewhat complicated, but can be resumed as:

    1. Calculate number of matching pods per topology. There is some new feature that takes in consideration the skew Use maxSkew in score calculation #90820
    2. Normalize by (minCount+maxCount-currCount)/maxCount*100.

    The effect is that pods with the lowest count get a score of 100. Then the rest of the scores go from a certain number to 100. A 0 score is given to a node with maxCount when minCount is 0. This formula has the effect of reducing differentiation among nodes as the minCount increases.

  • NodeResourcesLeastAllocated: The formula is very simple, the score is equal to the percentage of free space (averaged for cpu and memory).

Each score on their own give the differentiation we want, but they kind of have different scales. As a result NodeResourcesLeastAllocated tends to be stronger.

Simulations

Environment

I'm using a 9 nodes (3 in each zone) cluster as an example. I've set different initial loads, and in all cases I simulate a Deployment of 30 pods (each assumed to take 1% of node resources).

Here are the scenarios, where numbers are percentage of use:

[[10, 10, 0], [5, 10, 0], [5, 5, 5]]
[[20, 20, 0], [10, 20, 0], [10, 10, 10]]
[[40, 40, 0], [20, 40, 0], [20, 20, 20]]
[[80, 80, 0], [40, 80, 0], [40, 40, 40]]

Results

For equal score weights (current configuration), this is the distribution of new pods:

maxSkew=1 for both nodes and zones

[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[2, 2, 6], [3, 2, 5], [4, 3, 3]]
[[0, 0, 9], [3, 0, 8], [4, 3, 3]]

Note how a node with only 40% utilization gets only 1 or 0 pods.

For double score weights

maxSkew=1 for both nodes and zones

[[3, 3, 4], [3, 3, 4], [4, 3, 3]]                                                                                                                                                                                                                                         
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]                                                                                                                                                                                                                                         
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]                                                                                                                                                                                                                                         
[[2, 2, 6], [3, 2, 5], [4, 3, 3]]

Only at 80% utilization we observe bigger skews, which is reasonable. But the spreading is still pretty good at 40%

For planned default spreading

See KEP, but this might change, depending on feedback here.

[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[1, 1, 7], [4, 0, 6], [4, 4, 3]]

Conclusion

I think this is enough justification to increase the weight of PodTopologySpread to 2 in its 2nd beta release.

@lavalamp
Copy link
Member

lavalamp commented May 14, 2020

Sounds reasonable.

It's mathematically equivalent to changing the formula to e.g. 200*(minCount+maxCount-currCount)/maxCount, right?

Have you considered doing that instead of changing the weight? It seems like that'd help everyone except those who specifically already increased PodTopologySpread, instead of just people who are using the defaults?

@lavalamp
Copy link
Member

Each score on their own give the differentiation we want, but they kind of have different scales

It'd make sense to me if the available functions were in the same scale / units so that they'd compose better, but it's not obvious how to do that--what would the units be? One thought is to make the units "std deviations from the mean"--if we knew the output frequency of each function, we could maybe do that. So basically this would measure how unusual a preference is, I guess.

(I don't think this is an actionable thought. It really bugs me that the outputs of these functions are combined when they are not really in the same units, though!)

@alculquicondor
Copy link
Member

It's mathematically equivalent to changing the formula to e.g. 200*(minCount+maxCount-currCount)/maxCount, right?

Yes

Have you considered doing that instead of changing the weight?

100 is the maximum score we have defined any plugins to have.

One thought is to make the units "std deviations from the mean"--if we knew the output frequency of each function, we could maybe do that. So basically this would measure how unusual a preference is, I guess.

I'm not convinced this would work. The variance in my examples for initial resource usage is quite high. So it would still get a lot of importance.

But, you have triggered me to start thinking about the linearity of the scores. The formula (minCount+maxCount-currCount)/maxCount is good because it reduce the variance as minCount increases. However, that reduction is too aggressive.

On the other hand, score=remaining_capacity is completely linear.

I'll experiment changing the linearity of either function, but preferring to do so for the beta plugin.

@alculquicondor
Copy link
Member

alculquicondor commented May 15, 2020

Here go 2 alternatives, with the same situations as #88174 (comment)

A Modify PodTopologySpread

((minCount+maxCount-currCount)/maxCount) ^ 2

[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[1, 1, 7], [3, 1, 7], [4, 3, 3]]

B Modify NodeResourcesLeastAllocated

score = sqrt(remaining_capacity)

[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[3, 3, 4], [3, 3, 4], [4, 3, 3]]
[[1, 1, 7], [4, 1, 6], [4, 3, 3]]

Conclusion

We either solution, we wouldn't need to modify the weights 🥳
Although we are not getting as good spreading as simply doubling the weight. But this is only in the last scenario, where we have usages of 80%. So I would call this rather a win (we should be paying greater attention to resource usage at this point).

I think option B is simpler to understand and gives slightly better results (only noticeable in the last case). So this would be my algorithmic preference.

However, I'm fine with option A.

@lavalamp
Copy link
Member

Yeah, that's a great idea.

You can generalize that and give each function a both a linear weight and a "gamma" (name from image processing): weight * (function_output ^ gamma) or even go for weight * ((bias + function_output) ^ gamma)

(but in this case I'm in favor of just making a targeted fix to that one function)

@alculquicondor
Copy link
Member

Awesome,
Any concerns against option B? Given that it's an older feature.

@ahg-g
Copy link
Member Author

ahg-g commented May 15, 2020

how would option B impact node ranking in the absence of spreading? would image size scores start to have bigger impact?

@lavalamp
Copy link
Member

I think SIG Scheduling should probably have a discussion. I could go either way, my naive inclination is to just go for the change, but I'm sure it'll break someone. We probably need a way to get back the old behavior, or a suggested weight change that would approximate it (not clear how well that could do though).

I also have @ahg-g's question, which is how well does this new value compose with other functions?

I feel like the proper answer looks like an evaluation criteria + {long list of scenarios} + {list of workloads to schedule} + randomized testing / benchmarking. Basically the multi-dimensional version of the analysis here. That sounds like a lot of work but it would permit actual optimization of these functions.

@lavalamp
Copy link
Member

a way to get back the old behavior

configurable gamma is an obvious way... :/

@ahg-g
Copy link
Member Author

ahg-g commented May 15, 2020

I don't think we want another config parameter, we already have the weight concept and it is configurable, so I think we should go with that, just double the spreading score weight by default, and users who don't favor this behavior can configure it back to 1.

@lavalamp
Copy link
Member

lavalamp commented May 15, 2020 via email

@alculquicondor
Copy link
Member

alculquicondor commented May 18, 2020

how would option B impact node ranking in the absence of spreading? would image size scores start to have bigger impact?

In short, yes. Although that score has a mechanism to avoid the "node heating problem"

// scaledImageScore returns an adaptively scaled score for the given state of an image.
// The size of the image is used as the base score, scaled by a factor which considers how much nodes the image has "spread" to.
// This heuristic aims to mitigate the undesirable "node heating problem", i.e., pods get assigned to the same or
// a few nodes due to image locality.
func scaledImageScore(imageState *framework.ImageStateSummary, totalNumNodes int) int64 {
spread := float64(imageState.NumNodes) / float64(totalNumNodes)
return int64(float64(imageState.Size) * spread)
}
we could have undesirable consequences unless we carefully analyze the impact.

But that same reasoning doesn't apply to option A (improving PodTopologySpreading), for the following reasons:

I would argue that option B A is better that simply doubling the weight because its signal is more conservative when it matters (nodes have high usage).

@ahg-g
Copy link
Member Author

ahg-g commented May 18, 2020

I am not clear on the following: you say "we could have undesirable consequences unless we carefully analyze the impact." but then you argue for the option at the end, can you clarify?

The score in 1.18 had pretty weak signal, that was only useful with very small number of pods. We have drastically improved this #90475. So it can only be compared to other scores now.

We had default spreading that was giving similar strong signal, we should compare with that.

My concern with modifying the NodeResourcesLeastAllocated is that we may break someone. Increasing the weight is something that users can revert, while option B can't be reverted.

@alculquicondor
Copy link
Member

Sorry, I meant option A, improve PodTopologySpread

@ahg-g
Copy link
Member Author

ahg-g commented May 19, 2020

One potential advantage to doubling the weight is that topology spread currently offers a signal comparable to default spreading, which we can think of as a baseline for spreading signal since it has been used for a while now. Doubling the weight will allow admins to revert back to the old behavior if they wish to do so.

Also, from my perspective, doubling the weight is giving better placement at high utilization ([[2, 2, 6], [3, 2, 5], [4, 3, 3]] vs [[1, 1, 7], [3, 1, 7], [4, 3, 3]])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants