Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"scheduler: non-compatible change in default topology spread constraints" #102136

Closed
gaorong opened this issue May 19, 2021 · 24 comments · Fixed by #105046
Closed

"scheduler: non-compatible change in default topology spread constraints" #102136

gaorong opened this issue May 19, 2021 · 24 comments · Fixed by #105046
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@gaorong
Copy link
Contributor

gaorong commented May 19, 2021

What happened:

From v1.19, the scheduler use PodTopologySpread plugin to do default spreading, with the following constraints, and the original SelectorSpread is disabled by default.

- whenUnsatisfiable: ScheduleAnyway
  topologyKey: kubernetes.io/hostname
  maxSkew: 3
- whenUnsatisfiable: ScheduleAnyway
  topologyKey: topology.kubernetes.io/zone
  maxSkew: 5

These default constraints suppose nodes have a topology.kubernetes.io/zone label. The algorithm will ignore nodes that don't have all the required topology keys presented when scoring.

if !nodeLabelsMatchSpreadConstraints(node.Labels, s.Constraints) {
// Nodes which don't have all required topologyKeys present are ignored
// when scoring later.
s.IgnoredNodes.Insert(node.Name)
continue
}

However, in many on-premise clusters, those labels are not added by default and no guaranteed to exist, which leads to the PodTopologySpread will not take effect at all.

In a worse case, If we assign a NodeResourcesMostAllocated plugin and the PodTopologySpread do not take effect as described above, the NodeResourcesMostAllocated will dominate the scheduling and be likely to bind all pods belonging to ReplicaSets to the same node, which will lead to low availability and is not accepted by our user.

In the SelectorSpread plugin, we have already taken care of this case:

// If there is zone information present, incorporate it
if haveZones {

When changed to PodTopologySpread, we do not care zome, so are not compatible with the original behavior.

What you expected to happen:

the default PodTopologySpread plugin should take care of the case in which nodes do not have zone labels and work properly.

How to reproduce it (as minimally and precisely as possible):

create a deployment in a cluster whose nodes do not have zone labels and observe the default spreading.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    master branch

/assign
/sig scheduling

@gaorong gaorong added the kind/bug Categorizes issue or PR as related to a bug. label May 19, 2021
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2021
@gaorong
Copy link
Contributor Author

gaorong commented May 19, 2021

A potential fix would like this: In the default constraints case, we do not ignore nodes in prescore phase, so nodes with no topology-labels will be grouped to an empty-value topology and count pod per topology normally. At the later score phase, those empty-value topology's scores will not be added to the final score as we have a label check as below, so the last result will favour nodes with all required topology keys presented and penalize without.

for i, c := range s.Constraints {
	if tpVal, ok := node.Labels[c.TopologyKey]; ok {
		var cnt int64
		if c.TopologyKey == v1.LabelHostname {
			cnt = int64(countPodsMatchSelector(nodeInfo.Pods, c.Selector, pod.Namespace))
		} else {
			pair := topologyPair{key: c.TopologyKey, value: tpVal}
			cnt = *s.TopologyPairToPodCounts[pair]
		}
		score += scoreForCount(cnt, c.MaxSkew, s.TopologyNormalizingWeight[i])
	}
} 

can this fix make sense? feel free to leave comments.

@ahg-g
Copy link
Member

ahg-g commented May 19, 2021

There are two options:

  1. Add topology.kubernetes.io/zone label to the nodes. This is reasonable since this label (along with topology.kubernetes.io/region) is now standard: https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/1659-standard-topology-labels

  2. Update the default pod spreading constraints in the scheduler CC to only use the hostname constraint.

@alculquicondor we faced/discussed this issue before, what was the conclusion?

@alculquicondor
Copy link
Member

My suggestion was to special case the system defaulting to still provide some spreading if the zone label doesn't exist #98480 (comment)

Feel free to implement that suggestion. This should apply for backporting.

@alculquicondor
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2021
@alculquicondor
Copy link
Member

cc @Huang-Wei who was in the previous thread.

@alculquicondor
Copy link
Member

Btw, in the meantime, you can disable the feature gate DefaultPodTopologySpread

@Huang-Wei
Copy link
Member

@alculquicondor 's suggestion looks good, but where should can we achieve that? Not during init time as defaultConstraints is instantiated without knowing the underlying node labels:

if args.DefaultingType == config.SystemDefaulting {
pl.defaultConstraints = systemDefaultConstraints
}

If in runtime, that implies we have to do it in every scoring cycle, and iterate the filtered nodes to know if the zone label is present or not.

@Huang-Wei
Copy link
Member

can this fix make sense? feel free to leave comments.

This may work except for the unnecessary calculation on the "artificial" empty-value topology.

@alculquicondor
Copy link
Member

Yes, it should be in runtime. We already iterate over all the nodes in PreScore, so I don't see why it wouldn't work.

can this fix make sense?

Something like that, but it should be done just for the System defaulting mode. If the cluster administrator set another default, we assume they know what they are doing.

@ahg-g
Copy link
Member

ahg-g commented May 19, 2021

This isn't pretty, but I guess we have to do it because it was a breaking change.

@ahg-g
Copy link
Member

ahg-g commented May 25, 2021

/retitle "scheduler: non-compatible change in default topology spread constraints"

@k8s-ci-robot k8s-ci-robot changed the title scheduler: non-compatilbel change in default topology spread constraints "scheduler: non-compatible change in default topology spread constraints" May 25, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2021
@alculquicondor
Copy link
Member

We probably need to solve this before graduating to GA kubernetes/enhancements#1258

@gaorong I noticed you still have a PR open. Are you still working on it?

@alculquicondor
Copy link
Member

/unassign @gaorong
as they seem inactive

I can take it over
/assign

@alculquicondor
Copy link
Member

I'm not going to backport this, but I will support a cherry pick if anybody needs it. 1.20 would be the oldest that can be fixed.

@alculquicondor
Copy link
Member

cc @damemi

@ahg-g
Copy link
Member

ahg-g commented Nov 19, 2021

/open

for backporting

@alculquicondor
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 22, 2021
@ahg-g
Copy link
Member

ahg-g commented Nov 22, 2021

we need to backport this

@alculquicondor
Copy link
Member

Opened: #106604, #106605, #106607

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 22, 2021
@ahg-g
Copy link
Member

ahg-g commented Dec 23, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@ahg-g: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants