serviceSpreadPriority not properly scoring nodes after being mapped to podTopologySpread #98480

damemi · 2021-01-27T14:34:46Z

What happened:
SelectorSpreadPriority and ServiceSpreadPriority were mapped to PodTopologySpread in #95448. However, in some scheduling runs with serviceSpreadPriority enabled, the (newly mapped) topology spread plugin reports score: 0 for all nodes, leading to pods being scheduled on the same node instead of being spread.

I0125 15:53:53.585512       1 generic_scheduler.go:504] Plugin PodTopologySpread scores on kxxx/service-spreading-d7bt9 => [{kxxx25-vnvkw-worker-scvsw 0} {kxxx25-vnvkw-worker-w6wvw 0}]

What you expected to happen:
Nodes to be scored relative to the appropriate selector spreading.

How to reproduce it (as minimally and precisely as possible):
Enable serviceSpreadPriority in the scheduler policy.cfg (and disable everything else)

$ cat policy.cfg 
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
],
"priorities" : [ 
        {"name" : "ServiceSpreadingPriority", "weight" : 1}
        ]
}

Create an RC and Service matching:

{
    "apiVersion": "v1",
    "kind": "List",
    "items": [
        {
            "apiVersion": "v1",
            "kind": "ReplicationController",
            "metadata": {
                "labels": {
                    "name": "service-spreading"
                },
                "name": "service-spreading"
            },
            "spec": {
                "replicas": 5,
                "template": {
                    "metadata": {
                        "labels": {
                            "name": "service-spreading"
                        }
                    },
                    "spec": {
                        "containers": [
                            {
                                "image": "openshift/hello-openshift",
                                "name": "service-pod"
                            }
                        ]
                    }
                }
            }
        },
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "service-spreading"
                },  
                "name": "service-spreading"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],  
                "selector": {
                    "name": "service-spreading"
                }   
            }
        }
    ]
}

See scores in scheduler logs (with v=10):

I0125 15:53:53.585512       1 generic_scheduler.go:504] Plugin PodTopologySpread scores on kxxx/service-spreading-d7bt9 => [{kxxx25-vnvkw-worker-scvsw 0} {kxxx25-vnvkw-worker-w6wvw 0}]

Anything else we need to know?:
We understand that both policy and serviceSpread are deprecated. If the simple solution is "don't use them", that is acceptable :) Just raising this bug to get more info.

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

/sig scheduling

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-01-27T14:34:53Z

@damemi: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi · 2021-01-27T14:35:04Z

/cc @alculquicondor @ingvagabund

alculquicondor · 2021-01-27T15:04:25Z

Can you show the output of the component config?
Nice thing that you added it :)

ingvagabund · 2021-01-27T15:10:37Z

I0121 12:33:56.933501       1 configfile.go:72] Using component config:
apiVersion: kubescheduler.config.k8s.io/v1beta1
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig
  qps: 50
enableContentionProfiling: true
enableProfiling: true
healthzBindAddress: 0.0.0.0:10251
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
  leaseDuration: 15s
  renewDeadline: 10s
  resourceLock: configmaps
  resourceName: kube-scheduler
  resourceNamespace: openshift-kube-scheduler
  retryPeriod: 2s
metricsBindAddress: 0.0.0.0:10251
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- pluginConfig:
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta1
      kind: DefaultPreemptionArgs
      minCandidateNodesAbsolute: 100
      minCandidateNodesPercentage: 10
    name: DefaultPreemption
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta1
      kind: NodeResourcesLeastAllocatedArgs
      resources:
      - name: cpu
        weight: 1
      - name: memory
        weight: 1
    name: NodeResourcesLeastAllocated
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta1
      defaultingType: System
      kind: PodTopologySpreadArgs
    name: PodTopologySpread
  plugins:
    bind:
      enabled:
      - name: DefaultBinder
        weight: 0
    filter:
      enabled:
      - name: NodeUnschedulable
        weight: 0
      - name: TaintToleration
        weight: 0
    permit: {}
    postBind: {}
    postFilter:
      enabled:
      - name: DefaultPreemption
        weight: 0
    preBind: {}
    preFilter: {}
    preScore:
      enabled:
      - name: PodTopologySpread
        weight: 0
    queueSort:
      enabled:
      - name: PrioritySort
        weight: 0
    reserve: {}
    score:
      enabled:
      - name: NodeResourcesLeastAllocated
        weight: 2
      - name: PodTopologySpread
        weight: 10
  schedulerName: default-scheduler

alculquicondor · 2021-01-27T15:27:01Z

oh, I think I know why. Do your nodes have hostname AND zone label?

This might be related to the Note here: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#internal-default-constraints

Unfortunately, that means that you cannot use policy API and configure default spreading. This is technically a regression if your nodes don't have zones.

alculquicondor · 2021-01-27T15:30:41Z

The workaround would be to add a zone to all the nodes.

But perhaps we can do something more in code (effectively lifting the note above). If defaultingType: System, ignore non-existing labels. @Huang-Wei WDYT?

ingvagabund · 2021-01-27T15:32:41Z

We have failure-domain.beta.kubernetes.io/zone, topology.kubernetes.io/zone and kubernetes.io/hostname label keys set. Any other labels missing?

alculquicondor · 2021-01-27T15:41:19Z

topology.kubernetes.io/zone andkubernetes.io/hostname are the correct ones.

Does this reproduce without the Policy API?

damemi · 2021-01-27T15:48:44Z

Does this reproduce without the Policy API?

Is ServiceSpreadPriority available as a CC plugin? I thought the only way to set that was with policy API

We haven't seen any problems with TopologySpreadConstraints itself

alculquicondor · 2021-01-27T15:55:13Z

TopologySpreadConstraints with defaultingType: System is the ServiceSpreadPriority

alculquicondor · 2021-01-27T19:15:37Z

I'm trying to reproduce while introducing a few log lines in scoring. However, my docker installation is somehow busted, so I can't run kind.

Earlier I noticed a log line in kube-scheduler indicating that it didn't have permission to list ReplicationControllers, so maybe that's related. However, the Service should cover for that. Have you tried creating the Service ahead of time?

And again, also worth testing if default spreading works without Policy API (no configuration should be enough for that).

Huang-Wei · 2021-01-27T20:12:53Z

Is ServiceSpreadPriority available as a CC plugin?

@damemi It's available as a plugin. And you have to enable it explicitly via CC (disabled by default), and also ensure the args.affinityLables is not nil.

Huang-Wei · 2021-01-27T20:21:09Z

But perhaps we can do something more in code (effectively lifting the note above). If defaultingType: System, ignore non-existing labels. @Huang-Wei WDYT?

So the root cause is that the cluster has both legacy topology labels (failure-domain.beta.kubernetes.io/*) and the up-to-date ones?

alculquicondor · 2021-01-27T20:43:22Z

We haven't found the root cause. My guess was that the nodes didn't have the required labels, but that's not the case.

It's available as a plugin. And you have to enable it explicitly via CC (disabled by default), and also ensure the args.affinityLables is not nil.

I think you are referring to the ServiceAffinity?

alculquicondor · 2021-01-27T20:46:25Z

@ingvagabund can you check the scheduler logs to see if it has any issues listing ReplicationControllers and Services?

ingvagabund · 2021-01-27T20:57:16Z

There are no issues though checking the nodes in this particular case I don't see any zone labels set :-| Looks like we don't set zone labels in every cloud provider installation. In this case the scheduling was carried over VSphere zones which do not have the zone label set. That explains why the case is not working as expected. @alculquicondor @Huang-Wei thanks for the pointers!!!

alculquicondor · 2021-01-27T21:03:38Z

Awesome!

We can still think about #98480 (comment)
as this is a breaking change.

Huang-Wei · 2021-01-28T01:30:18Z

I think you are referring to the ServiceAffinity?

oops, indeed. Nvm :)

Huang-Wei · 2021-01-28T02:22:50Z

@alculquicondor If users specify SelectorSpreadPriority, internally we set the defaulting type of PodTopologySpread to System, and that pluginConfig will be valid for Filter as well. So does that mean in this case, non-existing topology labels will even blocking regular pods from scheduling? (why Mike and Jan didn't run into this is just b/c they disabled PodTopologySpread in Filter?)

alculquicondor · 2021-01-28T14:17:39Z

So does that mean in this case, non-existing topology labels will even blocking regular pods from scheduling?

No, because the System default doesn't have hard pod spreading.

Huang-Wei · 2021-01-28T18:00:34Z

No, because the System default doesn't have hard pod spreading.

Thanks, I see systemDefaultConstraints is with WhenUnsatisfiable=ScheduleAnyway.

alculquicondor · 2021-02-17T20:16:27Z

Given that, before moving to pod topology spread, this used to work when zone labels where not set, my recommendation would be to do #98480 (comment) so that this is not a breaking change for some clusters

@Huang-Wei wdyt? any takers?

fejta-bot · 2021-05-18T21:14:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

alculquicondor · 2021-05-19T18:24:24Z

/close
in favor of #102136

k8s-ci-robot · 2021-05-19T18:24:35Z

@alculquicondor: Closing this issue.

In response to this:

/close
in favor of #102136

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi added the kind/bug Categorizes issue or PR as related to a bug. label Jan 27, 2021

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 27, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2021

alculquicondor mentioned this issue May 19, 2021

"scheduler: non-compatible change in default topology spread constraints" #102136

Closed

k8s-ci-robot closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serviceSpreadPriority not properly scoring nodes after being mapped to podTopologySpread #98480

serviceSpreadPriority not properly scoring nodes after being mapped to podTopologySpread #98480

damemi commented Jan 27, 2021

k8s-ci-robot commented Jan 27, 2021

damemi commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

ingvagabund commented Jan 27, 2021

alculquicondor commented Jan 27, 2021 •

edited

alculquicondor commented Jan 27, 2021

ingvagabund commented Jan 27, 2021 •

edited

alculquicondor commented Jan 27, 2021

damemi commented Jan 27, 2021 •

edited

alculquicondor commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

Huang-Wei commented Jan 27, 2021 •

edited

Huang-Wei commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

ingvagabund commented Jan 27, 2021 •

edited

alculquicondor commented Jan 27, 2021

Huang-Wei commented Jan 28, 2021

Huang-Wei commented Jan 28, 2021

alculquicondor commented Jan 28, 2021 •

edited

Huang-Wei commented Jan 28, 2021

alculquicondor commented Feb 17, 2021

fejta-bot commented May 18, 2021

alculquicondor commented May 19, 2021

k8s-ci-robot commented May 19, 2021

serviceSpreadPriority not properly scoring nodes after being mapped to podTopologySpread #98480

serviceSpreadPriority not properly scoring nodes after being mapped to podTopologySpread #98480

Comments

damemi commented Jan 27, 2021

k8s-ci-robot commented Jan 27, 2021

damemi commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

ingvagabund commented Jan 27, 2021

alculquicondor commented Jan 27, 2021 • edited

alculquicondor commented Jan 27, 2021

ingvagabund commented Jan 27, 2021 • edited

alculquicondor commented Jan 27, 2021

damemi commented Jan 27, 2021 • edited

alculquicondor commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

Huang-Wei commented Jan 27, 2021 • edited

Huang-Wei commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

alculquicondor commented Jan 27, 2021

ingvagabund commented Jan 27, 2021 • edited

alculquicondor commented Jan 27, 2021

Huang-Wei commented Jan 28, 2021

Huang-Wei commented Jan 28, 2021

alculquicondor commented Jan 28, 2021 • edited

Huang-Wei commented Jan 28, 2021

alculquicondor commented Feb 17, 2021

fejta-bot commented May 18, 2021

alculquicondor commented May 19, 2021

k8s-ci-robot commented May 19, 2021

alculquicondor commented Jan 27, 2021 •

edited

ingvagabund commented Jan 27, 2021 •

edited

damemi commented Jan 27, 2021 •

edited

Huang-Wei commented Jan 27, 2021 •

edited

ingvagabund commented Jan 27, 2021 •

edited

alculquicondor commented Jan 28, 2021 •

edited