Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serviceSpreadPriority not properly scoring nodes after being mapped to podTopologySpread #98480

Closed
damemi opened this issue Jan 27, 2021 · 25 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@damemi
Copy link
Contributor

damemi commented Jan 27, 2021

What happened:
SelectorSpreadPriority and ServiceSpreadPriority were mapped to PodTopologySpread in #95448. However, in some scheduling runs with serviceSpreadPriority enabled, the (newly mapped) topology spread plugin reports score: 0 for all nodes, leading to pods being scheduled on the same node instead of being spread.

I0125 15:53:53.585512       1 generic_scheduler.go:504] Plugin PodTopologySpread scores on kxxx/service-spreading-d7bt9 => [{kxxx25-vnvkw-worker-scvsw 0} {kxxx25-vnvkw-worker-w6wvw 0}]

What you expected to happen:
Nodes to be scored relative to the appropriate selector spreading.

How to reproduce it (as minimally and precisely as possible):
Enable serviceSpreadPriority in the scheduler policy.cfg (and disable everything else)

$ cat policy.cfg 
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
],
"priorities" : [ 
        {"name" : "ServiceSpreadingPriority", "weight" : 1}
        ]
}

Create an RC and Service matching:

{
    "apiVersion": "v1",
    "kind": "List",
    "items": [
        {
            "apiVersion": "v1",
            "kind": "ReplicationController",
            "metadata": {
                "labels": {
                    "name": "service-spreading"
                },
                "name": "service-spreading"
            },
            "spec": {
                "replicas": 5,
                "template": {
                    "metadata": {
                        "labels": {
                            "name": "service-spreading"
                        }
                    },
                    "spec": {
                        "containers": [
                            {
                                "image": "openshift/hello-openshift",
                                "name": "service-pod"
                            }
                        ]
                    }
                }
            }
        },
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "service-spreading"
                },  
                "name": "service-spreading"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],  
                "selector": {
                    "name": "service-spreading"
                }   
            }
        }
    ]
}

See scores in scheduler logs (with v=10):

I0125 15:53:53.585512       1 generic_scheduler.go:504] Plugin PodTopologySpread scores on kxxx/service-spreading-d7bt9 => [{kxxx25-vnvkw-worker-scvsw 0} {kxxx25-vnvkw-worker-w6wvw 0}]

Anything else we need to know?:
We understand that both policy and serviceSpread are deprecated. If the simple solution is "don't use them", that is acceptable :) Just raising this bug to get more info.

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/sig scheduling

@damemi damemi added the kind/bug Categorizes issue or PR as related to a bug. label Jan 27, 2021
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 27, 2021
@k8s-ci-robot
Copy link
Contributor

@damemi: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@damemi
Copy link
Contributor Author

damemi commented Jan 27, 2021

/cc @alculquicondor @ingvagabund

@alculquicondor
Copy link
Member

Can you show the output of the component config?
Nice thing that you added it :)

@ingvagabund
Copy link
Contributor

I0121 12:33:56.933501       1 configfile.go:72] Using component config:
apiVersion: kubescheduler.config.k8s.io/v1beta1
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  kubeconfig: /etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig
  qps: 50
enableContentionProfiling: true
enableProfiling: true
healthzBindAddress: 0.0.0.0:10251
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
  leaseDuration: 15s
  renewDeadline: 10s
  resourceLock: configmaps
  resourceName: kube-scheduler
  resourceNamespace: openshift-kube-scheduler
  retryPeriod: 2s
metricsBindAddress: 0.0.0.0:10251
parallelism: 16
percentageOfNodesToScore: 0
podInitialBackoffSeconds: 1
podMaxBackoffSeconds: 10
profiles:
- pluginConfig:
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta1
      kind: DefaultPreemptionArgs
      minCandidateNodesAbsolute: 100
      minCandidateNodesPercentage: 10
    name: DefaultPreemption
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta1
      kind: NodeResourcesLeastAllocatedArgs
      resources:
      - name: cpu
        weight: 1
      - name: memory
        weight: 1
    name: NodeResourcesLeastAllocated
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1beta1
      defaultingType: System
      kind: PodTopologySpreadArgs
    name: PodTopologySpread
  plugins:
    bind:
      enabled:
      - name: DefaultBinder
        weight: 0
    filter:
      enabled:
      - name: NodeUnschedulable
        weight: 0
      - name: TaintToleration
        weight: 0
    permit: {}
    postBind: {}
    postFilter:
      enabled:
      - name: DefaultPreemption
        weight: 0
    preBind: {}
    preFilter: {}
    preScore:
      enabled:
      - name: PodTopologySpread
        weight: 0
    queueSort:
      enabled:
      - name: PrioritySort
        weight: 0
    reserve: {}
    score:
      enabled:
      - name: NodeResourcesLeastAllocated
        weight: 2
      - name: PodTopologySpread
        weight: 10
  schedulerName: default-scheduler

@alculquicondor
Copy link
Member

alculquicondor commented Jan 27, 2021

oh, I think I know why. Do your nodes have hostname AND zone label?

This might be related to the Note here: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#internal-default-constraints

Unfortunately, that means that you cannot use policy API and configure default spreading. This is technically a regression if your nodes don't have zones.

@alculquicondor
Copy link
Member

The workaround would be to add a zone to all the nodes.

But perhaps we can do something more in code (effectively lifting the note above). If defaultingType: System, ignore non-existing labels. @Huang-Wei WDYT?

@ingvagabund
Copy link
Contributor

ingvagabund commented Jan 27, 2021

We have failure-domain.beta.kubernetes.io/zone, topology.kubernetes.io/zone and kubernetes.io/hostname label keys set. Any other labels missing?

@alculquicondor
Copy link
Member

topology.kubernetes.io/zone andkubernetes.io/hostname are the correct ones.

Does this reproduce without the Policy API?

@damemi
Copy link
Contributor Author

damemi commented Jan 27, 2021

Does this reproduce without the Policy API?

Is ServiceSpreadPriority available as a CC plugin? I thought the only way to set that was with policy API

We haven't seen any problems with TopologySpreadConstraints itself

@alculquicondor
Copy link
Member

TopologySpreadConstraints with defaultingType: System is the ServiceSpreadPriority

@alculquicondor
Copy link
Member

I'm trying to reproduce while introducing a few log lines in scoring. However, my docker installation is somehow busted, so I can't run kind.

Earlier I noticed a log line in kube-scheduler indicating that it didn't have permission to list ReplicationControllers, so maybe that's related. However, the Service should cover for that. Have you tried creating the Service ahead of time?

And again, also worth testing if default spreading works without Policy API (no configuration should be enough for that).

@Huang-Wei
Copy link
Member

Huang-Wei commented Jan 27, 2021

Is ServiceSpreadPriority available as a CC plugin?

@damemi It's available as a plugin. And you have to enable it explicitly via CC (disabled by default), and also ensure the args.affinityLables is not nil.

@Huang-Wei
Copy link
Member

But perhaps we can do something more in code (effectively lifting the note above). If defaultingType: System, ignore non-existing labels. @Huang-Wei WDYT?

So the root cause is that the cluster has both legacy topology labels (failure-domain.beta.kubernetes.io/*) and the up-to-date ones?

@alculquicondor
Copy link
Member

We haven't found the root cause. My guess was that the nodes didn't have the required labels, but that's not the case.

It's available as a plugin. And you have to enable it explicitly via CC (disabled by default), and also ensure the args.affinityLables is not nil.

I think you are referring to the ServiceAffinity?

@alculquicondor
Copy link
Member

@ingvagabund can you check the scheduler logs to see if it has any issues listing ReplicationControllers and Services?

@ingvagabund
Copy link
Contributor

ingvagabund commented Jan 27, 2021

There are no issues though checking the nodes in this particular case I don't see any zone labels set :-| Looks like we don't set zone labels in every cloud provider installation. In this case the scheduling was carried over VSphere zones which do not have the zone label set. That explains why the case is not working as expected. @alculquicondor @Huang-Wei thanks for the pointers!!!

@alculquicondor
Copy link
Member

Awesome!

We can still think about #98480 (comment)
as this is a breaking change.

@Huang-Wei
Copy link
Member

I think you are referring to the ServiceAffinity?

oops, indeed. Nvm :)

@Huang-Wei
Copy link
Member

@alculquicondor If users specify SelectorSpreadPriority, internally we set the defaulting type of PodTopologySpread to System, and that pluginConfig will be valid for Filter as well. So does that mean in this case, non-existing topology labels will even blocking regular pods from scheduling? (why Mike and Jan didn't run into this is just b/c they disabled PodTopologySpread in Filter?)

@alculquicondor
Copy link
Member

alculquicondor commented Jan 28, 2021

So does that mean in this case, non-existing topology labels will even blocking regular pods from scheduling?

No, because the System default doesn't have hard pod spreading.

@Huang-Wei
Copy link
Member

No, because the System default doesn't have hard pod spreading.

Thanks, I see systemDefaultConstraints is with WhenUnsatisfiable=ScheduleAnyway.

@alculquicondor
Copy link
Member

Given that, before moving to pod topology spread, this used to work when zone labels where not set, my recommendation would be to do #98480 (comment) so that this is not a breaking change for some clusters

@Huang-Wei wdyt? any takers?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2021
@alculquicondor
Copy link
Member

/close
in favor of #102136

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

/close
in favor of #102136

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

6 participants