Scheduler sometimes preempts unnecessary pods #70622

Huang-Wei · 2018-11-04T06:54:56Z

What happened:

Sometimes, scheduler doesn't preempt pods in an "exact" correct way/path. But good thing is the final state is accurate - pods which should be preempted are finally preempted.

What you expected to happen:

The internal preemption process should also be exactly correct to avoid producing unnecessary preemptions.

How to reproduce it (as minimally and precisely as possible):

Following test are performed on a 8-core cpu worker node. You might need to adjust the cpu request/limit to reproduce.

Step 0: to make it a clean env that no pods has occupied cpu, I edited all workloads to remove their cpu request/limits, so the node has 0 usage on cpu:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests  Limits
  --------  --------  ------
  cpu       0 (0%)    0 (0%)
  memory    0 (0%)    0 (0%)

Step 1: Create 4 priority classes

apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p1
value: 1
globalDefault: false
description: "Priority p1 of value 1."
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p2
value: 2
globalDefault: false
description: "Priority p2 of value 2."
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p3
value: 3
globalDefault: false
description: "Priority p3 of value 3."
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p4
value: 4
globalDefault: false

Step 2: Create priority{1,2,3}.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-1
spec:
  replicas: 5
  selector:
    matchLabels:
      app: pause1
  template:
    metadata:
      labels:
        app: pause1
    spec:
      priorityClassName: p1
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 400m
          limits:
            cpu: 400m
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-2
spec:
  replicas: 4
  selector:
    matchLabels:
      app: pause2
  template:
    metadata:
      labels:
        app: pause2
    spec:
      priorityClassName: p2
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 500m
          limits:
            cpu: 500m
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-3
spec:
  replicas: 4
  selector:
    matchLabels:
      app: pause3
  template:
    metadata:
      labels:
        app: pause3
    spec:
      priorityClassName: p3
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 950m
          limits:
            cpu: 950m

By now, deploy1, deploy2, deploy3 occupied 7800m cpu:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests     Limits
  --------  --------     ------
  cpu       7800m (97%)  7800m (97%)

Step 3: Create a high priority deployment4 to see how preemption works

apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-4
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause4
  template:
    metadata:
      labels:
        app: pause4
    spec:
      priorityClassName: p4
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 4000m
          limits:
            cpu: 4000m

Expected result is that pods in deploy1 and deploy2 are pending, and pods in deploy3 should NOT be touched. And finally pods in deploy3 and deploy4 are running.

But it turns out it's not the case, see detailed log
.

Anything else we need to know?:

It's easy to reproduce in a multiple nodes env (kubeadm), but not that easy to repro in a single node env (hack/local-up-cluster.sh).

Environment:

Kubernetes version (use kubectl version): v1.11.3, v1.12.1, and master branch
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Ubuntu
Kernel (e.g. uname -a):
Install tools: kubeadm, or hack/local-up-cluster.sh
Others:

/kind bug
/sig scheduling

The text was updated successfully, but these errors were encountered:

Huang-Wei · 2018-11-04T06:59:17Z

70622.log

Huang-Wei · 2018-11-04T07:15:39Z

@bsalamat this is the issue I talked to you earlier in slack.

I can reproduce it in 1.11 everytime. On 1.12 it's not that easy to reproduce, but I do see this issue also. So I believe it's a bug, in rare (racing) conditions.

bsalamat · 2018-11-06T01:53:56Z

I remember that you were able to reproduce this in your cluster that you brought up with kubeadm, but I was not able to reproduce in a cluster brought up with "kube-up.sh". Have you tried bringing a cluster up with kube-up.sh and see if you can reproduce it?

Huang-Wei · 2018-11-06T01:56:09Z

@bsalamat by "kube-up.sh" you mean "hack/local-up-cluster.sh"?

bsalamat · 2018-11-06T21:20:22Z

No, I mean ./cluster/kube-up.sh.

Huang-Wei · 2018-11-06T21:37:15Z

@bsalamat I don't have a paid gce/aws account lol, so never played with that. Right now I can easily reproduce in on v1.11.3 with hack/local-up-cluster.sh, and kubeadm v1.11.x and kubeadm v1.12.x.

bsalamat · 2018-11-06T23:08:41Z

hmm.. this is odd that the issue is not reproducible when a cluster is not created by kubeadm.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 4, 2018

Huang-Wei mentioned this issue Nov 4, 2018

Remove unnecessary lock in scheduling PriorityQueue #70623

Closed

Huang-Wei mentioned this issue Nov 10, 2018

ensure scheduler preemptor behaves in an efficient/correct path #70898

Merged

Huang-Wei changed the title ~~Scheduler sometimes doesn't preempts pods in a correct path~~ Scheduler sometimes preempts unnecessary pods Nov 12, 2018

k8s-ci-robot closed this as completed in #70898 Nov 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler sometimes preempts unnecessary pods #70622

Scheduler sometimes preempts unnecessary pods #70622

Huang-Wei commented Nov 4, 2018 •

edited

Loading

Huang-Wei commented Nov 4, 2018

Huang-Wei commented Nov 4, 2018

bsalamat commented Nov 6, 2018

Huang-Wei commented Nov 6, 2018

bsalamat commented Nov 6, 2018

Huang-Wei commented Nov 6, 2018 •

edited

Loading

bsalamat commented Nov 6, 2018

Scheduler sometimes preempts unnecessary pods #70622

Scheduler sometimes preempts unnecessary pods #70622

Comments

Huang-Wei commented Nov 4, 2018 • edited Loading

Huang-Wei commented Nov 4, 2018

Huang-Wei commented Nov 4, 2018

bsalamat commented Nov 6, 2018

Huang-Wei commented Nov 6, 2018

bsalamat commented Nov 6, 2018

Huang-Wei commented Nov 6, 2018 • edited Loading

bsalamat commented Nov 6, 2018

Huang-Wei commented Nov 4, 2018 •

edited

Loading

Huang-Wei commented Nov 6, 2018 •

edited

Loading