Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler sometimes preempts unnecessary pods #70622

Closed
Huang-Wei opened this issue Nov 4, 2018 · 7 comments · Fixed by #70898
Closed

Scheduler sometimes preempts unnecessary pods #70622

Huang-Wei opened this issue Nov 4, 2018 · 7 comments · Fixed by #70898
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@Huang-Wei
Copy link
Member

Huang-Wei commented Nov 4, 2018

What happened:

Sometimes, scheduler doesn't preempt pods in an "exact" correct way/path. But good thing is the final state is accurate - pods which should be preempted are finally preempted.

What you expected to happen:

The internal preemption process should also be exactly correct to avoid producing unnecessary preemptions.

How to reproduce it (as minimally and precisely as possible):

Following test are performed on a 8-core cpu worker node. You might need to adjust the cpu request/limit to reproduce.

Step 0: to make it a clean env that no pods has occupied cpu, I edited all workloads to remove their cpu request/limits, so the node has 0 usage on cpu:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests  Limits
  --------  --------  ------
  cpu       0 (0%)    0 (0%)
  memory    0 (0%)    0 (0%)
Step 1: Create 4 priority classes

apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p1
value: 1
globalDefault: false
description: "Priority p1 of value 1."
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p2
value: 2
globalDefault: false
description: "Priority p2 of value 2."
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p3
value: 3
globalDefault: false
description: "Priority p3 of value 3."
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: p4
value: 4
globalDefault: false

Step 2: Create priority{1,2,3}.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-1
spec:
  replicas: 5
  selector:
    matchLabels:
      app: pause1
  template:
    metadata:
      labels:
        app: pause1
    spec:
      priorityClassName: p1
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 400m
          limits:
            cpu: 400m
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-2
spec:
  replicas: 4
  selector:
    matchLabels:
      app: pause2
  template:
    metadata:
      labels:
        app: pause2
    spec:
      priorityClassName: p2
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 500m
          limits:
            cpu: 500m
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-3
spec:
  replicas: 4
  selector:
    matchLabels:
      app: pause3
  template:
    metadata:
      labels:
        app: pause3
    spec:
      priorityClassName: p3
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 950m
          limits:
            cpu: 950m

By now, deploy1, deploy2, deploy3 occupied 7800m cpu:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests     Limits
  --------  --------     ------
  cpu       7800m (97%)  7800m (97%)
Step 3: Create a high priority deployment4 to see how preemption works

apiVersion: apps/v1
kind: Deployment
metadata:
  name: lab1-4
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pause4
  template:
    metadata:
      labels:
        app: pause4
    spec:
      priorityClassName: p4
      containers:
      - name: pause
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: 4000m
          limits:
            cpu: 4000m

Expected result is that pods in deploy1 and deploy2 are pending, and pods in deploy3 should NOT be touched. And finally pods in deploy3 and deploy4 are running.

But it turns out it's not the case, see detailed log
.

Anything else we need to know?:

It's easy to reproduce in a multiple nodes env (kubeadm), but not that easy to repro in a single node env (hack/local-up-cluster.sh).

Environment:

  • Kubernetes version (use kubectl version): v1.11.3, v1.12.1, and master branch
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Ubuntu
  • Kernel (e.g. uname -a):
  • Install tools: kubeadm, or hack/local-up-cluster.sh
  • Others:

/kind bug
/sig scheduling

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 4, 2018
@Huang-Wei
Copy link
Member Author

70622.log

@Huang-Wei
Copy link
Member Author

@bsalamat this is the issue I talked to you earlier in slack.

I can reproduce it in 1.11 everytime. On 1.12 it's not that easy to reproduce, but I do see this issue also. So I believe it's a bug, in rare (racing) conditions.

@bsalamat
Copy link
Member

bsalamat commented Nov 6, 2018

I remember that you were able to reproduce this in your cluster that you brought up with kubeadm, but I was not able to reproduce in a cluster brought up with "kube-up.sh". Have you tried bringing a cluster up with kube-up.sh and see if you can reproduce it?

@Huang-Wei
Copy link
Member Author

@bsalamat by "kube-up.sh" you mean "hack/local-up-cluster.sh"?

@bsalamat
Copy link
Member

bsalamat commented Nov 6, 2018

No, I mean ./cluster/kube-up.sh.

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Nov 6, 2018

@bsalamat I don't have a paid gce/aws account lol, so never played with that. Right now I can easily reproduce in on v1.11.3 with hack/local-up-cluster.sh, and kubeadm v1.11.x and kubeadm v1.12.x.

@bsalamat
Copy link
Member

bsalamat commented Nov 6, 2018

hmm.. this is odd that the issue is not reproducible when a cluster is not created by kubeadm.

@Huang-Wei Huang-Wei changed the title Scheduler sometimes doesn't preempts pods in a correct path Scheduler sometimes preempts unnecessary pods Nov 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
3 participants