Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

koord-scheduler: DeviceShare supports preempting devices #1146

Merged

Conversation

eahydra
Copy link
Member

@eahydra eahydra commented Mar 27, 2023

Ⅰ. Describe what this PR does

Enhanced DeviceShare scheduling plugins

  • support preempting Devices

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

  1. Create a Pod and apply for all GPUs on a specified node. In the scenario I tested, a node has two GPU instances
$ cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test-reserve-gpu
  name: test-gpu-deploy
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: test-reserve-gpu
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test-reserve-gpu
        koordinator.sh/qosClass: LS
    spec:
      containers:
      - args:
        - "3600"
        command:
        - sleep
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
        name: test
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
            koordinator.sh/gpu: "200"
      schedulerName: koord-scheduler
EOF
deployment.apps/test-gpu-deploy created
  1. Check Pod status
$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP         NODE                    NOMINATED NODE   READINESS GATES
test-gpu-deploy-849874876f-cfvrr   1/1     Running   0          10s   10.0.3.7   cn-beijing.10.0.3.245   <none>           <none>
  1. Create a Pod with high priority than pre-created pod
$ cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: high-test-reserve-gpu
  name: high-test-gpu-deploy
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: high-test-reserve-gpu
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: high-test-reserve-gpu
        koordinator.sh/qosClass: LS
    spec:
      containers:
      - args:
        - "3600"
        command:
        - sleep
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
        name: test
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
            koordinator.sh/gpu: "100"
      priorityClassName: system-cluster-critical
      schedulerName: koord-scheduler
EOF
deployment.apps/high-test-gpu-deploy created
  1. Watch the Pod status
$ kubectl get pod -o wide
NAME                                   READY   STATUS        RESTARTS   AGE   IP         NODE                    NOMINATED NODE          READINESS GATES
high-test-gpu-deploy-dcd465c44-mq9bv   0/1     Pending       0          4s    <none>     <none>                  cn-beijing.10.0.3.245   <none>
test-gpu-deploy-849874876f-8fpsk       0/1     Pending       0          4s    <none>     <none>                  <none>                  <none>
test-gpu-deploy-849874876f-cfvrr       1/1     Terminating   0          21s   10.0.3.4   cn-beijing.10.0.3.245   <none>                  <none>
  1. Get events of preempted pod
$ kubectl get event | grep cfvrr
38s         Normal    Preempted           pod/test-gpu-deploy-849874876f-cfvrr        Preempted by default/high-test-gpu-deploy-dcd465c44-mq9bv on node cn-beijing.10.0.3.245

As the result shows, Pod test-gpu-deploy-849874876f-cfvrr is preempted by high-test-gpu-deploy-dcd465c44-mq9bv.

$ kubectl get pod 
NAME                                   READY   STATUS    RESTARTS   AGE
high-test-gpu-deploy-dcd465c44-mq9bv   1/1     Running   0          6m39s
test-gpu-deploy-849874876f-8fpsk       0/1     Pending   0          6m38s

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

Signed-off-by: Joseph <joseph.t.lee@outlook.com>
@koordinator-bot koordinator-bot bot requested a review from buptcozy March 27, 2023 11:30
@eahydra eahydra changed the title koord-scheduler: DeviceShare support preempting devices koord-scheduler: DeviceShare supports preempting devices Mar 27, 2023
@eahydra eahydra added this to the v1.2 milestone Mar 27, 2023
@eahydra eahydra added the enhancement New feature or request label Mar 27, 2023
@codecov
Copy link

codecov bot commented Mar 27, 2023

Codecov Report

Patch coverage: 66.66% and no project coverage change.

Comparison is base (b7d7a45) 66.77% compared to head (e7bc64a) 66.77%.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #1146    +/-   ##
========================================
  Coverage   66.77%   66.77%            
========================================
  Files         271      271            
  Lines       29603    29751   +148     
========================================
+ Hits        19766    19865    +99     
- Misses       8425     8464    +39     
- Partials     1412     1422    +10     
Flag Coverage Δ
unittests 66.77% <66.66%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/scheduler/plugins/deviceshare/plugin.go 71.42% <52.04%> (-16.37%) ⬇️
pkg/scheduler/plugins/deviceshare/utils.go 92.19% <57.89%> (-7.81%) ⬇️
pkg/scheduler/plugins/deviceshare/device_cache.go 89.68% <96.49%> (+3.45%) ⬆️
pkg/scheduler/plugins/deviceshare/allocator.go 87.50% <100.00%> (ø)

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@jasonliu747 jasonliu747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hormes
Copy link
Member

hormes commented Mar 29, 2023

/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hormes, jasonliu747

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit bda1e74 into koordinator-sh:main Mar 29, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants