Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

koord-scheduler: DeviceShare skips unhealthy device instances #1159

Conversation

eahydra
Copy link
Member

@eahydra eahydra commented Mar 29, 2023

Ⅰ. Describe what this PR does

  1. resolve the issue [BUG] failed to allocate GPU on Reserve/PreBind phase #1136, skip unhealthy devices when assigning.
  2. merge the implementation of tryAllocateCommonDevice and tryAllocateGPU. The logic of these two functions is basically the same, the only difference is that the GPU device needs to fill ResourceGPUMemory/ResourceGPUMemoryRatio

Ⅱ. Does this pull request fix one issue?

fix #1136

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@codecov
Copy link

codecov bot commented Mar 30, 2023

Codecov Report

Patch coverage: 90.16% and project coverage change: -0.16 ⚠️

Comparison is base (bda1e74) 66.87% compared to head (dd836a4) 66.71%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1159      +/-   ##
==========================================
- Coverage   66.87%   66.71%   -0.16%     
==========================================
  Files         272      273       +1     
  Lines       29891    29878      -13     
==========================================
- Hits        19989    19933      -56     
- Misses       8478     8520      +42     
- Partials     1424     1425       +1     
Flag Coverage Δ
unittests 66.71% <90.16%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/scheduler/plugins/deviceshare/device_cache.go 86.75% <87.50%> (-2.94%) ⬇️
pkg/scheduler/plugins/deviceshare/utils.go 91.30% <100.00%> (-0.90%) ⬇️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Joseph <joseph.t.lee@outlook.com>
@eahydra eahydra force-pushed the fix_deviceshare_reserve_prebind_error branch from 4a13a8b to dd836a4 Compare March 30, 2023 02:24
Copy link
Member

@jasonliu747 jasonliu747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hormes
Copy link
Member

hormes commented Mar 30, 2023

/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hormes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit a18df74 into koordinator-sh:main Mar 30, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] failed to allocate GPU on Reserve/PreBind phase
3 participants