-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix device plugin re-registration #63118
Fix device plugin re-registration #63118
Conversation
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiayingz, vikaschoudhary16 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Automatic merge from submit-queue (batch tested with PRs 62951, 57460, 63118). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Sorry I didn't have a lot of time to look at this issue in detail. Looking a bit more at the code it seems to me like the "device state" and "API server state" are not really separated which leads in this case to working on the "device state" to fix an "API server state" issue. "device state" = devices stored in the endpoint What do you think if we fixed the issue as following:
|
can you please explain more this point? i think i got this. you mean no need to store in "api server state". I just need to think through this a bit more. thanks for the input. will get back. |
Instead of keeping the endpoint around when its gRPC connection is closed you delete it. |
…-of-#63118-upstream-release-1.10-1524807783 Automatic merge from submit-queue. cherry pick of #63118: Fix race between stopping old and starting new endpoint **What this PR does / why we need it**: Cherry pick of #63118 on release-1.10. #63118: Fix device plugin re-registration **Special notes for your reviewer**: **Release note**: ```release-note Fix issue where on re-registration of device plugin, `allocatable` was not getting updated. This issue makes devices invisible to the Kubelet if device plugin restarts. Only work-around, if this fix is not there, is to restart the kubelet and then start device plugin. ``` /cc @jiayingz
…-of-#63118-upstream-release-1.9-1524808696 Automatic merge from submit-queue. Automated cherry pick of #63118: Fix race between stopping old and starting new endpoint **What this PR does / why we need it**: Cherry pick of #63118 on release-1.9 #63118: Fix device plugin re-registration **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes # **Special notes for your reviewer**: **Release note**: ```release-note Fix issue where on re-registration of device plugin, `allocatable` was not getting updated. This issue makes devices invisible to the Kubelet if device plugin restarts. Only work-around, if this fix is not there, is to restart the kubelet and then start device plugin. ``` /cc @jiayingz
What this PR does / why we need it:
While registering a new endpoint, device manager copies all the devices from the old endpoint for the same resource and then it stops the old endpoint and starts the new endpoint.
There is no sync between stopping the old and starting the new. While stopping the old, manager marks devices(which are copied to new endpoint as well) as "Unhealthy".
In the endpoint.go, when after restart, plugin reports devices healthy, same health state (healthy) is found in the endpoint database and endpoint module does not update manager database.
Solution in the PR is to mark devices as unhealthy before copying to new endpoint.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #62773
Special notes for your reviewer:
Release note:
/cc @jiayingz @vishh @RenaudWasTaken @derekwaynecarr