Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix device plugin re-registration #63118

Merged

Conversation

vikaschoudhary16
Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Apr 25, 2018

What this PR does / why we need it:
While registering a new endpoint, device manager copies all the devices from the old endpoint for the same resource and then it stops the old endpoint and starts the new endpoint.

There is no sync between stopping the old and starting the new. While stopping the old, manager marks devices(which are copied to new endpoint as well) as "Unhealthy".

In the endpoint.go, when after restart, plugin reports devices healthy, same health state (healthy) is found in the endpoint database and endpoint module does not update manager database.

Solution in the PR is to mark devices as unhealthy before copying to new endpoint.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #62773

Special notes for your reviewer:

Release note:

Fix issue where on re-registration of device plugin, `allocatable` was not getting updated. This issue makes devices invisible to the Kubelet if device plugin restarts. Only work-around, if this fix is not there, is to restart the kubelet and then start device plugin.

/cc @jiayingz @vishh @RenaudWasTaken @derekwaynecarr

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 25, 2018
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 25, 2018
@jiayingz
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiayingz, vikaschoudhary16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2018
@vikaschoudhary16 vikaschoudhary16 changed the title Fix race between stopping old and starting new endpoint Fix device plugin re-registration Apr 25, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 62951, 57460, 63118). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 046baee into kubernetes:master Apr 25, 2018
@vikaschoudhary16 vikaschoudhary16 deleted the start-stop-race branch April 25, 2018 09:10
@RenaudWasTaken
Copy link
Contributor

Sorry I didn't have a lot of time to look at this issue in detail.

Looking a bit more at the code it seems to me like the "device state" and "API server state" are not really separated which leads in this case to working on the "device state" to fix an "API server state" issue.

"device state" = devices stored in the endpoint
"API server state" = devices stored in the manager which will be fed to the API server in Capacity and Allocatable

What do you think if we fixed the issue as following:

  • When creating an endpoint use the devices from the "API server state" (devices will necessarily be in Unhealthy state if the endpoint was lost)
  • Move the "stop time" to the "API server state"
  • Make the endpoint's lifetime be as short as the gRPC connection

@vikaschoudhary16
Copy link
Contributor Author

vikaschoudhary16 commented Apr 25, 2018

Make the endpoint's lifetime be as short as the gRPC connection

can you please explain more this point? i think i got this. you mean no need to store in "api server state". I just need to think through this a bit more. thanks for the input. will get back.

@RenaudWasTaken
Copy link
Contributor

Instead of keeping the endpoint around when its gRPC connection is closed you delete it.
The data needed later should be kept and retrieved from the "API server state"

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 27, 2018
k8s-github-robot pushed a commit that referenced this pull request Apr 30, 2018
…-of-#63118-upstream-release-1.10-1524807783

Automatic merge from submit-queue.

cherry pick of #63118: Fix race between stopping old and starting new endpoint

**What this PR does / why we need it**:
Cherry pick of #63118 on release-1.10.
#63118: Fix device plugin re-registration

**Special notes for your reviewer**:

**Release note**:

```release-note
Fix issue where on re-registration of device plugin, `allocatable` was not getting updated. This issue makes devices invisible to the Kubelet if device plugin restarts. Only work-around, if this fix is not there, is to restart the kubelet and then start device plugin.
```
/cc @jiayingz
k8s-github-robot pushed a commit that referenced this pull request May 3, 2018
…-of-#63118-upstream-release-1.9-1524808696

Automatic merge from submit-queue.

Automated cherry pick of #63118: Fix race between stopping old and starting new endpoint

**What this PR does / why we need it**:
Cherry pick of #63118 on release-1.9
#63118: Fix device plugin re-registration

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
Fix issue where on re-registration of device plugin, `allocatable` was not getting updated. This issue makes devices invisible to the Kubelet if device plugin restarts. Only work-around, if this fix is not there, is to restart the kubelet and then start device plugin.
```
/cc @jiayingz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[deviceplugin] allocatable resource remains 0 after device-plugin restart
5 participants