[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738

swatisehgal · 2023-05-03T00:00:53Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Cherry-pick of the following PR (/commits) on release-1.25:

92e0020 of e2e: node remove: kubevirt device plugin #115926: This commit contains sample device plugin refactoring.
node: device-mgr: sample device plugin: Add support to control registration process #115107: This PR updates sample device plugin to enable the e2e node tests (or any other entity with full access to the node filesystem) to control the registration process. This feature is used in node: device-mgr: Handle recovery flow by checking if healthy devices exist #114640 where e2e test is implemented to simulate scenarios where application pods requesting devices come up before the device plugin pod on node reboot/ kubelet restart.
test: Fix path to e2e node sample device plugin #116240: This PR fixes a broken file path which got skipped as in the sample device plugin PR above.
node: e2e device plugin test improvements #117057
node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376: This PR contains the core changes where we ensure that healthy devices exist and and pre-allocated devices are healthy. In addition to that, this PR contains e2e test to reproduce a scenario where application pod comes up before device plugin register itself to kubelet resulting in an admission error.

Rationale for the changes
In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the healthDevices/unhealthyDevices to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately.

During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated.

Also we need to move this check above needed==0 where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure that the devices that were previously allocated are healthy.

For more details refer to the comment here: #109595 (comment).

Which issue(s) this PR fixes:

Fixes #109595

Does this PR introduce a user-facing change?

During device plugin allocation, resources requested by the pod can only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated. If these conditions are not sattsfied, the pod would fail with `UnexpectedAdmissionError` error.

Start to consolidate the sample device plugin utility and constants in a central place, because we need to use it in different e2e tests. Having a central dependency is better than a maze of entangled e2e tests depending on each other helpers. Signed-off-by: Francesco Romani <fromani@redhat.com>

Update the sample device plugin to enable the e2e node tests (or any other entity with full access to the node filesystem) to control the registration process. We add a new environment variable `REGISTER_CONTROL_FILE`. The value of this variable must be a file which prevents the plugin to register itself while it's present. Once removed, the plugin will go on and complete the registration. The plugin will automatically detect the parent directory on which the file resides and detect deletions, unblocking the registration process. If the file is specified but unaccessible, the plugin will fail. If the file is not specified, the registration process will progress as usual and never pause. The plugin will need read access to the parent directory. This feature is useful because it is not possible to control the order in which the pods are recovered after node reboot/kubelet restart. In this approach, the testing environment will create a directory and then a empty file to pause the registration process of the plugin. Once pointed to that file, the plugin will start and wait for it to be deleted. Only after the directory has been deleted, the plugin would proceed to registration. This feature is used in kubernetes#114640 where e2e test is implemented to simulate scenarios where application pods requesting devices come up before the device plugin pod on node reboot/ kubelet restart. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

The existing path is incorrect (missing `sample-device-plugin`) directory and thus causing test failures. The full path should be `test/e2e/testing-manifests/sample-device-plugin/sample-device-plugin.yaml`. Signed-off-by: David Porter <david@porter.me>

Rather than only returning a string forcing us to log failure with `framework.Fail`, we return a string and error to handle error cases more conventionally. This enables us to use the `parseLog` function inside `Eventually` and `Consistently` blocks, or in general to delegate the error processing and enable better composability. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>

We rename to make the intent more explicit; We make it global to be able to reuse the value all across the module (e.g. to check the node readiness) later on. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>

With this change the error message are more helpful and easier to troubleshoot in case of test failures. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

swatisehgal · 2023-05-03T11:30:04Z

/test pull-kubernetes-unit
unrelated test failure

swatisehgal · 2023-05-03T15:36:08Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd-serial

rphillips · 2023-05-04T18:11:09Z

/lgtm

k8s-ci-robot · 2023-05-04T18:11:16Z

LGTM label has been added.

Git tree hash: 1b63c67942c091880eca3b0f8b493da67086c6d2

ffromani · 2023-05-05T12:14:28Z

/lgtm

disclosure: I co-authored a large part of the changes
similar considerations to #116337 (comment) applies.

The majority of the effort here is spent adding/improving e2e tests to verify this change. While this comes with a cost (larger PR, more code to backport) overall I'm leaning towards preferring a larger backport like this with extensive e2e coverage, because this is a obscure bug in a difficult-to-trigger flow.

While I reckon there's probably no universally right answer to the dilemma between larger PR (with more tests) vs targeted fix, IMO the basic principle that hard-to-trigger flows should have more test coverage, not less (exactly because they are hard to trigger!) prevails here.

bart0sh · 2023-05-05T15:28:48Z

/assign @klueska

dchen1107 · 2023-05-10T20:28:21Z

The same patch-set for 1.26 is applied here for 1.25. Also reviewed the diff between this one and the backport for 1.27.

/lgtm
/approve from SIG Node perspective.

xmudrii

/lgtm
/approve

k8s-ci-robot · 2023-05-11T17:06:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, swatisehgal, xmudrii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/OWNERS~~ [dchen1107]
~~test/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani and others added 7 commits May 2, 2023 20:27

node: device-mgr: sample device plugin: manifest to avoid registration

02e1fff

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

node: device-plugin: e2e: Annotate device check with error message

20f6635

With this change the error message are more helpful and easier to troubleshoot in case of test failures. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

k8s-ci-robot added this to the v1.25 milestone May 3, 2023

k8s-ci-robot requested review from bart0sh and klueska May 3, 2023 00:01

k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 3, 2023

swatisehgal force-pushed the device-mgr-fix-1.25 branch from 3ec62b1 to 1fd579b Compare May 3, 2023 01:14

bart0sh added this to WIP in SIG Node PR Triage May 3, 2023

swatisehgal force-pushed the device-mgr-fix-1.25 branch 3 times, most recently from e25d942 to 16324a1 Compare May 3, 2023 10:36

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board May 3, 2023

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 3, 2023

swatisehgal moved this from WIP to Needs Reviewer in SIG Node PR Triage May 3, 2023

SergeyKanzhelev moved this from Triage to PRs - Needs Reviewer in SIG Node CI/Test Board May 3, 2023

swatisehgal mentioned this pull request May 4, 2023

[release-4.12] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists openshift/kubernetes#1564

Closed

k8s-ci-robot assigned rphillips May 4, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2023

k8s-ci-robot assigned ffromani May 5, 2023

k8s-ci-robot assigned klueska May 5, 2023

bart0sh moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage May 5, 2023

swatisehgal mentioned this pull request May 10, 2023

node: ci: Enable nodefeature test execution in CI for release branches kubernetes/test-infra#29483

Open

k8s-ci-robot assigned dchen1107 May 10, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2023

xmudrii approved these changes May 11, 2023

View reviewed changes

SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to PRs - Needs Approver May 11, 2023

k8s-ci-robot assigned xmudrii May 11, 2023

xmudrii added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label May 11, 2023

k8s-ci-robot removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label May 11, 2023

k8s-ci-robot merged commit 0a7c2a9 into kubernetes:release-1.25 May 11, 2023
17 checks passed

SIG Node CI/Test Board automation moved this from PRs - Needs Approver to Done May 11, 2023

SIG Node PR Triage automation moved this from Needs Approver to Done May 11, 2023

swatisehgal deleted the device-mgr-fix-1.25 branch May 16, 2023 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738

[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738

swatisehgal commented May 3, 2023 •

edited

swatisehgal commented May 3, 2023

swatisehgal commented May 3, 2023

rphillips commented May 4, 2023

k8s-ci-robot commented May 4, 2023

ffromani commented May 5, 2023

bart0sh commented May 5, 2023

dchen1107 commented May 10, 2023

xmudrii left a comment

k8s-ci-robot commented May 11, 2023

[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738

[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738

Conversation

swatisehgal commented May 3, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

swatisehgal commented May 3, 2023

swatisehgal commented May 3, 2023

rphillips commented May 4, 2023

k8s-ci-robot commented May 4, 2023

ffromani commented May 5, 2023

bart0sh commented May 5, 2023

dchen1107 commented May 10, 2023

xmudrii left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 11, 2023

swatisehgal commented May 3, 2023 •

edited