Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738

Merged
merged 18 commits into from May 11, 2023

Conversation

swatisehgal
Copy link
Contributor

@swatisehgal swatisehgal commented May 3, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Cherry-pick of the following PR (/commits) on release-1.25:

  1. 92e0020 of e2e: node remove: kubevirt device plugin #115926: This commit contains sample device plugin refactoring.
  2. node: device-mgr: sample device plugin: Add support to control registration process #115107: This PR updates sample device plugin to enable the e2e node tests (or any other entity with full access to the node filesystem) to control the registration process. This feature is used in node: device-mgr: Handle recovery flow by checking if healthy devices exist #114640 where e2e test is implemented to simulate scenarios where application pods requesting devices come up before the device plugin pod on node reboot/ kubelet restart.
  3. test: Fix path to e2e node sample device plugin #116240: This PR fixes a broken file path which got skipped as in the sample device plugin PR above.
  4. node: e2e device plugin test improvements #117057
  5. node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376: This PR contains the core changes where we ensure that healthy devices exist and and pre-allocated devices are healthy. In addition to that, this PR contains e2e test to reproduce a scenario where application pod comes up before device plugin register itself to kubelet resulting in an admission error.

Rationale for the changes
In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the healthDevices/unhealthyDevices to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately.

During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated.

Also we need to move this check above needed==0 where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure that the devices that were previously allocated are healthy.

For more details refer to the comment here: #109595 (comment).

Which issue(s) this PR fixes:

Fixes #109595

Does this PR introduce a user-facing change?

During device plugin allocation, resources requested by the pod can only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated. If these conditions are not sattsfied, the pod would fail with `UnexpectedAdmissionError` error.

ffromani and others added 7 commits May 2, 2023 20:27
Start to consolidate the sample device plugin utility
and constants in a central place, because we need
to use it in different e2e tests.

Having a central dependency is better than a maze of
entangled e2e tests depending on each other helpers.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Update the sample device plugin to enable the e2e node tests (or any
other entity with full access to the node filesystem) to control the
registration process. We add a new environment variable `REGISTER_CONTROL_FILE`.
The value of this variable must be a file which prevents the plugin
to register itself while it's present. Once removed, the plugin will
go on and complete the registration. The plugin will automatically
detect the parent directory on which the file resides and detect
deletions, unblocking the registration process. If the file is specified
but unaccessible, the plugin will fail. If the file is not specified,
the registration process will progress as usual and never pause.
The plugin will need read access to the parent directory.

This feature is useful because it is not possible to control the order
in which the pods are recovered after node reboot/kubelet restart.

In this approach, the testing environment will create a directory and
then a empty file to pause the registration process of the plugin.
Once pointed to that file, the plugin will start and wait for it to
be deleted. Only after the directory has been deleted,
the plugin would proceed to registration.

This feature is used in kubernetes#114640 where e2e test is implemented to
simulate scenarios where application pods requesting devices come up before
the device plugin pod on node reboot/ kubelet restart.

Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The existing path is incorrect (missing `sample-device-plugin`)
directory and thus causing test failures. The full path should be
`test/e2e/testing-manifests/sample-device-plugin/sample-device-plugin.yaml`.

Signed-off-by: David Porter <david@porter.me>
Rather than only returning a string forcing us to log failure with
`framework.Fail`, we return a string and error to handle error cases
more conventionally. This enables us to use the `parseLog` function
inside `Eventually` and `Consistently` blocks, or in general to delegate
the error processing and enable better composability.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
We rename to make the intent more explicit;
We make it global to be able to reuse the value all across the module
(e.g. to check the node readiness) later on.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
With this change the error message are more helpful and easier
to troubleshoot in case of test failures.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone May 3, 2023
@k8s-ci-robot k8s-ci-robot added do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet labels May 3, 2023
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 3, 2023
@bart0sh bart0sh added this to WIP in SIG Node PR Triage May 3, 2023
@swatisehgal swatisehgal force-pushed the device-mgr-fix-1.25 branch 3 times, most recently from e25d942 to 16324a1 Compare May 3, 2023 10:36
@swatisehgal
Copy link
Contributor Author

/test pull-kubernetes-unit
unrelated test failure

@swatisehgal
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd-serial

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 3, 2023
@swatisehgal swatisehgal moved this from WIP to Needs Reviewer in SIG Node PR Triage May 3, 2023
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to PRs - Needs Reviewer in SIG Node CI/Test Board May 3, 2023
@rphillips
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1b63c67942c091880eca3b0f8b493da67086c6d2

@ffromani
Copy link
Contributor

ffromani commented May 5, 2023

/lgtm

disclosure: I co-authored a large part of the changes
similar considerations to #116337 (comment) applies.

The majority of the effort here is spent adding/improving e2e tests to verify this change. While this comes with a cost (larger PR, more code to backport) overall I'm leaning towards preferring a larger backport like this with extensive e2e coverage, because this is a obscure bug in a difficult-to-trigger flow.

While I reckon there's probably no universally right answer to the dilemma between larger PR (with more tests) vs targeted fix, IMO the basic principle that hard-to-trigger flows should have more test coverage, not less (exactly because they are hard to trigger!) prevails here.

@bart0sh
Copy link
Contributor

bart0sh commented May 5, 2023

/assign @klueska

@dchen1107
Copy link
Member

The same patch-set for 1.26 is applied here for 1.25. Also reviewed the diff between this one and the backport for 1.27.

/lgtm
/approve from SIG Node perspective.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2023
Copy link
Member

@xmudrii xmudrii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to PRs - Needs Approver May 11, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, swatisehgal, xmudrii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xmudrii xmudrii added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label May 11, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label May 11, 2023
@k8s-ci-robot k8s-ci-robot merged commit 0a7c2a9 into kubernetes:release-1.25 May 11, 2023
17 checks passed
SIG Node CI/Test Board automation moved this from PRs - Needs Approver to Done May 11, 2023
SIG Node PR Triage automation moved this from Needs Approver to Done May 11, 2023
@swatisehgal swatisehgal deleted the device-mgr-fix-1.25 branch May 16, 2023 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

None yet

9 participants