New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738
[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738
Conversation
Start to consolidate the sample device plugin utility and constants in a central place, because we need to use it in different e2e tests. Having a central dependency is better than a maze of entangled e2e tests depending on each other helpers. Signed-off-by: Francesco Romani <fromani@redhat.com>
Update the sample device plugin to enable the e2e node tests (or any other entity with full access to the node filesystem) to control the registration process. We add a new environment variable `REGISTER_CONTROL_FILE`. The value of this variable must be a file which prevents the plugin to register itself while it's present. Once removed, the plugin will go on and complete the registration. The plugin will automatically detect the parent directory on which the file resides and detect deletions, unblocking the registration process. If the file is specified but unaccessible, the plugin will fail. If the file is not specified, the registration process will progress as usual and never pause. The plugin will need read access to the parent directory. This feature is useful because it is not possible to control the order in which the pods are recovered after node reboot/kubelet restart. In this approach, the testing environment will create a directory and then a empty file to pause the registration process of the plugin. Once pointed to that file, the plugin will start and wait for it to be deleted. Only after the directory has been deleted, the plugin would proceed to registration. This feature is used in kubernetes#114640 where e2e test is implemented to simulate scenarios where application pods requesting devices come up before the device plugin pod on node reboot/ kubelet restart. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The existing path is incorrect (missing `sample-device-plugin`) directory and thus causing test failures. The full path should be `test/e2e/testing-manifests/sample-device-plugin/sample-device-plugin.yaml`. Signed-off-by: David Porter <david@porter.me>
Rather than only returning a string forcing us to log failure with `framework.Fail`, we return a string and error to handle error cases more conventionally. This enables us to use the `parseLog` function inside `Eventually` and `Consistently` blocks, or in general to delegate the error processing and enable better composability. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>
We rename to make the intent more explicit; We make it global to be able to reuse the value all across the module (e.g. to check the node readiness) later on. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>
With this change the error message are more helpful and easier to troubleshoot in case of test failures. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
3ec62b1
to
1fd579b
Compare
e25d942
to
16324a1
Compare
/test pull-kubernetes-unit |
/test pull-kubernetes-e2e-gce-ubuntu-containerd-serial |
/lgtm |
LGTM label has been added. Git tree hash: 1b63c67942c091880eca3b0f8b493da67086c6d2
|
/lgtm disclosure: I co-authored a large part of the changes The majority of the effort here is spent adding/improving e2e tests to verify this change. While this comes with a cost (larger PR, more code to backport) overall I'm leaning towards preferring a larger backport like this with extensive e2e coverage, because this is a obscure bug in a difficult-to-trigger flow. While I reckon there's probably no universally right answer to the dilemma between larger PR (with more tests) vs targeted fix, IMO the basic principle that hard-to-trigger flows should have more test coverage, not less (exactly because they are hard to trigger!) prevails here. |
/assign @klueska |
The same patch-set for 1.26 is applied here for 1.25. Also reviewed the diff between this one and the backport for 1.27. /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, swatisehgal, xmudrii The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Cherry-pick of the following PR (/commits) on release-1.25:
Rationale for the changes
In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the
healthDevices
/unhealthyDevices
to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately.During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated.
Also we need to move this check above
needed==0
where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure that the devices that were previously allocated are healthy.For more details refer to the comment here: #109595 (comment).
Which issue(s) this PR fixes:
Fixes #109595
Does this PR introduce a user-facing change?