-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738
[1.25] node: device-mgr: Fix recovery flow by ensuring healthy devices exist and pre-allocated devices are healthy #117738
Commits on May 2, 2023
-
e2e: node: unify sample device plugin utilities
Start to consolidate the sample device plugin utility and constants in a central place, because we need to use it in different e2e tests. Having a central dependency is better than a maze of entangled e2e tests depending on each other helpers. Signed-off-by: Francesco Romani <fromani@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for dd9dbcc - Browse repository at this point
Copy the full SHA dd9dbccView commit details -
node: device-mgr: sample device plugin: control registration process
Update the sample device plugin to enable the e2e node tests (or any other entity with full access to the node filesystem) to control the registration process. We add a new environment variable `REGISTER_CONTROL_FILE`. The value of this variable must be a file which prevents the plugin to register itself while it's present. Once removed, the plugin will go on and complete the registration. The plugin will automatically detect the parent directory on which the file resides and detect deletions, unblocking the registration process. If the file is specified but unaccessible, the plugin will fail. If the file is not specified, the registration process will progress as usual and never pause. The plugin will need read access to the parent directory. This feature is useful because it is not possible to control the order in which the pods are recovered after node reboot/kubelet restart. In this approach, the testing environment will create a directory and then a empty file to pause the registration process of the plugin. Once pointed to that file, the plugin will start and wait for it to be deleted. Only after the directory has been deleted, the plugin would proceed to registration. This feature is used in kubernetes#114640 where e2e test is implemented to simulate scenarios where application pods requesting devices come up before the device plugin pod on node reboot/ kubelet restart. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 3a9c226 - Browse repository at this point
Copy the full SHA 3a9c226View commit details -
node: device-mgr: sample device plugin: manifest to avoid registration
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 02e1fff - Browse repository at this point
Copy the full SHA 02e1fffView commit details -
test: Fix path to e2e node sample device plugin
The existing path is incorrect (missing `sample-device-plugin`) directory and thus causing test failures. The full path should be `test/e2e/testing-manifests/sample-device-plugin/sample-device-plugin.yaml`. Signed-off-by: David Porter <david@porter.me>
Configuration menu - View commit details
-
Copy full SHA for 7d1756a - Browse repository at this point
Copy the full SHA 7d1756aView commit details -
node: device-plugins: e2e: Refactor parse log to return string and error
Rather than only returning a string forcing us to log failure with `framework.Fail`, we return a string and error to handle error cases more conventionally. This enables us to use the `parseLog` function inside `Eventually` and `Consistently` blocks, or in general to delegate the error processing and enable better composability. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for d00f41a - Browse repository at this point
Copy the full SHA d00f41aView commit details -
node: device-plugins: e2e: s/devLen/expectedSampleDevsAmount
We rename to make the intent more explicit; We make it global to be able to reuse the value all across the module (e.g. to check the node readiness) later on. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 6bae194 - Browse repository at this point
Copy the full SHA 6bae194View commit details -
node: device-plugin: e2e: Annotate device check with error message
With this change the error message are more helpful and easier to troubleshoot in case of test failures. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 20f6635 - Browse repository at this point
Copy the full SHA 20f6635View commit details
Commits on May 3, 2023
-
node: device-plugin: e2e: Isolate test to pod restart scenario
Rather than testing out for both pod restart and kubelet restart, we change the tests to just handle pod restart scenario. Clarify the test purpose and add extra check to tighten the test. We would be adding additional tests to cover kubelet restart scenarios in subsequent commits. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for b9ecf98 - Browse repository at this point
Copy the full SHA b9ecf98View commit details -
node: device-plugin: e2e: Update test description to make it explicit
Explicitly state that the test involves kubelet restart and device plugin re-registration (no pod restart) We remove the part of the code where we wait for the pod to restart as this test case should no longer involve pod restart. In addition to that, we use `waitForNodeReady` instead of `WaitForAllNodesSchedulable` for ensuring that the node is ready for pods to be scheduled on it. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 60d81f0 - Browse repository at this point
Copy the full SHA 60d81f0View commit details -
node: device-plugin: e2e: Provide sleep intervals via constants
Based on whether the test case requires pod restart or not, the sleep interval needs to be updated and we define constants to represent the two sleep intervals that can be used in the corresponding test cases. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for e53a718 - Browse repository at this point
Copy the full SHA e53a718View commit details -
node: device-plugin: e2e: Add test case for kubelet restart
Capture explicitly a test case pertaining to kubelet restart but with no pod restart and device plugin re-registration. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 57b644b - Browse repository at this point
Copy the full SHA 57b644bView commit details -
node: device-mgr: Handle recovery by checking if healthy devices exist
In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the `healthDevices`/`unhealthyDevices` to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately. During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated. Also we need to move this check above `needed==0` where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure he devices that were previously allocated are healthy. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 337e6e7 - Browse repository at this point
Copy the full SHA 337e6e7View commit details -
node: device-mgr: e2e: Implement End to end test
This commit reuses e2e tests implmented as part of kubernetes#110729. The commit is borrowed from the aforementioned PR as is to preserve authorship. Subsequent commit will update the end to end test to simulate the problem this PR is trying to solve by reproducing the issue: 109595. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 4eb2808 - Browse repository at this point
Copy the full SHA 4eb2808View commit details -
node: device-mgr: e2e: Update the e2e test to reproduce issue:109595
Breakdown of the steps implemented as part of this e2e test is as follows: 1. Create a file `registration` at path `/var/lib/kubelet/device-plugins/sample/` 2. Create sample device plugin with an environment variable with `REGISTER_CONTROL_FILE=/var/lib/kubelet/device-plugins/sample/registration` that waits for a client to delete the control file. 3. Trigger plugin registeration by deleting the abovementioned directory. 4. Create a test pod requesting devices exposed by the device plugin. 5. Stop kubelet. 6. Remove pods using CRI to ensure new pods are created after kubelet restart. 7. Restart kubelet. 8. Wait for the sample device plugin pod to be running. In this case, the registration is not triggered. 9. Ensure that resource capacity/allocatable exported by the device plugin is zero. 10. The test pod should fail with `UnexpectedAdmissionError` 11. Delete the test pod. 12. Delete the sample device plugin pod. 13. Remove `/var/lib/kubelet/device-plugins/sample/` and its content, the directory created to control registration Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 6c0c91e - Browse repository at this point
Copy the full SHA 6c0c91eView commit details -
node: device-mgr: e2e: adapt to sample device plugin refactoring
These updates are to adapt to the sample device plugin refactoring done here: kubernetes@92e0020. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 9fc2d5c - Browse repository at this point
Copy the full SHA 9fc2d5cView commit details -
node: device-plugin: e2e: Capture pod admission failure
This test captures that scenario where after kubelet restart, application pod comes up and the device plugin pod hasn't re-registered itself, the pod fails with admission error. It is worth noting that once the device plugin pod has registered itself, another application pod requesting devices ends up running successfully. For the test case where kubelet is restarted and device plugin has re-registered without involving pod restart, since the pod after kubelet restart ends up with admission error, we cannot be certain the device that the second pod (pod2) would get. As long as, it gets a device we consider the test to pass. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for b50d53f - Browse repository at this point
Copy the full SHA b50d53fView commit details -
node: device-plugin: add node reboot test scenario
Add a test suit to simulate node reboot (achieved by removing pods using CRI API before kubelet is restarted). Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 4ba6f87 - Browse repository at this point
Copy the full SHA 4ba6f87View commit details -
node: device-plugin: e2e: Additional test cases
Additional test cases added: Keeps device plugin assignments across pod and kubelet restarts (no device plugin re-registration) Keeps device plugin assignments after the device plugin has re-registered (no kubelet or pod restart) Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Configuration menu - View commit details
-
Copy full SHA for 7f1809d - Browse repository at this point
Copy the full SHA 7f1809dView commit details