Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support device health-check #362

Open
adrianchiris opened this issue Jul 13, 2021 · 3 comments
Open

Support device health-check #362

adrianchiris opened this issue Jul 13, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@adrianchiris
Copy link
Contributor

What would you like to be added?

Support periodically checking for device health and notifying kubelet on changes to devices via ListAndWatch rpc call

What is the use case for this feature / enhancement?

devices may become un-healthy, e.g a resource was consumed by workload during which it has become corrupted. we should report this to kubelet to avoid requests for this device for future workloads.

https://github.com/kubernetes/kubernetes/blob/234d7311822aecb8c5f4115107007b8420d9316b/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto#L58

@adrianchiris adrianchiris added the enhancement New feature or request label Jul 13, 2021
@TothFerenc
Copy link
Contributor

Isn't it a bug as it is mentioned as a supported feature?: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin#features

@ipatrykx
Copy link
Contributor

It seems that there is some handler for updateSignal, but I assume that has to be issued by kubelet (?). Do you want to make DP 'proactively' scan the devices health status and then pass that info to kubelet on it's own?

The other question for me is what about the plans to make DP to track the devices (like in the issue 276) - should the DP then also track the health status of the 'consumed' devices? I am wondering is that even achievable as the devices are moved to the container's namespace?

@adrianchiris
Copy link
Contributor Author

Isn't it a bug as it is mentioned as a supported feature?:

Maybe a documentation bug :) , i dont remember having this logic in DP.

@ipatrykx i think we should first define what is a healthy device.

a good start IMO is: a device is considered healthy if all relevant resources for that device are present in the system.
I am unsure how to check this for allocated devices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants