-
Notifications
You must be signed in to change notification settings - Fork 791
[CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images #15017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
@wphuhn-intel Ping on this one? Need your take on any security impact. Thanks! |
My brief thoughts, because I'm still researching this:
|
Thanks for the feedback! When the CI runs, it actually runs as the Basically we need to have Each runner only runs one Github CI job at a time, and these machines are only used for Github CI. There are no other workloads on the host. I expect many other things would break if multiple jobs could be run at the same time or if the host was doing other stuff. An idea I had was maybe we could trigger a script to run outside of Docker in CI, but giving Github access to the actual runner outside of Docker seems really bad. We need to reset the GPU because sometimes tests can cause GPU hangs and break the CI for everyone. Obviously that needs to be investigated separately, but we shouldn't hold everyone up. Let me know if you have any ideas. |
AFAIK, that's the case with all CI machines except the new PVC runners - their GPU resources are shared with other teams as well therefore, we don't reset the GPU in PVC runners. |
I did a fair amount of looking into this yesterday, and I was able to find capability settings for Nvidia: But I'm still looking for others. That being said, have you guys considered HEALTHCHECK? (https://docs.docker.com/reference/dockerfile/#healthcheck) This is a Docker security best practice, and you could do something like |
Thanks a lot for the deep investigation William. I'm about to go to go vacation so my next response might be delayed. |
Dropping this for now, thanks |
I'm working on adding GPU reset support for AMD/NVIDIA to hopefully improve runner stability.
In order to run the commands to reset the GPU, we need
sudo
.We also need
sudo
to reset on Intel GPUs, and we have already granted thesycl
usersudo
permission in the Docker image used for Intel GPUs.However the image used for AMD and NVIDIA does not inherit from the
ubuntu2204_base
Dockerfile like the Intel GPU image does so we need to add the same thing there.We also need to use the
--privileged
docker flag for AMD/NVIDIA. Again, we already do this for Intel GPUs. We need it to access thesysfs
interface to do a GPU reset.