Skip to content

Conversation

sarnex
Copy link
Contributor

@sarnex sarnex commented Aug 8, 2024

I'm working on adding GPU reset support for AMD/NVIDIA to hopefully improve runner stability.

In order to run the commands to reset the GPU, we need sudo.

We also need sudo to reset on Intel GPUs, and we have already granted the sycl user sudo permission in the Docker image used for Intel GPUs.

However the image used for AMD and NVIDIA does not inherit from the ubuntu2204_base Dockerfile like the Intel GPU image does so we need to add the same thing there.

We also need to use the --privileged docker flag for AMD/NVIDIA. Again, we already do this for Intel GPUs. We need it to access the sysfs interface to do a GPU reset.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
@sarnex sarnex marked this pull request as ready for review August 8, 2024 18:54
@sarnex sarnex requested a review from a team as a code owner August 8, 2024 18:54
@sarnex sarnex requested a review from aelovikov-intel August 8, 2024 18:54
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
@sarnex sarnex temporarily deployed to WindowsCILock August 8, 2024 19:52 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock August 8, 2024 19:53 — with GitHub Actions Inactive
@sarnex sarnex changed the title [CI] Allow sycl user to run sudo in HIP/CUDA docker images [CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images Aug 8, 2024
@sarnex sarnex temporarily deployed to WindowsCILock August 8, 2024 20:54 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock August 8, 2024 21:12 — with GitHub Actions Inactive
@sarnex
Copy link
Contributor Author

sarnex commented Aug 12, 2024

@wphuhn-intel Ping on this one? Need your take on any security impact. Thanks!

@wphuhn-intel
Copy link

My brief thoughts, because I'm still researching this:

  • I see USER root at the top of these Dockerfiles, but it never transitions to USER sycl. Are these Dockerfiles running as root by default?
  • Running --privileged containers is not a good idea, to put it lightly. ( See https://blog.trailofbits.com/2019/07/19/understanding-docker-container-escapes/ ) I'm currently trying to find if there's a kernel capability that would allow for rebooting the GPUs without giving the container the keys to the kingdom, so then something like --cap-drop all --cap-add make-the-gpu-work could be used instead.
  • Which leads to my biggest question, why does a given container need the ability to reset the GPU? Won't this interrupt all other workloads running on the host?

@sarnex
Copy link
Contributor Author

sarnex commented Aug 12, 2024

@wphuhn-intel

Thanks for the feedback!

When the CI runs, it actually runs as the sycl user. You can see that here. This script is what's run by the CI when it enters into the Docker run.

Basically we need to have /sys/kernel/debug/ mounted and writable from the container. The only thing I found that worked was --privileged but I am absolutely not a Docker expert, maybe there is something else we can do.

Each runner only runs one Github CI job at a time, and these machines are only used for Github CI. There are no other workloads on the host. I expect many other things would break if multiple jobs could be run at the same time or if the host was doing other stuff. An idea I had was maybe we could trigger a script to run outside of Docker in CI, but giving Github access to the actual runner outside of Docker seems really bad. We need to reset the GPU because sometimes tests can cause GPU hangs and break the CI for everyone. Obviously that needs to be investigated separately, but we shouldn't hold everyone up.

Let me know if you have any ideas.

@uditagarwal97
Copy link
Contributor

Each runner only runs one Github CI job at a time, and these machines are only used for Github CI.

AFAIK, that's the case with all CI machines except the new PVC runners - their GPU resources are shared with other teams as well therefore, we don't reset the GPU in PVC runners.

@wphuhn-intel
Copy link

I did a fair amount of looking into this yesterday, and I was able to find capability settings for Nvidia:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#dockerfiles

But I'm still looking for others.

That being said, have you guys considered HEALTHCHECK? (https://docs.docker.com/reference/dockerfile/#healthcheck) This is a Docker security best practice, and you could do something like HEALTHCHECK {sycl-ls,nvidia-smi,whatever-amd-uses} as a basic "is the GPU still responding" test which could be handled host side.

@sarnex
Copy link
Contributor Author

sarnex commented Aug 15, 2024

Thanks a lot for the deep investigation William. I'm about to go to go vacation so my next response might be delayed.

@sarnex
Copy link
Contributor Author

sarnex commented Sep 27, 2024

Dropping this for now, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants