[CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images #15017

sarnex · 2024-08-08T18:50:26Z

I'm working on adding GPU reset support for AMD/NVIDIA to hopefully improve runner stability.

In order to run the commands to reset the GPU, we need sudo.

We also need sudo to reset on Intel GPUs, and we have already granted the sycl user sudo permission in the Docker image used for Intel GPUs.

However the image used for AMD and NVIDIA does not inherit from the ubuntu2204_base Dockerfile like the Intel GPU image does so we need to add the same thing there.

We also need to use the --privileged docker flag for AMD/NVIDIA. Again, we already do this for Intel GPUs. We need it to access the sysfs interface to do a GPU reset.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

sarnex · 2024-08-12T14:19:49Z

@wphuhn-intel Ping on this one? Need your take on any security impact. Thanks!

wphuhn-intel · 2024-08-12T21:50:02Z

My brief thoughts, because I'm still researching this:

I see USER root at the top of these Dockerfiles, but it never transitions to USER sycl. Are these Dockerfiles running as root by default?
Running --privileged containers is not a good idea, to put it lightly. ( See https://blog.trailofbits.com/2019/07/19/understanding-docker-container-escapes/ ) I'm currently trying to find if there's a kernel capability that would allow for rebooting the GPUs without giving the container the keys to the kingdom, so then something like --cap-drop all --cap-add make-the-gpu-work could be used instead.
Which leads to my biggest question, why does a given container need the ability to reset the GPU? Won't this interrupt all other workloads running on the host?

sarnex · 2024-08-12T21:59:14Z

@wphuhn-intel

Thanks for the feedback!

When the CI runs, it actually runs as the sycl user. You can see that here. This script is what's run by the CI when it enters into the Docker run.

Basically we need to have /sys/kernel/debug/ mounted and writable from the container. The only thing I found that worked was --privileged but I am absolutely not a Docker expert, maybe there is something else we can do.

Each runner only runs one Github CI job at a time, and these machines are only used for Github CI. There are no other workloads on the host. I expect many other things would break if multiple jobs could be run at the same time or if the host was doing other stuff. An idea I had was maybe we could trigger a script to run outside of Docker in CI, but giving Github access to the actual runner outside of Docker seems really bad. We need to reset the GPU because sometimes tests can cause GPU hangs and break the CI for everyone. Obviously that needs to be investigated separately, but we shouldn't hold everyone up.

Let me know if you have any ideas.

uditagarwal97 · 2024-08-12T22:03:11Z

Each runner only runs one Github CI job at a time, and these machines are only used for Github CI.

AFAIK, that's the case with all CI machines except the new PVC runners - their GPU resources are shared with other teams as well therefore, we don't reset the GPU in PVC runners.

wphuhn-intel · 2024-08-14T12:00:46Z

I did a fair amount of looking into this yesterday, and I was able to find capability settings for Nvidia:

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#dockerfiles

But I'm still looking for others.

That being said, have you guys considered HEALTHCHECK? (https://docs.docker.com/reference/dockerfile/#healthcheck) This is a Docker security best practice, and you could do something like HEALTHCHECK {sycl-ls,nvidia-smi,whatever-amd-uses} as a basic "is the GPU still responding" test which could be handled host side.

sarnex · 2024-08-15T14:44:45Z

Thanks a lot for the deep investigation William. I'm about to go to go vacation so my next response might be delayed.

sarnex · 2024-09-27T18:10:09Z

Dropping this for now, thanks

[CI] Allow sycl user to run sudo in HIP/CUDA docker images

5f9d7d5

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

sarnex marked this pull request as ready for review August 8, 2024 18:54

sarnex requested a review from a team as a code owner August 8, 2024 18:54

sarnex requested a review from aelovikov-intel August 8, 2024 18:54

aelovikov-intel requested a review from wphuhn-intel August 8, 2024 19:24

we also need --priviledged

54d63fd

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

sarnex temporarily deployed to WindowsCILock August 8, 2024 19:52 — with GitHub Actions Inactive

sarnex temporarily deployed to WindowsCILock August 8, 2024 19:53 — with GitHub Actions Inactive

sarnex changed the title ~~[CI] Allow sycl user to run sudo in HIP/CUDA docker images~~ [CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images Aug 8, 2024

sarnex temporarily deployed to WindowsCILock August 8, 2024 20:54 — with GitHub Actions Inactive

sarnex temporarily deployed to WindowsCILock August 8, 2024 21:12 — with GitHub Actions Inactive

sarnex closed this Sep 27, 2024

sarnex mentioned this pull request Dec 11, 2024

[SYCL][Devops] Fix containers build #16324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images #15017

[CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images #15017

Uh oh!

sarnex commented Aug 8, 2024 •

edited

Loading

Uh oh!

sarnex commented Aug 12, 2024

Uh oh!

wphuhn-intel commented Aug 12, 2024

Uh oh!

sarnex commented Aug 12, 2024 •

edited

Loading

Uh oh!

uditagarwal97 commented Aug 12, 2024

Uh oh!

wphuhn-intel commented Aug 14, 2024

Uh oh!

sarnex commented Aug 15, 2024

Uh oh!

sarnex commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images #15017

[CI] Give sycl user permission to do GPU reset in HIP/CUDA docker images #15017

Uh oh!

Conversation

sarnex commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarnex commented Aug 12, 2024

Uh oh!

wphuhn-intel commented Aug 12, 2024

Uh oh!

sarnex commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uditagarwal97 commented Aug 12, 2024

Uh oh!

wphuhn-intel commented Aug 14, 2024

Uh oh!

sarnex commented Aug 15, 2024

Uh oh!

sarnex commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarnex commented Aug 8, 2024 •

edited

Loading

sarnex commented Aug 12, 2024 •

edited

Loading