-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cannot run deepspeed in containers #4971
Comments
@zxti - that's odd, this shouldn't be the case. Could you share a sample docker container from NGC that you're using so I can test and try to repro on our side? |
@loadams I'll check which docker container I was specifically using. But for now - you can also repro in Colab, let me know if this helps: https://colab.research.google.com/drive/1vpmay34Wfc31ilOHSB4G6WIGsmw8H0-0?usp=sharing |
It was nvcr.io/nvidia/pytorch:23.12-py3 |
I encountered the same issue when running the CIFAR example inside the Docker image, and the solution to the problem was to run the image with the $ docker run -it --rm --privileged ... |
@zxti - does that help resolve the issue for you? |
Hi @loadams, unfortunately no, since this is what I mentioned in the initial post - you can work around the issue if you have privileged access to the host, but in very many environments (such as RunPod, Colab, etc.) you don't have this. |
Describe the bug
Unable to run deepspeed cifar in containerized environments (unless you control the container host) - I think because of numactl?
To Reproduce
Steps to reproduce the behavior:
Start a container, such as NGC. Or go to Google Colab.
git clone deepspeedexamples.
Go to training/cifar
Run run_ds_moe.sh. (If using Colab, adjust the script to set all GPUs/experts vars to 1)
Expected behavior
Runs.
ds_report output
System info (please complete the following information):
Launcher context
Using
deepspeed
launcherThe text was updated successfully, but these errors were encountered: