Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot run deepspeed in containers #4971

Open
zxti opened this issue Jan 17, 2024 · 6 comments
Open

[BUG] Cannot run deepspeed in containers #4971

zxti opened this issue Jan 17, 2024 · 6 comments
Assignees
Labels
bug Something isn't working training

Comments

@zxti
Copy link

zxti commented Jan 17, 2024

Describe the bug
Unable to run deepspeed cifar in containerized environments (unless you control the container host) - I think because of numactl?

To Reproduce
Steps to reproduce the behavior:

Start a container, such as NGC. Or go to Google Colab.

git clone deepspeedexamples.

Go to training/cifar

Run run_ds_moe.sh. (If using Colab, adjust the script to set all GPUs/experts vars to 1)

Expected behavior
Runs.

ds_report output

[2024-01-17 19:54:12,758] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-17 19:54:14,599] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-01-17 19:54:14,605] [INFO] [runner.py:571:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --bind_cores_to_rank cifar10_deepspeed.py --log-interval 100 --deepspeed --deepspeed_config ds_config.json --moe --ep-world-size 1 --num-experts 1 --top-k 1 --noisy-gate-policy RSample --moe-param-group
[2024-01-17 19:54:17,084] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-17 19:54:19,709] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.19.3-1+cuda12.2
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.19.3-1
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.3-1
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.19.3-1+cuda12.2
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.19.3-1
[2024-01-17 19:54:19,710] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-01-17 19:54:19,710] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-01-17 19:54:19,710] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-01-17 19:54:19,710] [INFO] [launch.py:163:main] dist_world_size=1
[2024-01-17 19:54:19,710] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
set_mempolicy: Operation not permitted
setting membind: Operation not permitted
[2024-01-17 19:54:20,732] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 8319
[2024-01-17 19:54:20,732] [ERROR] [launch.py:321:sigkill_handler] ['numactl', '-m', '0', '-C', '0', '/usr/bin/python3', '-u', 'cifar10_deepspeed.py', '--local_rank=0', '--log-interval', '100', '--deepspeed', '--deepspeed_config', 'ds_config.json', '--moe', '--ep-world-size', '1', '--num-experts', '1', '--top-k', '1', '--noisy-gate-policy', 'RSample', '--moe-param-group'] exits with return code = 1

System info (please complete the following information):

  • OS: Ubuntu 22.04
  • GPU count and types: 1xT4
  • Python version: 3.10.2

Launcher context
Using deepspeed launcher

@zxti zxti added bug Something isn't working training labels Jan 17, 2024
@loadams loadams self-assigned this Jan 22, 2024
@loadams
Copy link
Contributor

loadams commented Jan 22, 2024

@zxti - that's odd, this shouldn't be the case. Could you share a sample docker container from NGC that you're using so I can test and try to repro on our side?

@zxti
Copy link
Author

zxti commented Jan 22, 2024

@loadams I'll check which docker container I was specifically using.

But for now - you can also repro in Colab, let me know if this helps: https://colab.research.google.com/drive/1vpmay34Wfc31ilOHSB4G6WIGsmw8H0-0?usp=sharing

@zxti
Copy link
Author

zxti commented Jan 22, 2024

It was nvcr.io/nvidia/pytorch:23.12-py3

@iKrishneel
Copy link

iKrishneel commented Feb 7, 2024

I encountered the same issue when running the CIFAR example inside the Docker image, and the solution to the problem was to run the image with the --privileged mode.

$ docker run -it --rm --privileged ...

@loadams
Copy link
Contributor

loadams commented Feb 7, 2024

@zxti - does that help resolve the issue for you?

@zxti
Copy link
Author

zxti commented Feb 7, 2024

Hi @loadams, unfortunately no, since this is what I mentioned in the initial post - you can work around the issue if you have privileged access to the host, but in very many environments (such as RunPod, Colab, etc.) you don't have this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants