[BUG] Cannot run deepspeed in containers #4971

zxti · 2024-01-17T20:02:11Z

Describe the bug
Unable to run deepspeed cifar in containerized environments (unless you control the container host) - I think because of numactl?

To Reproduce
Steps to reproduce the behavior:

Start a container, such as NGC. Or go to Google Colab.

git clone deepspeedexamples.

Go to training/cifar

Run run_ds_moe.sh. (If using Colab, adjust the script to set all GPUs/experts vars to 1)

Expected behavior
Runs.

ds_report output

[2024-01-17 19:54:12,758] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-17 19:54:14,599] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-01-17 19:54:14,605] [INFO] [runner.py:571:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --bind_cores_to_rank cifar10_deepspeed.py --log-interval 100 --deepspeed --deepspeed_config ds_config.json --moe --ep-world-size 1 --num-experts 1 --top-k 1 --noisy-gate-policy RSample --moe-param-group
[2024-01-17 19:54:17,084] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-17 19:54:19,709] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.19.3-1+cuda12.2
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.19.3-1
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.3-1
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.19.3-1+cuda12.2
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-01-17 19:54:19,710] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.19.3-1
[2024-01-17 19:54:19,710] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-01-17 19:54:19,710] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-01-17 19:54:19,710] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-01-17 19:54:19,710] [INFO] [launch.py:163:main] dist_world_size=1
[2024-01-17 19:54:19,710] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
set_mempolicy: Operation not permitted
setting membind: Operation not permitted
[2024-01-17 19:54:20,732] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 8319
[2024-01-17 19:54:20,732] [ERROR] [launch.py:321:sigkill_handler] ['numactl', '-m', '0', '-C', '0', '/usr/bin/python3', '-u', 'cifar10_deepspeed.py', '--local_rank=0', '--log-interval', '100', '--deepspeed', '--deepspeed_config', 'ds_config.json', '--moe', '--ep-world-size', '1', '--num-experts', '1', '--top-k', '1', '--noisy-gate-policy', 'RSample', '--moe-param-group'] exits with return code = 1

System info (please complete the following information):

OS: Ubuntu 22.04
GPU count and types: 1xT4
Python version: 3.10.2

Launcher context
Using deepspeed launcher

The text was updated successfully, but these errors were encountered:

loadams · 2024-01-22T21:15:07Z

@zxti - that's odd, this shouldn't be the case. Could you share a sample docker container from NGC that you're using so I can test and try to repro on our side?

zxti · 2024-01-22T21:30:15Z

@loadams I'll check which docker container I was specifically using.

But for now - you can also repro in Colab, let me know if this helps: https://colab.research.google.com/drive/1vpmay34Wfc31ilOHSB4G6WIGsmw8H0-0?usp=sharing

zxti · 2024-01-22T23:12:08Z

It was nvcr.io/nvidia/pytorch:23.12-py3

iKrishneel · 2024-02-07T11:01:02Z

I encountered the same issue when running the CIFAR example inside the Docker image, and the solution to the problem was to run the image with the --privileged mode.

$ docker run -it --rm --privileged ...

loadams · 2024-02-07T20:36:34Z

@zxti - does that help resolve the issue for you?

zxti · 2024-02-07T21:39:18Z

Hi @loadams, unfortunately no, since this is what I mentioned in the initial post - you can work around the issue if you have privileged access to the host, but in very many environments (such as RunPod, Colab, etc.) you don't have this.

zxti added bug Something isn't working training labels Jan 17, 2024

loadams self-assigned this Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot run deepspeed in containers #4971

[BUG] Cannot run deepspeed in containers #4971

zxti commented Jan 17, 2024 •

edited

loadams commented Jan 22, 2024

zxti commented Jan 22, 2024

zxti commented Jan 22, 2024

iKrishneel commented Feb 7, 2024 •

edited

loadams commented Feb 7, 2024

zxti commented Feb 7, 2024

[BUG] Cannot run deepspeed in containers #4971

[BUG] Cannot run deepspeed in containers #4971

Comments

zxti commented Jan 17, 2024 • edited

loadams commented Jan 22, 2024

zxti commented Jan 22, 2024

zxti commented Jan 22, 2024

iKrishneel commented Feb 7, 2024 • edited

loadams commented Feb 7, 2024

zxti commented Feb 7, 2024

zxti commented Jan 17, 2024 •

edited

iKrishneel commented Feb 7, 2024 •

edited