-
Notifications
You must be signed in to change notification settings - Fork 75
Closed
Description
according to the doc:
export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_script.py
while actually it failed
NCCL_DEBUG=TRACE LD_AUDIT=/usr/local/lib/python3.12/dist-packages/mscclpp/lib/libmscclpp_audit_nccl.so MSCCLPP_DEBUG=TRACE torchrun --master_port=29503 --nnodes=1 --nproc_per_node=4 test/torch/correctness_test.py --collective allreduce --nelem 1048576 --dtype fp16
W1021 13:58:14.016000 12663 torch/distributed/run.py:792]
W1021 13:58:14.016000 12663 torch/distributed/run.py:792] *****************************************
W1021 13:58:14.016000 12663 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1021 13:58:14.016000 12663 torch/distributed/run.py:792] *****************************************
my-dev-1:12728:12728 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12728:12728 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12728:12728 [0] MSCCLPP INFO rank 0 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12730:12730 [2] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12729:12729 [1] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12730:12730 [2] MSCCLPP INFO rank 2 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12729:12729 [1] MSCCLPP INFO rank 1 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12731:12731 [3] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12731:12731 [3] MSCCLPP INFO rank 3 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12731:12731 [3] MSCCLPP INFO rank 3 - unix socket server started
my-dev-1:12728:12728 [0] MSCCLPP INFO rank 0 - unix socket server started
my-dev-1:12730:12730 [2] MSCCLPP INFO rank 2 - unix socket server started
my-dev-1:12729:12729 [1] MSCCLPP INFO rank 1 - unix socket server started
[rank0]:[W1021 13:58:19.822731766 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
W1021 13:58:20.229000 12663 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 12728 closing signal SIGTERM
W1021 13:58:20.229000 12663 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 12731 closing signal SIGTERM
E1021 13:58:20.643000 12663 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 12729) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
test/torch/correctness_test.py FAILED
I tried to invoke with 'LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/mscclpp/lib/libmscclpp_nccl.so' same error
btw, mpirun tests passed in my H200, I do not find any torchrun in .azure-pipelines/nccl-api-test.yaml, so torchrun is unsupported? if so , when to support it?
Metadata
Metadata
Assignees
Labels
No labels