Skip to content

[Bug] mscclpp can not replace nccl in torchrun cases #654

@zhangandy0727-jpg

Description

@zhangandy0727-jpg

according to the doc:

export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_script.py

while actually it failed

NCCL_DEBUG=TRACE LD_AUDIT=/usr/local/lib/python3.12/dist-packages/mscclpp/lib/libmscclpp_audit_nccl.so MSCCLPP_DEBUG=TRACE torchrun --master_port=29503 --nnodes=1 --nproc_per_node=4 test/torch/correctness_test.py --collective allreduce --nelem 1048576 --dtype fp16
W1021 13:58:14.016000 12663 torch/distributed/run.py:792]
W1021 13:58:14.016000 12663 torch/distributed/run.py:792] *****************************************
W1021 13:58:14.016000 12663 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1021 13:58:14.016000 12663 torch/distributed/run.py:792] *****************************************
my-dev-1:12728:12728 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12728:12728 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12728:12728 [0] MSCCLPP INFO rank 0 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12730:12730 [2] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12729:12729 [1] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12730:12730 [2] MSCCLPP INFO rank 2 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12729:12729 [1] MSCCLPP INFO rank 1 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12731:12731 [3] MSCCLPP INFO TcpBootstrap : Using eth0:10.133.0.138<0>
my-dev-1:12731:12731 [3] MSCCLPP INFO rank 3 nranks 4 - connecting to 10.133.0.138<33005>
my-dev-1:12731:12731 [3] MSCCLPP INFO rank 3 - unix socket server started
my-dev-1:12728:12728 [0] MSCCLPP INFO rank 0 - unix socket server started
my-dev-1:12730:12730 [2] MSCCLPP INFO rank 2 - unix socket server started
my-dev-1:12729:12729 [1] MSCCLPP INFO rank 1 - unix socket server started
[rank0]:[W1021 13:58:19.822731766 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
W1021 13:58:20.229000 12663 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 12728 closing signal SIGTERM
W1021 13:58:20.229000 12663 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 12731 closing signal SIGTERM
E1021 13:58:20.643000 12663 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 12729) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test/torch/correctness_test.py FAILED

I tried to invoke with 'LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/mscclpp/lib/libmscclpp_nccl.so' same error

btw, mpirun tests passed in my H200, I do not find any torchrun in .azure-pipelines/nccl-api-test.yaml, so torchrun is unsupported? if so , when to support it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions