Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using COLLNET failed with sharp plugin #1219

Open
shanleo2024 opened this issue Mar 12, 2024 · 15 comments
Open

Using COLLNET failed with sharp plugin #1219

shanleo2024 opened this issue Mar 12, 2024 · 15 comments

Comments

@shanleo2024
Copy link

shanleo2024 commented Mar 12, 2024

Hi, I have run all_reduce_perf using sharp plugin with the two params:
-x NCCL_COLLNET_ENABLE=1
-x NCCL_ALGO=CollNet \

The opensm master has the sharp_am service:

service sharp_am status
Redirecting to /bin/systemctl status sharp_am.service
● sharp_am.service - SHARP Aggregation Manager (sharp_am). Version: 3.0.0
Loaded: loaded (/etc/systemd/system/sharp_am.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/sharp_am.service.d
└─Service.conf
Active: active (running) since Fri 2024-03-15 09:51:55 CST; 12min ago
Main PID: 17459 (sharp_am)
Tasks: 40 (limit: 26213)
Memory: 26.3M
CGroup: /system.slice/sharp_am.service
└─17459 /opt/hpc/software/mpi/hpcx/v2.12.0/sharp/bin/sharp_am -O -/opt/hpc/software/mpi/hpcx/v2.12.0/sharp/conf/sharp_am.cfg

And from the SHARP log, I can see the SHARP has been initialized successfully.
But there are some error logs when running the test with SHARP:

[C25L19:1:28576 unique id 1201360626793454896] ERROR collect_ports_data: device mlx5_0 port 1 is not valid (port is used by SM)
[C25L19:1:28576 unique id 1201360626793454896] ERROR collect_ports_data: failed to find valid ports
[C25L19:1:28576 unique id 1201360626793454896] ERROR sharp_get_local_data: error retrieving local data for process number 1
[C25L19:1:28576 - context.c:415] ERROR sharp_get_local_data failed: Could not open any IB device(-47)
[C25L18:0:29488 - context.c:433] ERROR OOB Gather failed on comm world, ret:3. rank:0
[C25L18:0:29488 - context.c:635] ERROR empty proceseses data ..
[C25L18:0:29491 - context.c:702] INFO job (ID: 1201360608447302825) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[C25L18:0:29491 - context.c:895] INFO sharp_job_id:28 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1)
[C25L18:0:29491 - context.c:899] INFO sharp_job_id:28 tree_type:SAT tree_idx:1 treeID:0x40 caps:0x36
[C25L18:0:29491 - comm.c:403] INFO [group#:0] job_id:28 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[C25L18:0:29491 - comm.c:403] INFO [group#:1] job_id:28 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0

Can you help to help me with case why this error occures and how to resolve it?
Thanks a lot.

@shanleo2024
Copy link
Author

As the log shows: device mlx5_0 port 1 is not valid (port is used by SM)
Does it mean that if the mlx5_0 is used by SM, it will not be used by SHARP at the same time?
I filtered the mlx5_0 using NCCL_IB_HCA, then the error logs disappeared.
I have learned the issue: #320
Is my understanding correct?

@AddyLaddy
Copy link
Collaborator

Maybe try adding -x SHARP_ALLOW_SM_PORT=1

@shanleo2024
Copy link
Author

I tried the env: -x SHARP_ALLOW_SM_PORT=1
But the error logs still occure.

[C25L18:0:6928 - context.c:695] INFO job (ID: 1201360872013076329) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[C25L18:0:6928 - context.c:885] INFO sharp_job_id:63 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:23 user_data_per_ost:1024 max_groups:23 max_qps:1 max_group_channels:1)
[C25L18:0:6928 - context.c:889] INFO sharp_job_id:63 tree_type:SAT tree_idx:1 treeID:0x40 caps:0x36
[C25L18:0:6928 - comm.c:370] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)
[C25L19:1:10202 - comm.c:370] ERROR Failed to lock SAT tree(ID:0x40 ret:0x4)

The issue is the same with the one which is raised by me also.
Mellanox/nccl-rdma-sharp-plugins#151

Thank you for your advance help.

@shanleo2024
Copy link
Author

HI @AddyLaddy,
Do you have any other suggestions?
Thank you in advance.

@shanleo2024
Copy link
Author

The error log shows that the sharp_coll_comm_init in ncclSharpConnect
C25L19:24661:24716 [3] sharp_plugin.c:320 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

What does this log mean? I can run the sharp_hello successfully.

[root@C25L18 device]# $HPCX_SHARP_DIR/bin/sharp_hello -d mlx5_0:1 -v 3

[C25L18:0:10352 - context.c:695] INFO job (ID: 1201355100769761442) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)

[C25L18:0:10352 - context.c:885] INFO sharp_job_id:15 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:153 user_data_per_ost:1024 max_groups:153 max_qps:1 max_group_channels:1)

[C25L18:0:10352 - comm.c:397] INFO [group#:0] job_id:15 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x10000000000) mlid:c009

Test Passed.

[root@C25L18 device]#

Can anyone give some help? many thanks!!!

@l00010728
Copy link

I have met the same problem. For Nodes with 8 400G IB HCA are conneted in one IB swith. I use sharp_hello to test all HCAs, Only the first HCA showed the following error:
image

However I can test nccl-tests by add -x SHARP_ALLOW_SM_PORT=1.

@shanleo2024
Copy link
Author

I can run sharp_hello successfully.

[root@C25L18 device]# $HPCX_SHARP_DIR/bin/sharp_hello -d mlx5_0:1 -v 3

[C25L18:0:10352 - context.c:695] INFO job (ID: 1201355100769761442) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)

[C25L18:0:10352 - context.c:885] INFO sharp_job_id:15 resv_key: tree_type:LLT tree_idx:0 treeID:0x1 caps:0x26 quota:(osts:153 user_data_per_ost:1024 max_groups:153 max_qps:1 max_group_channels:1)

[C25L18:0:10352 - comm.c:397] INFO [group#:0] job_id:15 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x10000000000) mlid:c009

Test Passed.

[root@C25L18 device]#

@RyoYang
Copy link

RyoYang commented Apr 10, 2024

I met the same issue during either distributed traininig or nccl-tests when adding the sharp paras:

        -x NCCL_COLLNET_ENABLE=1 \
        -x SHARP_COLL_ENABLE_SAT=1 \
        -x SHARP_COLL_LOG_LEVEL=3 \
        -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \

Issues:

[vmss000000:0:5803 - tree_ops.c:312] ERROR Failed to lock SAT tree(ID:0x41. ret:0x4)
[vmss000000:0:5803 - comm.c:379] ERROR Failed to lock SAT tree(ID:0x41 ret:0xffffffee)

vmss000001:6566:7295 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)


vmss000002:7005:7494 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)


vmss000000:5803:6295 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)


vmss000003:6522:7256 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

All the nodes can pass sharp_hello tests.

@shanleo2024
Copy link
Author

I met the same issue during either distributed traininig or nccl-tests when adding the sharp paras:

        -x NCCL_COLLNET_ENABLE=1 \
        -x SHARP_COLL_ENABLE_SAT=1 \
        -x SHARP_COLL_LOG_LEVEL=3 \
        -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \

Issues:

[vmss000000:0:5803 - tree_ops.c:312] ERROR Failed to lock SAT tree(ID:0x41. ret:0x4)
[vmss000000:0:5803 - comm.c:379] ERROR Failed to lock SAT tree(ID:0x41 ret:0xffffffee)

vmss000001:6566:7295 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)


vmss000002:7005:7494 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)


vmss000000:5803:6295 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)


vmss000003:6522:7256 [7] sharp_plugin.c:351 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

All the nodes can pass sharp_hello tests.

Seems the same issue, the error occurs when calling the sharp_coll_comm_init in ncclSharpConnect.

@Lzhang-hub
Copy link

I have same issue. I use same docker image to run nccl-test and training job ,nccl-test is normal, but trainingjob get nccl log :

[node:0:174 - context.c:857][2024-05-30 12:26:47] INFO sharp_job_id:692  resv_key: tree_type:LLT tree_idx:0  treeID:8 caps:0x66 quo
ta:(osts:23 user_data_per_ost:1024 max_groups:12 max_qps:1 max_group_channels:1)
[node:0:174 - context.c:872][2024-05-30 12:26:47] INFO sharp_job_id:692  tree_type:SAT tree_idx:1  treeID:520 caps:0x76
node:174:4368 [3] NCCL INFO SHARP rank 0/4 initialized on mlx5_4:1
[node:0:174 - tree_ops.c:314][2024-05-30 12:26:47] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[node:0:174 - comm.c:379][2024-05-30 12:26:47] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

node:174:4368 [3] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

node:174:4359 [3] NCCL INFO transport.cc:327 -> 2
node:176:4360 [5] NCCL INFO CollNet 05/0 : 5 [receive] via COLLNET/SHARP/5/GDRDMA
[node:0:175 - context.c:660][2024-05-30 12:26:47] INFO job (ID: 35887842743275558) resource request quota: ( osts:0 user_data_per_o
st:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[node:0:175 - context.c:857][2024-05-30 12:26:48] INFO sharp_job_id:693  resv_key: tree_type:LLT tree_idx:0  treeID:8 caps:0x66 quo
ta:(osts:23 user_data_per_ost:1024 max_groups:12 max_qps:1 max_group_channels:1)
[node:0:175 - context.c:872][2024-05-30 12:26:48] INFO sharp_job_id:693  tree_type:SAT tree_idx:1  treeID:520 caps:0x76
node:175:4366 [4] NCCL INFO SHARP rank 0/4 initialized on mlx5_5:1
[node:0:175 - tree_ops.c:314][2024-05-30 12:26:48] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[node:0:175 - comm.c:379][2024-05-30 12:26:48] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

node:175:4366 [4] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

node:175:4357 [4] NCCL INFO transport.cc:327 -> 2
node:177:4362 [6] NCCL INFO CollNet 06/0 : 6 [receive] via COLLNET/SHARP/6/GDRDMA
[node:0:176 - context.c:660][2024-05-30 12:26:48] INFO job (ID: 35887007889204304) resource request quota: ( osts:0 user_data_per_o
st:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[node:0:176 - context.c:857][2024-05-30 12:26:49] INFO sharp_job_id:694  resv_key: tree_type:LLT tree_idx:0  treeID:8 caps:0x66 quo
ta:(osts:23 user_data_per_ost:1024 max_groups:12 max_qps:1 max_group_channels:1)
[node:0:176 - context.c:872][2024-05-30 12:26:49] INFO sharp_job_id:694  tree_type:SAT tree_idx:1  treeID:520 caps:0x76
node:176:4372 [5] NCCL INFO SHARP rank 0/4 initialized on mlx5_6:1
[node:0:176 - tree_ops.c:314][2024-05-30 12:26:49] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[node:0:176 - comm.c:379][2024-05-30 12:26:49] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

node:176:4372 [5] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

node:176:4360 [5] NCCL INFO transport.cc:327 -> 2
node:178:4361 [7] NCCL INFO CollNet 07/0 : 7 [receive] via COLLNET/SHARP/7/GDRDMA
[node:0:177 - context.c:660][2024-05-30 12:26:49] INFO job (ID: 35887715155457631) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[node:0:177 - context.c:857][2024-05-30 12:26:49] INFO sharp_job_id:695  resv_key: tree_type:LLT tree_idx:0  treeID:8 caps:0x66 quota:(osts:23 user_data_per_ost:1024 max_groups:12 max_qps:1 max_group_channels:1)
[node:0:177 - context.c:872][2024-05-30 12:26:49] INFO sharp_job_id:695  tree_type:SAT tree_idx:1  treeID:520 caps:0x76
node:177:4370 [6] NCCL INFO SHARP rank 0/4 initialized on mlx5_7:1
[node:0:177 - tree_ops.c:314][2024-05-30 12:26:49] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[node:0:177 - comm.c:379][2024-05-30 12:26:49] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

node:177:4370 [6] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

node:177:4362 [6] NCCL INFO transport.cc:327 -> 2
[node:0:178 - context.c:660][2024-05-30 12:26:49] INFO job (ID: 35887593619058448) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[node:0:178 - context.c:857][2024-05-30 12:26:50] INFO sharp_job_id:696  resv_key: tree_type:LLT tree_idx:0  treeID:8 caps:0x66 quota:(osts:23 user_data_per_ost:1024 max_groups:12 max_qps:1 max_group_channels:1)
[node:0:178 - context.c:872][2024-05-30 12:26:50] INFO sharp_job_id:696  tree_type:SAT tree_idx:1  treeID:520 caps:0x76
node:178:4364 [7] NCCL INFO SHARP rank 0/4 initialized on mlx5_11:1
[node:0:178 - tree_ops.c:314][2024-05-30 12:26:50] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[node:0:178 - comm.c:379][2024-05-30 12:26:50] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

node:178:4364 [7] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

node:178:4361 [7] NCCL INFO transport.cc:327 -> 2
node:178:4361 [7] NCCL INFO init.cc:718 -> 2
node:178:4361 [7] NCCL INFO NCCL_ALGO set by environment to nvls
node:176:4360 [5] NCCL INFO init.cc:718 -> 2
node:177:4362 [6] NCCL INFO init.cc:718 -> 2
node:177:4362 [6] NCCL INFO NCCL_ALGO set by environment to nvls
node:173:4363 [2] NCCL INFO init.cc:718 -> 2
node:175:4357 [4] NCCL INFO init.cc:718 -> 2
node:175:4357 [4] NCCL INFO NCCL_ALGO set by environment to nvls
node:174:4359 [3] NCCL INFO init.cc:718 -> 2
node:172:4358 [1] NCCL INFO init.cc:718 -> 2
node:172:4358 [1] NCCL INFO NCCL_ALGO set by environment to nvls
node:172:4358 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
node:172:4358 [1] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer

node:171:4356 [0] transport.cc:356 NCCL WARN Cannot initialize CollNet, using point-to-point network instead

@Lzhang-hub
Copy link

@RyoYang @shanleo1986 can you run nccl-test right? I can run nccl-test normal, but trainingjob with error log

@AddyLaddy
Copy link
Collaborator

NCCL INFO NCCL_ALGO set by environment to nvls
NVLS is NVLink SHARP which is an intra-node only Algorithm. You should not be setting NCCL_ALGO unless you're doing some fine grain investigations or benchmarking. NCCL will automatically choose the best ALGO/PROTO base on the job size, topology, network etc and message size

@Lzhang-hub
Copy link

I unset NCCL_ALGO, get same error.
Beside,I can run nccl-test with same env get busbw 480GB/s.I think it used sharp successfully, but for trainging job with same image and nccl env, got this error log.

@Lzhang-hub
Copy link

Update:
I found SAT env
I set SHARP_COLL_SAT_LOCK_BATCH_SIZE=1,then error is disappeared.But busbw is low.
image

@shanleo2024
Copy link
Author

I can run the IB sharp using NCCL_IB_HCA=mlx5_0, but cannot work based on multiple IB cards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants