TL/MLX5: add device mem mcast bcast #989

MamziB · 2024-06-13T23:57:10Z

TL/MLX5: add device mem mcast bcast

samnordmann

Thanks! Can you please address the following comment?

Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.
We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code
Please make sure building ucc is possible even without cuda support
Please fix the CI issues
Can you add test for this feature?

src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.c

src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c

src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h

src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c

src/components/tl/mlx5/tl_mlx5.c

src/components/tl/mlx5/mcast/tl_mlx5_mcast.h

src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c

MamziB · 2024-06-17T18:54:24Z

Please make sure building ucc is possible even without cuda support

Thanks! Can you please address the following comment?

1- Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code
Please refer to #989 (comment)

Please make sure building ucc is possible even without cuda support
Sure will do

Please fix the CI issues
Sure, will do

Can you add test for this feature?
Sure, I will open new PR for it

MamziB · 2024-06-17T19:58:51Z

Hi @samnordmann Thanks for the constructive comments. I have added a new commit. Please let me know if you have more comments.

samnordmann · 2024-06-18T12:05:31Z

Thanks! Can you please address the following comment?

Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here

ucc/src/components/tl/mlx5/tl_mlx5_team.c

Line 299 in b2d5a78

ucc_status_t ucc_tl_mlx5_team_get_scores(ucc_base_team_t * tl_team,

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code

Please refer to #989 (comment)

I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.

Please make sure building ucc is possible even without cuda support

Sure will do

This comment is not addressed. Every cuda API call should be decorated with appropriate compilator guard. (another advantage of using mc). This is related to the error seen in the CI. The program should compile with a configure command of the type configure --with-tls=ucp,mlx5 --without-cuda, otherwise it will be rejected by the tests

Please fix the CI issues

Sure, will do

there are still the same issues

Can you add test for this feature?

Sure, I will open new PR for it

Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI

samnordmann

thanks. Some comments still need to be addressed

MamziB · 2024-07-09T23:42:45Z

@samnordmann thanks for the constructive comments. Please see the new commit.

samnordmann

Thanks! Can you please address the following comment?
Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here

ucc/src/components/tl/mlx5/tl_mlx5_team.c

Line 299 in b2d5a78

ucc_status_t ucc_tl_mlx5_team_get_scores(ucc_base_team_t * tl_team,

Please can you help me understand this? I still don't understand what is the motivation for this PR and what would happen today if a user uses mcast with GPU buffer (which is already enabled as pointed out above). Please provide a description.

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code

Please refer to #989 (comment)

I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.

Thanks for replacing cuda memory calls with mc calls. However, the goal is also to use mc component for CPU mem calls. It will remove a lot of duplication. Can you please make this change?

Please fix the CI issues

Sure, will do

there are still the same issues

The CI is still red. Can you please rebase this PR and re-run the CI so we can check?

Can you add test for this feature?

Sure, I will open new PR for it

Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI

Can you please add these tests?

src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c

src/components/tl/mlx5/mcast/tl_mlx5_mcast_progress.c

src/components/tl/mlx5/tl_mlx5.c

MamziB · 2024-07-11T00:05:29Z

@samnordmann can you please take a look at the new commit?

With the new design changes, we have a performance issue: if I use ucc_mc_memcpy() I get way worse performance compared to directly using cudaMemcpy(). Please see below:

cudaMemcpy:

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                       6.50
[1,0]<stdout>:2                       6.48
[1,0]<stdout>:4                       6.45
[1,0]<stdout>:8                       6.41
[1,0]<stdout>:16                      6.37
[1,0]<stdout>:32                      6.35
[1,0]<stdout>:64                      6.17
[1,0]<stdout>:128                     6.39
[1,0]<stdout>:256                     6.46
[1,0]<stdout>:512                     6.63
[1,0]<stdout>:1024                    6.61
[1,0]<stdout>:2048                    6.64
[1,0]<stdout>:4096                   10.49
[1,0]<stdout>:8192                   15.14
[1,0]<stdout>:16384                  32.25
[1,0]<stdout>:32768                  37.82
[1,0]<stdout>:65536                  84.40
[1,0]<stdout>:131072                180.79
[1,0]<stdout>:262144                470.13
[1,0]<stdout>:524288                949.81
[1,0]<stdout>:1048576              1956.74

ucc_mc_memcpy():

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                     365.79
[1,0]<stdout>:2                     360.71
[1,0]<stdout>:4                     361.11
[1,0]<stdout>:8                     360.60
[1,0]<stdout>:16                    363.05
[1,0]<stdout>:32                    361.96
[1,0]<stdout>:64                    361.18
[1,0]<stdout>:128                   362.93
[1,0]<stdout>:256                   360.26
[1,0]<stdout>:512                   359.99
[1,0]<stdout>:1024                  362.55
[1,0]<stdout>:2048                  359.93
[1,0]<stdout>:4096                  679.74
[1,0]<stdout>:8192                 1200.42
[1,0]<stdout>:16384                2305.80
[1,0]<stdout>:32768                4513.86
[1,0]<stdout>:65536                6907.43
[1,0]<stdout>:131072              13578.23
[1,0]<stdout>:262144              26846.44
[1,0]<stdout>:524288              53841.84
[1,0]<stdout>:1048576            107402.99

MamziB · 2024-07-11T19:36:05Z

@samnordmann can you please take a look at the new commit?

With the new design changes, we have a performance issue: if I use ucc_mc_memcpy() I get way worse performance compared to directly using cudaMemcpy(). Please see below:

cudaMemcpy:

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                       6.50
[1,0]<stdout>:2                       6.48
[1,0]<stdout>:4                       6.45
[1,0]<stdout>:8                       6.41
[1,0]<stdout>:16                      6.37
[1,0]<stdout>:32                      6.35
[1,0]<stdout>:64                      6.17
[1,0]<stdout>:128                     6.39
[1,0]<stdout>:256                     6.46
[1,0]<stdout>:512                     6.63
[1,0]<stdout>:1024                    6.61
[1,0]<stdout>:2048                    6.64
[1,0]<stdout>:4096                   10.49
[1,0]<stdout>:8192                   15.14
[1,0]<stdout>:16384                  32.25
[1,0]<stdout>:32768                  37.82
[1,0]<stdout>:65536                  84.40
[1,0]<stdout>:131072                180.79
[1,0]<stdout>:262144                470.13
[1,0]<stdout>:524288                949.81
[1,0]<stdout>:1048576              1956.74

ucc_mc_memcpy():

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                     365.79
[1,0]<stdout>:2                     360.71
[1,0]<stdout>:4                     361.11
[1,0]<stdout>:8                     360.60
[1,0]<stdout>:16                    363.05
[1,0]<stdout>:32                    361.96
[1,0]<stdout>:64                    361.18
[1,0]<stdout>:128                   362.93
[1,0]<stdout>:256                   360.26
[1,0]<stdout>:512                   359.99
[1,0]<stdout>:1024                  362.55
[1,0]<stdout>:2048                  359.93
[1,0]<stdout>:4096                  679.74
[1,0]<stdout>:8192                 1200.42
[1,0]<stdout>:16384                2305.80
[1,0]<stdout>:32768                4513.86
[1,0]<stdout>:65536                6907.43
[1,0]<stdout>:131072              13578.23
[1,0]<stdout>:262144              26846.44
[1,0]<stdout>:524288              53841.84
[1,0]<stdout>:1048576            107402.99

So the performance gap comes from the cudaMemcpy being asynchronous in the ucc_mc_memcpy and such a copy involves overheads related to managing cuda streams. We do not require an asynchronous copy in our design, therefore, I added a new mc function that uses synchronous cudaMemcpy and used it in our design.

MamziB · 2024-07-11T19:37:35Z

@samnordmann Thanks for the constructive comments. I addressed all your comments and fixed the performance issues. Please take a look at the updated commit.

MamziB · 2024-07-11T19:38:18Z

@Sergei-Lebedev can you also please let me know if you have any comments on this PR? I addressed all Sam's comments already.

samnordmann

Thanks!

TL/MLX5: add device mem mcast bcast

11edf29

MamziB requested review from samnordmann and Sergei-Lebedev June 13, 2024 23:57

samnordmann reviewed Jun 14, 2024

View reviewed changes

MamziB self-assigned this Jun 17, 2024

MamziB added the Ready-for-Review label Jun 17, 2024

samnordmann self-requested a review June 18, 2024 12:08

samnordmann reviewed Jun 18, 2024

View reviewed changes

MamziB force-pushed the mamzi/device-mcast-bcast branch from 4cb99e3 to 5fca2e8 Compare July 9, 2024 23:39

samnordmann self-requested a review July 10, 2024 13:00

samnordmann reviewed Jul 10, 2024

View reviewed changes

MamziB force-pushed the mamzi/device-mcast-bcast branch from 565c092 to 9d12794 Compare July 11, 2024 19:33

TL/MLX5: addressing all sam's comments on PR 989

e336b2a

MamziB force-pushed the mamzi/device-mcast-bcast branch from 9d12794 to e336b2a Compare July 11, 2024 19:43

samnordmann approved these changes Jul 12, 2024

View reviewed changes

manjugv requested review from janjust and wfaderhold21 July 16, 2024 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/MLX5: add device mem mcast bcast #989

TL/MLX5: add device mem mcast bcast #989

MamziB commented Jun 13, 2024

samnordmann left a comment

MamziB commented Jun 17, 2024

MamziB commented Jun 17, 2024

samnordmann commented Jun 18, 2024

samnordmann left a comment

MamziB commented Jul 9, 2024

samnordmann left a comment

MamziB commented Jul 11, 2024

MamziB commented Jul 11, 2024

MamziB commented Jul 11, 2024

MamziB commented Jul 11, 2024

samnordmann left a comment

TL/MLX5: add device mem mcast bcast #989

Are you sure you want to change the base?

TL/MLX5: add device mem mcast bcast #989

Conversation

MamziB commented Jun 13, 2024

samnordmann left a comment

Choose a reason for hiding this comment

MamziB commented Jun 17, 2024

MamziB commented Jun 17, 2024

samnordmann commented Jun 18, 2024

samnordmann left a comment

Choose a reason for hiding this comment

MamziB commented Jul 9, 2024

samnordmann left a comment

Choose a reason for hiding this comment

MamziB commented Jul 11, 2024

MamziB commented Jul 11, 2024

MamziB commented Jul 11, 2024

MamziB commented Jul 11, 2024

samnordmann left a comment

Choose a reason for hiding this comment