Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: add device mem mcast bcast #989

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Jun 13, 2024

TL/MLX5: add device mem mcast bcast

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you please address the following comment?

  1. Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.
  2. We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code
  3. Please make sure building ucc is possible even without cuda support
  4. Please fix the CI issues
  5. Can you add test for this feature?

src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_helper.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
@MamziB
Copy link
Collaborator Author

MamziB commented Jun 17, 2024

  • Please make sure building ucc is possible even without cuda support

Thanks! Can you please address the following comment?

1- Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code
Please refer to #989 (comment)

Please make sure building ucc is possible even without cuda support
Sure will do

Please fix the CI issues
Sure, will do

Can you add test for this feature?
Sure, I will open new PR for it

@MamziB
Copy link
Collaborator Author

MamziB commented Jun 17, 2024

Hi @samnordmann Thanks for the constructive comments. I have added a new commit. Please let me know if you have more comments.

@MamziB MamziB self-assigned this Jun 17, 2024
@samnordmann
Copy link
Collaborator

Thanks! Can you please address the following comment?

Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here

ucc_status_t ucc_tl_mlx5_team_get_scores(ucc_base_team_t * tl_team,

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code

Please refer to #989 (comment)

I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.

Please make sure building ucc is possible even without cuda support

Sure will do

This comment is not addressed. Every cuda API call should be decorated with appropriate compilator guard. (another advantage of using mc). This is related to the error seen in the CI. The program should compile with a configure command of the type configure --with-tls=ucp,mlx5 --without-cuda, otherwise it will be rejected by the tests

Please fix the CI issues

Sure, will do

there are still the same issues

Can you add test for this feature?

Sure, I will open new PR for it

Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI

@samnordmann samnordmann self-requested a review June 18, 2024 12:08
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Some comments still need to be addressed

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 9, 2024

@samnordmann thanks for the constructive comments. Please see the new commit.

@samnordmann samnordmann self-requested a review July 10, 2024 13:00
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you please address the following comment?
Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.

If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.

So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here

ucc_status_t ucc_tl_mlx5_team_get_scores(ucc_base_team_t * tl_team,

Please can you help me understand this? I still don't understand what is the motivation for this PR and what would happen today if a user uses mcast with GPU buffer (which is already enabled as pointed out above). Please provide a description.

We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code

Please refer to #989 (comment)

I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.

Thanks for replacing cuda memory calls with mc calls. However, the goal is also to use mc component for CPU mem calls. It will remove a lot of duplication. Can you please make this change?

Please fix the CI issues

Sure, will do

there are still the same issues

The CI is still red. Can you please rebase this PR and re-run the CI so we can check?

Can you add test for this feature?

Sure, I will open new PR for it

Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI

Can you please add these tests?

src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5.c Outdated Show resolved Hide resolved
@MamziB
Copy link
Collaborator Author

MamziB commented Jul 11, 2024

@samnordmann can you please take a look at the new commit?

With the new design changes, we have a performance issue: if I use ucc_mc_memcpy() I get way worse performance compared to directly using cudaMemcpy(). Please see below:

cudaMemcpy:

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                       6.50
[1,0]<stdout>:2                       6.48
[1,0]<stdout>:4                       6.45
[1,0]<stdout>:8                       6.41
[1,0]<stdout>:16                      6.37
[1,0]<stdout>:32                      6.35
[1,0]<stdout>:64                      6.17
[1,0]<stdout>:128                     6.39
[1,0]<stdout>:256                     6.46
[1,0]<stdout>:512                     6.63
[1,0]<stdout>:1024                    6.61
[1,0]<stdout>:2048                    6.64
[1,0]<stdout>:4096                   10.49
[1,0]<stdout>:8192                   15.14
[1,0]<stdout>:16384                  32.25
[1,0]<stdout>:32768                  37.82
[1,0]<stdout>:65536                  84.40
[1,0]<stdout>:131072                180.79
[1,0]<stdout>:262144                470.13
[1,0]<stdout>:524288                949.81
[1,0]<stdout>:1048576              1956.74

ucc_mc_memcpy():

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                     365.79
[1,0]<stdout>:2                     360.71
[1,0]<stdout>:4                     361.11
[1,0]<stdout>:8                     360.60
[1,0]<stdout>:16                    363.05
[1,0]<stdout>:32                    361.96
[1,0]<stdout>:64                    361.18
[1,0]<stdout>:128                   362.93
[1,0]<stdout>:256                   360.26
[1,0]<stdout>:512                   359.99
[1,0]<stdout>:1024                  362.55
[1,0]<stdout>:2048                  359.93
[1,0]<stdout>:4096                  679.74
[1,0]<stdout>:8192                 1200.42
[1,0]<stdout>:16384                2305.80
[1,0]<stdout>:32768                4513.86
[1,0]<stdout>:65536                6907.43
[1,0]<stdout>:131072              13578.23
[1,0]<stdout>:262144              26846.44
[1,0]<stdout>:524288              53841.84
[1,0]<stdout>:1048576            107402.99

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 11, 2024

@samnordmann can you please take a look at the new commit?

With the new design changes, we have a performance issue: if I use ucc_mc_memcpy() I get way worse performance compared to directly using cudaMemcpy(). Please see below:

cudaMemcpy:

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                       6.50
[1,0]<stdout>:2                       6.48
[1,0]<stdout>:4                       6.45
[1,0]<stdout>:8                       6.41
[1,0]<stdout>:16                      6.37
[1,0]<stdout>:32                      6.35
[1,0]<stdout>:64                      6.17
[1,0]<stdout>:128                     6.39
[1,0]<stdout>:256                     6.46
[1,0]<stdout>:512                     6.63
[1,0]<stdout>:1024                    6.61
[1,0]<stdout>:2048                    6.64
[1,0]<stdout>:4096                   10.49
[1,0]<stdout>:8192                   15.14
[1,0]<stdout>:16384                  32.25
[1,0]<stdout>:32768                  37.82
[1,0]<stdout>:65536                  84.40
[1,0]<stdout>:131072                180.79
[1,0]<stdout>:262144                470.13
[1,0]<stdout>:524288                949.81
[1,0]<stdout>:1048576              1956.74

ucc_mc_memcpy():

[1,0]<stdout>:# OSU MPI-CUDA Broadcast Latency Test v7.2
[1,0]<stdout>:# Datatype: MPI_CHAR.
[1,0]<stdout>:# Size       Avg Latency(us)
[1,0]<stdout>:1                     365.79
[1,0]<stdout>:2                     360.71
[1,0]<stdout>:4                     361.11
[1,0]<stdout>:8                     360.60
[1,0]<stdout>:16                    363.05
[1,0]<stdout>:32                    361.96
[1,0]<stdout>:64                    361.18
[1,0]<stdout>:128                   362.93
[1,0]<stdout>:256                   360.26
[1,0]<stdout>:512                   359.99
[1,0]<stdout>:1024                  362.55
[1,0]<stdout>:2048                  359.93
[1,0]<stdout>:4096                  679.74
[1,0]<stdout>:8192                 1200.42
[1,0]<stdout>:16384                2305.80
[1,0]<stdout>:32768                4513.86
[1,0]<stdout>:65536                6907.43
[1,0]<stdout>:131072              13578.23
[1,0]<stdout>:262144              26846.44
[1,0]<stdout>:524288              53841.84
[1,0]<stdout>:1048576            107402.99

So the performance gap comes from the cudaMemcpy being asynchronous in the ucc_mc_memcpy and such a copy involves overheads related to managing cuda streams. We do not require an asynchronous copy in our design, therefore, I added a new mc function that uses synchronous cudaMemcpy and used it in our design.

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 11, 2024

@samnordmann Thanks for the constructive comments. I addressed all your comments and fixed the performance issues. Please take a look at the updated commit.

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 11, 2024

@Sergei-Lebedev can you also please let me know if you have any comments on this PR? I addressed all Sam's comments already.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants