Skip to content

Conversation

@eddy16112
Copy link
Contributor

@eddy16112 eddy16112 commented Apr 29, 2022

Provide CPU-only collective routines including alltoall, alltoallv and allgather for the case when there are multiple ranks per node.

We adopt the similar API design from NCCL and MPI. Here are the description of APIs:

  • collInit and collFinalize : similar to MPI_Init and MPI_Finalize, which are called by the top level task of each node.
  • collGetUniqueId: similar to ncclGetUniqueId, only rank 0 need to call it to get a unique id. The unique id will be used to distinguish communicators.
  • collCommCreate and collCommDestroy: similar to ncclCommInitRank and ncclCommDestroy, only ranks participated in the communicator need to call them to initialize/finalize a communicator. There are thread-safe.
  • collAllgather, collAlltoall and collAlltoallv: collective routines called by each rank within a communicator.

It supports both with and without LEGATE_USE_GASNET. When enabling LEGATE_USE_GASNET, we use multi-thread MPI with tagging to implement collective communications. When disabling LEGATE_USE_GASNET, we use a shared buffer to allow each rank(thread) within the same node to exchange data.

@magnatelee
Copy link
Contributor

@eddy16112 what else do you need to fix for the merge? unless there's anything else you want to change, feel free to merge it

@eddy16112
Copy link
Contributor Author

@eddy16112 what else do you need to fix for the merge? unless there's anything else you want to change, feel free to merge it

I am OK to merge it now, but it won't work until we add MPI_Init to legion_python.

if (current_unique_id > MAX_NB_COMMS) {
log_coll.fatal(
"Please increase the LEGATE_MAX_CPU_COMMS by export LEGATE_MAX_CPU_COMMS=new number, current "
"value "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this weird line formatting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it, sometime the pre-commit can change the code into a weird format.

@magnatelee
Copy link
Contributor

I am OK to merge it now, but it won't work until we add MPI_Init to legion_python.

#264 should address the problem.


def destroy(self) -> None:
Communicator.destroy(self)
self._runtime.issue_execution_fence(block=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments: 1) please leave a comment why we need this blocking. 2) do this only if you have any CPU communicators. many programs don't ever create them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the destructor of communicator, so if programs do not create them, the destroy function will not be called, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Communicator.destroy is invoked unconditionally once per communicator type. it's Communicator._finalize that is invoked for each created communicator. the fence is to block on the tasks launched by those Communicator._finalize calls, so you don't need it if no communicator is created. I assume you still need to legate_cpucoll_finalize even when no CPU communicator is created. but I believe that function can be invoked without the fence in that case.

Copy link
Contributor Author

@eddy16112 eddy16112 Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I have pushed a fix for it.
10c988e

@magnatelee
Copy link
Contributor

LGTM. Feel free to merge it when you can

@eddy16112
Copy link
Contributor Author

LGTM. Feel free to merge it when you can

Thank you, I will update the description, and then merge it.

@eddy16112 eddy16112 merged commit 9640528 into nv-legate:branch-22.07 Jun 14, 2022
elliottslaughter pushed a commit to elliottslaughter/legate that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants