-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/UCP: ranks reordering #698
TL/UCP: ranks reordering #698
Conversation
de2d995
to
6361687
Compare
@@ -386,6 +456,9 @@ ucc_status_t ucc_sbgp_create(ucc_topo_t *topo, ucc_sbgp_type_t type) | |||
case UCC_SBGP_FULL: | |||
status = sbgp_create_full(topo, sbgp); | |||
break; | |||
case UCC_SBGP_FULL_HOST_ORDERED: | |||
status = sbgp_create_full_ordered(topo, sbgp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it host list order preserved when doing this ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sorts hosts by hashes, for instance
original mapping:
rank0 - hostA (hash: 0xB) socket 0
rank1 - hostA (hash: 0xB) socket 1
rank2 - hostA (hash: 0xB) socket 0
rank3 - hostA (hash: 0xB) socket 1
rank4 - hostB (hash: 0xA) socket 0
rank5 - hostB (hash: 0xA) socket 1
rank6 - hostB (hash: 0xA) socket 0
rank7 - hostB (hash: 0xA) socket 1
reordering:
rank4, rank6, rank5, rank7, rank0, rank2, rank1, rank3
|
||
enum ucc_tl_ucp_task_flags { | ||
/*indicates whether subset field of tl_ucp_task is set*/ | ||
UCC_TL_UCP_TASK_FLAG_SUBSET = UCC_BIT(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this flag? the original idea was that subset is set to EP_MAP_FULL by default. So if collective supports it, then it always uses "subset" which evaluates to no-op for FULL. Otherwise works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's used to identify if subset is already set and shouldn't be changed. For example, in service collective.
sbgp->rank_map[i] = sorted[i].id; | ||
} | ||
|
||
/*TODO: try to detect map by numa,socket,node and use UCC_EP_MAP_CB to save |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use ucc_ep_map_from_array to try to pack the ep map (detect strided/flat). Alloc int *rank, fill in line 430, and then ucc_ep_map_from_array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's already used in ucc_sbgp_create, but it can't detect map by numa, socket or node because they are not really strided
* For regular allgather with rank reordering both endpoints | ||
* and blocks permutation are necessary. | ||
*/ | ||
ucc_rank_t (*get_send_block)(ucc_subset_t *subset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could maybe try to remove that conditionals for services vs regular. IF we introduce tl_ucp_team->reorder_map explciitly rathar than updating subset. Then:
allgather_ring_init() {
if (!team_is_single_node(tl_ucp_team) && cfg->allgather_use_host_ordering) {
if (task->subset == FULL) {
sbgp = get_sbgp(HOST_ORDERED);
task->reorder_map = sbgp.map;
} else {
task->reorder_map = FULL; // this is true for service
}
}
then in the collective you would use double mapping team->subset->reorder_map ? Now that i think of it, we maybe even can udpate subset as ucc_ep_map_create_nested(subset, reorder_map) and it will work? (need to check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i understand, semantic is diffrrent for service allgather and regular allgather. Service allgather with subset looks to me like a lightweight allgather with new subteam and so new topo based permutation required in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah.. i was thinking how to avoid that difference of flows between service and regular. At first i thought we can unify them, but looks like we can't. we need separate maps for "peer" and "block" computations for the 2 cases. so basically i meant to be explcit: 1 thing is "subset", it running an algorithm over only a handful of procs out of the team; another thin is "reordering".
Suppose we would introduce "reorder_map" default to FULL, optionally set to get_sbgp(HOST_ORDERED).map. Then in your allgather implementation you would have:
sendto = ucc_ep_map_eval(task->subset.map, (trank + 1) % tsize);
recvfrom = ucc_ep_map_eval(task->subset.map, (trank - 1 + tsize) % tsize);
sendto = ucc_ep_map_eval(task->reorder.map, sendto);
recvfrom = ucc_ep_map_eval(task->reorder.map, recvfrom);
...
get_recv_block {
...
ucc_ep_map_eval(task->reorder_map, (trank - step - 1 + tsize) % tsize);
}
So, this way you don't have fn pointers for send/get block. the behavior is achieved by setting reorder_map = FULL for service. Then it would even could be possible to support reordering for subset (though i don't see a use case for that now. maybe if we would like to support arbitrary size active sets we could use re-ordering for them).
Anyways, this was my logic. And when i wrote this i quickly thought that maybe this even allows to build nested map (my comment above) - but that was incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code snippet above is not correct, subset.map is actually reverse permutation that map from new numbering to old numbering, subset.myrank is rank number in new numbering. In ring algorithm we first need to find neighbors in new numbering and then translate it back to old numbering to know EPs. However, if we want to combine multiple permutations only reverse permutation is not enough, we also need forward permutation that translates from old numbering to new numbering.
Since we are not going to use subset and reordering simultaneously anyway, maybe add union to tl_ucp task structure:
union {
ucc_subset_t subset;
ucc_subset_t reordering;
}
If algorithm supports both subset and reordering it will use subset, otherwise it will use reordering field. In those algorithms that support both we will need somehow to check if rank permutation is enough, or we also need block permutation. This can be done by providing function pointers to blocks call functions or adding flag that says to permute ranks only and don't touch blocks.
6361687
to
c8ecddb
Compare
@vspetrov are you ok with the changes? |
ping @vspetrov |
@@ -1,5 +1,5 @@ | |||
/** | |||
* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2023
@@ -1,5 +1,5 @@ | |||
/** | |||
* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2023
@@ -1,5 +1,5 @@ | |||
/** | |||
* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2023
@@ -1,5 +1,6 @@ | |||
/* | |||
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2023
c8ecddb
to
ba88854
Compare
ba88854
to
534673f
Compare
bot:retest |
* TL/UCP: ranks reordering * REVIEW: fix review comments
* TL/UCP: ranks reordering * REVIEW: fix review comments
What
Use topology information in TL UCP to reorder ranks so that ranks first grouped by node, by socket within node, by numa within socket.
Why ?
Permutation is used in TL UCP allgather to form better ring algorithm.
Performance data collected on N2 PPN48 (AMD EPYC 7443: 24 cores, 2 sockets, 8 NUMAs)