Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: ranks reordering #698

Merged
merged 2 commits into from
Mar 20, 2023

Conversation

Sergei-Lebedev
Copy link
Contributor

What

Use topology information in TL UCP to reorder ranks so that ranks first grouped by node, by socket within node, by numa within socket.

Why ?

Permutation is used in TL UCP allgather to form better ring algorithm.
Performance data collected on N2 PPN48 (AMD EPYC 7443: 24 cores, 2 sockets, 8 NUMAs)
ompi
ucc no topo
ucc with topo

@@ -386,6 +456,9 @@ ucc_status_t ucc_sbgp_create(ucc_topo_t *topo, ucc_sbgp_type_t type)
case UCC_SBGP_FULL:
status = sbgp_create_full(topo, sbgp);
break;
case UCC_SBGP_FULL_HOST_ORDERED:
status = sbgp_create_full_ordered(topo, sbgp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it host list order preserved when doing this ordering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sorts hosts by hashes, for instance
original mapping:
rank0 - hostA (hash: 0xB) socket 0
rank1 - hostA (hash: 0xB) socket 1
rank2 - hostA (hash: 0xB) socket 0
rank3 - hostA (hash: 0xB) socket 1
rank4 - hostB (hash: 0xA) socket 0
rank5 - hostB (hash: 0xA) socket 1
rank6 - hostB (hash: 0xA) socket 0
rank7 - hostB (hash: 0xA) socket 1
reordering:
rank4, rank6, rank5, rank7, rank0, rank2, rank1, rank3

src/components/tl/ucp/allgather/allgather.c Outdated Show resolved Hide resolved

enum ucc_tl_ucp_task_flags {
/*indicates whether subset field of tl_ucp_task is set*/
UCC_TL_UCP_TASK_FLAG_SUBSET = UCC_BIT(0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this flag? the original idea was that subset is set to EP_MAP_FULL by default. So if collective supports it, then it always uses "subset" which evaluates to no-op for FULL. Otherwise works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used to identify if subset is already set and shouldn't be changed. For example, in service collective.

sbgp->rank_map[i] = sorted[i].id;
}

/*TODO: try to detect map by numa,socket,node and use UCC_EP_MAP_CB to save
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use ucc_ep_map_from_array to try to pack the ep map (detect strided/flat). Alloc int *rank, fill in line 430, and then ucc_ep_map_from_array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's already used in ucc_sbgp_create, but it can't detect map by numa, socket or node because they are not really strided

* For regular allgather with rank reordering both endpoints
* and blocks permutation are necessary.
*/
ucc_rank_t (*get_send_block)(ucc_subset_t *subset,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could maybe try to remove that conditionals for services vs regular. IF we introduce tl_ucp_team->reorder_map explciitly rathar than updating subset. Then:

allgather_ring_init() {
     if (!team_is_single_node(tl_ucp_team) && cfg->allgather_use_host_ordering) {
             if (task->subset == FULL) {
                   sbgp = get_sbgp(HOST_ORDERED);
                   task->reorder_map = sbgp.map;
             } else {
                   task->reorder_map = FULL; // this is true for service 
             }
     }

then in the collective you would use double mapping team->subset->reorder_map ? Now that i think of it, we maybe even can udpate subset as ucc_ep_map_create_nested(subset, reorder_map) and it will work? (need to check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand, semantic is diffrrent for service allgather and regular allgather. Service allgather with subset looks to me like a lightweight allgather with new subteam and so new topo based permutation required in that case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.. i was thinking how to avoid that difference of flows between service and regular. At first i thought we can unify them, but looks like we can't. we need separate maps for "peer" and "block" computations for the 2 cases. so basically i meant to be explcit: 1 thing is "subset", it running an algorithm over only a handful of procs out of the team; another thin is "reordering".

Suppose we would introduce "reorder_map" default to FULL, optionally set to get_sbgp(HOST_ORDERED).map. Then in your allgather implementation you would have:

    sendto   = ucc_ep_map_eval(task->subset.map, (trank + 1) % tsize);
    recvfrom = ucc_ep_map_eval(task->subset.map, (trank - 1 + tsize) % tsize);
    sendto   = ucc_ep_map_eval(task->reorder.map, sendto);
    recvfrom = ucc_ep_map_eval(task->reorder.map, recvfrom);

   ...
   get_recv_block {
     ...
   ucc_ep_map_eval(task->reorder_map, (trank - step - 1 + tsize) % tsize);
   }

So, this way you don't have fn pointers for send/get block. the behavior is achieved by setting reorder_map = FULL for service. Then it would even could be possible to support reordering for subset (though i don't see a use case for that now. maybe if we would like to support arbitrary size active sets we could use re-ordering for them).

Anyways, this was my logic. And when i wrote this i quickly thought that maybe this even allows to build nested map (my comment above) - but that was incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code snippet above is not correct, subset.map is actually reverse permutation that map from new numbering to old numbering, subset.myrank is rank number in new numbering. In ring algorithm we first need to find neighbors in new numbering and then translate it back to old numbering to know EPs. However, if we want to combine multiple permutations only reverse permutation is not enough, we also need forward permutation that translates from old numbering to new numbering.
Since we are not going to use subset and reordering simultaneously anyway, maybe add union to tl_ucp task structure:

union {
  ucc_subset_t subset;
  ucc_subset_t reordering;
} 

If algorithm supports both subset and reordering it will use subset, otherwise it will use reordering field. In those algorithms that support both we will need somehow to check if rank permutation is enough, or we also need block permutation. This can be done by providing function pointers to blocks call functions or adding flag that says to permute ranks only and don't touch blocks.

@manjugv
Copy link
Contributor

manjugv commented Feb 15, 2023

@vspetrov are you ok with the changes?

@manjugv
Copy link
Contributor

manjugv commented Mar 15, 2023

ping @vspetrov

@@ -1,5 +1,5 @@
/**
* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023

@@ -1,5 +1,5 @@
/**
* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023

@@ -1,5 +1,5 @@
/**
* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023

@@ -1,5 +1,6 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023

@Sergei-Lebedev
Copy link
Contributor Author

bot:retest

@Sergei-Lebedev Sergei-Lebedev merged commit 1a84a3d into openucx:master Mar 20, 2023
@Sergei-Lebedev Sergei-Lebedev deleted the topic/rank_reordering branch March 20, 2023 16:43
jeniaka pushed a commit to jeniaka/ucc that referenced this pull request Jun 22, 2023
* TL/UCP: ranks reordering

* REVIEW: fix review comments
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
* TL/UCP: ranks reordering

* REVIEW: fix review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants