New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP: Optimize resources for homogeneous clusters #3056
Conversation
+ Implement UCX_UNIFIED_MODE variable, which is supposed to enable various optimizations sutiable for homogeneous clusters. + The first optimization is saving of tl resources.If there are several interfaces with the same capabilities on the same device, select the one with the best perf charsteristics and close the others.
Test PASSed. |
Test FAILed. |
bot:mlx:retest |
Test FAILed. |
Test PASSed. |
Test PASSed. |
src/ucp/core/ucp_context.c
Outdated
@@ -234,6 +234,10 @@ static ucs_config_field_t ucp_config_table[] = { | |||
"another thread, or incoming active messages, but consumes more resources.", | |||
ucs_offsetof(ucp_config_t, ctx.flush_worker_eps), UCS_CONFIG_TYPE_BOOL}, | |||
|
|||
{"UNIFIED_MODE", "n", | |||
"Enable various optimizations intended for homogeneous environment. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need better explanation what this means , something like "it's guaranteed that the local transport resources/devices of all entities which connect to each other are the same"
src/ucp/core/ucp_context.c
Outdated
* for each particular device. | ||
*/ | ||
if (config->ctx.unified_mode && | ||
ucp_tls_array_is_present((const char**)config->tls.names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
src/ucp/core/ucp_worker.c
Outdated
ucp_worker_iface_t *wiface) | ||
{ | ||
ucp_context_h ctx = worker->context; | ||
ucp_rsc_index_t id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"id" -> "rsc_index"
src/ucp/core/ucp_worker.c
Outdated
/* Check that another iface: | ||
* 1. Supports all capabilities of the target iface (at least) | ||
* 2. Has the same or better performance charasteristics */ | ||
if (ucs_test_all_flags(if_iter->attr.cap.flags, wiface->attr.cap.flags) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to test all flags? maybe define some mask of what is really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, better check that all caps are covered. Flags can be relevant to one of the features (maybe except UCT_IFACE_FLAG_CONNECT_TO_IFACE). We could try to use different masks depending on features requested, but what would be the real use case which benefits from that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to ignore UCT_IFACE_FLAG_CONNECT_TO_IFACE/EP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/ucp/core/ucp_worker.c
Outdated
* 2. Has the same or better performance charasteristics */ | ||
if (ucs_test_all_flags(if_iter->attr.cap.flags, wiface->attr.cap.flags) && | ||
(if_iter->attr.overhead <= wiface->attr.overhead) && | ||
(if_iter->attr.latency.overhead <= wiface->attr.latency.overhead) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's check latency as function of NP instead of overhead,growth
maybe this way we can avoid creating dc on small scale and rc on large scale
src/ucp/core/ucp_worker.c
Outdated
/* Do not check this iface anymore, because better one exists. | ||
* It helps to avoid the case when two interfaces with the same caps | ||
* and performance exclude each other. */ | ||
wiface->flags |= UCP_WORKER_IFACE_FLAG_SUBOPTIMAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe UCP_WORKER_IFACE_FLAG_UNUSED?
src/ucp/core/ucp_worker.c
Outdated
|
||
if (num_ifaces < context->num_tls) { | ||
/* Some ifaces need to be closed */ | ||
ifaces = ucs_calloc(num_ifaces, sizeof(ucp_worker_iface_t), "ucp iface"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need to calloc? we can use memmove to overwrite the array in-place
src/ucp/core/ucp_worker.c
Outdated
memcpy(ifaces + iface_id, wiface, sizeof(*wiface)); | ||
++iface_id; | ||
} else { | ||
ucs_debug("closing suboptimal interface[%d]=%p on " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe print which resource replaces which
e.g closing resource[0] rc/mlx5_0:1 since resource[1] rc/mlx5_0:1 is better
src/ucp/core/ucp_worker.c
Outdated
|
||
/* Cache tl_bitmap on the context, so the next workers would not need | ||
* to select best ifaces. */ | ||
ucs_atomic_cswap64(&context->tl_bitmap, 0ul, tl_bitmap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we need atomic, if several threads write they would write same value anyway, no?
ucs_status_t status; | ||
|
||
UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker); | ||
|
||
for (rsc_index = 0; rsc_index < worker->context->num_tls; ++rsc_index) { | ||
if (worker->ifaces[rsc_index].iface == NULL) { | ||
ucs_for_each_bit(rsc_index, worker->context->tl_bitmap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need ucs_for_each_bit here? can't we just go over the array consecutively (without knowing rsc_index)?
(this comment can be relevant to other for loops which were converted to ucs_for_each_bit...)
Test FAILed. |
Test FAILed. |
Test FAILed. |
bot:mlx:retest |
Test FAILed. |
bot:mlx:retest |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
if (ucs_test_all_flags(if_iter->attr.cap.flags, test_flags) && | ||
(if_iter->attr.overhead <= wiface->attr.overhead) && | ||
(if_iter->attr.bandwidth >= wiface->attr.bandwidth) && | ||
(if_iter->attr.priority >= wiface->attr.priority)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: in order to avoid more 'if' nesting level, maybe inverse the condition and do 'continue'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I get it: currently in most cases we should fail on the very first condition checking capability bits, then all other conditions will not be checked. If everything, but latency matches, we will have to calculate latency funcs anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just suggested to reduce nesting level to make it easier to read, not for performance (therefore it's 'minor')
src/ucp/core/ucp_worker.c
Outdated
|
||
latency_iter = ucp_worker_iface_latency(worker, if_iter); | ||
latency_cur = ucp_worker_iface_latency(worker, wiface); | ||
epsilon = (latency_iter + latency_cur) *1e-6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after '*'
src/ucp/core/ucp_worker.c
Outdated
@@ -804,52 +823,47 @@ static ucs_status_t ucp_worker_select_best_ifaces(ucp_worker_h worker, | |||
{ | |||
ucp_context_h context = worker->context; | |||
uint64_t tl_bitmap = 0; | |||
ucp_worker_iface_t *ifaces; | |||
ucp_rsc_index_t repl_ifaces[context->num_tls]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is gcc extension, let's use ucs_alloca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, variable length array was added in C99
Test FAILed. |
Test FAILed. |
Test PASSed. |
src/ucp/core/ucp_worker.c
Outdated
ucp_rsc_index_t repl_ifaces[context->num_tls]; | ||
ucp_context_h context = worker->context; | ||
uint64_t tl_bitmap = 0; | ||
ucp_rsc_index_t *repl_ifaces = ucs_alloca(context->num_tls); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would use context->num_tls * sizeof(ucp_rsc_index_t)
, to not assume it's 1 byte
Test PASSed. |
Conflicts: src/ucp/tag/rndv.c
Test PASSed. |
Test PASSed. |
What
Implement UCX_UNIFIED_MODE variable, which is supposed to enable various optimizations suitable for homogeneous clusters. The first optimization is saving of tl resources. If there are several interfaces with the same capabilities on the same device, select the one with the best perf characteristics and close the others.
Why ?
This optimization implies the following benefits for homogeneous clusters:
How ?
Worker open all available ifaces (to get its capabilities) and compares them for replaceability. If some ifaces on the same device provide the same capabilities, worker would select the best performing one and will close the others. Then the worker would cache the map of selected ifaces on the context for other workers to use it rather than selecting ifaces by themselves.