Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP: Optimize resources for homogeneous clusters #3056

Merged
merged 10 commits into from Dec 2, 2018

Conversation

brminich
Copy link
Contributor

What

Implement UCX_UNIFIED_MODE variable, which is supposed to enable various optimizations suitable for homogeneous clusters. The first optimization is saving of tl resources. If there are several interfaces with the same capabilities on the same device, select the one with the best perf characteristics and close the others.

Why ?

This optimization implies the following benefits for homogeneous clusters:

  • Lower memory consumption by the workers
  • Smaller worker addresses
  • Avoiding unneeded iface progressions (tx)

How ?

Worker open all available ifaces (to get its capabilities) and compares them for replaceability. If some ifaces on the same device provide the same capabilities, worker would select the best performing one and will close the others. Then the worker would cache the map of selected ifaces on the context for other workers to use it rather than selecting ifaces by themselves.

+ Implement UCX_UNIFIED_MODE variable, which is supposed to enable
various optimizations sutiable for homogeneous clusters.

+ The first optimization is saving of tl resources.If there are several
interfaces with the same capabilities on the same device, select the one
with the best perf charsteristics and close the others.
@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5634/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8299/ for details (Mellanox internal link).

@yosefe
Copy link
Contributor

yosefe commented Nov 26, 2018

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8315/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5645/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8316/ for details (Mellanox internal link).

@@ -234,6 +234,10 @@ static ucs_config_field_t ucp_config_table[] = {
"another thread, or incoming active messages, but consumes more resources.",
ucs_offsetof(ucp_config_t, ctx.flush_worker_eps), UCS_CONFIG_TYPE_BOOL},

{"UNIFIED_MODE", "n",
"Enable various optimizations intended for homogeneous environment. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need better explanation what this means , something like "it's guaranteed that the local transport resources/devices of all entities which connect to each other are the same"

* for each particular device.
*/
if (config->ctx.unified_mode &&
ucp_tls_array_is_present((const char**)config->tls.names,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

ucp_worker_iface_t *wiface)
{
ucp_context_h ctx = worker->context;
ucp_rsc_index_t id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"id" -> "rsc_index"

/* Check that another iface:
* 1. Supports all capabilities of the target iface (at least)
* 2. Has the same or better performance charasteristics */
if (ucs_test_all_flags(if_iter->attr.cap.flags, wiface->attr.cap.flags) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to test all flags? maybe define some mask of what is really needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, better check that all caps are covered. Flags can be relevant to one of the features (maybe except UCT_IFACE_FLAG_CONNECT_TO_IFACE). We could try to use different masks depending on features requested, but what would be the real use case which benefits from that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to ignore UCT_IFACE_FLAG_CONNECT_TO_IFACE/EP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* 2. Has the same or better performance charasteristics */
if (ucs_test_all_flags(if_iter->attr.cap.flags, wiface->attr.cap.flags) &&
(if_iter->attr.overhead <= wiface->attr.overhead) &&
(if_iter->attr.latency.overhead <= wiface->attr.latency.overhead) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's check latency as function of NP instead of overhead,growth
maybe this way we can avoid creating dc on small scale and rc on large scale

/* Do not check this iface anymore, because better one exists.
* It helps to avoid the case when two interfaces with the same caps
* and performance exclude each other. */
wiface->flags |= UCP_WORKER_IFACE_FLAG_SUBOPTIMAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe UCP_WORKER_IFACE_FLAG_UNUSED?


if (num_ifaces < context->num_tls) {
/* Some ifaces need to be closed */
ifaces = ucs_calloc(num_ifaces, sizeof(ucp_worker_iface_t), "ucp iface");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to calloc? we can use memmove to overwrite the array in-place

memcpy(ifaces + iface_id, wiface, sizeof(*wiface));
++iface_id;
} else {
ucs_debug("closing suboptimal interface[%d]=%p on "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe print which resource replaces which
e.g closing resource[0] rc/mlx5_0:1 since resource[1] rc/mlx5_0:1 is better


/* Cache tl_bitmap on the context, so the next workers would not need
* to select best ifaces. */
ucs_atomic_cswap64(&context->tl_bitmap, 0ul, tl_bitmap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we need atomic, if several threads write they would write same value anyway, no?

ucs_status_t status;

UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);

for (rsc_index = 0; rsc_index < worker->context->num_tls; ++rsc_index) {
if (worker->ifaces[rsc_index].iface == NULL) {
ucs_for_each_bit(rsc_index, worker->context->tl_bitmap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need ucs_for_each_bit here? can't we just go over the array consecutively (without knowing rsc_index)?
(this comment can be relevant to other for loops which were converted to ucs_for_each_bit...)

@yosefe yosefe added the Optimization Code / performance optimization label Nov 27, 2018
@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5661/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8338/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8339/ for details (Mellanox internal link).

@yosefe
Copy link
Contributor

yosefe commented Nov 28, 2018

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8345/ for details (Mellanox internal link).

@yosefe
Copy link
Contributor

yosefe commented Nov 28, 2018

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8350/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5669/ for details.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5672/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8359/ for details (Mellanox internal link).

if (ucs_test_all_flags(if_iter->attr.cap.flags, test_flags) &&
(if_iter->attr.overhead <= wiface->attr.overhead) &&
(if_iter->attr.bandwidth >= wiface->attr.bandwidth) &&
(if_iter->attr.priority >= wiface->attr.priority)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: in order to avoid more 'if' nesting level, maybe inverse the condition and do 'continue'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get it: currently in most cases we should fail on the very first condition checking capability bits, then all other conditions will not be checked. If everything, but latency matches, we will have to calculate latency funcs anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just suggested to reduce nesting level to make it easier to read, not for performance (therefore it's 'minor')


latency_iter = ucp_worker_iface_latency(worker, if_iter);
latency_cur = ucp_worker_iface_latency(worker, wiface);
epsilon = (latency_iter + latency_cur) *1e-6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after '*'

@@ -804,52 +823,47 @@ static ucs_status_t ucp_worker_select_best_ifaces(ucp_worker_h worker,
{
ucp_context_h context = worker->context;
uint64_t tl_bitmap = 0;
ucp_worker_iface_t *ifaces;
ucp_rsc_index_t repl_ifaces[context->num_tls];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is gcc extension, let's use ucs_alloca

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, variable length array was added in C99

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5676/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8363/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5677/ for details.

ucp_rsc_index_t repl_ifaces[context->num_tls];
ucp_context_h context = worker->context;
uint64_t tl_bitmap = 0;
ucp_rsc_index_t *repl_ifaces = ucs_alloca(context->num_tls);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would use context->num_tls * sizeof(ucp_rsc_index_t), to not assume it's 1 byte

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8364/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5681/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8367/ for details (Mellanox internal link).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimization Code / performance optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants