UCP: Optimize resources for homogeneous clusters #3056

brminich · 2018-11-24T20:53:04Z

What

Implement UCX_UNIFIED_MODE variable, which is supposed to enable various optimizations suitable for homogeneous clusters. The first optimization is saving of tl resources. If there are several interfaces with the same capabilities on the same device, select the one with the best perf characteristics and close the others.

Why ?

This optimization implies the following benefits for homogeneous clusters:

Lower memory consumption by the workers
Smaller worker addresses
Avoiding unneeded iface progressions (tx)

How ?

Worker open all available ifaces (to get its capabilities) and compares them for replaceability. If some ifaces on the same device provide the same capabilities, worker would select the best performing one and will close the others. Then the worker would cache the map of selected ifaces on the context for other workers to use it rather than selecting ifaces by themselves.

+ Implement UCX_UNIFIED_MODE variable, which is supposed to enable various optimizations sutiable for homogeneous clusters. + The first optimization is saving of tl resources.If there are several interfaces with the same capabilities on the same device, select the one with the best perf charsteristics and close the others.

swx-jenkins1 · 2018-11-24T21:35:42Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5634/ for details.

mellanox-github · 2018-11-24T23:33:18Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8299/ for details (Mellanox internal link).

yosefe · 2018-11-26T00:05:20Z

bot:mlx:retest

mellanox-github · 2018-11-26T02:30:24Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8315/ for details (Mellanox internal link).

swx-jenkins1 · 2018-11-26T08:47:00Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5645/ for details.

mellanox-github · 2018-11-26T10:59:28Z

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8316/ for details (Mellanox internal link).

yosefe · 2018-11-27T11:00:10Z

src/ucp/core/ucp_context.c

@@ -234,6 +234,10 @@ static ucs_config_field_t ucp_config_table[] = {
   "another thread, or incoming active messages, but consumes more resources.",
   ucs_offsetof(ucp_config_t, ctx.flush_worker_eps), UCS_CONFIG_TYPE_BOOL},

+  {"UNIFIED_MODE", "n",
+   "Enable various optimizations intended for homogeneous environment. \n",


need better explanation what this means , something like "it's guaranteed that the local transport resources/devices of all entities which connect to each other are the same"

yosefe · 2018-11-27T11:00:29Z

src/ucp/core/ucp_context.c

+     * for each particular device.
+     */
+    if  (config->ctx.unified_mode &&
+         ucp_tls_array_is_present((const char**)config->tls.names,


why is this needed?

yosefe · 2018-11-27T11:04:17Z

src/ucp/core/ucp_worker.c

+                                        ucp_worker_iface_t *wiface)
+{
+    ucp_context_h ctx = worker->context;
+    ucp_rsc_index_t id;


"id" -> "rsc_index"

yosefe · 2018-11-27T11:04:51Z

src/ucp/core/ucp_worker.c

+        /* Check that another iface:
+         * 1. Supports all capabilities of the target iface (at least)
+         * 2. Has the same or better performance charasteristics */
+        if (ucs_test_all_flags(if_iter->attr.cap.flags, wiface->attr.cap.flags) &&


do we really need to test all flags? maybe define some mask of what is really needed?

IMO, better check that all caps are covered. Flags can be relevant to one of the features (maybe except UCT_IFACE_FLAG_CONNECT_TO_IFACE). We could try to use different masks depending on features requested, but what would be the real use case which benefits from that?

just to ignore UCT_IFACE_FLAG_CONNECT_TO_IFACE/EP

yosefe · 2018-11-27T11:05:58Z

src/ucp/core/ucp_worker.c

+         * 2. Has the same or better performance charasteristics */
+        if (ucs_test_all_flags(if_iter->attr.cap.flags, wiface->attr.cap.flags) &&
+            (if_iter->attr.overhead          <= wiface->attr.overhead) &&
+            (if_iter->attr.latency.overhead  <= wiface->attr.latency.overhead) &&


let's check latency as function of NP instead of overhead,growth
maybe this way we can avoid creating dc on small scale and rc on large scale

yosefe · 2018-11-27T11:06:29Z

src/ucp/core/ucp_worker.c

+            /* Do not check this iface anymore, because better one exists.
+             * It helps to avoid the case when two interfaces with the same caps
+             * and performance exclude each other. */
+            wiface->flags |= UCP_WORKER_IFACE_FLAG_SUBOPTIMAL;


maybe UCP_WORKER_IFACE_FLAG_UNUSED?

yosefe · 2018-11-27T11:07:53Z

src/ucp/core/ucp_worker.c

+
+    if (num_ifaces < context->num_tls) {
+        /* Some ifaces need to be closed */
+        ifaces = ucs_calloc(num_ifaces, sizeof(ucp_worker_iface_t), "ucp iface");


why need to calloc? we can use memmove to overwrite the array in-place

yosefe · 2018-11-27T11:09:01Z

src/ucp/core/ucp_worker.c

+                memcpy(ifaces + iface_id, wiface, sizeof(*wiface));
+                ++iface_id;
+            } else {
+                ucs_debug("closing suboptimal interface[%d]=%p on "


maybe print which resource replaces which
e.g closing resource[0] rc/mlx5_0:1 since resource[1] rc/mlx5_0:1 is better

yosefe · 2018-11-27T11:10:51Z

src/ucp/core/ucp_worker.c

+
+        /* Cache tl_bitmap on the context, so the next workers would not need
+         * to select best ifaces. */
+        ucs_atomic_cswap64(&context->tl_bitmap, 0ul, tl_bitmap);


not sure we need atomic, if several threads write they would write same value anyway, no?

yosefe · 2018-11-27T11:13:35Z

src/ucp/rma/flush.c

    ucs_status_t status;

    UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);

-    for (rsc_index = 0; rsc_index < worker->context->num_tls; ++rsc_index) {
-        if (worker->ifaces[rsc_index].iface == NULL) {
+    ucs_for_each_bit(rsc_index, worker->context->tl_bitmap) {


why do we need ucs_for_each_bit here? can't we just go over the array consecutively (without knowing rsc_index)?
(this comment can be relevant to other for loops which were converted to ucs_for_each_bit...)

swx-jenkins1 · 2018-11-27T19:31:25Z

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5661/ for details.

mellanox-github · 2018-11-27T19:44:27Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8338/ for details (Mellanox internal link).

mellanox-github · 2018-11-27T23:01:35Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8339/ for details (Mellanox internal link).

yosefe · 2018-11-28T16:24:33Z

bot:mlx:retest

mellanox-github · 2018-11-28T19:12:22Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8345/ for details (Mellanox internal link).

yosefe · 2018-11-28T22:47:47Z

bot:mlx:retest

mellanox-github · 2018-11-29T01:14:14Z

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8350/ for details (Mellanox internal link).

swx-jenkins1 · 2018-11-29T11:27:24Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5669/ for details.

swx-jenkins1 · 2018-11-30T08:31:58Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5672/ for details.

mellanox-github · 2018-11-30T10:16:14Z

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8359/ for details (Mellanox internal link).

yosefe · 2018-11-30T11:40:22Z

src/ucp/core/ucp_worker.c

+        if (ucs_test_all_flags(if_iter->attr.cap.flags, test_flags) &&
+            (if_iter->attr.overhead  <= wiface->attr.overhead)      &&
+            (if_iter->attr.bandwidth >= wiface->attr.bandwidth)     &&
+            (if_iter->attr.priority  >= wiface->attr.priority)) {


minor: in order to avoid more 'if' nesting level, maybe inverse the condition and do 'continue'.

not sure I get it: currently in most cases we should fail on the very first condition checking capability bits, then all other conditions will not be checked. If everything, but latency matches, we will have to calculate latency funcs anyway

just suggested to reduce nesting level to make it easier to read, not for performance (therefore it's 'minor')

yosefe · 2018-11-30T11:40:56Z

src/ucp/core/ucp_worker.c

+
+            latency_iter = ucp_worker_iface_latency(worker, if_iter);
+            latency_cur  = ucp_worker_iface_latency(worker, wiface);
+            epsilon      = (latency_iter + latency_cur) *1e-6;


space after '*'

yosefe · 2018-11-30T11:43:12Z

src/ucp/core/ucp_worker.c

@@ -804,52 +823,47 @@ static ucs_status_t ucp_worker_select_best_ifaces(ucp_worker_h worker,
 {
    ucp_context_h context = worker->context;
    uint64_t tl_bitmap    = 0;
-    ucp_worker_iface_t *ifaces;
+    ucp_rsc_index_t repl_ifaces[context->num_tls];


this is gcc extension, let's use ucs_alloca

AFAIK, variable length array was added in C99

swx-jenkins1 · 2018-11-30T15:53:30Z

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5676/ for details.

mellanox-github · 2018-11-30T18:16:17Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8363/ for details (Mellanox internal link).

swx-jenkins1 · 2018-11-30T18:30:09Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5677/ for details.

yosefe · 2018-11-30T20:04:17Z

src/ucp/core/ucp_worker.c

-    ucp_rsc_index_t repl_ifaces[context->num_tls];
+    ucp_context_h context        = worker->context;
+    uint64_t tl_bitmap           = 0;
+    ucp_rsc_index_t *repl_ifaces = ucs_alloca(context->num_tls);


i would use context->num_tls * sizeof(ucp_rsc_index_t), to not assume it's 1 byte

mellanox-github · 2018-11-30T20:10:09Z

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8364/ for details (Mellanox internal link).

Conflicts: src/ucp/tag/rndv.c

swx-jenkins1 · 2018-12-01T18:46:27Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5681/ for details.

mellanox-github · 2018-12-01T20:31:46Z

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8367/ for details (Mellanox internal link).

brminich added 2 commits November 24, 2018 20:10

UCP: Worker iface addressing refactoring

4ffcb1b

brminich added 2 commits November 26, 2018 10:05

UCP: Fix coverity errors

95ec8d5

UCP: Set UCP_MAX_RESOURCES to 64

fdfbaa1

yosefe reviewed Nov 27, 2018

View reviewed changes

yosefe added the Optimization Code / performance optimization label Nov 27, 2018

UCP: CR comments p1

22af5f5

UCP: Fix worker->num_ifaces initialization

810c1e7

UCP: use memcpy instead of memmove for wifaces

961138a

yosefe reviewed Nov 30, 2018

View reviewed changes

UCP: CR comments p2

614d0c3

UCP: Fix alloca comp error on ARM

efde530

yosefe reviewed Nov 30, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into topic/ucp_homo_mode

70ba628

Conflicts: src/ucp/tag/rndv.c

yosefe approved these changes Dec 1, 2018

View reviewed changes

yosefe merged commit 787abd7 into openucx:master Dec 2, 2018

shamisp mentioned this pull request Mar 5, 2019

UCT/UGNI: Update UGNI to new UCT api. #3314

Merged

artpol84 mentioned this pull request Mar 8, 2019

bfrops: added flexible-size integer packing to bfrops v4 openpmix/openpmix#1130

Closed

UCP: Optimize resources for homogeneous clusters #3056

UCP: Optimize resources for homogeneous clusters #3056

Conversation

brminich commented Nov 24, 2018

What

Why ?

How ?

swx-jenkins1 commented Nov 24, 2018

mellanox-github commented Nov 24, 2018

yosefe commented Nov 26, 2018

mellanox-github commented Nov 26, 2018

swx-jenkins1 commented Nov 26, 2018

mellanox-github commented Nov 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swx-jenkins1 commented Nov 27, 2018

mellanox-github commented Nov 27, 2018

mellanox-github commented Nov 27, 2018

yosefe commented Nov 28, 2018

mellanox-github commented Nov 28, 2018

yosefe commented Nov 28, 2018

mellanox-github commented Nov 29, 2018

swx-jenkins1 commented Nov 29, 2018

swx-jenkins1 commented Nov 30, 2018

mellanox-github commented Nov 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swx-jenkins1 commented Nov 30, 2018

mellanox-github commented Nov 30, 2018

swx-jenkins1 commented Nov 30, 2018

Choose a reason for hiding this comment

mellanox-github commented Nov 30, 2018

swx-jenkins1 commented Dec 1, 2018

mellanox-github commented Dec 1, 2018