Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOOLS/PERFTEST: ucc perftest #166

Merged
merged 2 commits into from
Apr 28, 2021

Conversation

Sergei-Lebedev
Copy link
Contributor

@Sergei-Lebedev Sergei-Lebedev commented Apr 26, 2021

What

Adding UCC internal performance tests

Why ?

UCC perftest is not a replacement for other well known benchmark such as OSU, instead the goal here is to cover different usage scenarios specific for UCC such as persistent collectives, asymmetric memory type, multithreading, measuring collective bw.

How ?

ucc_perftest is compiled as a separate binary file in tools/perftest. Different backends might be used for OOB, but right now only MPI is available.
ucc_pt_benchmark - common logic for starting benchmark
ucc_pt_coll - collective abstraction for benchmarking
ucc_pt_bootstrap - OOB backend abstraction
ucc_pt_comm - ucc_lib + ucc_context + ucc_team

Running allreduce on 8 ranks with ucc_perftest and OSU with ucc coll component (using Val's MPI driver):

  • UCC Perftest
Collective: Allreduce
Memory type: host
Data type: 11
Operation type: 2
Warmup: 200; Iterations: 100
       Count        Size                Time, us
                                 avg         min         max
         128         512        6.48        5.76        7.18
         256        1024        7.65        7.19        8.10
         512        2048        9.89        9.56       10.14
        1024        4096       16.18       16.02       16.41
        2048        8192       25.03       24.55       25.46
        4096       16384       41.91       41.30       43.14
        8192       32768       56.47       55.63       57.13
       16384       65536      101.32      100.09      103.56
       32768      131072      182.24      179.16      185.13
       65536      262144      352.69      320.45      377.47
      131072      524288      682.40      624.88      727.74
      262144     1048576     1304.06     1286.65     1320.36
      524288     2097152     2641.49     2627.99     2674.91
     1048576     4194304     6433.45     6370.56     6472.88
     2097152     8388608    15113.78    14921.75    15312.76
     4194304    16777216    44856.27    44334.32    45363.89
  • OSU Allreduce
# OSU MPI Allreduce Latency Test v5.6.2
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
512                     7.24              6.86              7.52         100
1024                    8.29              8.01              8.46         100
2048                   10.53              9.93             11.18         100
4096                   15.03             14.50             15.64         100
8192                   25.11             24.40             25.80         100
16384                  37.26             36.14             38.33         100
32768                  60.45             58.65             62.75         100
65536                 109.76            107.78            112.19         100
131072                190.86            180.98            196.24         100
262144                355.12            349.44            363.02         100
524288                708.33            678.93            734.56         100
1048576              1356.70           1325.77           1399.61         100
2097152              2690.97           2642.03           2774.06         100
4194304              6500.48           6298.73           6777.99         100
8388608             15228.75          14709.94          15732.21         100
16777216            45990.79          45844.17          46254.13         100

UCCCHECK_GOTO(ucc_collective_post(req), free_req, st);
do {
UCCCHECK_GOTO(ucc_context_progress(ctx), free_req, st);
} while (ucc_collective_test(req) == UCC_INPROGRESS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for NOT_SUPPORTED or other err codes?

coll_args.dst.info.mem_type = mt;
}

ucc_status_t ucc_pt_coll_allreduce::get_coll(size_t count,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho "get_coll" confusing name, since you not "getting" anything. i would call it "init_args"

ucc_context_params_t context_params;
ucc_team_params_t team_params;
ucc_status_t st;
st = ucc_lib_config_read("TORCH", nullptr, &lib_config);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TORCH -> PERF_TEST ?

std::memset(&lib_params, 0, sizeof(ucc_lib_params_t));
lib_params.mask = UCC_LIB_PARAM_FIELD_THREAD_MODE;
lib_params.thread_mode = UCC_THREAD_SINGLE;
st = ucc_init(&lib_params, lib_config, &lib);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCC_CHECKGOTO(st)?

UCC_CONTEXT_PARAM_FIELD_OOB;
context_params.type = UCC_CONTEXT_SHARED;
context_params.oob = bootstrap->get_context_oob();
ucc_context_create(lib, &context_params, context_config, &context);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

st = ucc_Context_create

bench.mt = UCC_MEMORY_TYPE_HOST;
bench.op = UCC_OP_SUM;
bench.inplace = false;
bench.n_iter = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is very useful to have iter/warmup pair for at least "small" and "large" msg sizes. For small (say up to 64K) we want warmup 100, iter 1000 i beleive. 10/10 will have a lot of noise on high core counts

config(cfg),
comm(communcator)
{
coll = new ucc_pt_coll_allreduce(cfg.dt, cfg.mt, cfg.op, cfg.inplace);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you make test with multiple collectives? Will create multiple ucc_pt_benchmark objects in main? Right now it is kind of all tight to a 1 benchmark. You have 1 arg parse, which creates 1 config, with 1 coll. Looks like it must be changed somehow. OR, is the use case is 1 coll_type at a time ? It also fine imho, just making sure i understand.

bench.mt = UCC_MEMORY_TYPE_HOST;
bench.op = UCC_OP_SUM;
bench.inplace = false;
bench.n_iter_large = 1000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

large should be 20/200 and small 100/1000 - vice versa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops 😀, fixed

@bureddy
Copy link
Collaborator

bureddy commented Apr 28, 2021

Data type: 11
Operation type: 2

Can we convert these numbers to names?

@Sergei-Lebedev
Copy link
Contributor Author

Data type: 11
Operation type: 2

Can we convert these numbers to names?

yep, fixed. Header looks like this now

Collective:             Allreduce
Memory type:            host
Data type:              float32
Operation type:         sum
Warmup:
  small                 100
  large                 20
Iterations:
  small                 1000
  large                 200

split warmup and iterations for small and large tests

fix error checking

fix config read prefix

fix datatype and reduction print
@Sergei-Lebedev Sergei-Lebedev merged commit d13e395 into openucx:master Apr 28, 2021
@Sergei-Lebedev Sergei-Lebedev deleted the topic/perftest branch April 28, 2021 19:38
@vspetrov vspetrov mentioned this pull request May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants