Skip to content

bandwidth/latency for ugni btl is poor #1005

@hppritcha

Description

@hppritcha

There's been some kind of dramatic collapse in the bandwidth of the ugni BTL on Cray XC,
particularly in the rendezvous path.

Thanks to Jerome Vienne (TACC) for discovering this. The behavior is as if there is
some kind of serialization of RDMA requests.

Here's what one gets on a Cray XC using Open MPI for osu_bw:

# OSU MPI Bandwidth Test v5.0
# Size      Bandwidth (MB/s)
1                       1.26
2                       2.53
4                       5.07
8                      10.03
16                     20.17
32                     40.06
64                     80.63
128                   159.03
256                   307.02
512                   515.40
1024                  853.31
2048                 1401.59
4096                 2036.56
8192                 2181.38
16384                2731.88
32768                2983.61
65536                3063.85
131072               2989.02
262144               2947.41
524288               2936.12
1048576              3013.37
2097152              3045.48
4194304              3066.06

whilst with Craypich:

# OSU MPI Bandwidth Test v5.0
# Size      Bandwidth (MB/s)
1                       1.77
2                       3.49
4                       7.08
8                      14.14
16                     27.63
32                     55.29
64                    110.43
128                   221.12
256                   429.90
512                   832.11
1024                 1226.09
2048                 1837.54
4096                 2486.40
8192                 5518.25
16384                8121.97
32768                9060.66
65536                9509.80
131072               9764.89
262144               9861.44
524288               9908.75
1048576              9914.67
2097152              9839.37
4194304              9146.36

Verified that on systems not using nativized slurm that the same behavior persists using either aprun or mpirun.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions