-
Notifications
You must be signed in to change notification settings - Fork 931
Closed
Description
There's been some kind of dramatic collapse in the bandwidth of the ugni BTL on Cray XC,
particularly in the rendezvous path.
Thanks to Jerome Vienne (TACC) for discovering this. The behavior is as if there is
some kind of serialization of RDMA requests.
Here's what one gets on a Cray XC using Open MPI for osu_bw:
# OSU MPI Bandwidth Test v5.0
# Size Bandwidth (MB/s)
1 1.26
2 2.53
4 5.07
8 10.03
16 20.17
32 40.06
64 80.63
128 159.03
256 307.02
512 515.40
1024 853.31
2048 1401.59
4096 2036.56
8192 2181.38
16384 2731.88
32768 2983.61
65536 3063.85
131072 2989.02
262144 2947.41
524288 2936.12
1048576 3013.37
2097152 3045.48
4194304 3066.06
whilst with Craypich:
# OSU MPI Bandwidth Test v5.0
# Size Bandwidth (MB/s)
1 1.77
2 3.49
4 7.08
8 14.14
16 27.63
32 55.29
64 110.43
128 221.12
256 429.90
512 832.11
1024 1226.09
2048 1837.54
4096 2486.40
8192 5518.25
16384 8121.97
32768 9060.66
65536 9509.80
131072 9764.89
262144 9861.44
524288 9908.75
1048576 9914.67
2097152 9839.37
4194304 9146.36
Verified that on systems not using nativized slurm that the same behavior persists using either aprun or mpirun.