# Running our Ping Pong test

I am going to run the ping pong test with the mvapich2 MPI implementation:

In [None]:
module unload cse6230
module load cse6230/gcc-omp-gpu

In [None]:
which mpicc

How many cores do I have?

In [None]:
echo ${PBS_NP}

Where are they?

In [None]:
cat ${PBS_NODEFILE} | uniq

In [1]:
make clean
make osu_latency

rm -f *.o osu_latency
mpicc  -g -Wall -O3 -I./ -c osu_util.c
mpicc  -g -Wall -O3 -I./ -c osu_util_mpi.c
mpicc  -g -Wall -O3 -I./ -c osu_latency.c
mpicc  osu_util.o osu_util_mpi.o osu_latency.o -o osu_latency


Let's run the basic test:

In [2]:
mpirun -np 2 ./osu_latency

MPI Version: 3.0
Intel(R) MPI Library 5.1.1 for Linux* OS

MPI # Procs: 2
MPI Wtime 1.57106e+09, precision 1e-06
MPI Wtime is global
MPI proc 0 host: rich133-h35-16-l.pace.gatech.edu
MPI proc 1 host: rich133-h35-16-l.pace.gatech.edu
# OSU MPI Latency Test
# Size            Latency (us)
0                         0.65
1                         0.66
2                         0.66
4                         0.65
8                         0.65
16                        0.65
32                        0.67
64                        0.72
128                       0.76
256                       0.76
512                       1.02
1024                      1.18
2048                      1.45
4096                      2.23
8192                      3.64
16384                     6.35
32768                    11.72
65536                    13.88
131072                   25.26
262144                   45.74
524288                   86.47
1048576                 168.52
2097152                 333.34


**What are our estimates for the latency and bandwidth from this? What is the inverse bandwidth?**

Those numbers sound similar to memory bandwidth numbers, so it sounds like MPI is taking advantage of direct memcpy between processes.  Is that true?

In [None]:
mpirun -np 2 -env MV2_USE_SHARED_MEM 0 ./osu_latency

Let's make sure we have control over where processes go with pinning (like OpenMP affinity):

In [None]:
mpirun -np 2 -env MV2_USE_SHARED_MEM 1 -env MV2_ENABLE_AFFINITY 1 ./osu_latency

Now what if I run with 28 processes (the number of cores on one node)? 112 (the number of cores on all nodes)?

In [None]:
mpirun -np 28 -env MV2_USE_SHARED_MEM 1 -env MV2_ENABLE_AFFINITY 1 ./osu_latency | tail -n 25

In [None]:
mpirun -np 112 -env MV2_USE_SHARED_MEM 1 -env MV2_ENABLE_AFFINITY 1 ./osu_latency | tail -n 25

All of the ping-pongs so far have been between rank 0 and rank 1.  With affinity, these are in the same NUMA domain.  What if I ping-pong between other ranks?

In [None]:
mpirun -np 112 -env MV2_USE_SHARED_MEM 1 -env MV2_ENABLE_AFFINITY 1 ./osu_latency 0 14 | tail -n 25

In [None]:
mpirun -np 112 -env MV2_USE_SHARED_MEM 1 -env MV2_ENABLE_AFFINITY 1 ./osu_latency 0 28 | tail -n 25

In [None]:
mpirun -np 112 -env MV2_USE_SHARED_MEM 1 -env MV2_ENABLE_AFFINITY 1 ./osu_latency 0 111 | tail -n 25

In [None]:
mpirun -np 112 -env MV2_USE_SHARED_MEM 1 -env MV2_ENABLE_AFFINITY 1 ./osu_latency 27 111 | tail -n 25