p100_configure

1. Set up Kokkos

You can use the latest version of Kokkos. The experiments of the paper "Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures" uses kokkos-version: 2.04.04

mkdir $HOME/kokkoskernels_spgemm_benchmark
cd $HOME/kokkoskernels_spgemm_benchmark
git clone git@github.com:kokkos/kokkos.git

2. Get KokkosKernels

cd $HOME/kokkoskernels_spgemm_benchmark
git clone git@github.com:kokkos/kokkos-kernels.git

Update: As of 03/07/2018, these functionalities are in the master branch. No need to check out the develop branch.

~~Currently KokkosKernels-spgemm updates are not on the master branch yet (12/20/2017). Checkout the develop branch.~~

cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels
git checkout master #git checkout develop #outdated as of 03/07/2018.

3. Update the compileKokkosKernels.sh located at example/buildlib.

cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib
vi compileKokkosKernels.sh

Below is the example of compileKokkosKernels.sh for Power8 with OpenMP execution space.

KOKKOS_PATH=${HOME}/kokkoskernels_spgemm_benchmark/kokkos #path to kokkos source
KOKKOSKERNELS_SCALARS='double' #we only need double
KOKKOSKERNELS_LAYOUTS=LayoutLeft #the layout types to instantiate.
KOKKOSKERNELS_ORDINALS=int #ordinal types to instantiate
KOKKOSKERNELS_OFFSETS=int #offset types to instantiate
KOKKOSKERNELS_PATH=${HOME}/kokkoskernels_spgemm_benchmark/kokkos-kernels #path to kokkos-kernels top directory.
KOKKOSKERNELS_OPTIONS=eti-only #options for kokkoskernels  
CXXFLAGS="-Wall -pedantic -Werror -O3 -g -Wshadow -Wsign-compare -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized"
CXX=${KOKKOS_PATH}/bin/nvcc_wrapper #FIX: 02/27/18 from CXX=g++
KOKKOS_DEVICES=OpenMP,Cuda  #we need both cuda and openmp/serial execution space.
KOKKOS_ARCHS=Pascal60,Power8
KOKKOSKERNELS_TPLS="cusparse" ###to enable cusparse 

../../scripts/generate_makefile.bash --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=${KOKKOSKERNELS_SCALARS} --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --kokkos-path=${KOKKOS_PATH} --with-devices=${KOKKOS_DEVICES} --arch=${KOKKOS_ARCHS} --compiler=${CXX} --with-options=${KOKKOSKERNELS_OPTIONS}  --cxxflags="${CXXFLAGS}" --with-tpls=${KOKKOSKERNELS_TPLS}

Set the compiler.

module load gcc/5.4.0
module load cuda/8.0.44

4. Compile KokkosKernels.

cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib
./compileKokkosKernels.sh
make build-test -j

5- Running Benchmarks.

Allocate the node using appropriate scheduling command.
Download a UFL sparse matrix. We are showing it on audikw_1 in this example.
Each is run 6 times, which can be changed using "repeat" keyword ("repeat 15" to repeat 15 times.)
First run is always discarded as warm-up. For each algorithm below, we run for [32, 64, 128] threads.
I am using ".bin" files below for faster I/O handles. ".mtx" files can also be used, based on the suffix correct reader will be called. But for faster experimenting, you can use KokkosKernels_MatrixConverter.exe as below for converint mtx files to bin files.

./KokkosKernels_MatrixConverter.exe --in_mtx audikw_1.mtx --out_mtx audikw_1.bin

Set the environment variables, go to the executables folder.

export OMP_PROC_BIND=spread 
export OMP_PLACES=threads
cd $HOME/kokkoskernels_spgemm_benchmark/kokkos-kernels/example/buildlib/perf_test

Running default algorithm: KKSPGEMM. Runtime: ~0.82 seconds

bash-4.2$ ./KokkosSparse_spgemm.exe --cuda 0 --amtx  audikw_1.bin 
B is not provided. Multiplying AxA.
macro  KOKKOS_ENABLE_CUDA      : defined
macro  CUDA_VERSION          = 8000 = version 8.0
Kokkos::Cuda[ 0 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K : Selected
Kokkos::Cuda[ 1 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 2 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 3 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Using A matrix for B as well
mm_time:0.793758 symbolic_time:0.152127 numeric_time:0.641631
mm_time:0.820052 symbolic_time:0.180181 numeric_time:0.639871
mm_time:0.820202 symbolic_time:0.180132 numeric_time:0.64007
mm_time:0.820129 symbolic_time:0.180188 numeric_time:0.639941
mm_time:0.819986 symbolic_time:0.180096 numeric_time:0.63989
mm_time:0.820145 symbolic_time:0.180194 numeric_time:0.639951

Running KKLP. Runtime: ~0.82 seconds

bash-4.2$  ./KokkosSparse_spgemm.exe --cuda 0 --amtx  audikw_1.bin --algorithm kklp
B is not provided. Multiplying AxA.
macro  KOKKOS_ENABLE_CUDA      : defined
macro  CUDA_VERSION          = 8000 = version 8.0
Kokkos::Cuda[ 0 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K : Selected
Kokkos::Cuda[ 1 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 2 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 3 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Using A matrix for B as well
mm_time:0.804836 symbolic_time:0.158642 numeric_time:0.646194
mm_time:0.822598 symbolic_time:0.180988 numeric_time:0.64161
mm_time:0.821275 symbolic_time:0.180926 numeric_time:0.640349
mm_time:0.820478 symbolic_time:0.180434 numeric_time:0.640044
mm_time:0.820358 symbolic_time:0.180384 numeric_time:0.639974
mm_time:0.815298 symbolic_time:0.174672 numeric_time:0.640626

Running KKMEM. Runtime: ~1.26 seconds

bash-4.2$  ./KokkosSparse_spgemm.exe --cuda 0 --amtx  audikw_1.bin --algorithm kkmem
B is not provided. Multiplying AxA.
macro  KOKKOS_ENABLE_CUDA      : defined
macro  CUDA_VERSION          = 8000 = version 8.0
Kokkos::Cuda[ 0 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K : Selected
Kokkos::Cuda[ 1 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 2 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 3 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Using A matrix for B as well
mm_time:1.23297 symbolic_time:0.172269 numeric_time:1.06071
mm_time:1.26545 symbolic_time:0.206706 numeric_time:1.05875
mm_time:1.26592 symbolic_time:0.206755 numeric_time:1.05917
mm_time:1.26359 symbolic_time:0.204676 numeric_time:1.05892
mm_time:1.26324 symbolic_time:0.206811 numeric_time:1.05643
mm_time:1.26511 symbolic_time:0.206649 numeric_time:1.05846

Running cuSPARSE. Runtime: ~2.05 seconds

bash-4.2$  ./KokkosSparse_spgemm.exe --cuda 0 --amtx audikw_1.bin --algorithm cusparse
B is not provided. Multiplying AxA.
macro  KOKKOS_ENABLE_CUDA      : defined
macro  CUDA_VERSION          = 8000 = version 8.0
Kokkos::Cuda[ 0 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K : Selected
Kokkos::Cuda[ 1 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 2 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Kokkos::Cuda[ 3 ] Tesla P100-SXM2-16GB capability 6.0, Total Global Memory: 15.89 G, Shared Memory per Block: 48 K
Using A matrix for B as well
mm_time:2.54251 symbolic_time:0.826121 numeric_time:1.71639
mm_time:2.05222 symbolic_time:0.334994 numeric_time:1.71723
mm_time:2.05029 symbolic_time:0.334889 numeric_time:1.71541
mm_time:2.05034 symbolic_time:0.33488 numeric_time:1.71546
mm_time:2.05032 symbolic_time:0.335036 numeric_time:1.71528
mm_time:2.05113 symbolic_time:0.334912 numeric_time:1.71622