Skip to content

SpGEMM_Benchmarks

Siva Rajamanickam edited this page Apr 24, 2018 · 26 revisions

Notes

  • Kokkos 2.50 is used for below experiments. The KokkosKernels algorithms described here are released as part of Kokkos Kernels version 2.60.
  • April 24, 2018: Kokkos::deep_copy has a minor bug in Kokkos 2.60 (https://github.com/kokkos/kokkos/issues/1583). Use Kokkos 2.50 until the next Kokkos release.

KNL Benchmark

This benchmark includes comparison of

  • KokkosKernels methods
    • KKSpGEMM (default algorithm in KokkosKernels)
    • KKMEM
    • KKDENSE
  • MKL methods: (intel-18.0.128)
    • MKL-INS: mkl-inspector executor.
    • MKL7: two-phase mkl with no sorting option:7
    • MKL8: two-phase mkl with output sorting option:8

The comparison is performed for both KNL's cache mode, and flat ddr memory. All experiments use quadrant mode that has single NUMA domain. All runtimes correspond to the runtime of "NoReuse" case where both symbolic and numeric phases are executed.

Setting up the compilation and run time parameters:

  • Setting up and replicating results on KNLs: Explains how the experiments are compiled and run with cache-mode. Same procedure can be followed to run on flat-ddr. Only changes required are:
    • Node-allocation: allocate node in flat memory mode.
    • Run with "numactl --membind 0 executable ..." to skip the use of MCDRAM.

Experiments on Flat DDR memory mode

  • KNL DDR RAW TABLE: gives a table including the runtimes of the 6 methods on 83 multiplications. For each matrix, table gives the best runtime of the method among different number of threads. Last 6 column gives the number of threads that the method achieved the best performance. For example, all methods achieve the best runtime for audikw_1 with 128 threads except MKL8 on DDR.

  • KNL DDR Performance Profile: gives the performance profiles of the algorithms for KNL's with flat ddr. For a given x, the y value indicates the number of problem cases, for which a method is less than x times slower than the best result achieved among the compared methods for each individual problem. The max value of y at x=1 is the number of problem cases for which a method achieved the best performance. The x value for which y=83 is the largest slowdown a method showed over any problem, compared with the best observed performance for that problem over all methods. KNL DDR Performance Profile

Experiments on Cache mode

Experiments using processor masking

Above experiments are run using the below environment variable.

export OMP_PROC_BIND=spread 

Most methods achieve their best performance using this configuration in most of the datasets. However, this configuration uses 256 threads on 68 cores of KNL (or 128 threads on 68 cores) with either 3 or 4 threads running each core when used without a processor masking. Below we run another benchmark using MPI masking by adding mpirun -np 1 -map-by socket:PE=64 before the executable. This limits only the use of 64 cores. We observe similar results as above.

  • KNL DDR MODE - Performance Profile with MPI Processor Masking KNL DDR Performance Profile
  • KNL CACHE MODE - Performance Profile with MPI Processor Masking KNL CACHE MODE - Performance Profile

Power8 Benchmark

This benchmark includes comparison of

  • KokkosKernels methods
    • KKSpGEMM (default algorithm in KokkosKernels)
    • KKMEM
    • KKDENSE
  • viennaCL OpenMP spgemm methods (v. 1.7.1).

All runtimes correspond to the runtime of "NoReuse" case where both symbolic and numeric phases are executed.

Setting up the compilation and run time parameters:

Experiment Resuts

P100 Benchmark

This benchmark includes comparison of

  • KokkosKernels methods
    • KKSpGEMM (default algorithm in KokkosKernels)
    • KKMEM
    • KKLP
  • viennaCL Cuda Implementation (v. 1.7.1)
  • cuSPARSE (cuda-8)
  • Nsparse: (v-1.2 July, 2017)

All runtimes correspond to the runtime of "NoReuse" case where both symbolic and numeric phases are executed.

Setting up the compilation and run time parameters:

Experiment Results on Flat DDR memory mode

  • P100 RAW TABLE: gives a table including the runtimes of the 6 methods on 81 multiplications.

  • P100 Performance Profile: gives the performance profiles of the algorithms for P100 GPUs. P100 Performance Profile

  • KKSPGEMM Speedup w.r.t. Nsparse: gives the speedup of KKSPGEMM w.r.t. Nsparse. Each bar shows the geometric mean of the 10 multiplications that are sorted based on FLOPs. KKSPGEMM Speedup w.r.t. Nsparse

Clone this wiki locally