Comparing approaches for CUDA-based vector multiplication.
In each of the experiments given below, we multiply two floating-point vectors
x
and y
, with number of elements from 10^6
to 10^9
using OpenMP.
Each element count is attempted with various approaches, running each approach 5
times to get a good time measure. Multiplication here represents any
memory-aligned independent operation, or a map()
operation.
In this experiment (adjust-launch), we multiply two floating-point vectors x
and y
using CUDA. Each element count is attempted with various CUDA launch
configs. Results indicate that a grid_limit of 16384/32768
, and a
block_size of 128/256
to be suitable for both float and double.
Using a grid_limit of MAX
and a block_size of 256
could be a decent
choice.
In this experiment (adjust-duty), we compare various per-thread duty numbers for CUDA-based vector multiplication. Each element count is attempted with various CUDA launch configs and per-thread duties. Results indicate no significant difference between adjust-launch approach, and this one.