Skip to content

Latest commit

 

History

History
114 lines (89 loc) · 4.01 KB

examples.rst

File metadata and controls

114 lines (89 loc) · 4.01 KB

Examples

Matrix multiplication

First, import the modules needed for this example:

../../../numba/cuda/tests/doc_examples/test_matmul.py

Here is a naïve implementation of matrix multiplication using a CUDA kernel:

../../../numba/cuda/tests/doc_examples/test_matmul.py

This implementation is straightforward and intuitive but performs poorly, because the same matrix elements will be loaded multiple times from device memory, which is slow (some devices may have transparent data caches, but they may not be large enough to hold the entire inputs at once).

It will be faster if we use a blocked algorithm to reduce accesses to the device memory. CUDA provides a fast shared memory <cuda-shared-memory> for threads in a block to cooperatively compute on a task. The following implements a faster version of the square matrix multiplication using shared memory:

../../../numba/cuda/tests/doc_examples/test_matmul.py

Because the shared memory is a limited resource, the code preloads a small block at a time from the input arrays. Then, it calls ~numba.cuda.syncthreads to wait until all threads have finished preloading and before doing the computation on the shared memory. It synchronizes again after the computation to ensure all threads have finished with the data in shared memory before overwriting it in the next loop iteration.

An example usage of this function is as follows:

../../../numba/cuda/tests/doc_examples/test_matmul.py

This passes a CUDA memory check test <debugging-cuda-python-code>, which can help with debugging. Running the code above produces the following output:

$ python fast_matmul.py
[[ 6.  6.  6.  6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]

Note

For high performance matrix multiplication in CUDA, see also the CuPy implementation.

The approach outlined here generalizes to non-square matrix multiplication as follows by adjusting the blockspergrid variable:

Again, here is an example usage:

../../../numba/cuda/tests/doc_examples/test_matmul.py

and the corresponding output:

$ python nonsquare_matmul.py
[[ 253.  253.  253.  253.  253.  253.  253.]
[ 782.  782.  782.  782.  782.  782.  782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
[ 782.  782.  782.  782.  782.  782.  782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]