Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slices of dynamic shared arrays all alias #5073

Closed
gmarkall opened this issue Jan 13, 2020 · 3 comments
Closed

Slices of dynamic shared arrays all alias #5073

gmarkall opened this issue Jan 13, 2020 · 3 comments
Labels
bug CUDA CUDA related issue/PR

Comments

@gmarkall
Copy link
Member

The following:

from numba import cuda, int32
import numpy as np


@cuda.jit(debug=True)
def sm_slice(x):
    dynsmem = cuda.shared.array(0, dtype=int32)
    sm1 = dynsmem[0:1]
    sm2 = dynsmem[1:2]

    sm1[0] = 1
    sm2[0] = 2
    x[0] = sm1[0]
    x[1] = sm2[0]


arr = np.zeros(2, dtype=np.int32)
nshared = 2 * arr.dtype.itemsize
sm_slice[1, 1, 0, nshared](arr)
print(arr)

when run on a CUDA device:

$ python alias.py 
[2. 2.]

and on the simulator:

$ NUMBA_ENABLE_CUDASIM=1 python alias.py 
[1. 2.]

Although we haven't documented dynamic shared memory (yet - I discovered this while writing examples), I think the simulator behaviour should be considered correct, because it matches what a normal user would expect to happen.

The PTX (edited for clarity) clearly does nothing to differentiate the slices, but I am yet to work back any further on the issue:

.visible .entry sm_slice(
        .param .u64 sm_slice_param_0,
        .param .u64 sm_slice_param_1,
        .param .u64 sm_slice_param_2,
        .param .u64 sm_slice_param_3,
        .param .u64 sm_slice_param_4,
        .param .u64 sm_slice_param_5,
        .param .u64 sm_slice_param_6
)
{
        .reg .b32       %r<5>;
        .reg .b64       %rd<5>;


        ld.param.u64    %rd1, [sm_slice_param_4];
        cvta.to.global.u64      %rd2, %rd1;
        mov.u64         %rd3, 0;
        mov.u64         %rd4, %rd3;
        // dbg 11
        mov.u32         %r1, 1;
        st.shared.u32   [_cudapy_smem_1], %r1;
        // dbg 12
        mov.u32         %r2, 2;
        st.shared.u32   [_cudapy_smem_1], %r2;
        // dbg 13
        ld.shared.u32   %r3, [_cudapy_smem_1];
        st.global.u32   [%rd2], %r3;
        // dbg 14
        ld.shared.u32   %r4, [_cudapy_smem_1];
        st.global.u32   [%rd2+4], %r4;
        ret;
}
@stuartarchibald stuartarchibald added needtriage CUDA CUDA related issue/PR labels Jan 13, 2020
@stuartarchibald
Copy link
Contributor

Thanks for the report. I'm of the view that the simulator behaviour is the expected behaviour too. Mark as a bug to fix?

@gmarkall
Copy link
Member Author

Sounds good - hoping to look into this fairly soon, so I don't have to write "Don't use slices of dynamic shared memory because it doesn't work" in the docs :-)

@gmarkall
Copy link
Member Author

One cause of the problem is that the shape and element count of a dynamic shared array is 0. If I manually hack the creation of the shared array such that its shape and element count are correct for this particular instance:

diff --git a/numba/cuda/cudaimpl.py b/numba/cuda/cudaimpl.py
index 31ede32..170b57f 100644
--- a/numba/cuda/cudaimpl.py
+++ b/numba/cuda/cudaimpl.py
@@ -629,6 +629,9 @@ def _generic_array(context, builder, shape, dtype, symbol_name, addrspace,
     elemcount = reduce(operator.mul, shape)
     lldtype = context.get_data_type(dtype)
     laryty = Type.array(lldtype, elemcount)
+    if elemcount == 0:
+        elemcount = 2
+        shape = (2,)
 
     if addrspace == nvvm.ADDRSPACE_LOCAL:
         # Special case local addrespace allocation to use alloca

then the correct output is produced in this case.

gmarkall added a commit to gmarkall/numba that referenced this issue Jan 17, 2020
The root of the issue was that computations of indices and slice bounds
were incorrect because the shape of dynamic shared memory is generally
(0,). To fix this, we compute the shape (1-D only) of dynamic shared
arrays using the dynamic shared memory size and the itemsize of the type
of the array when it is created.

This is implemented by reading the special register %dynamic_smem_size -
unfortunately NVVM doesn't provide an intrinsic for this, so we access
it using inline assembly.
seibert added a commit that referenced this issue Feb 26, 2020
Fix #5073: Slices of dynamic shared memory all alias
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug CUDA CUDA related issue/PR
Projects
None yet
Development

No branches or pull requests

2 participants