Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple outputs for guvectorize on CUDA target #8303

Closed
s-m-e opened this issue Aug 1, 2022 · 2 comments · Fixed by #8341
Closed

Allow multiple outputs for guvectorize on CUDA target #8303

s-m-e opened this issue Aug 1, 2022 · 2 comments · Fixed by #8341
Assignees
Labels
CUDA CUDA related issue/PR feature_request
Milestone

Comments

@s-m-e
Copy link

s-m-e commented Aug 1, 2022

Feature request

The following works just fine for targets cpu and parallel ...

from numba import guvectorize
import numpy as np

@guvectorize(
   'void(f8[:],f8[:],f8[:],f8[:],f8[:],f8[:])',
    '(n),(n),(n)->(n),(n),(n)',
    target = 'cpu',
    nopython = True,
)
def foo(a, b, c, x, y, z):
    for idx in range(a.shape[0]):
        x[idx], y[idx], z[idx] = a[idx] * b[idx], b[idx] * c[idx], c[idx] * a[idx]

LEN = 100_000_000

data = np.arange(0, 3 * LEN, dtype = 'f8').reshape(3, LEN)
res = np.zeros_like(data)

foo(data[0], data[1], data[2], res[0], res[1], res[2])  # providing views on `res`

print(res)

... but fails to compile with AssertionError: only support 1 output for target cuda. It would be really useful if this would also work for CUDA.

In the meantime, the current behavior is, as far as I can tell, not documented.


For a longer discussion, context and examples, see here.

@guilhermeleobas
Copy link
Contributor

Hi @s-m-e, thanks for the report. I'll put this up for discussion at the next triage meeting.

@gmarkall gmarkall added the CUDA CUDA related issue/PR label Aug 3, 2022
gmarkall added a commit to gmarkall/numba that referenced this issue Aug 3, 2022
Testing with this example from numba#8303:

```python
from numba import guvectorize, jit, cuda
import numpy as np

TARGET = 'cuda'
assert TARGET in ('cpu', 'parallel', 'cuda')

if TARGET == 'cuda':
    hjit = cuda.jit
    hkwargs = dict(device=True, inline=True)
else:
    hjit = jit
    hkwargs = dict(nopython=True, inline='always')

def _parse_signature(s):
    s = s.replace('M', 'Tuple([V,V,V])')
    s = s.replace('V', 'Tuple([f,f,f])')
    return s.replace('f', 'f8')

@hjit(_parse_signature('V(V,M)'), **hkwargs)
def matmul_VM_(a, b):
    return (
        a[0] * b[0][0] + a[1] * b[1][0] + a[2] * b[2][0],
        a[0] * b[0][1] + a[1] * b[1][1] + a[2] * b[2][1],
        a[0] * b[0][2] + a[1] * b[1][2] + a[2] * b[2][2],
    )

@guvectorize(
    _parse_signature('void(f[:],f[:],f[:],f[:],f[:],f[:])'),
    '(n),(n),(n)->(n),(n),(n)',  # "layout" in numba terms
    target=TARGET,
    nopython=True,
)
def foo(a, b, c, x, y, z):
    R = (
        (0.2, 0.8, 0.3),
        (0.3, 0.5, 0.6),
        (0.4, 0.1, 0.8),
    )
    for idx in range(a.shape[0]):
        x[idx], y[idx], z[idx] = matmul_VM_((a[idx], b[idx], c[idx]), R)

LEN = 100_000_000

data = np.arange(0, 3 * LEN, dtype='f8').reshape(3, LEN)
res = np.zeros_like(data)

foo(data[0], data[1], data[2], res[0], res[1], res[2])
print(res)
```

There are still some mismatches to fix up yet.
@gmarkall
Copy link
Member

gmarkall commented Aug 3, 2022

@s-m-e I'm starting working on this on this branch: https://github.com/gmarkall/numba/tree/issue-8303

Presently I'm just hacking away to see what needs changing ("brute force" modifications 🙂) because I'm not too familiar with the code in this area. Hopefully it will just need a bunch of small fixes to turn the assumption of a single output into iterations over multiple outputs in various locations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA CUDA related issue/PR feature_request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants