Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda segfault nvvm with opt+debug #7214

Closed
2 tasks done
stuartarchibald opened this issue Jul 16, 2021 · 7 comments · Fixed by #7222
Closed
2 tasks done

cuda segfault nvvm with opt+debug #7214

stuartarchibald opened this issue Jul 16, 2021 · 7 comments · Fixed by #7222
Labels
bug - segfault Bugs that cause SIGSEGV, SIGABRT, SIGILL, SIGBUS CUDA CUDA related issue/PR

Comments

@stuartarchibald
Copy link
Contributor

Reporting a bug

  • I have tried using the latest released version of Numba (most recent is
    visible in the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
  • I have included a self contained code sample to reproduce the problem.
    i.e. it's possible to run as 'python bug.py'.

This segfaults:

from numba import cuda

@cuda.jit(debug=True)
def f(b, n):
    result = 0
    for i in range(n):
        result += b

f[1, 1](1, 2)

Under cuda-gdb the backtrace points to the nvvmCompilerProgram call in libnvvm.so.

Setting opt=False or debug=False avoids the segfault.

CC @gmarkall

@stuartarchibald stuartarchibald added CUDA CUDA related issue/PR bug - segfault Bugs that cause SIGSEGV, SIGABRT, SIGILL, SIGBUS labels Jul 16, 2021
@danielverschueren
Copy link

danielverschueren commented Jul 16, 2021

Thanks for opening this ticket @stuartarchibald.

I apologise if this is not the right place to post this, but I thought it might be related and helpful to the issue:

I am experiencing some more strange issues with while/for loops in numba: matrix indexing in a while loop also creates segfaults. I have attached a minimal example here that produces the error:

from numba import cuda
import numpy as np

# 3 kernels, 2 fail on segfault
# fmat       indexes a matrix b[i,j] with 2 variables i,j; fails on segfault
# flinTwo   indexes an array b[i+j] with 2 variables i,j; passes
# flinthree indexes an array b[i+m+j] with 3 variables i,j,m; fails on segfault

@cuda.jit(debug=True)
def fmat(b, n):
    # this kernel fails... segfault
    result = 0
    j = 1
    i = 0
    while (j < b.shape[0]): # don't overflow
        while (i < n) and (i < b.shape[1]): # don't overflow
            result = b[j,i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True)
def flinTwo(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True)
def flinThree(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i+m]
            i += 1
        j += 1
        i = 0


a = np.asarray([[0,1,2,3,4,5],[0,1,2,3,4,5]])
d = a.flatten()
a_d = cuda.to_device(a)
d_d = cuda.to_device(d)

flinTwo[1,1](d_d, 0, 5)
print('flinTwo completes, no failures')

# this segfaults 
flinThree[1, 1](d_d, 0, 5)

# this segfaults too
fmat[1,1](a_d, 5)

I assume this issue is related to the issue on which the ticket was opened, so I post it here. Any fix or workaround would be appreciated!

@gmarkall
Copy link
Member

Turning on debug without turning off optimization is not supported by NVVM - noted in the description of the -g option in https://docs.nvidia.com/cuda/libnvvm-api/group__compilation.html#group__compilation_1g76ac1e23f5d0e2240e78be0e63450346.

To compile with debug, opt=False also needs adding:

from numba import cuda

@cuda.jit(debug=True, opt=False)
def f(b, n):
    result = 0
    for i in range(n):
        result += b

f[1, 1](1, 2)

and

from numba import cuda
import numpy as np

# 3 kernels, 2 fail on segfault
# fmat       indexes a matrix b[i,j] with 2 variables i,j; fails on segfault
# flinTwo   indexes an array b[i+j] with 2 variables i,j; passes
# flinthree indexes an array b[i+m+j] with 3 variables i,j,m; fails on segfault

@cuda.jit(debug=True, opt=False)
def fmat(b, n):
    # this kernel fails... segfault
    result = 0
    j = 1
    i = 0
    while (j < b.shape[0]): # don't overflow
        while (i < n) and (i < b.shape[1]): # don't overflow
            result = b[j,i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True, opt=False)
def flinTwo(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True, opt=False)
def flinThree(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i+m]
            i += 1
        j += 1
        i = 0


a = np.asarray([[0,1,2,3,4,5],[0,1,2,3,4,5]])
d = a.flatten()
a_d = cuda.to_device(a)
d_d = cuda.to_device(d)

flinTwo[1,1](d_d, 0, 5)
print('flinTwo completes, no failures')

# this segfaults 
flinThree[1, 1](d_d, 0, 5)

# this segfaults too
fmat[1,1](a_d, 5)

both execute for me with no segfault or other issue.

For now, (prior to other fixes / reorganization of the debug kwarg to disentangle exception checking and debuginfo generation), we should emit a warning to inform the user they ought to also use opt=False.

gmarkall added a commit to gmarkall/numba that referenced this issue Jul 20, 2021
As per Issue numba#7214, using debug=True with opt=True can cause NVVM to
crash (or the link can fail with an error message from ptxas). We can't
easily stop the user from doing this by default as it might break a lot
of existing code (particularly that which uses debug=True to force a
check for exceptions) but we can warn them.

This commit adds a warning when debug and opt are both true, and some
tests for most cases. The debuginfo tests are also updated so that they
don't trigger the warning and make the test suite noisy. However, I
chose not to "fix" the tests that check for a raised exception (e.g. in
test_exception and test_fastmath) - I think the test suite should be
emitting warnings there to highlight/remind about the fact that we need
to split debug=True into at least two options (e.g. debuginfo=True and
exceptions=True).

Fixes numba#7214.
@gmarkall
Copy link
Member

PR #7222 adds the relevant warning.

@sgbaird
Copy link
Contributor

sgbaird commented Aug 28, 2021

I think this might be the issue I'm experiencing in #7326

@gmarkall
Copy link
Member

I think this might be the issue I'm experiencing in #7326

I don't see any use of debug=True there - why do you think this is the same issue? (have I missed the use of debug?)

@sgbaird
Copy link
Contributor

sgbaird commented Aug 31, 2021

It was with the environment variable (unless I'm mistaken about it's usage):

os.environ["NUMBA_CUDA_DEBUGINFO"] = "1"

@gmarkall
Copy link
Member

Ah! You're right!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - segfault Bugs that cause SIGSEGV, SIGABRT, SIGILL, SIGBUS CUDA CUDA related issue/PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants