cuda segfault nvvm with opt+debug #7214

stuartarchibald · 2021-07-16T11:15:20Z

Reporting a bug

I have tried using the latest released version of Numba (most recent is
visible in the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
I have included a self contained code sample to reproduce the problem.
i.e. it's possible to run as 'python bug.py'.

This segfaults:

from numba import cuda

@cuda.jit(debug=True)
def f(b, n):
    result = 0
    for i in range(n):
        result += b

f[1, 1](1, 2)

Under cuda-gdb the backtrace points to the nvvmCompilerProgram call in libnvvm.so.

Setting opt=False or debug=False avoids the segfault.

CC @gmarkall

The text was updated successfully, but these errors were encountered:

danielverschueren · 2021-07-16T15:46:23Z

Thanks for opening this ticket @stuartarchibald.

I apologise if this is not the right place to post this, but I thought it might be related and helpful to the issue:

I am experiencing some more strange issues with while/for loops in numba: matrix indexing in a while loop also creates segfaults. I have attached a minimal example here that produces the error:

from numba import cuda
import numpy as np

# 3 kernels, 2 fail on segfault
# fmat       indexes a matrix b[i,j] with 2 variables i,j; fails on segfault
# flinTwo   indexes an array b[i+j] with 2 variables i,j; passes
# flinthree indexes an array b[i+m+j] with 3 variables i,j,m; fails on segfault

@cuda.jit(debug=True)
def fmat(b, n):
    # this kernel fails... segfault
    result = 0
    j = 1
    i = 0
    while (j < b.shape[0]): # don't overflow
        while (i < n) and (i < b.shape[1]): # don't overflow
            result = b[j,i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True)
def flinTwo(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True)
def flinThree(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i+m]
            i += 1
        j += 1
        i = 0


a = np.asarray([[0,1,2,3,4,5],[0,1,2,3,4,5]])
d = a.flatten()
a_d = cuda.to_device(a)
d_d = cuda.to_device(d)

flinTwo[1,1](d_d, 0, 5)
print('flinTwo completes, no failures')

# this segfaults 
flinThree[1, 1](d_d, 0, 5)

# this segfaults too
fmat[1,1](a_d, 5)

I assume this issue is related to the issue on which the ticket was opened, so I post it here. Any fix or workaround would be appreciated!

gmarkall · 2021-07-20T09:58:29Z

Turning on debug without turning off optimization is not supported by NVVM - noted in the description of the -g option in https://docs.nvidia.com/cuda/libnvvm-api/group__compilation.html#group__compilation_1g76ac1e23f5d0e2240e78be0e63450346.

To compile with debug, opt=False also needs adding:

from numba import cuda

@cuda.jit(debug=True, opt=False)
def f(b, n):
    result = 0
    for i in range(n):
        result += b

f[1, 1](1, 2)

and

from numba import cuda
import numpy as np

# 3 kernels, 2 fail on segfault
# fmat       indexes a matrix b[i,j] with 2 variables i,j; fails on segfault
# flinTwo   indexes an array b[i+j] with 2 variables i,j; passes
# flinthree indexes an array b[i+m+j] with 3 variables i,j,m; fails on segfault

@cuda.jit(debug=True, opt=False)
def fmat(b, n):
    # this kernel fails... segfault
    result = 0
    j = 1
    i = 0
    while (j < b.shape[0]): # don't overflow
        while (i < n) and (i < b.shape[1]): # don't overflow
            result = b[j,i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True, opt=False)
def flinTwo(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i]
            i += 1
        j += 1
        i = 0

@cuda.jit(debug=True, opt=False)
def flinThree(b, m, n):
    # this kernel is happy, no faults!
    result = 0
    j = 0
    i = 0
    while (j < 2):
        while (i < n):
            if (j+i+m) < len(b): # don't overflow
                result = b[j+i+m]
            i += 1
        j += 1
        i = 0


a = np.asarray([[0,1,2,3,4,5],[0,1,2,3,4,5]])
d = a.flatten()
a_d = cuda.to_device(a)
d_d = cuda.to_device(d)

flinTwo[1,1](d_d, 0, 5)
print('flinTwo completes, no failures')

# this segfaults 
flinThree[1, 1](d_d, 0, 5)

# this segfaults too
fmat[1,1](a_d, 5)

both execute for me with no segfault or other issue.

For now, (prior to other fixes / reorganization of the debug kwarg to disentangle exception checking and debuginfo generation), we should emit a warning to inform the user they ought to also use opt=False.

As per Issue numba#7214, using debug=True with opt=True can cause NVVM to crash (or the link can fail with an error message from ptxas). We can't easily stop the user from doing this by default as it might break a lot of existing code (particularly that which uses debug=True to force a check for exceptions) but we can warn them. This commit adds a warning when debug and opt are both true, and some tests for most cases. The debuginfo tests are also updated so that they don't trigger the warning and make the test suite noisy. However, I chose not to "fix" the tests that check for a raised exception (e.g. in test_exception and test_fastmath) - I think the test suite should be emitting warnings there to highlight/remind about the fact that we need to split debug=True into at least two options (e.g. debuginfo=True and exceptions=True). Fixes numba#7214.

gmarkall · 2021-07-20T11:49:45Z

PR #7222 adds the relevant warning.

sgbaird · 2021-08-28T07:55:14Z

I think this might be the issue I'm experiencing in #7326

gmarkall · 2021-08-31T10:29:26Z

I think this might be the issue I'm experiencing in #7326

I don't see any use of debug=True there - why do you think this is the same issue? (have I missed the use of debug?)

sgbaird · 2021-08-31T12:24:19Z

It was with the environment variable (unless I'm mistaken about it's usage):

os.environ["NUMBA_CUDA_DEBUGINFO"] = "1"

gmarkall · 2021-08-31T14:04:44Z

Ah! You're right!

stuartarchibald added CUDA CUDA related issue/PR bug - segfault Bugs that cause SIGSEGV, SIGABRT, SIGILL, SIGBUS labels Jul 16, 2021

gmarkall mentioned this issue Jul 20, 2021

CUDA: Warn when debug=True and opt=True #7222

Merged

gmarkall mentioned this issue Jul 20, 2021

[FEA] Properly raise when attempting to cast NA to bool inside UDFs rapidsai/cudf#8774

Open

sklam closed this as completed in #7222 Aug 11, 2021

gmarkall mentioned this issue Aug 31, 2021

CUDA: NVVM crash with reproducer #7326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda segfault nvvm with opt+debug #7214

cuda segfault nvvm with opt+debug #7214

stuartarchibald commented Jul 16, 2021

danielverschueren commented Jul 16, 2021 •

edited by gmarkall

gmarkall commented Jul 20, 2021

gmarkall commented Jul 20, 2021

sgbaird commented Aug 28, 2021

gmarkall commented Aug 31, 2021

sgbaird commented Aug 31, 2021

gmarkall commented Aug 31, 2021

cuda segfault nvvm with opt+debug #7214

cuda segfault nvvm with opt+debug #7214

Comments

stuartarchibald commented Jul 16, 2021

Reporting a bug

danielverschueren commented Jul 16, 2021 • edited by gmarkall

gmarkall commented Jul 20, 2021

gmarkall commented Jul 20, 2021

sgbaird commented Aug 28, 2021

gmarkall commented Aug 31, 2021

sgbaird commented Aug 31, 2021

gmarkall commented Aug 31, 2021

danielverschueren commented Jul 16, 2021 •

edited by gmarkall