Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounds checking #4432

Open
wants to merge 20 commits into
base: master
from

Conversation

@asmeurer
Copy link
Contributor

commented Aug 9, 2019

This is still a work in progress, but feedback is welcome. I'm still new to LLVM so let me know if I should be generating the code in a better way.

I have changed the API of get_item_pointer[2] to include the context argument. This is required to raise an exception.

Still TODO:

  • Add tests
  • Add the flag to the public API (right now it is enabled by default for
    testing purposes).
  • Document the flag
  • Figure out why the parallel tests fail with bounds checking enabled
  • Figure out why the cuda tests fail with bounds checking enabled
  • Add a CI run with bounds checking globally enabled
  • See if missing broadcast errors are related to this #4432 (comment)
  • Fix memory leak in the tests
  • Add support for boundschecking fancy indexing

TODOs that should probably wait for a future PR:

  • Make the error message match object mode. This would require being more fancy in the way the exception is generated. For now, if NUMBA_FULL_TRACEBACK=1 is set, the index, shape, and axis are printed as a debug message.
  • Make the error message show the location in the code

I'll need help on how to do the last item.

There is also a boundcheck flag in some places in the code, which doesn't seem to do anything. I have named my flag boundscheck with an s, as that seemed better, but I don't particularly care if you decide another name is better.

asmeurer added 2 commits Aug 9, 2019
This is required to be able to raise exceptions, for instance, for bounds
checking.
So far we need to:

- Add tests
- Make the error message match object mode
- Add the flag to the public API (right now it is enabled by default for
  testing purposes)
with if_unlikely(builder, out_of_bounds_upper):
context.call_conv.return_user_exc(builder, IndexError, (msg,))
out_of_bounds_lower = builder.icmp_signed('<', ind, ind.type(0))
with if_unlikely(builder, out_of_bounds_lower):

This comment has been minimized.

Copy link
@asmeurer

asmeurer Aug 9, 2019

Author Contributor

Should I merge these two ifs? I didn't see how to do a boolean OR in the llvmlite documentation.

This comment has been minimized.

Copy link
@stuartarchibald

stuartarchibald Aug 12, 2019

Contributor

You possibly want something like this:

            out_of_bounds_upper = builder.icmp_signed('>=', ind, dimlen)
            out_of_bounds_lower = builder.icmp_signed('<', ind, ind.type(0))
            predicate = builder.or_(out_of_bounds_upper, out_of_bounds_lower)
            with if_unlikely(builder, predicate):
                context.call_conv.return_user_exc(builder, IndexError, (msg,))

Docs: http://llvmlite.pydata.org/en/latest/user-guide/ir/ir-builder.html#llvmlite.ir.IRBuilder.or_

This comment has been minimized.

Copy link
@asmeurer

asmeurer Aug 12, 2019

Author Contributor

It says or_ is bitwise OR (like | instead of ||). I don't know if that matters here. I mainly want to make sure I generate code that LLVM optimizes correctly (how can I check that?).

This comment has been minimized.

Copy link
@stuartarchibald

stuartarchibald Aug 12, 2019

Contributor

It is indeed bitwise or, suggested that for same net effect, though it will likely generate an or op whereas a logical or would likely generate jump(s). To get something logical or-like (i.e. ||) it's is a bit like what you have already but the branches need nesting so that the second condition is only checked if the first passes, I'm assuming that's what you are thinking of?

You can check the LLVM IR with NUMBA_DUMP_LLVM=1 as an env var, or the .inspect_llvm method on the dispatcher. The assembly can be seen with the .inspect_asm method on the dispatcher. How were you hoping it to optimise?

This comment has been minimized.

Copy link
@asmeurer

asmeurer Aug 12, 2019

Author Contributor

I mean if the index is already known to be less than the dimension or nonnegative it should optimize out the if.

This comment has been minimized.

Copy link
@stuartarchibald

stuartarchibald Aug 12, 2019

Contributor

If I'm reading the LLVM correctly (and I'm not very good at reading LLVM, so I could be wrong), it saves the IndexError as a predefined constant. So that adds a bit of overhead to every function call (though it looks like it does it for every exception, so maybe that's not my problem to solve here). Also it implies that it would be more efficient to have only one return_user_exc here.

I'm not sure how this adds overhead to a function call, any chance you could please expand on this? Thanks.

This comment has been minimized.

Copy link
@asmeurer

asmeurer Aug 12, 2019

Author Contributor

I think under very specific conditions it might be possible to do this, but all the information would likely have to be constant at compile time. Also, any if from the pattern being added here will likely create a branch in a tight indexing loop and prevent vectorization.

So if I have something like

A = 0
for i in range(x.shape[0]):
    A += x[i]

it won't be able to deduce that 0 <= i < x.shape[0] and so the bounds checking can be optimized away?

I'm not sure how this adds overhead to a function call, any chance you could please expand on this? Thanks.

I think I misread the LLVM. The exceptions are saved as constant pickles on the functions, but are only unpickled when needed. So it adds a memory overhead to the function definition, but that's it. It's probably better to not have the exception there twice but maybe not a huge deal.

More importantly, since the exceptions are precomputed, I don't see how to raise an exception with a dynamic message, namely including the dimension in the error message like NumPy does. Would we need to add support for pickling a function that returns the exception with the given message? Or is this already something that can be done? If it's not easy to do this, I might skip it for this PR, if that's alright.

This comment has been minimized.

Copy link
@stuartarchibald

stuartarchibald Aug 13, 2019

Contributor

I think under very specific conditions it might be possible to do this, but all the information would likely have to be constant at compile time. Also, any if from the pattern being added here will likely create a branch in a tight indexing loop and prevent vectorization.

So if I have something like

A = 0
for i in range(x.shape[0]):
    A += x[i]

it won't be able to deduce that 0 <= i < x.shape[0] and so the bounds checking can be optimized away?

In theory, yes, but it depends on what code gets generated. I think that if an unsigned comparison operator were used in the bounds check then LLVM is more likely to be able to work it out, if a signed comparison is done then it won't. I just generated the assembly for:

  1. no boundscheck
  2. boundscheck with signed comparison
  3. boundscheck with unsigned comparison

Cases 1. and 3.are reasonably similar but 1. has a deeper loop unroll. 2. has no vectorisation because there's a test/jump in the middle of the loop.

I think more can be done in terms of identifying unsigned ranges and suspect that this is likely to help.

I'm not sure how this adds overhead to a function call, any chance you could please expand on this? Thanks.

I think I misread the LLVM. The exceptions are saved as constant pickles on the functions, but are only unpickled when needed. So it adds a memory overhead to the function definition, but that's it. It's probably better to not have the exception there twice but maybe not a huge deal.

More importantly, since the exceptions are precomputed, I don't see how to raise an exception with a dynamic message, namely including the dimension in the error message like NumPy does. Would we need to add support for pickling a function that returns the exception with the given message? Or is this already something that can be done? If it's not easy to do this, I might skip it for this PR, if that's alright.

Numba stores exception strings as compile time constants so you cannot raise an exception with a dynamically generated string. So yes, skipping it is fine :)

This comment has been minimized.

Copy link
@stuartarchibald

stuartarchibald Aug 15, 2019

Contributor

I made all range(x) calls unsigned in #4446 and was reminded of the amount of work needed to implement this correctly due to the NumPy casting rule (generalizing as):
uint* <op> int* -> float.

This comment has been minimized.

Copy link
@asmeurer

asmeurer Aug 15, 2019

Author Contributor

Numba stores exception strings as compile time constants so you cannot raise an exception with a dynamically generated string. So yes, skipping it is fine :)

I think you could serialize a function that creates the IndexError rather than the IndexError itself. Then modify return_user_exc so it can call the function to create the exception before raising it.

The thing I'm not sure about are the builder instructions for converting the index, shape, and axis into strings so they can be put in the message.

numba/cgutils.py Outdated Show resolved Hide resolved
numba/cgutils.py Outdated Show resolved Hide resolved
@stuartarchibald

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2019

There is also a boundcheck flag in some places in the code, which doesn't seem to do anything. I have named my flag boundscheck with an s, as that seemed better, but I don't particularly care if you decide another name is better.

Agree, boundscheck makes sense, I also think renaming the current flag to that and reusing it if convenient would be good, else just remove the current flag as IIRC it's dead.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

I guess this isn't accounting for the numpy shape:

>>> import numba
>>> import numpy
>>> @numba.jit
... def test(x):
...     return x[:, 0]
>>> test(numpy.empty((0, 1)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index is out of bounds

This is the source of the test failures.

Is the numpy shape saved somewhere? If not, I guess we can just disable bounds checks if one of the shapes is 0. We can't get segfaults or nonsense memory values in that case. It mainly would help for type checking.

@stuartarchibald

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

I guess this isn't accounting for the numpy shape:

>>> import numba
>>> import numpy
>>> @numba.jit
... def test(x):
...     return x[:, 0]
>>> test(numpy.empty((0, 1)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index is out of bounds

This is the source of the test failures.

Is the numpy shape saved somewhere? If not, I guess we can just disable bounds checks if one of the shapes is 0. We can't get segfaults or nonsense memory values in that case. It mainly would help for type checking.

The shape is an argument to the get_item_pointer2 function, I guess you can generate a branch based on that? I'm not sure that disabling bounds checking on arrays where one of the shape elements is 0 would be desirable, you can still access outside the allocated region and indeed the page, after all it's just a GEP that's being emitted. e.g. with no bounds check:

import numba
import numpy
@numba.jit
def test(x):
    big_index = 10000000000000000
    return x[big_index, big_index]

print(test(numpy.empty((0, 1))))

will likely segfault.


This little snippet might help you with debugging, it lets you print an iterable from the compiled code:

        def print_iterable(builder, name, fmt, iterable):
            printf(builder, "{}\n".format(name))
            for x in iterable:
                printf(builder, fmt, x)
            printf(builder, "\n")

        print_iterable(builder, "shape", "%d, ", shape)
        print_iterable(builder, "indices", "%d, ", indices)

if you paste that into the if boundscheck: branch body and run the failing example above you should see:

shape
0, 1, 
indices
0, 0, 
Traceback (most recent call last):
  File "test_4432.py", line 21, in <module>
    print(test(numpy.empty((0, 1))))
IndexError: index is out of bounds branch1

To accommodate the 0d case I suspect a:

builder.if_else(zero_d_predicate) as (then, otherwise):
  with then:
    <stuff>
  with otherwise:
    <stuff>

might be useful?

Hope this helps?

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Oh the problem isn't that it's removing the numpy shape as I was assuming. It's that bounds checking for an empty dimension shouldn't do the index >= dimension check.

Also regarding the error thing, I hadn't realized that it reuses the pickled exception if it is used multiple times. I had seen it twice in there but only because I had different error messages for the upper and lower out of bounds at the time. So I was worried that you would get 2n pickled exceptions saved to the compiled function for n array accesses, but that isn't the case.

Thanks for the debug suggestion. I didn't notice the printf function.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Can we distinguish these two cases inside numba?

>>> a = np.empty((0, 1))
>>> a[:, 0]
array([], dtype=float64)
>>> a[0, 0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 0 is out of bounds for axis 0 with size 0

The above printf code shows the shape as (0, 1) and indices as (0, 0) for both.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

This code does not segfault for me (in numba master)

from numba import njit
@njit
def test(x):
    return x[:, 10000000000000000000]
import numpy as np
a = np.empty((0, 1))
test(a)
@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

I think what we have to do is to implement separate bounds checking in basic_indexing() in arrayobj.py (and possibly other places if they deal with slices as well). That function already does its own wraparound logic (it's the reason there is a flag for it in get_item_pointer, so it can avoid doing it twice), so it's reasonable for it to do bounds checking as well. As far as I can tell the input to get_item_pointer for a[:, 0] and a[0, 0] is identical. Only basic_indexing() knows when the index is a slice, and hence can be used on an empty dimension.

I don't actually understand how numpy handles shape-0 arrays. Does it just create a placeholder pointer?

In NumPy with a = np.empty((0, 1)), a[0, 0] raises IndexError, but a[:, 0]
does not.

It is impossible to distinguish these two cases in get_item_pointer, which
both want a pointer to the "start" of the empty array. So I have factored out
the bounds checking into a helper function, and called it separately in
basic_indexing() (the same as is already done with wraparound correction).
That way, we turn of bounds checking for slice indices on shape 0 axes. Note
that slice indices are already not bounds checked from above, e.g., a[0:100]
works even if a has fewer than 100 elements. It works this way even on Python
lists.
@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

I've implemented that in the latest commit.

asmeurer added 5 commits Aug 15, 2019
Python/NumPy doesn't do any bounds checking on slices.
… when num == 0
This is only temporary, until we can include it in the error message proper. I
have enabled it when NUMBA_FULL_TRACEBACKS is set (I don't want to add a new
debug flag for it since it should hopefully be removed soon).
@stuartarchibald

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

>>> a = np.empty((0, 1))
>>> a[:, 0]
array([], dtype=float64)
>>> a[0, 0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 0 is out of bounds for axis 0 with size 0

By the time we get to GEP I think it's probably too late, the contextual slice info is lost. I think this is because a slice looks like accessing index 0, i.e. "where does the data start for a slice == what is index 0". This https://github.com/numba/numba/pull/4432/files#diff-36c9ab8fa31f756e0da52905db0a95deR717-R722 is just working out where in data block the relevant data is.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2019

I figured out that issue (see my most recent comments/commits).

Right now I'm trying to figure out the test failures, which are presently failing because bounds checking is still enabled by default. It looks like there are some issues with both cuda and parallel.

Here's an example:

from numba import njit
import numpy as np

@njit(parallel=True)
def test(X):
    X[:0] += 1
    return X

test(np.ones(10))

Enabling the debug output which I just added, you get

debug: IndexError: index 1 is out of bounds for axis 0 with size 0
debug: IndexError: index 0 is out of bounds for axis 0 with size 0
debug: IndexError: index 7 is out of bounds for axis 0 with size 0
debug: IndexError: index 5 is out of bounds for axis 0 with size 0
debug: IndexError: index 9 is out of bounds for axis 0 with size 0
debug: IndexError: index 4 is out of bounds for axis 0 with size 0
debug: IndexError: index 3 is out of bounds for axis 0 with size 0
debug: IndexError: index 6 is out of bounds for axis 0 with size 0
debug: IndexError: index 8 is out of bounds for axis 0 with size 0
debug: IndexError: index 2 is out of bounds for axis 0 with size 0
IndexError: index is out of bounds

I guess parallel takes slices and converts them into explicit indices. But bounds checking should not be enabled for slices, so I need to figure out where that is happening and how to disable it for that code path, similar to what I did for the single core slice code path.

I haven't looked at the cuda issues yet, but I think they are similar.

As an aside, it could be a good idea to have a CI build that runs the test suite with bounds checking enabled. I already found one function in numba that failed with it enabled (see 0b95b7c).

@stuartarchibald

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

I figured out that issue (see my most recent comments/commits).

Right now I'm trying to figure out the test failures, which are presently failing because bounds checking is still enabled by default. It looks like there are some issues with both cuda and parallel.

Here's an example:

from numba import njit
import numpy as np

@njit(parallel=True)
def test(X):
    X[:0] += 1
    return X

test(np.ones(10))

Enabling the debug output which I just added, you get

debug: IndexError: index 1 is out of bounds for axis 0 with size 0
debug: IndexError: index 0 is out of bounds for axis 0 with size 0
debug: IndexError: index 7 is out of bounds for axis 0 with size 0
debug: IndexError: index 5 is out of bounds for axis 0 with size 0
debug: IndexError: index 9 is out of bounds for axis 0 with size 0
debug: IndexError: index 4 is out of bounds for axis 0 with size 0
debug: IndexError: index 3 is out of bounds for axis 0 with size 0
debug: IndexError: index 6 is out of bounds for axis 0 with size 0
debug: IndexError: index 8 is out of bounds for axis 0 with size 0
debug: IndexError: index 2 is out of bounds for axis 0 with size 0
IndexError: index is out of bounds

I guess parallel takes slices and converts them into explicit indices. But bounds checking should not be enabled for slices, so I need to figure out where that is happening and how to disable it for that code path, similar to what I did for the single core slice code path.

Were I to guess, this is where I'd start looking. github.com/numba/numba/blob/f629736997eaf83040caecd1fea7322577b7fd89/numba/parfor.py#L1921-L1934
Try here:

numba/numba/parfor.py

Lines 2214 to 2220 in f629736

if isinstance(value_typ, types.npytypes.Array):
value_var = ir.Var(scope, mk_unique_var("$value_var"), loc)
self.typemap[value_var.name] = value_typ.dtype
getitem_call = ir.Expr.getitem(value, index_var, loc)
self.calltypes[getitem_call] = signature(
value_typ.dtype, value_typ, index_var_typ)
true_block.body.append(ir.Assign(getitem_call, value_var, loc))

I haven't looked at the cuda issues yet, but I think they are similar.

As an aside, it could be a good idea to have a CI build that runs the test suite with bounds checking enabled. I already found one function in numba that failed with it enabled (see 0b95b7c).

If we have the compute capacity then I think this makes sense.

@stuartarchibald

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2019

Note to self: need to find a way to get an ir.Loc to the site the exception is being created else the exception will correctly report out of bounds, but the site it will be raised from will be the jitted function, not the corresponding source line.

@stuartarchibald

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2019

@DrTodd13 This is the issue with parfors and bounds check, if you have any ideas they'd be appreciated, thanks #4432 (comment) !

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2019

The bounds check should only happen when the getitem is physically called, unless I am misunderstanding how things work. Is the parfor generated from the slice doing unnecessary redundant getitems? I wasn't able yet to debug why it happens (any suggestions for how to debug this sort of thing, beyond "learn how to read LLVM"?)

And if it is and they're not easy to remove, is there some way to annotate the getitem so that the get_item_pointer can know it shouldn't be bounds checked? The parfor logic happens in a completely different stage from the get_item_pointer.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Sep 27, 2019

OK, regarding parfor, I'm pretty sure there's a bug

>>> from numba import njit
>>> import numpy as np
>>> @njit(parallel=True)
... def test(X):
...     X[:0] = 2
...     return X
>>> test(np.ones(10))
array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])
>>> def test2(X):
...     X[:0] = 2
...     return X
>>> test2(np.ones(10))
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

I opened #4630 for it.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Sep 27, 2019

Regarding the test_array_ctypes_ref_error_in_parallel test failure (#2887), it seems to be unrelated to this change. I think there's a race condition for the memory being freed and the extra time spent by the bounds check causes it to be hit.

Consider this behavior from the test slightly modified (to call the callback twice). This is in master:

# test.py
from numba import njit
import numpy as np

# Issue #2887
from ctypes import CFUNCTYPE, c_void_p, c_int32, c_double, c_bool

@CFUNCTYPE(c_bool, c_void_p, c_int32, c_void_p)
def callback(inptr, size, outptr):
    # A ctypes callback that manipulate the incoming pointers.
    try:
        inbuf = (c_double * size).from_address(inptr)
        outbuf = (c_double * 1).from_address(outptr)
        a = np.ndarray(size, buffer=inbuf, dtype=np.float64)
        b = np.ndarray(1, buffer=outbuf, dtype=np.float64)
        b[0] = (a + a.size)[0]
        return True
    except:
        import traceback
        traceback.print_exception()
        return False

# parallel=True is required to reproduce the error.
@njit()
def foo(size):
    arr = np.ones(size)
    out = np.empty(1)
    # Exercise array.ctypes
    inct = arr.ctypes
    outct = out.ctypes
    # The reference to `arr` is dead by now
    # print(inct.data)
    status = callback(inct.data, size, outct.data)
    print(status, out)
    status = callback(inct.data, size, outct.data)
    print(status, out)

size = 3
foo(size)
$ python test.py
True [4.]
True [-9.86830992e+148]

If you remove the @njit decorator, you get

True [4.]
True [4.]

I've also removed the parallel decorator, which isn't needed to reproduce it. Probably the parallel=True also caused the same race condition to be hit with its overhead, as the function doesn't actually use any parallelism.

So I think issue #2887 wasn't actually fixed.

asmeurer added 4 commits Sep 27, 2019
…oundscheck flag to @jit
@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Sep 27, 2019

I've added a boundscheck flag to @jit (replacing the old unused boundcheck flag), defaulting to False. I also added a NUMBA_BOUNDSCHECK environment variable which globally overrides the flag. I've added documentation for both.

What remains to do is to add tests for this, and also add (and fix) a CI build with NUMBA_BOUNDSCHECK=1. I'm hoping someone else will look at the parfor issue (#4630) and the #2893 issue. I still need to look into the CUDA failures and see if bounds checking even makes sense on CUDA or if the flag should be disallowed there (unfortunately the Azure build logs from before are gone now, so I have to wait for a rebuild; would it be possible to change the Azure build logs so they stick around forever?).

By the way, why is jit documented separately in its docstring and in the RST docs? Wouldn't it be better to pull in the docstring using autodoc?

asmeurer added 2 commits Sep 30, 2019
Jit flags
---------

These variables globally override flags to the :func:`~numba.jit` decorator.

This comment has been minimized.

Copy link
@dhirschfeld

dhirschfeld Oct 1, 2019

Contributor

It would be extremely surprising to me if I explicitly wrote boundscheck=False and that didn't have any effect because of a hidden environment variable NUMBA_BOUNDSCHECK=1 which I didn't even know about.

I guess you want to give a user the ability to modify the default behaviour of a library they didn't write so maybe it's unavoidable?

This comment has been minimized.

Copy link
@asmeurer

asmeurer Oct 2, 2019

Author Contributor

Yes, that's the idea. The idea is to add a test build for numba itself with it globally enabled, but I can definitely see this being useful for general code that uses numba. If you get weird results, you can turn this on to see if its an out of bounds access. You can get the same thing by disabling jit entirely (NUMBA_DISABLE_JIT), but sometimes that's too slow to be tenable. I didn't realize this would be controversial.

I can make it so that explicitly writing jit(boundscheck=False) overrides even the environment variable. That would be a rather rare thing to write, since the default is already False. But that also makes the overall logic for the flag and the environment variable a bit confusing IMO.

This comment has been minimized.

Copy link
@dhirschfeld

dhirschfeld Oct 2, 2019

Contributor

I think it's fine as-is - just something for a user to watch out for. Assuming they're not sharing a computer they would have had to set the env var themselves so they should know about it...

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2019

Aside from a lack of tests, which I will work on soon, this is ready for review.

One question about tests: is there a good way to test the environment variable behavior? I want to test that it overrides the flag correctly, but it would need to be done in a subprocess.

@asmeurer asmeurer changed the title [WIP] Bounds checking Bounds checking Oct 2, 2019
@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2019

Also where's the best place to add a CI build with bounds checking enabled (assuming you still agree it's a good idea)?

@stuartarchibald

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2019

Aside from a lack of tests, which I will work on soon, this is ready for review.

One question about tests: is there a good way to test the environment variable behavior? I want to test that it overrides the flag correctly, but it would need to be done in a subprocess.

@asmeurer probably want something like this:

def test_disable_performance_warnings(self):
not_found_ret_code = 55
found_ret_code = 99
expected = "'parallel=True' was specified but no transformation"
# NOTE: the error_usecases is needed as the NumbaPerformanceWarning's
# for parallel=True failing to parallelise do not appear for functions
# defined by string eval/exec etc.
parallel_code = """if 1:
import warnings
from numba.tests.error_usecases import foo
import numba
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter('always')
foo()
for x in w:
if x.category == numba.errors.NumbaPerformanceWarning:
if "%s" in str(x.message):
exit(%s)
exit(%s)
""" % (expected, found_ret_code, not_found_ret_code)
# run in the standard env, warning should raise
popen = subprocess.Popen([sys.executable, "-c", parallel_code])
out, err = popen.communicate()
self.assertEqual(popen.returncode, found_ret_code)
# run in an env with performance warnings disabled, should not warn
env = dict(os.environ)
env['NUMBA_DISABLE_PERFORMANCE_WARNINGS'] = "1"
popen = subprocess.Popen([sys.executable, "-c", parallel_code], env=env)
out, err = popen.communicate()
self.assertEqual(popen.returncode, not_found_ret_code)

Also where's the best place to add a CI build with bounds checking enabled (assuming you still agree it's a good idea)?

Probably the Azure tests: https://github.com/numba/numba/blob/5bd018fd55ab2ff0916ab2c6360334fe94e471a4/azure-pipelines.yml
with a conditional export here:

# switch off color messages
export NUMBA_DISABLE_ERROR_MESSAGE_HIGHLIGHTING=1
# switch on developer mode
export NUMBA_DEVELOPER_MODE=1
# enable the fault handler
export PYTHONFAULTHANDLER=1

I think it'd be wise to see how much of an impact it has on runtime before committing to having it on in public CI. Have you run the test suite locally with it on/off to get some indication? Thanks.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 3, 2019

I'll run the tests locally now, but based on old CI builds on this branch (enabled and disabled), it doesn't seem like there is much of a difference.

@sklam

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2019

Replying to #4432 (comment) regarding issue #2887.

TL;DR

Pass array.ctypes into a C-callable instead of passing array.ctypes.data. The latter is not holding a reference to array.

Details

The problem seen here is different. The use of arr.ctypes.data yields an integer type with no reference to the array arr. It is possible that arr is deallocated at that time.

The numpy docs mention the following:

Note that unlike data_as, a reference will not be kept to the array: code like ctypes.c_void_p((a + b).ctypes.data) will result in a pointer to a deallocated array, and should be spelt (a + b).ctypes.data_as(ctypes.c_void_p)

(from https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.ctypes.html)

The trouble is complicated by:

  • CPython would keep arr around if it is a local variable. But that is an implementation detail. Numba is more eager to cleanup dead variable (determined by static analysis). So numba can remove unused variable very early.
  • In pure python, such code will likely give the illusion of working. 1) arr.ctypes creates a cyclic reference that defers deallocation further (untils the cyclic detector kicks in). 2) Numpy deallocation routine does not erase the underlying data or unmap it. Reading and writing to the deallocated buffer will likely not cause any error. On the other hand, Numba overwrites the first cacheline of the to-be deallocated buffer to make use-after-free error more obvious (at
    memset(ptr, 0xDE, MIN(size, 256));
    ). The -9.86830992e+148 value is the 0xDEDEDEDEDEDEDEDE sequence interpreted as a float64.

Since numba doesn't implement the arr.ctypes.data_as, the recommendation is to pass arr.ctypes, for which ctypes will gladly take for an argument that is expecting a pointer type like c_void_p.

EDIT:

For those curious, here's a script that shows the cyclic reference of a.ctypes:

from ctypes import c_double
import weakref
import gc

import numpy as np


def foo():
    a = np.ones(1)
    wref = weakref.ref(a)
    return a.ctypes.data, wref


def bar():
    a, wref = foo()
    # the following proves that `a.ctypes`` a cyclic reference.
    # comment-out to keep the array alive
    gc.collect()
    print('dead?', wref() is None)
    inptr = a
    inbuf = (c_double * 1).from_address(inptr)
    print('inbuf[0] =', inbuf[0])


bar()
@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2019

@sklam what should be done about the test case for that is in the code for that? It fails when bounds checking is enabled?

@sklam

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2019

changing

status = callback(inct.data, size, outct.data)
to status = callback(inct, size, outct) seems to work.

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2019

OK, that seems fine. Does that still test the original issue?

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2019

The tests seem to take about the same amount of time with NUMBA_BOUNDSCHECK=1. Maybe it would be useful to benchmark the bounds checking in general. Is there a good example you know of that I can benchmark it on?

@sklam

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2019

Maybe something like a mergesort on array: https://github.com/numba/numba/blob/master/examples/mergesort.py

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2019

Looks like about 10% to 30% slower.

No bounds checking

$ ipython
Python 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import examples.mergesort                                                                                                                                                                                              

In [2]: import numpy as np                                                                                                                                                                                                     

In [3]: arr = np.random.random(10**5)                                                                                                                                                                                          

In [4]: # warmup                                                                                                                                                                                                               

In [5]: examples.mergesort.mergesort(arr.copy());                                                                                                                                                                              

In [6]: examples.mergesort.mergesort_inplace(arr.copy());                                                                                                                                                                      

In [7]: %timeit examples.mergesort.mergesort(arr.copy())                                                                                                                                                                       
14.5 ms ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit examples.mergesort.mergesort_inplace(arr.copy())                                                                                                                                                               
1.83 s ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: # Confirm no bounds checking                                                                                                                                                                                           

In [10]: from numba import njit                                                                                                                                                                                                

In [11]: @njit 
    ...: def test(x): 
    ...:     return x[1] 
    ...:                                                                                                                                                                                                                       

In [12]: test(np.array([1]))                                                                                                                                                                                                   
Out[12]: 140162938617040

Bounds checking

$ NUMBA_BOUNDSCHECK=1 ipython
Python 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import examples.mergesort

In [2]: import numpy as np

In [3]: arr = np.random.random(10**5)

In [4]: # warmup

In [5]: examples.mergesort.mergesort(arr.copy());

In [6]: examples.mergesort.mergesort_inplace(arr.copy());

In [7]: %timeit examples.mergesort.mergesort(arr.copy())
16.5 ms ± 45.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit examples.mergesort.mergesort_inplace(arr.copy())
2.42 s ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: # Confirm bounds checking is enabled

In [10]: from numba import njit

In [11]: @njit
    ...: def test(x):
    ...:     return x[1]
    ...:

In [12]: test(np.array([1]))
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-63f95a072376> in <module>
----> 1 test(np.array([1]))

IndexError: index is out of bounds

In [13]: 16.5/14.5
Out[13]: 1.1379310344827587

In [14]: 2.42/1.83
Out[14]: 1.3224043715846994

I do wonder, though, going back to #4432 (comment), to what degree is LLVM able to optimize out the bounds checks? Most of the indices in the mergesort algorithm are from loops that are based on the array.size. So it's not clear to me if every index actually gets a bounds check or if LLVM is able to optimize some of them away, and I don't know how to check that. Is there a good example where the array indices cannot possibly be known at compile time, like if they are inputs to the function?

One of the tests currently fails with a memory leak error, but I'm not sure
how to fix it.
noboundscheck(a)
noboundscheck(b)
with self.assertRaises(IndexError):
boundscheck(a)

This comment has been minimized.

Copy link
@asmeurer

asmeurer Oct 11, 2019

Author Contributor

This line causes a memory leak error. I'm not clear why. Does it have to do with the fact that it raises an exception? The other test doesn't have an memory leak errors.

I was also having issues getting the return_type argument to compile_isolated correct for this test, so it could be related.

Or it's possible there is a bug in the code. How can you debug memory leaks?

@asmeurer

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2019

I just noticed this doesn't handle fancy indexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.