Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Buttress handling of extreme values in randint #8846

Merged
merged 1 commit into from
May 9, 2017
Merged

BUG: Buttress handling of extreme values in randint #8846

merged 1 commit into from
May 9, 2017

Conversation

gfyoung
Copy link
Contributor

@gfyoung gfyoung commented Mar 26, 2017

In randint, we perform integer comparisons to check whether the bounds are violated. With numpy signed integers, those comparisons can be prone to overflow bugs at the extremes. This PR makes the function robust against those bugs.

cc @eric-wieser


for dt in self.itype:
lbnd = 0 if dt is np.bool_ else np.iinfo(dt).min
ubnd = 2 if dt is np.bool_ else np.iinfo(dt).max + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would np.iinfo(np.bool_) be a reasonable thing for numpy to contain? (EDIT: #8849)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, as it does not work for the moment:

>>> np.iinfo(np.bool_)
...
ValueError: Invalid integer data type.

@eric-wieser
Copy link
Member

eric-wieser commented Mar 27, 2017

So, one other problem to address is generating np.uint64s in [0, u64top), where u64top = 2^64. I can think of a few solutions(superfluous casting added to be obvious about types, int is py3k):

  1. np.random.randint(int(0), int(u64top))
    require arguments to be ints, and make this impossible to achieve passing uint64s. But having a function that can't be invoked correctly with np scalars is unpleasant, and this can't be generalized to ufuncs. This is what this PR implements.
    Obviously we want this to work to not surprise users, but I don't think this is enough.

  2. np.random.randint(np.uint64(0), np.uint64(u64top-1), closed=True)
    Allow the endpoint to be included. Could also be called endpoint for consistency with np.linspace. But closed ranges are uncommon in python, and we already removed random_integers in DEP: Deprecate random_integers #6931. This is at least more explicit, but might still not be desirable

  3. np.random.randint(np.uint64(0), np.uint64(0), wrap=True)
    The semantics would be that the number line of np.uint64 is considered circular - two equal numbers would be a full range, and otherwise asking for numbers between any two points is valid. This would completely disable the bound checking currently in place, as every set of inputs would be valid. This is more confusing than 2.

Of course you could just ask for a direct 64-bit integer, but it would be nice to cover the case where a user api wants to generate up to some variable without a special case

Thoughts?

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 27, 2017

@eric-wieser : np.iinfo(np.uint64).max = 18446744073709551615 = 2^64 - 1. What you're suggesting is to generate random integers that are beyond even what uint64 can hold, which is NOT the point of this PR in fact.

@eric-wieser
Copy link
Member

eric-wieser commented Mar 27, 2017

@gfyoung: Good catch on the typo. I meant generate numbers in [0, 2^64), ie a open range - which would fit in a uint64

ilow = int(low)
ihigh = int(high)

if ilow < lowbnd:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issue that is the root cause here: #8851

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, a lot of "quirks" (i.e. bugs) arose when I started working on this PR in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I'd be happy with this patch if you add a TODO comment that this is a workaround for gh-8851

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problems doing that.

@gfyoung
Copy link
Contributor Author

gfyoung commented Apr 26, 2017

@eric-wieser : Got around to looking at this again, and I'm trying to refresh myself on what exactly is blocking this PR from being merged. Let me respond to the suggestions you provided above:

  1. The casting to int is to avoid that annoying comparison that you explicitly raised as an issue in BUG: incorrect equality of np.int64 and np.uint64 #8851. If that bug didn't exist, I wouldn't need to do that casting to Python int. I think this is the most robust solution for the time being. I understand that you want to make this a ufunc but that is some ways away from being done IMO. This patch is otherwise robust and clean.

  2. Yep, I don't think we'll be allowing closed intervals as the standard any time soon.

  3. Agreed that this option is not desirable.

@eric-wieser
Copy link
Member

Main blocker is that I want someone else to weigh in here, and that this raises a bunch of other issues*. I agree that np.random.randint having different behaviour on integers and numpy scalars is a bug, and I also think that this is an ok patch make things consistent.

But I think this is in some senses a workaround - we should actually just fix the relational operators here, rather than add a workaround - np.int32 <op> np.uint32 already does the "right" thing (ie, not what C would do) for relational operators, so I'd propose we add qQ and Qq ufunc loops for relational operators (and probably lL and Ll, since sometimes l == q...)

Would that fix this bug?


*For instance:

i64max = np.int64(np.iinfo(np.int64).max)
my_data = np.array([i64max - 4] * 5 + [i64max])

data_range = my_data.min(), my_data.max()

# generate some fake data in the same range as the existing data
faked_data = np.random.randint(data_range[0], data_range[1] + 1)

This fails because we already get an overflow before we make it to randint - so in this case, is an argument for re-introducing closed ranges (wrapping makes no sense on signed integers, as it is not well-defined anyway). Interestingly, this case doesn't come up for uint64s, as silent promotion to float happens.

@gfyoung
Copy link
Contributor Author

gfyoung commented Apr 26, 2017

Okay. Fair enough. Two points in response:

  1. I think fixing comparison operators would patch this bug I think. However, I don't know how simple it will be to patch. The question is whether it will be worthwhile to wait or submit this now.

  2. Could you tag this for 0.13 just so we can ensure that we will get eyes on this?

@eric-wieser eric-wieser added this to the 1.13.0 release milestone Apr 26, 2017
@rkern
Copy link
Member

rkern commented Apr 27, 2017

I like the idea of adding those comparison ufunc loops. Adding new ufunc loops is pretty straightforward. It might complicate someone's theoretical table of how dtype promotion works, but for the comparison operators, I'm okay with that. :-)

@gfyoung
Copy link
Contributor Author

gfyoung commented Apr 27, 2017

@rkern : FYI, #8851 for further discussion on comparison operators.

except Exception as e:
raise AssertionError("No error should have been raised, "
"but one was with the following "
"message:\n\n%s" % str(e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little weird - better to leave the raw exception so that the tests produce a useful stack trace, I think

Copy link
Contributor Author

@gfyoung gfyoung Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because I wanted to know which element in the for-loop, if any, failed. No stack-trace is going to tell you that AFAICT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need an assert_no_exception or some such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think assert_no_exception would be all the useful - wouldn't that go on every single function call otherwise? Based on @gfyoung's reasining, parameterized test cases would solve this.

I did this because I wanted to know which element in the for-loop, if any, failed.

In what way does this do a better job of doing that, given that dt is not used in the message?

raise ValueError("low >= high")

with self.lock:
ret = randfunc(low, high - 1, size, self.state_address)
ret = randfunc(ilow, ihigh - 1, size, self.state_address)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we claim that everything should work if comparisons get fixed, then this line doesn't need to change - right?

Copy link
Contributor Author

@gfyoung gfyoung Apr 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. The following lines will not be needed:

ilow = int(ilow)
ihigh = int(high)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me be more clear: I think we should continue to pass low and high on this line - there is no reason to change to ilow and ihigh here.

Copy link
Contributor Author

@gfyoung gfyoung Apr 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the case:

>>> uint64_max = np.uint64(np.iinfo(np.uint64).max)
>>> uint64_max - 1
1.8446744073709552e+19

This PR cannot demonstrate any clearer just how fragile numpy's operators are when handling large uint64 numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn, you're right. In that case, the comment above saying "remove these lines" should also point out that we need them to work around subtraction as well. Or just a comment to that effect next to the subtraction

Alternatively, np.subtract(high, 1, dtype=dtype) would probably be safe here too.

Copy link
Contributor Author

@gfyoung gfyoung Apr 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. In fact, even your suggestion is not safe:

>>> uint64_max = np.uint64(np.iinfo(np.uint64).max)
>>>
# correct but bad for randint
>>> np.subtract(uint64_max, 1, dtype=np.int64)
-2
>>>
>>> np.subtract(uint64_max, 1, dtype=None)   # oops
1.8446744073709552e+19

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant by # correct but bad for randint. You're in for a bad time if you ask for an upper bound that doesn't come close to fitting in the container type. What does passing dtype=None into randint do normally?

Copy link
Contributor Author

@gfyoung gfyoung Apr 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at those examples again, I realize that those aren't valid in the context of randint. The only time that we could do this subtraction is if dtype=np.uint64, in which case the subtraction actually works correctly.

So ignore everything I just said in the previous comment above. I still would rather prefer to use high - 1 (as the current workaround) over np.subtract, as that will make it clear to us later that we need to patch this once #8851 is fixed.

dtype=None is not valid (per the docs) and will error as an invalid dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.subtract has the benefit of extending to broadcasting more naturally.

I'm happy with the current workaround, provided there's a comment indicating that the subtraction too is a workaround

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment has been appended to the TODO I wrote earlier.

@charris
Copy link
Member

charris commented May 7, 2017

What is the status of this?

@gfyoung
Copy link
Contributor Author

gfyoung commented May 7, 2017

IINM @eric-wieser has OK'ed the changes, but he wanted feedback from other maintainers before merging this PR.

@charris
Copy link
Member

charris commented May 8, 2017

@eric-wieser Seems reasonable to me. I'll merge this later if you don't beat me to it.

@eric-wieser
Copy link
Member

My main qualm with this patch is that we lock ourselves into behaviour that will be difficult to maintain in #6938

@gfyoung
Copy link
Contributor Author

gfyoung commented May 8, 2017

@eric-wieser : What behavior are you referring to?

@charris
Copy link
Member

charris commented May 8, 2017

@eric-wieser Looks like it just fixes a bug in the current version. What am I missing?

@eric-wieser
Copy link
Member

eric-wieser commented May 9, 2017

What am I missing?

That once we have broadcasting, casting the results to int is not a particularly viable strategy, as they are no longer scalars - the equivalent would be .astype(object), which would be pretty slow

@eric-wieser
Copy link
Member

eric-wieser commented May 9, 2017

I guess my concern comes down to the fact that at some level this is a workaround for an underlying bug (uint64/int64 operations), which leaves us two options in future:

  • Fix the underlying bug to match the behaviour of the workaround
  • Decide on a different fix when we finally address the bug, and have to leave the workaround in for compatibility

@gfyoung
Copy link
Contributor Author

gfyoung commented May 9, 2017

True that this is a workaround in some respects, but that patch should not affect the numbers generated by this function. I don't see how correctly implementing uint64 subtraction and comparison would change what numbers we get.

@eric-wieser
Copy link
Member

Worst case scenario, we decide that uint64/int64 ops are an error or warning, rather than being broken like they are now. If that happens, this would switch from "code that should work" to "code that shouldn't work". But that's an unlikely resolution.

Let's put this in, with the assumption that in the unlikely event supporting this becomes a burden, we can just go through a deprecation cycle.

@gfyoung
Copy link
Contributor Author

gfyoung commented May 9, 2017

Fair enough. If that does end up happening, I'd have no problems putting together the patches to deprecate given the unlikelihood.

@charris
Copy link
Member

charris commented May 9, 2017

I suppose the other option is to raise an error for the bad combinations.

@charris charris merged commit 2b4ecc4 into numpy:master May 9, 2017
@charris
Copy link
Member

charris commented May 9, 2017

Thanks @gfyoung .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants