-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add edge keyword argument to digitize #16937
base: main
Are you sure you want to change the base?
Conversation
Ok so my solution to the overflow was to cast integers to 64 bit which will only fail if the greatest bin is |
What are your thoughts on my proposal in #16933? |
Yeah that would be a good solution. Then the edge argument could just be converted to the boolean list to pass to searchsorted. |
numpy/lib/function_base.py
Outdated
idx = -1 if delta == mono else 0 | ||
if np.issubdtype(bins.dtype, _nx.integer): | ||
bins = _nx.int64(bins) | ||
if abs(bins[idx]) >= 2 ** 63 - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rejects uint64 2**64-2
, but should not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_nx.int64(2 ** 63)
throws an error but _nx.int64(2 ** 63 - 1)
does not so as far as I know that's the top of the range. That seems like a pretty big range of acceptable numbers, I think that should do it if that's the range of int64
, not sure it would be worth extending to larger numbers unless there is some particular reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment is about uint64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running into some real funny behavior with bins = np.uint64([0, 2**64 - 2])
where bins[-1] += 1
gives [0, 0]
as does bins[-1] -= 1
. And so does bins[-1] = np.uint64(bins[-1] + 1)
. Very strange.
This should definitely not be figured out in np.digitize and should be it's own issue. For digitize, I'd be in favor of just leaving it as is and not supporting the very narrow use-case of unsigned integers above 2 ** 63 at least until that gets figured out elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the issue you're seeing is that 1
is of type int64
. bins[i] += np.uint64(1)
should work for the uint64 case, although will fail for the int64
case.
and not supporting the very narrow use-case of unsigned integers above 2 ** 63
I'd wager that anyone using uint64
probably wants this extra range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the help, okay now this should work for uint64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think we want to take the approach of fiddling with inputs to searchsorted
, when it would be far less error-prone to just handle the flag inside search-sorted.
@@ -1740,13 +1740,18 @@ def test_large_integers_increasing(self): | |||
# gh-11022 | |||
x = 2**54 # loses precision in a float | |||
assert_equal(np.digitize(x, [x - 1, x + 1]), 1) | |||
assert_equal(np.digitize(x, [x - 1, x + 1], False, True), 1) | |||
with assert_raises(RuntimeError): | |||
np.digitize(x, [x - 1, 2 ** 63 - 1], False, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's better than producing an incorrect result, I see no reason why we should want this to fail, 1
is clearly a correct result. I suppose NotImplementedError
would at least convey that this is our fault not the callers, but I don't really like that approach.
Additionally, the following should not fail, but does:
np.digitize(np.uint64(x), np.array([x - 1, 2 ** 63 - 1], dtype=np.uint64), False, True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be nice to support that but I tried implementing a branching logic for uint64
but numpy behaved very funny with the large numbers (see above).
Adding an edge keyword allows a simpler syntax for passing in bins while including edge cases that would otherwise provoke slightly weird behavior as in
resulting in
when users might prefer
as they only passed two ranges and all the values were within the ranges including edge cases.