Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add edge keyword argument to digitize #16937

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

alexrockhill
Copy link

Adding an edge keyword allows a simpler syntax for passing in bins while including edge cases that would otherwise provoke slightly weird behavior as in

x = [1, 2, 3]
bins = [0, 1.5, 3]

resulting in

np.digitize = [1, 2, 3]

when users might prefer

np.digitize = [1, 2, 2]

as they only passed two ranges and all the values were within the ranges including edge cases.

@alexrockhill
Copy link
Author

Ok so my solution to the overflow was to cast integers to 64 bit which will only fail if the greatest bin is 2 ** 63 - 1 but it will raise a runtime error.

@eric-wieser
Copy link
Member

What are your thoughts on my proposal in #16933?

@alexrockhill
Copy link
Author

What are your thoughts on my proposal in #16933?

Yeah that would be a good solution. Then the edge argument could just be converted to the boolean list to pass to searchsorted.

idx = -1 if delta == mono else 0
if np.issubdtype(bins.dtype, _nx.integer):
bins = _nx.int64(bins)
if abs(bins[idx]) >= 2 ** 63 - 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rejects uint64 2**64-2, but should not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_nx.int64(2 ** 63) throws an error but _nx.int64(2 ** 63 - 1) does not so as far as I know that's the top of the range. That seems like a pretty big range of acceptable numbers, I think that should do it if that's the range of int64, not sure it would be worth extending to larger numbers unless there is some particular reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is about uint64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running into some real funny behavior with bins = np.uint64([0, 2**64 - 2]) where bins[-1] += 1 gives [0, 0] as does bins[-1] -= 1. And so does bins[-1] = np.uint64(bins[-1] + 1). Very strange.

This should definitely not be figured out in np.digitize and should be it's own issue. For digitize, I'd be in favor of just leaving it as is and not supporting the very narrow use-case of unsigned integers above 2 ** 63 at least until that gets figured out elsewhere.

Copy link
Member

@eric-wieser eric-wieser Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the issue you're seeing is that 1 is of type int64. bins[i] += np.uint64(1) should work for the uint64 case, although will fail for the int64 case.

and not supporting the very narrow use-case of unsigned integers above 2 ** 63

I'd wager that anyone using uint64 probably wants this extra range.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help, okay now this should work for uint64

@charris charris changed the title MRG, ENH: added edge keyword argument to digitize ENH: Add edge keyword argument to digitize Jul 25, 2020
Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think we want to take the approach of fiddling with inputs to searchsorted, when it would be far less error-prone to just handle the flag inside search-sorted.

@@ -1740,13 +1740,18 @@ def test_large_integers_increasing(self):
# gh-11022
x = 2**54 # loses precision in a float
assert_equal(np.digitize(x, [x - 1, x + 1]), 1)
assert_equal(np.digitize(x, [x - 1, x + 1], False, True), 1)
with assert_raises(RuntimeError):
np.digitize(x, [x - 1, 2 ** 63 - 1], False, True)
Copy link
Member

@eric-wieser eric-wieser Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's better than producing an incorrect result, I see no reason why we should want this to fail, 1 is clearly a correct result. I suppose NotImplementedError would at least convey that this is our fault not the callers, but I don't really like that approach.

Additionally, the following should not fail, but does:

np.digitize(np.uint64(x), np.array([x - 1, 2 ** 63 - 1], dtype=np.uint64), False, True)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be nice to support that but I tried implementing a branching logic for uint64 but numpy behaved very funny with the large numbers (see above).

Base automatically changed from master to main March 4, 2021 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting a decision
Status: Needs decision
Development

Successfully merging this pull request may close these issues.

3 participants