Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Allow any integer type in bincount (fixes #823) #4366

Closed
wants to merge 1 commit into from

Conversation

jaimefrio
Copy link
Member

Gives array-likes that cannot be cast to np.intp a second chance, by
comparing its entries one by one to NPY_INTP_MAX and forcing the cast if
none exceeds it. Fixes #823.

Gives array-likes that cannot be cast to np.intp a second chance, by
comparing its entries one by one to NPY_INTP_MAX and forcing the cast if
none exceeds it. Fixes numpy#823.
@jaimefrio
Copy link
Member Author

This is a simpler, slower alternative to the type specific functions I had implemented in #4330. Since solving this issue is orthogonal to whether an axis argument is ever added to bincount, I thought it made sense to implement it separately.

@juliantaylor
Copy link
Contributor

wouldn't it be simpler to just try to cast the not-intp array to NPY_UINTP then get the maximum and if it fits cast back to intp? Probably also faster than the function pointer compare loop.

actually you could probably save the cast if the rest of the code works on npy_uintp as documented and you just create a npy_uintp view of the npy_intp array after the minimum check

@jaimefrio
Copy link
Member Author

On a 32 bit system uintp is uint32, but you could also be receiving uint64 and int64, and neither can be safely cast to uint32.

A potentially faster alternative without multiple functions for different types is to try to create a NPY_INTP array, if it fails, create it with type set to NULL, and if the resulting type is an integer, force cast it to NPY_ULONGLONG, and check there for items exceeding NPY_INTP_MAX. The only problem then would be how to tell apart negative values and values that are too large, to give a meaningful error to the user.

@juliantaylor
Copy link
Contributor

maybe its better to add type specific functions, even on 64 bit you might not want to cast your large int16 array to 64 bit integers

@jaimefrio
Copy link
Member Author

I may have gotten a little carried away there, but type specific functions were frowned upon by @charris in #4330, which is what prompted this more flexible, less efficient approach here, as a test of a simplified approach for #4330.

I think he was concerned that the additional complexity did not justify making faster a function whose speed nobody is complaining about. Hence this approach that leaves the original path unchanged, and provides a slower alternative path for what used not to work.

I am happy to rewrite it either way, or in some middle ground approach, but need your guidance as to what approach best fits NumPy.

@charris
Copy link
Member

charris commented Feb 26, 2014

I'd say just chunk the conversions.

@charris
Copy link
Member

charris commented Feb 26, 2014

Chunking would probably be a good idea for any integer type bigger than uint16 anyway. Not sure of the best chunk size though, maybe 64K.

@jaimefrio
Copy link
Member Author

By "chunking" you mean something similar to what NpyIter does when buffering? That works wonderfully well, but the overhead of setting up NpyIter kills performance for small arrays. Before I go and reinvent the wheel, is there any standard way of "chunking" that I am not aware of?

@charris
Copy link
Member

charris commented Feb 27, 2014

I was thinking that you could not specify the npy_intp type to begin with, then convert the resulting array to npy_intp in chunks. But there is a problem with the minmax function as it only works with npy_intp. You could call the max/min methods of the array, however. That would probably be slower, especially for small arrays.

What you have now doesn't bother me that much, but as Julian says, it does waste space. But no more than it did.

@seberg
Copy link
Member

seberg commented Feb 27, 2014

If you find a cool solution for this, put it into the indexing machinery ;). It currently just casts to np.intp potentially creating integer wraparound problems, too... The nicest way would probably be to plug in some warning/error mechanism into the integer casting functions, so that we can do NPY_ATTEMPT_SAME_KIND casting in the NpyIter. And get an error (if not at the beginning, but at least some error at all, even if in indexing things could potentially be corrupted).

@jaimefrio
Copy link
Member Author

That's a nasty shortcut you are taking there! ;-)

>>> a = np.array([np.iinfo(np.uintp).max])
>>> a
array([18446744073709551615], dtype=uint64)
>>> a[a]
array([18446744073709551615], dtype=uint64)
>>> a.astype(np.intp) # this is what's going on under the hood
array([-1], dtype=int64)

As you say, probably the proper solution would be to have a set of int-to-int conversion functions that checked for inboundness, and that had a return value that could be used to signal overflows in the conversion. I think I have followed the path of how casting works all the way to here, so duplicating some of those for ints. That's definitely more bite than I can swallow with my current knowledge of the code base, so I won't even dream of attempting it.

Quite frankly, if something so ubiquitous as indexing can live with this behavior, the hack in this PR is unwarranted: until an elegant solution is devised for that, perhaps the best solution for this particular bug (which is everywhere, see #4384) is to follow the leader and do as indexing does, i.e. something equivalent to NPY_SAME_KIND_CASTING?

On a 64 bit system, the worse that could happen is that you get an error for a negative number if the values are too large, and we can always change the error to "incorrect values in array" to not mislead the user:

>>> np.array([np.iinfo(np.intp).max+1])
array([9223372036854775808], dtype=uint64)
>>> np.array([np.iinfo(np.intp).max+1]).astype(np.intp)
array([-9223372036854775808], dtype=int64)

On 32 bit systems you can have messier bugs creeping in:

>>> np.array([np.iinfo(np.uintp).max+1])
array([4294967296], dtype=int64)
>>> np.array([np.iinfo(np.uintp).max+1]).astype(np.intp)
array([0])

Which is not good, you should have gotten a "Maximum allowed dimension exceeded", but it's wrong in a way consistent with indexing...

@seberg
Copy link
Member

seberg commented Feb 27, 2014

It is kinda impressive that nobody ever noticed and bothered to open an issue about it ;). But I didn't feel like specializing a load of inner loops for this, and the old code did a forced cast too...
The correct way to fix it is indeed to check out how the inner loops could warn/raise about integer wraparound problems better. Once we got that we can at least raise an error reasonably. Of course for indexing specifically specializing the inner loops would be plausible, but unless someone screams about np.int_ indexing being very common and slow on windows machines...

One advantage indexing has over bincount is that you can't even fathom the idea of indexing with such a large integer ;).

@charris
Copy link
Member

charris commented Feb 27, 2014

True, true. My 64 bit machine only uses 48 bits for memory addressing. That's what, 32 terabytes or so? I've been robbed.

@jaimefrio
Copy link
Member Author

Actually bincount or repeat (#4384) are not very different from indexing in that respect: such huge numbers, if properly handled, would lead to some form of "array too big" error. By forcing the cast, you only change where the error happens. The only situation where an overflow may wrap around to give incorrect results is in 32 bit systems, see the example above where 2**32 is magically converted to 0. So I will restate my proposal:

We (that would be me) make bincount (and repeat and any other function taking integers and trying to cast them to npy_intp) be broken in a way consistent with indexing. We close those issues, and open a new one about the indexing force cast issues, and also list there the functions that would need to be corrected if that ever gets figured out.

@jaimefrio
Copy link
Member Author

np.tile seems to already be forcing the cast, so the error messages may need redoing:

>>> np.tile([1,2,3], np.array([2**63]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\WinPython-64bit-2.7.6.2\python-2.7.6.amd64\lib\site-packages\numpy\lib\shape_base.py", line 829, in tile
    c = c.reshape(-1, n).repeat(nrep, 0)
ValueError: negative dimensions are not allowed

@matthew-brett
Copy link
Contributor

We just ran into this (kind of) problem : dipy/dipy#789 - not bincount, but repeat.


min = (const void *)PyArray_DATA(arr_cmp2);
max = min + stride;
while (size--) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it me, or is there no iteration happening here?

return 0;
}
else if (cmp(a, max, arr) > 0) {
PyErr_SetString(PyExc_ValueError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these errors would make more sense emitted by the caller

@anirudh2290
Copy link
Member

@jaimefrio this PR has been stale for a while, do you still want to continue working on this PR ?

@charris
Copy link
Member

charris commented Dec 28, 2020

I'm going to close this, it is stalled and needs rebase. @jaimefrio Please make a new PR if you want to pursue this.

@charris charris closed this Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bincount does not accept input of type > N.uint16 (Trac #225)
8 participants