-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add np.search() #9055
ENH: Add np.search() #9055
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be in lib/extras.py
, not core
.
Also, commit message should start with ENH too, not just the PR title
numpy/core/fromnumeric.py
Outdated
Both arrays are first flattened to 1D, and the values in `a` must be unique. | ||
This is an accelerated equivalent of: | ||
|
||
`np.array([ int(np.where(a == val)[0]) for val in v ])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be better documented in terms of the more obvious a.index(v)
. Have a look at the docstring for isin
- I suspect your description can match almost exactly, but swapping a in b
for a.index(b)
numpy/core/fromnumeric.py
Outdated
Find indices into `a` where elements in `v` match those in `a`. | ||
|
||
Find the indices into an array `a` whose values match those queried in `v`. | ||
Both arrays are first flattened to 1D, and the values in `a` must be unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the flattening of v
? Seems like we can meaningfully preserve shape here, and searchsorted
already knows how to handle Nd input as its second argument
numpy/core/fromnumeric.py
Outdated
asortis = a.argsort() | ||
## TODO: which is preferable?: | ||
return asortis[a[asortis].searchsorted(v)] | ||
#return asortis[a.searchsorted(v, sorter=asortis)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second will be faster, as it doesn't construct a temporary array
numpy/core/fromnumeric.py
Outdated
a, v = np.ravel(a), np.ravel(v) | ||
if len(a) != len(np.unique(a)): | ||
raise ValueError("values in `a` must be unique for unambiguous results") | ||
if not np.in1d(v, a).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be .isin
numpy/core/fromnumeric.py
Outdated
""" | ||
a, v = np.ravel(a), np.ravel(v) | ||
if len(a) != len(np.unique(a)): | ||
raise ValueError("values in `a` must be unique for unambiguous results") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that we should just allow this - computing unique
is expensive, especially if the input is already unique
Changes, including extensions, to the existing API require discussion on the mailing list, please send a message explaining the proposed change. My personal opinion is that the code you are proposing solves a too restricted problem. If it could handle both values in
In summary, unless someone comes up with some super smart way of handling things, it seems to me that a function like this is doomed, because it either will solve a too restricted problem or will produce a too complex output. |
Is there a reason that this is not the approach taken in |
I suppose the trickery in
|
I am inclined to agree with @jaimefrio here. There isn't really a natural way to make this function work for finding all matches in NumPy, because numpy doesn't have good ways to represent ragged arrays. |
Sorry for the delay, and thanks for the comments. @eric-wieser, I've tried to address your comments. I can't find any file named After thinking about this some more, it seems to me that @jaimefrio, OK, I'll send a message to the mailing list. My motivation for this is that I often find myself using
But because I rarely test either of these assumptions, I always feel like I'm abusing Regarding this being something that solves a restricted problem, I feel that's a benefit, not a drawback. I think it's safest to just raise an error if
Does that sound like a reasonable way to handle values in |
I think it's important for this function to have two branches, based upon whether it's worth sorting Roughly speaking, if
Core NumPy functions should not return masked arrays, so This leaves |
Although adding a complementary |
Maybe the use-case of finding the index of nearest element in the array could also somehow be handled. I can't be the only one using something like |
Thanks again for the comments. I've made some changes. @shoyer, I've added a @jaimefrio, I've added a @eric-wieser, I'm still not sure where to put this. It's still in |
else: | ||
indices = np.zeros_like(v, dtype=int) | ||
indices[hits] = sortis[sideis[hits]] | ||
indices[misses] = fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a case where np.where
makes sense:
indices = np.where(hits, sortis[sideis], fill_value)
This would also mean you don't need to compute misses
.
By the way, this is only lightly tested so far. See the the Examples section in the docsting. |
lis = a.searchsorted(v, side='left', sorter=sortis) | ||
ris = a.searchsorted(v, side='right', sorter=sortis) | ||
sideis = {'first':lis, 'last':ris-1}[which] | ||
hits = lis != ris # elements in v that are in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The searches are the second most expensive operations after the sorting, or the most expensive one if the needle is larger than the haystack. So it is best to not do two of them if at all possible. And if you don't want to find all occurrences it is certainly possible:
if which == 'first':
index = a.searchsorted(v, side='left', sorter=sortis)
inbounds = index < len(a)
elif which == 'last':
index = a.searchsorted(v, side='right', sorter=sortis) - 1
inbounds = index >= 0
else:
raise ValueError('Relevant message goes here.')
hits = a[index[inbounds]] == v[inbounds]
I would skip the Look at the random module for an example of Cython in NumPy. We might have a few other miscellaneous |
I am going to close this. @mspacek Thanks for the work, feel free to pursue this in a new PR. |
Find the indices into an array
a
whose values match those queried inv
.This is a PR to go with issue #9052. I'm not at all sure where this should live. It's in
fromnumeric.py
right now, which is probably wrong. Of course, there would also need to be documentation changes and some kind of tests added, but I thought I'd put this here now to get some feedback.