Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add np.search() #9055

Closed
wants to merge 2 commits into from
Closed

ENH: Add np.search() #9055

wants to merge 2 commits into from

Conversation

mspacek
Copy link
Contributor

@mspacek mspacek commented May 5, 2017

Find the indices into an array a whose values match those queried in v.

This is a PR to go with issue #9052. I'm not at all sure where this should live. It's in fromnumeric.py right now, which is probably wrong. Of course, there would also need to be documentation changes and some kind of tests added, but I thought I'd put this here now to get some feedback.

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in lib/extras.py, not core.

Also, commit message should start with ENH too, not just the PR title

Both arrays are first flattened to 1D, and the values in `a` must be unique.
This is an accelerated equivalent of:

`np.array([ int(np.where(a == val)[0]) for val in v ])`
Copy link
Member

@eric-wieser eric-wieser May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better documented in terms of the more obvious a.index(v). Have a look at the docstring for isin - I suspect your description can match almost exactly, but swapping a in b for a.index(b)

Find indices into `a` where elements in `v` match those in `a`.

Find the indices into an array `a` whose values match those queried in `v`.
Both arrays are first flattened to 1D, and the values in `a` must be unique.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the flattening of v? Seems like we can meaningfully preserve shape here, and searchsorted already knows how to handle Nd input as its second argument

asortis = a.argsort()
## TODO: which is preferable?:
return asortis[a[asortis].searchsorted(v)]
#return asortis[a.searchsorted(v, sorter=asortis)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second will be faster, as it doesn't construct a temporary array

a, v = np.ravel(a), np.ravel(v)
if len(a) != len(np.unique(a)):
raise ValueError("values in `a` must be unique for unambiguous results")
if not np.in1d(v, a).all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be .isin

"""
a, v = np.ravel(a), np.ravel(v)
if len(a) != len(np.unique(a)):
raise ValueError("values in `a` must be unique for unambiguous results")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that we should just allow this - computing unique is expensive, especially if the input is already unique

@jaimefrio
Copy link
Member

Changes, including extensions, to the existing API require discussion on the mailing list, please send a message explaining the proposed change.

My personal opinion is that the code you are proposing solves a too restricted problem. If it could handle both values in v not found in a and found more than once in a, then maybe it would make sense. But that is problematic:

  • The simplest way to detect missing values is to run np.searchsorted on your arrays twice, one with side='left', the other with side='right'. If you get the same index for a value on both searches, then the value is not in the array.
  • Running two searches would also let you find multiples instances of a value: if searchsorted returns indices i_l and i_r for left and right searches of value v, then that value is found in indices asortis[i_l:i_r] of the array a.
  • The problem with this is that there is no easy way to return these results to the user in a numpythonic way. Because the number of times each value appears in an array could be different, you would need to return a list of arrays, with one array of indices per value in v. But lists of arrays are not what you typically want in NumPy, so it's not clear to me that this is very helpful.

In summary, unless someone comes up with some super smart way of handling things, it seems to me that a function like this is doomed, because it either will solve a too restricted problem or will produce a too complex output.

@eric-wieser
Copy link
Member

If you get the same index for a value on both searches, then the value is not in the array.

Is there a reason that this is not the approach taken in in1d?

@jaimefrio
Copy link
Member

I suppose the trickery in in1d is faster? For one data point it is:

n = 10000

a = np.random.randint(2 * n, size=n)
b = np.random.randint(2 * n, size=n)

def f1(a, b):
  return np.in1d(a, b)

def f2(a, b):
  sorter = np.argsort(b)
  left = np.searchsorted(b, a, side='left', sorter=sorter)
  right = np.searchsorted(b, a, side='right', sorter=sorter)
  return left != right

assert np.all(f1(a, b) == f2(a, b))

%timeit f1(a, b)
100 loops, best of 3: 2.71 ms per loop

%timeit f2(a, b)
100 loops, best of 3: 4.39 ms per loop

@shoyer
Copy link
Member

shoyer commented May 8, 2017

I am inclined to agree with @jaimefrio here. There isn't really a natural way to make this function work for finding all matches in NumPy, because numpy doesn't have good ways to represent ragged arrays.

@mspacek mspacek changed the title ENH: Add np.argmatch() ENH: Add np.search() May 8, 2017
@mspacek
Copy link
Contributor Author

mspacek commented May 8, 2017

Sorry for the delay, and thanks for the comments.

@eric-wieser, I've tried to address your comments. I can't find any file named lib/extras.py. Did you mean lib/ma/extras.py? For now I've left it in fromnumeric.py, right above the definition for searchsorted.

After thinking about this some more, it seems to me that np.search might be a better name than argmatch or argval or find or index. It complements np.searchsorted in that it doesn't assume that the input array a is sorted. Otherwise, it does pretty much the same thing.

@jaimefrio, OK, I'll send a message to the mailing list. My motivation for this is that I often find myself using searchsorted for exactly this purpose because it's fast and convenient, and usually the following assumptions are valid:

  1. a is sorted
  2. v is a subset of a

But because I rarely test either of these assumptions, I always feel like I'm abusing searchsorted, and I would guess many others do the same. Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array.

Regarding this being something that solves a restricted problem, I feel that's a benefit, not a drawback. I think it's safest to just raise an error if v is not a subset of a. However, I came across numpy_indexed.indices() which seems to do exactly what I'm proposing here. That function has a kwarg missing, which I quite like:

missing : {'raise', 'ignore', 'mask' or int}
        if `missing` is 'raise', a KeyError is raised if not all elements of `that` are present in `this`
        if `missing` is 'mask', a masked array is returned,
        where items of `that` not present in `this` are masked out
        if `missing` is 'ignore', all elements of `that` are assumed to be present in `this`,
        and output is undefined otherwise
        if missing is an integer, this is used as a fill-value

Does that sound like a reasonable way to handle values in v that might be missing in a?

@shoyer
Copy link
Member

shoyer commented May 9, 2017

I think it's important for this function to have two branches, based upon whether it's worth sorting a or not. Otherwise, users will start using this as an alternative to np.nonzero(v == a) (for scalar a) and be surprised when it's slower.

Roughly speaking, if len(v) < log(len(a)), we don't want to sort a and instead need to do a naive loop over a and v (needs to be done in C or Cython for performance) that can exit early if it finds a match. If A = len(a) and V = len(v), the time complexity here is roughly O(A log(A) + V log(V)) for the sort based solution vs O(A*V) for the loop.

That function has a kwarg missing, which I quite like:

Core NumPy functions should not return masked arrays, so 'mask' is out. I'm also not sure how 'ignore' would be helpful -- usually we avoid undefined behavior unless there are significant performance benefits, which I don't think there are in this case.

This leaves 'raise' and int. Raising should be the default behavior, so I would suggest a default value of None to indicate raising an error (rather than a string literal), and allowing an optional integer fill value. For that matter, it would probably be better to call this fill_value rather than missing.

@eric-wieser
Copy link
Member

Core NumPy functions should not return masked arrays, so 'mask'

Although adding a complementary np.ma.search would be sensible

@Tillsten
Copy link

Maybe the use-case of finding the index of nearest element in the array could also somehow be handled. I can't be the only one using something like np.argmin(np.abs(arr-value_to_find)) quite often.

@mspacek
Copy link
Contributor Author

mspacek commented May 15, 2017

Thanks again for the comments. I've made some changes.

@shoyer, I've added a fill_value kwarg. As for optimization, I could probably write up a naive loop in Cython if given a bit of guidance regarding how to integrate it with the rest of the code. Doing it directly in C might be a bit much for me...

@jaimefrio, I've added a which kwarg, which controls whether to return the index of the first or last hit for values in v that have multiple hits in a. Maybe someone can come up with a better kwarg name?

@eric-wieser, I'm still not sure where to put this. It's still in fromnumeric.py

else:
indices = np.zeros_like(v, dtype=int)
indices[hits] = sortis[sideis[hits]]
indices[misses] = fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a case where np.where makes sense:

indices = np.where(hits, sortis[sideis], fill_value)

This would also mean you don't need to compute misses.

@mspacek
Copy link
Contributor Author

mspacek commented May 15, 2017

By the way, this is only lightly tested so far. See the the Examples section in the docsting.

lis = a.searchsorted(v, side='left', sorter=sortis)
ris = a.searchsorted(v, side='right', sorter=sortis)
sideis = {'first':lis, 'last':ris-1}[which]
hits = lis != ris # elements in v that are in a
Copy link
Member

@jaimefrio jaimefrio May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The searches are the second most expensive operations after the sorting, or the most expensive one if the needle is larger than the haystack. So it is best to not do two of them if at all possible. And if you don't want to find all occurrences it is certainly possible:

if which == 'first':
    index = a.searchsorted(v, side='left', sorter=sortis)
    inbounds = index < len(a)
elif which == 'last':
    index = a.searchsorted(v, side='right', sorter=sortis) - 1
    inbounds = index >= 0
else:
    raise ValueError('Relevant message goes here.')
hits = a[index[inbounds]] == v[inbounds]

@shoyer
Copy link
Member

shoyer commented May 15, 2017

I would skip the which argument even if we don't change the name. Behavior that can be accomplished equivalently with a single function call / indexing operation is not worthy of a keyword argument.

Look at the random module for an example of Cython in NumPy. We might have a few other miscellaneous .pyx files hanging around, too.

Base automatically changed from master to main March 4, 2021 02:04
@InessaPawson InessaPawson added the 52 - Inactive Pending author response label Jun 8, 2022
@charris
Copy link
Member

charris commented Feb 20, 2023

I am going to close this. @mspacek Thanks for the work, feel free to pursue this in a new PR.

@charris charris closed this Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants