ENH: Add np.search() #9055

mspacek · 2017-05-05T13:21:20Z

Find the indices into an array a whose values match those queried in v.

This is a PR to go with issue #9052. I'm not at all sure where this should live. It's in fromnumeric.py right now, which is probably wrong. Of course, there would also need to be documentation changes and some kind of tests added, but I thought I'd put this here now to get some feedback.

eric-wieser

This should probably be in lib/extras.py, not core.

Also, commit message should start with ENH too, not just the PR title

eric-wieser · 2017-05-05T13:44:11Z

numpy/core/fromnumeric.py

+    Both arrays are first flattened to 1D, and the values in `a` must be unique.
+    This is an accelerated equivalent of:
+
+    `np.array([ int(np.where(a == val)[0]) for val in v ])`


I think this would be better documented in terms of the more obvious a.index(v). Have a look at the docstring for isin - I suspect your description can match almost exactly, but swapping a in b for a.index(b)

eric-wieser · 2017-05-05T13:44:39Z

numpy/core/fromnumeric.py

+    Find indices into `a` where elements in `v` match those in `a`.
+
+    Find the indices into an array `a` whose values match those queried in `v`.
+    Both arrays are first flattened to 1D, and the values in `a` must be unique.


Can we remove the flattening of v? Seems like we can meaningfully preserve shape here, and searchsorted already knows how to handle Nd input as its second argument

eric-wieser · 2017-05-05T13:46:09Z

numpy/core/fromnumeric.py

+    asortis = a.argsort()
+    ## TODO: which is preferable?:
+    return asortis[a[asortis].searchsorted(v)]
+    #return asortis[a.searchsorted(v, sorter=asortis)]


The second will be faster, as it doesn't construct a temporary array

eric-wieser · 2017-05-05T13:49:03Z

numpy/core/fromnumeric.py

+    a, v = np.ravel(a), np.ravel(v)
+    if len(a) != len(np.unique(a)):
+        raise ValueError("values in `a` must be unique for unambiguous results")
+    if not np.in1d(v, a).all():


This should be .isin

eric-wieser · 2017-05-05T13:50:47Z

numpy/core/fromnumeric.py

+    """
+    a, v = np.ravel(a), np.ravel(v)
+    if len(a) != len(np.unique(a)):
+        raise ValueError("values in `a` must be unique for unambiguous results")


Seems to me that we should just allow this - computing unique is expensive, especially if the input is already unique

jaimefrio · 2017-05-05T14:34:23Z

Changes, including extensions, to the existing API require discussion on the mailing list, please send a message explaining the proposed change.

My personal opinion is that the code you are proposing solves a too restricted problem. If it could handle both values in v not found in a and found more than once in a, then maybe it would make sense. But that is problematic:

The simplest way to detect missing values is to run np.searchsorted on your arrays twice, one with side='left', the other with side='right'. If you get the same index for a value on both searches, then the value is not in the array.
Running two searches would also let you find multiples instances of a value: if searchsorted returns indices i_l and i_r for left and right searches of value v, then that value is found in indices asortis[i_l:i_r] of the array a.
The problem with this is that there is no easy way to return these results to the user in a numpythonic way. Because the number of times each value appears in an array could be different, you would need to return a list of arrays, with one array of indices per value in v. But lists of arrays are not what you typically want in NumPy, so it's not clear to me that this is very helpful.

In summary, unless someone comes up with some super smart way of handling things, it seems to me that a function like this is doomed, because it either will solve a too restricted problem or will produce a too complex output.

eric-wieser · 2017-05-05T14:42:02Z

If you get the same index for a value on both searches, then the value is not in the array.

Is there a reason that this is not the approach taken in in1d?

jaimefrio · 2017-05-05T15:35:32Z

I suppose the trickery in in1d is faster? For one data point it is:

n = 10000

a = np.random.randint(2 * n, size=n)
b = np.random.randint(2 * n, size=n)

def f1(a, b):
  return np.in1d(a, b)

def f2(a, b):
  sorter = np.argsort(b)
  left = np.searchsorted(b, a, side='left', sorter=sorter)
  right = np.searchsorted(b, a, side='right', sorter=sorter)
  return left != right

assert np.all(f1(a, b) == f2(a, b))

%timeit f1(a, b)
100 loops, best of 3: 2.71 ms per loop

%timeit f2(a, b)
100 loops, best of 3: 4.39 ms per loop

shoyer · 2017-05-08T04:56:12Z

I am inclined to agree with @jaimefrio here. There isn't really a natural way to make this function work for finding all matches in NumPy, because numpy doesn't have good ways to represent ragged arrays.

mspacek · 2017-05-08T19:29:03Z

Sorry for the delay, and thanks for the comments.

@eric-wieser, I've tried to address your comments. I can't find any file named lib/extras.py. Did you mean lib/ma/extras.py? For now I've left it in fromnumeric.py, right above the definition for searchsorted.

After thinking about this some more, it seems to me that np.search might be a better name than argmatch or argval or find or index. It complements np.searchsorted in that it doesn't assume that the input array a is sorted. Otherwise, it does pretty much the same thing.

@jaimefrio, OK, I'll send a message to the mailing list. My motivation for this is that I often find myself using searchsorted for exactly this purpose because it's fast and convenient, and usually the following assumptions are valid:

a is sorted
v is a subset of a

But because I rarely test either of these assumptions, I always feel like I'm abusing searchsorted, and I would guess many others do the same. Looking at my own habits and uses, it seems to me that finding the indices of matching values of one array in another is a more common use case than finding insertion indices of one array into another sorted array.

Regarding this being something that solves a restricted problem, I feel that's a benefit, not a drawback. I think it's safest to just raise an error if v is not a subset of a. However, I came across numpy_indexed.indices() which seems to do exactly what I'm proposing here. That function has a kwarg missing, which I quite like:

missing : {'raise', 'ignore', 'mask' or int}
        if `missing` is 'raise', a KeyError is raised if not all elements of `that` are present in `this`
        if `missing` is 'mask', a masked array is returned,
        where items of `that` not present in `this` are masked out
        if `missing` is 'ignore', all elements of `that` are assumed to be present in `this`,
        and output is undefined otherwise
        if missing is an integer, this is used as a fill-value

Does that sound like a reasonable way to handle values in v that might be missing in a?

shoyer · 2017-05-09T17:54:28Z

I think it's important for this function to have two branches, based upon whether it's worth sorting a or not. Otherwise, users will start using this as an alternative to np.nonzero(v == a) (for scalar a) and be surprised when it's slower.

Roughly speaking, if len(v) < log(len(a)), we don't want to sort a and instead need to do a naive loop over a and v (needs to be done in C or Cython for performance) that can exit early if it finds a match. If A = len(a) and V = len(v), the time complexity here is roughly O(A log(A) + V log(V)) for the sort based solution vs O(A*V) for the loop.

That function has a kwarg missing, which I quite like:

Core NumPy functions should not return masked arrays, so 'mask' is out. I'm also not sure how 'ignore' would be helpful -- usually we avoid undefined behavior unless there are significant performance benefits, which I don't think there are in this case.

This leaves 'raise' and int. Raising should be the default behavior, so I would suggest a default value of None to indicate raising an error (rather than a string literal), and allowing an optional integer fill value. For that matter, it would probably be better to call this fill_value rather than missing.

eric-wieser · 2017-05-09T23:31:16Z

Core NumPy functions should not return masked arrays, so 'mask'

Although adding a complementary np.ma.search would be sensible

Tillsten · 2017-05-10T20:01:19Z

Maybe the use-case of finding the index of nearest element in the array could also somehow be handled. I can't be the only one using something like np.argmin(np.abs(arr-value_to_find)) quite often.

mspacek · 2017-05-15T14:44:17Z

Thanks again for the comments. I've made some changes.

@shoyer, I've added a fill_value kwarg. As for optimization, I could probably write up a naive loop in Cython if given a bit of guidance regarding how to integrate it with the rest of the code. Doing it directly in C might be a bit much for me...

@jaimefrio, I've added a which kwarg, which controls whether to return the index of the first or last hit for values in v that have multiple hits in a. Maybe someone can come up with a better kwarg name?

@eric-wieser, I'm still not sure where to put this. It's still in fromnumeric.py

perimosocordiae · 2017-05-15T15:11:33Z

numpy/core/fromnumeric.py

+    else:
+        indices = np.zeros_like(v, dtype=int)
+        indices[hits] = sortis[sideis[hits]]
+        indices[misses] = fill_value


This seems like a case where np.where makes sense:

indices = np.where(hits, sortis[sideis], fill_value)

This would also mean you don't need to compute misses.

mspacek · 2017-05-15T15:25:45Z

By the way, this is only lightly tested so far. See the the Examples section in the docsting.

jaimefrio · 2017-05-15T15:44:22Z

numpy/core/fromnumeric.py

+    lis = a.searchsorted(v, side='left', sorter=sortis)
+    ris = a.searchsorted(v, side='right', sorter=sortis)
+    sideis = {'first':lis, 'last':ris-1}[which]
+    hits = lis != ris # elements in v that are in a


The searches are the second most expensive operations after the sorting, or the most expensive one if the needle is larger than the haystack. So it is best to not do two of them if at all possible. And if you don't want to find all occurrences it is certainly possible:

if which == 'first': index = a.searchsorted(v, side='left', sorter=sortis) inbounds = index < len(a) elif which == 'last': index = a.searchsorted(v, side='right', sorter=sortis) - 1 inbounds = index >= 0 else: raise ValueError('Relevant message goes here.') hits = a[index[inbounds]] == v[inbounds]

shoyer · 2017-05-15T18:43:13Z

I would skip the which argument even if we don't change the name. Behavior that can be accomplished equivalently with a single function call / indexing operation is not worthy of a keyword argument.

Look at the random module for an example of Cython in NumPy. We might have a few other miscellaneous .pyx files hanging around, too.

charris · 2023-02-20T20:41:41Z

I am going to close this. @mspacek Thanks for the work, feel free to pursue this in a new PR.

eric-wieser reviewed May 5, 2017

View reviewed changes

charris added the 01 - Enhancement label May 5, 2017

ENH: Add np.search()

47fd6ad

mspacek force-pushed the argmatch branch from 58a235a to 47fd6ad Compare May 8, 2017 19:00

mspacek changed the title ~~ENH: Add np.argmatch()~~ ENH: Add np.search() May 8, 2017

charris added the component: numpy._core label May 10, 2017

Add kwargs fill_value and which

7e6a509

perimosocordiae reviewed May 15, 2017

View reviewed changes

jaimefrio reviewed May 15, 2017

View reviewed changes

Base automatically changed from master to main March 4, 2021 02:04

InessaPawson added the 52 - Inactive Pending author response label Jun 8, 2022

charris closed this Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add np.search() #9055

ENH: Add np.search() #9055

mspacek commented May 5, 2017

eric-wieser left a comment •

edited

Loading

eric-wieser May 5, 2017 •

edited

Loading

eric-wieser May 5, 2017

eric-wieser May 5, 2017

eric-wieser May 5, 2017

eric-wieser May 5, 2017

jaimefrio commented May 5, 2017

eric-wieser commented May 5, 2017

jaimefrio commented May 5, 2017

shoyer commented May 8, 2017

mspacek commented May 8, 2017 •

edited

Loading

shoyer commented May 9, 2017 •

edited

Loading

eric-wieser commented May 9, 2017

Tillsten commented May 10, 2017

mspacek commented May 15, 2017

perimosocordiae May 15, 2017

mspacek commented May 15, 2017

jaimefrio May 15, 2017 •

edited

Loading

shoyer commented May 15, 2017

charris commented Feb 20, 2023

ENH: Add np.search() #9055

ENH: Add np.search() #9055

Conversation

mspacek commented May 5, 2017

eric-wieser left a comment • edited Loading

Choose a reason for hiding this comment

eric-wieser May 5, 2017 • edited Loading

Choose a reason for hiding this comment

eric-wieser May 5, 2017

Choose a reason for hiding this comment

eric-wieser May 5, 2017

Choose a reason for hiding this comment

eric-wieser May 5, 2017

Choose a reason for hiding this comment

eric-wieser May 5, 2017

Choose a reason for hiding this comment

jaimefrio commented May 5, 2017

eric-wieser commented May 5, 2017

jaimefrio commented May 5, 2017

shoyer commented May 8, 2017

mspacek commented May 8, 2017 • edited Loading

shoyer commented May 9, 2017 • edited Loading

eric-wieser commented May 9, 2017

Tillsten commented May 10, 2017

mspacek commented May 15, 2017

perimosocordiae May 15, 2017

Choose a reason for hiding this comment

mspacek commented May 15, 2017

jaimefrio May 15, 2017 • edited Loading

Choose a reason for hiding this comment

shoyer commented May 15, 2017

charris commented Feb 20, 2023

eric-wieser left a comment •

edited

Loading

eric-wieser May 5, 2017 •

edited

Loading

mspacek commented May 8, 2017 •

edited

Loading

shoyer commented May 9, 2017 •

edited

Loading

jaimefrio May 15, 2017 •

edited

Loading