Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n-array versions of set operations, especially intersect1d #5179

Closed
cbrueffer opened this issue Oct 13, 2014 · 9 comments
Closed

n-array versions of set operations, especially intersect1d #5179

cbrueffer opened this issue Oct 13, 2014 · 9 comments

Comments

@cbrueffer
Copy link
Contributor

The stock Python set operations support multiple inputs, while the Numpy equivalents only support two arrays.

A version of intersect1d supporting n arrays was posted to numpy-discussion back in 2009: http://numpy-discussion.10968.n7.nabble.com/intersect1d-for-N-input-arrays-td25956.html

The posted multi-array version works well for me, it would be great to see it (or a variant) in Numpy.

@njsmith
Copy link
Member

njsmith commented Oct 13, 2014

I can't think why we would have any objection to these in principle, though
for the usual free software reasons I also doubt anyone will jump up to do
it for you. Want to prepare a pull request?
On 13 Oct 2014 12:17, "Christian Brueffer" notifications@github.com wrote:

The stock Python set operations support multiple inputs, while the Numpy
equivalents only support two arrays.

A version of intersect1d supporting n arrays was posted to
numpy-discussion back in 2009:
http://numpy-discussion.10968.n7.nabble.com/intersect1d-for-N-input-arrays-td25956.html

The posted multi-array version works well for me, it would be great to see
it (or a variant) in Numpy.


Reply to this email directly or view it on GitHub
#5179.

@cbrueffer
Copy link
Contributor Author

Absolutely. Before I do that, do you have an opinion on the issues raised in the referenced email thread? I suppose for API stability n-matrix versions should be added as seperate functions, but should they be implemented by repeatedly calling the respective two-matrix versions, or is something like the provided implementation preferrable?

@njsmith
Copy link
Member

njsmith commented Oct 13, 2014

API issues should be raised on numpy-discussion for the community to weigh
in on (most people don't read the bug tracker).

I guess it's possible that there are people passing the assume_unique
argument as a positional arg, in which case yeah you might need to make a
new function.

I don't really care about the implementation much myself, since I'm not the
one who'll be using it :-), and implementations are easily fixable later as
long as the API is good.
On 13 Oct 2014 14:11, "Christian Brueffer" notifications@github.com wrote:

Absolutely. Before I do that, do you have an opinion on the issues raised
in the referenced email thread? I suppose for API stability n-matrix
versions should be added as seperate functions, but should they be
implemented by repeatedly calling the respective two-matrix versions, or is
something like the provided implementation preferrable?


Reply to this email directly or view it on GitHub
#5179 (comment).

@jaimefrio
Copy link
Member

Well, since the method, both old and new, is sorting based, when you intersect m arrays of n elements each, the whole operation is O(m*n*log(m*n)). If, on the other hand, you do m-1 pairwise intersections, you end up with a worst case of O((m-1)*2*n*log(2*n)). And that worse case only happens if all arrays contain the exact same items, which is probably not the most typical of cases.

My point is that there probably is no major advantage to what you propose over something like reduce(np.intersect1d, arrays), and it may even perform worse in many typical use cases. So saving a few key strokes doesn't seem to justify adding anything to the API.

@cbrueffer
Copy link
Contributor Author

Wow, I wasn't aware of reduce() until now, thanks for the hint! I agree this is a better solution. I'll submit a patch which adds an example of this instead.

@njsmith
Copy link
Member

njsmith commented Oct 13, 2014

Built-in reduce annoyingly requires an identity, so I think it'd be

reduce(np.intersect1d, arrays, [])
On 13 Oct 2014 15:37, "Christian Brueffer" notifications@github.com wrote:

Wow, I wasn't aware of reduce() until now, thanks for the hint! I agree
this is a better solution. I'll submit a patch which adds an example of
this instead.


Reply to this email directly or view it on GitHub
#5179 (comment).

@cbrueffer
Copy link
Contributor Author

The third argument is the initializer, so adding [] results in the first intersect being between [] and array1 (if I've understood the API docs correctly), resulting in []. Not exactly what's intended :-)

@njsmith
Copy link
Member

njsmith commented Oct 13, 2014

Doh, I was thinking of union obviously. Well, give it a try - it's possible
you'll have to do something awkward like

reduce(intersect1d, arrs[1:], arrs[0])

but I'm on my phone so don't take my word for it (as if you would after my
last suggestion...).
On 13 Oct 2014 16:04, "Christian Brueffer" notifications@github.com wrote:

The third argument is the initializer, so adding [] results in the first
intersect being between [] and array1 (if I've understood the API docs
correctly), resulting in []. Not exactly what's intended :-)


Reply to this email directly or view it on GitHub
#5179 (comment).

@cbrueffer
Copy link
Contributor Author

reduce(np.intersect1d, arrays) as suggested by Jaime actually works perfectly for me. Thanks to both of you for the input!

cbrueffer added a commit to cbrueffer/numpy that referenced this issue Oct 15, 2014
The approach was suggested by Jaime Frio in issue numpy#5179.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants