Support for np.partition #3320

rjenc29 · 2018-09-18T07:46:20Z

No description provided.

codecov-io · 2018-09-18T09:37:37Z

Codecov Report

Merging #3320 into master will increase coverage by 0.04%.
The diff coverage is 79.87%.

@@            Coverage Diff             @@
##           master    #3320      +/-   ##
==========================================
+ Coverage   80.57%   80.61%   +0.04%     
==========================================
  Files         390      390              
  Lines       78248    78608     +360     
  Branches     8834     8885      +51     
==========================================
+ Hits        63045    63368     +323     
- Misses      13821    13843      +22     
- Partials     1382     1397      +15

stuartarchibald · 2018-09-19T08:35:37Z

Thanks for opening this PR. The failures from 32bit linux coming from the Azure builds can be ignored at present whilst issues with the CI instructions are sorted out!

rjenc29 · 2018-09-19T10:29:19Z

Sure thing – great to see the nascent Azure builds!

So, on to this PR….

In the interests of full disclosure:

I struggled a bit to get my head around the behaviour of np.partition in the general case (multi-dimensional array containing some NaNs etc).
I haven’t attempted to replicate the ‘introselect’ algorithm; I just thought I’d wire in the existing select / partition algos which median and percentile use.

Given (1) and (2), I decided to reimplement all of the (relevant) unit tests from numpy and add a few of my own – therefore, the tests are in general not comparing versus numpy output, but rather asserting certain properties of the output which must be true for a correctly partitioned array (of which there may be many for a given input array and partition index).

The performance is marginally better or the same as numpy for ‘small’ problems but doesn’t scale particularly well and is outperformed by numpy for ‘not small’ problems. I can provide more info / examples / a notebook if helpful.

From a review perspective, whenever you have some time:

The output will not in general match numpy except with regards to those properties which be true for a correctly partitioned array (in my understanding). Is this acceptable?
The scaling / performance properties are not great – I can think of some areas which might contribute but some core dev advice would be appreciated.
Should we look to implement np.argpartition whilst we're at it?
The usual stuff, please :)

seibert · 2018-09-19T13:56:03Z

FYI: The Linux 32 failure on Azure can be fixed if you pull #3309 into this PR.

stuartarchibald · 2018-09-28T17:37:33Z

@rjenc29 Thanks for this. To your points...

I think this is ok, unless the algorithms in NumPy are replicated exactly matching will be very hard. Further the guarantee over the output is The ordering of the elements in the two partitions is undefined., so users shouldn't be relying on output order.
I think that two things are adversely impacting the performance, the first is the NaN checking, the second is the reuse of _select which doesn't seem to have the same level of efficiency as NumPy's equivalent. I'm of the view that whilst it'd be great to match or beat the performance of NumPy, first, let's make sure this is all working and there's an implementation to work from.
Could do, though I think let's get this in first as I expect any changes/improvements to this would be reflected in argpartition?

rjenc29 · 2018-09-29T18:17:05Z

Thanks for the feedback. I tried out a couple of options to reduce the cost of the NaN checking / number of pivots & I think you're on the money.

The last commit provides a speedup over my original of x2 to x5, depending on the nature of the inputs, and means there's a wider range of problem sizes for which the performance is the same or better than NumPy.

It does however involve touching the _partition algorithm - in theory, I haven't changed its default behaviour (functional or non functional), but it's reached quite widely so there's attendant risk. It also might look a bit weird.

It would look less weird & be less risky if I instead implemented a 'nan aware partition' but it might look a bit duplicitous as it's largely 'the same' except for the predicate which is used to determine whether a pivot is needed.

Anyway, if you have time to take a look, let me know what you reckon.

stuartarchibald · 2018-10-03T11:43:21Z

@rjenc29 no problem, thanks for looking at this again, glad to hear that my guess at performance bottlenecks was correct and some speed up was achieved. I think changing the _partition alg is probably a necessity here, baking the NaN handling into the pivoting saves multiple array walking operations and multiple branches for result computation. To avoid the duplication/propagating kwargs factory functions may help as named algs can be trivially composed with them, have a look at: stuartarchibald@90acad4 and see what you think, you are welcome to use/not use it!

rjenc29 · 2018-10-04T11:55:45Z

Thanks for the factory functions suggestion - very neat. Included in the last set of commits.

stuartarchibald

Thanks for the updates to the impl based on initial feedback. I've given the code some initial review now. Most of my comments are inline but there are a couple of more general things to resolve.

If the tests came from NumPy/are re-impls, a reference to their original source (with SHA in the URL) would be good to add to the comments in the test code.
Query... Given we have a source of truth in NumPy, could some of the tests check that the set of unordered elements match NumPy's (I guess this is perhaps equivalent to matching </> sorted[kth], in which case I think this is done in some places already so no need to worry, this is my query!)?

Other than that, the impl and testing is generally good, though the tests need expanding to cover types that are legal for a (see inline comment for details).

Thanks again.

stuartarchibald · 2018-10-05T10:06:41Z

numba/targets/arraymath.py

+@overload(np.partition)
+def np_partition(a, kth):
+
+    if isinstance(kth, (types.Array, types.Sequence)):


Perhaps fold this branching into:

kthdt = getattr(kth, 'dtype', kth) if not isinstance(kthdt, types.Integer): raise TypeError('Partition index must be integer')

?

I also think that a needs some type legalization similar to that of kth, what is valid as a type for a? Seems like Numpy rejects at least scalars and 0d arrays like np.array(1). I guess a could feasibly a sequence type too?

Took out the branching and added support for Boolean (who knew)!

Put in a guard to reject 0D arrays and support array-like inputs (tuple, list).

stuartarchibald · 2018-10-05T10:13:39Z

numba/targets/arraymath.py

+            raise TypeError('Partition index must be integer')
+
+    def np_partition_impl(a, kth):
+        if len(a.flat) == 0:


if a.size == 0 ? saves the flat iter and len calls

stuartarchibald · 2018-10-05T10:34:47Z

numba/targets/arraymath.py

+        if not isinstance(kth, types.Integer):
+            raise TypeError('Partition index must be integer')
+
+    def np_partition_impl(a, kth):


I think a needs to be put through _asarray. Without this asking for array attrs on a is an error. Also see my note on type legalization above.

As an aside, it's weird in NumPy that this is valid:

np.partition([], 2) np.partition(np.array([]), 2)

Agreed, now using _asarray.

Those two edge cases are pretty weird - I added them explicitly to check the behaviour is equivalently weird.

rjenc29 · 2018-10-05T16:18:07Z

Added references to the original NumPy tests and a sanity check along the lines suggested which I have dropped in a few places.

Added a bunch of additional tests - both positive (e.g. support for array-like inputs) and negative (reject 0D arrays).

The last CI failed but I suspect this is spurious. I will commit something tomorrow morning to give CI a kick.

Cheers!

stuartarchibald · 2018-10-05T16:29:20Z

Thanks @rjenc29 , I've given CI a shove. By the nature of the way it broke/got stuck, there may be a memory leak.

rjenc29 · 2018-10-05T16:39:53Z

Great, thanks! I will keep an eye out.

…gible

rjenc29 · 2018-10-06T11:57:26Z

Added handling for another edge case (reject multi-dimensional kth) and some minor tweaks.

A few of the CI jobs timed out; those which did not have passed.

I will hold off on any more changes at this time; if you have time to review, that would be great.

rjenc29 · 2018-10-07T12:09:28Z

5 jobs timed out in the last CI run (4 on the previous).

Just a thought - there's a way of speeding up the case where kth elements are sequential - e.g. np.partition(some_array, (1, 2, 3, 4)) - it's a bit fiddly, but it seems to work. Anyway, I will park it for now rather than keep moving the review goal posts :)

seibert · 2018-10-08T01:15:32Z

Aside: Our test duration has been creeping up on the Travis CI 50 minute maximum for a while, and it seems we've finally hit it. Might be a good time to abandon it since the same tests finish in 27 minutes on Azure.

stuartarchibald · 2018-10-09T10:22:08Z

Moving test duration discussion to #3391.

numba/tests/test_np_functions.py

stuartarchibald · 2018-10-15T12:13:31Z

@sklam as this contains a variant of my patch stuartarchibald@90acad4 please could you take a quick look. I'm of the view that this PR is good to go.

stuartarchibald · 2018-10-16T14:52:39Z

@rjenc29 this has a conflict since merging #3337, please could you merge in/rebase on master? Thanks.

rjenc29 · 2018-10-16T16:10:30Z

Done - will keep an eye on CI...

stuartarchibald · 2018-10-16T17:04:04Z

@rjenc29 thanks for the fix up. @sklam confirmed "Looks good" over private channel (thanks!). This can be merged once CI passes.

rjenc29 added 7 commits September 14, 2018 07:57

initial commit

45423a7

handle invalid kths

849a2ce

lots of tests

9adad60

reduce search space for subsequent pivots

44a3d6f

update travis

3869e06

minor refinements

dc8170e

added some comments

cba8568

stuartarchibald added this to the Numba 0.41 RC milestone Sep 19, 2018

stuartarchibald self-requested a review September 19, 2018 08:35

Merge branch 'master' into partition

f4cf6fb

rjenc29 changed the title ~~[WIP] Support for np.partition~~ Support for np.partition Sep 24, 2018

Merge branch 'master' into partition

b03cc56

np.partition speedup

d68de2f

rjenc29 added 3 commits October 4, 2018 08:43

Merge branch 'master' into partition

3cf00b2

factories for select and partition

888103e

use seeded rng in tests

8df9811

stuartarchibald requested changes Oct 5, 2018

View reviewed changes

rjenc29 added 5 commits October 5, 2018 12:34

use a.size to determine if a is empty

bbdd731

take out branching in initial kth validation

380f301

checks for validity of a and add tests

8e9301b

add sanity check versus numpy

ebd6e61

add comment to ref NumPy tests

a0544c5

handle boolean inputs

821f060

rjenc29 added 3 commits October 5, 2018 20:32

handle multi-dimensional kth and make exception messages more intelli…

511faa7

…gible

reorder checks

2e3f111

relocate comment

614fcb8

additional test and rename variable

6293617

stuartarchibald mentioned this pull request Oct 9, 2018

Some tests are taking disproportionate amounts of time to run #3391

Open

1 task

stuartarchibald reviewed Oct 11, 2018

View reviewed changes

numba/tests/test_np_functions.py Show resolved Hide resolved

stuartarchibald approved these changes Oct 15, 2018

View reviewed changes

rjenc29 added 2 commits October 16, 2018 16:56

merge master

8ee1adb

fix mismerge - variable rename and additional tests

bc83639

stuartarchibald merged commit 92d7b37 into numba:master Oct 17, 2018

stuartarchibald mentioned this pull request Apr 2, 2019

Support for numpy.partition #3185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for np.partition #3320

Support for np.partition #3320

rjenc29 commented Sep 18, 2018

codecov-io commented Sep 18, 2018 •

edited

stuartarchibald commented Sep 19, 2018

rjenc29 commented Sep 19, 2018

seibert commented Sep 19, 2018

stuartarchibald commented Sep 28, 2018

rjenc29 commented Sep 29, 2018

stuartarchibald commented Oct 3, 2018

rjenc29 commented Oct 4, 2018

stuartarchibald left a comment

stuartarchibald Oct 5, 2018

stuartarchibald Oct 5, 2018

rjenc29 Oct 5, 2018

stuartarchibald Oct 5, 2018

rjenc29 Oct 5, 2018

stuartarchibald Oct 5, 2018

rjenc29 Oct 5, 2018

rjenc29 commented Oct 5, 2018

stuartarchibald commented Oct 5, 2018

rjenc29 commented Oct 5, 2018

rjenc29 commented Oct 6, 2018

rjenc29 commented Oct 7, 2018

seibert commented Oct 8, 2018

stuartarchibald commented Oct 9, 2018

stuartarchibald commented Oct 15, 2018

stuartarchibald commented Oct 16, 2018

rjenc29 commented Oct 16, 2018

stuartarchibald commented Oct 16, 2018

Support for np.partition #3320

Support for np.partition #3320

Conversation

rjenc29 commented Sep 18, 2018

codecov-io commented Sep 18, 2018 • edited

Codecov Report

stuartarchibald commented Sep 19, 2018

rjenc29 commented Sep 19, 2018

seibert commented Sep 19, 2018

stuartarchibald commented Sep 28, 2018

rjenc29 commented Sep 29, 2018

stuartarchibald commented Oct 3, 2018

rjenc29 commented Oct 4, 2018

stuartarchibald left a comment

Choose a reason for hiding this comment

stuartarchibald Oct 5, 2018

Choose a reason for hiding this comment

stuartarchibald Oct 5, 2018

Choose a reason for hiding this comment

rjenc29 Oct 5, 2018

Choose a reason for hiding this comment

stuartarchibald Oct 5, 2018

Choose a reason for hiding this comment

rjenc29 Oct 5, 2018

Choose a reason for hiding this comment

stuartarchibald Oct 5, 2018

Choose a reason for hiding this comment

rjenc29 Oct 5, 2018

Choose a reason for hiding this comment

rjenc29 commented Oct 5, 2018

stuartarchibald commented Oct 5, 2018

rjenc29 commented Oct 5, 2018

rjenc29 commented Oct 6, 2018

rjenc29 commented Oct 7, 2018

seibert commented Oct 8, 2018

stuartarchibald commented Oct 9, 2018

stuartarchibald commented Oct 15, 2018

stuartarchibald commented Oct 16, 2018

rjenc29 commented Oct 16, 2018

stuartarchibald commented Oct 16, 2018

codecov-io commented Sep 18, 2018 •

edited