New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for np.partition #3320
Support for np.partition #3320
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3320 +/- ##
==========================================
+ Coverage 80.57% 80.61% +0.04%
==========================================
Files 390 390
Lines 78248 78608 +360
Branches 8834 8885 +51
==========================================
+ Hits 63045 63368 +323
- Misses 13821 13843 +22
- Partials 1382 1397 +15 |
Thanks for opening this PR. The failures from 32bit linux coming from the Azure builds can be ignored at present whilst issues with the CI instructions are sorted out! |
Sure thing – great to see the nascent Azure builds! So, on to this PR…. In the interests of full disclosure:
Given (1) and (2), I decided to reimplement all of the (relevant) unit tests from numpy and add a few of my own – therefore, the tests are in general not comparing versus numpy output, but rather asserting certain properties of the output which must be true for a correctly partitioned array (of which there may be many for a given input array and partition index). The performance is marginally better or the same as numpy for ‘small’ problems but doesn’t scale particularly well and is outperformed by numpy for ‘not small’ problems. I can provide more info / examples / a notebook if helpful. From a review perspective, whenever you have some time:
|
FYI: The Linux 32 failure on Azure can be fixed if you pull #3309 into this PR. |
@rjenc29 Thanks for this. To your points...
|
Thanks for the feedback. I tried out a couple of options to reduce the cost of the NaN checking / number of pivots & I think you're on the money. The last commit provides a speedup over my original of x2 to x5, depending on the nature of the inputs, and means there's a wider range of problem sizes for which the performance is the same or better than NumPy. It does however involve touching the It would look less weird & be less risky if I instead implemented a 'nan aware partition' but it might look a bit duplicitous as it's largely 'the same' except for the predicate which is used to determine whether a pivot is needed. Anyway, if you have time to take a look, let me know what you reckon. |
@rjenc29 no problem, thanks for looking at this again, glad to hear that my guess at performance bottlenecks was correct and some speed up was achieved. I think changing the |
Thanks for the factory functions suggestion - very neat. Included in the last set of commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates to the impl based on initial feedback. I've given the code some initial review now. Most of my comments are inline but there are a couple of more general things to resolve.
- If the tests came from NumPy/are re-impls, a reference to their original source (with SHA in the URL) would be good to add to the comments in the test code.
- Query... Given we have a source of truth in NumPy, could some of the tests check that the set of unordered elements match NumPy's (I guess this is perhaps equivalent to matching
<
/>
sorted[kth]
, in which case I think this is done in some places already so no need to worry, this is my query!)?
Other than that, the impl and testing is generally good, though the tests need expanding to cover types that are legal for a
(see inline comment for details).
Thanks again.
numba/targets/arraymath.py
Outdated
@overload(np.partition) | ||
def np_partition(a, kth): | ||
|
||
if isinstance(kth, (types.Array, types.Sequence)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps fold this branching into:
kthdt = getattr(kth, 'dtype', kth)
if not isinstance(kthdt, types.Integer):
raise TypeError('Partition index must be integer')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that a
needs some type legalization similar to that of kth
, what is valid as a type for a
? Seems like Numpy rejects at least scalars and 0d arrays like np.array(1)
. I guess a
could feasibly a sequence type too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took out the branching and added support for Boolean (who knew)!
Put in a guard to reject 0D arrays and support array-like inputs (tuple, list).
numba/targets/arraymath.py
Outdated
raise TypeError('Partition index must be integer') | ||
|
||
def np_partition_impl(a, kth): | ||
if len(a.flat) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a.size == 0
? saves the flat iter and len calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if not isinstance(kth, types.Integer): | ||
raise TypeError('Partition index must be integer') | ||
|
||
def np_partition_impl(a, kth): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a
needs to be put through _asarray
. Without this asking for array attrs on a
is an error. Also see my note on type legalization above.
As an aside, it's weird in NumPy that this is valid:
np.partition([], 2)
np.partition(np.array([]), 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, now using _asarray.
Those two edge cases are pretty weird - I added them explicitly to check the behaviour is equivalently weird.
Added references to the original NumPy tests and a sanity check along the lines suggested which I have dropped in a few places. Added a bunch of additional tests - both positive (e.g. support for array-like inputs) and negative (reject 0D arrays). The last CI failed but I suspect this is spurious. I will commit something tomorrow morning to give CI a kick. Cheers! |
Thanks @rjenc29 , I've given CI a shove. By the nature of the way it broke/got stuck, there may be a memory leak. |
Great, thanks! I will keep an eye out. |
Added handling for another edge case (reject multi-dimensional kth) and some minor tweaks. A few of the CI jobs timed out; those which did not have passed. I will hold off on any more changes at this time; if you have time to review, that would be great. |
5 jobs timed out in the last CI run (4 on the previous). Just a thought - there's a way of speeding up the case where kth elements are sequential - e.g. |
Aside: Our test duration has been creeping up on the Travis CI 50 minute maximum for a while, and it seems we've finally hit it. Might be a good time to abandon it since the same tests finish in 27 minutes on Azure. |
Moving test duration discussion to #3391. |
@sklam as this contains a variant of my patch stuartarchibald@90acad4 please could you take a quick look. I'm of the view that this PR is good to go. |
Done - will keep an eye on CI... |
No description provided.