Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for np.partition #3320

Merged
merged 25 commits into from Oct 17, 2018
Merged

Support for np.partition #3320

merged 25 commits into from Oct 17, 2018

Conversation

rjenc29
Copy link
Contributor

@rjenc29 rjenc29 commented Sep 18, 2018

No description provided.

@codecov-io
Copy link

codecov-io commented Sep 18, 2018

Codecov Report

Merging #3320 into master will increase coverage by 0.04%.
The diff coverage is 79.87%.

@@            Coverage Diff             @@
##           master    #3320      +/-   ##
==========================================
+ Coverage   80.57%   80.61%   +0.04%     
==========================================
  Files         390      390              
  Lines       78248    78608     +360     
  Branches     8834     8885      +51     
==========================================
+ Hits        63045    63368     +323     
- Misses      13821    13843      +22     
- Partials     1382     1397      +15

@stuartarchibald
Copy link
Contributor

Thanks for opening this PR. The failures from 32bit linux coming from the Azure builds can be ignored at present whilst issues with the CI instructions are sorted out!

@stuartarchibald stuartarchibald added this to the Numba 0.41 RC milestone Sep 19, 2018
@rjenc29
Copy link
Contributor Author

rjenc29 commented Sep 19, 2018

Sure thing – great to see the nascent Azure builds!

So, on to this PR….

In the interests of full disclosure:

  1. I struggled a bit to get my head around the behaviour of np.partition in the general case (multi-dimensional array containing some NaNs etc).
  2. I haven’t attempted to replicate the ‘introselect’ algorithm; I just thought I’d wire in the existing select / partition algos which median and percentile use.

Given (1) and (2), I decided to reimplement all of the (relevant) unit tests from numpy and add a few of my own – therefore, the tests are in general not comparing versus numpy output, but rather asserting certain properties of the output which must be true for a correctly partitioned array (of which there may be many for a given input array and partition index).

The performance is marginally better or the same as numpy for ‘small’ problems but doesn’t scale particularly well and is outperformed by numpy for ‘not small’ problems. I can provide more info / examples / a notebook if helpful.

From a review perspective, whenever you have some time:

  1. The output will not in general match numpy except with regards to those properties which be true for a correctly partitioned array (in my understanding). Is this acceptable?
  2. The scaling / performance properties are not great – I can think of some areas which might contribute but some core dev advice would be appreciated.
  3. Should we look to implement np.argpartition whilst we're at it?
  4. The usual stuff, please :)

@seibert
Copy link
Contributor

seibert commented Sep 19, 2018

FYI: The Linux 32 failure on Azure can be fixed if you pull #3309 into this PR.

@rjenc29 rjenc29 changed the title [WIP] Support for np.partition Support for np.partition Sep 24, 2018
@stuartarchibald
Copy link
Contributor

@rjenc29 Thanks for this. To your points...

  1. I think this is ok, unless the algorithms in NumPy are replicated exactly matching will be very hard. Further the guarantee over the output is The ordering of the elements in the two partitions is undefined., so users shouldn't be relying on output order.
  2. I think that two things are adversely impacting the performance, the first is the NaN checking, the second is the reuse of _select which doesn't seem to have the same level of efficiency as NumPy's equivalent. I'm of the view that whilst it'd be great to match or beat the performance of NumPy, first, let's make sure this is all working and there's an implementation to work from.
  3. Could do, though I think let's get this in first as I expect any changes/improvements to this would be reflected in argpartition?

@rjenc29
Copy link
Contributor Author

rjenc29 commented Sep 29, 2018

Thanks for the feedback. I tried out a couple of options to reduce the cost of the NaN checking / number of pivots & I think you're on the money.

The last commit provides a speedup over my original of x2 to x5, depending on the nature of the inputs, and means there's a wider range of problem sizes for which the performance is the same or better than NumPy.

It does however involve touching the _partition algorithm - in theory, I haven't changed its default behaviour (functional or non functional), but it's reached quite widely so there's attendant risk. It also might look a bit weird.

It would look less weird & be less risky if I instead implemented a 'nan aware partition' but it might look a bit duplicitous as it's largely 'the same' except for the predicate which is used to determine whether a pivot is needed.

Anyway, if you have time to take a look, let me know what you reckon.

@stuartarchibald
Copy link
Contributor

@rjenc29 no problem, thanks for looking at this again, glad to hear that my guess at performance bottlenecks was correct and some speed up was achieved. I think changing the _partition alg is probably a necessity here, baking the NaN handling into the pivoting saves multiple array walking operations and multiple branches for result computation. To avoid the duplication/propagating kwargs factory functions may help as named algs can be trivially composed with them, have a look at: stuartarchibald@90acad4 and see what you think, you are welcome to use/not use it!

@rjenc29
Copy link
Contributor Author

rjenc29 commented Oct 4, 2018

Thanks for the factory functions suggestion - very neat. Included in the last set of commits.

Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates to the impl based on initial feedback. I've given the code some initial review now. Most of my comments are inline but there are a couple of more general things to resolve.

  1. If the tests came from NumPy/are re-impls, a reference to their original source (with SHA in the URL) would be good to add to the comments in the test code.
  2. Query... Given we have a source of truth in NumPy, could some of the tests check that the set of unordered elements match NumPy's (I guess this is perhaps equivalent to matching </> sorted[kth], in which case I think this is done in some places already so no need to worry, this is my query!)?

Other than that, the impl and testing is generally good, though the tests need expanding to cover types that are legal for a (see inline comment for details).

Thanks again.

@overload(np.partition)
def np_partition(a, kth):

if isinstance(kth, (types.Array, types.Sequence)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps fold this branching into:

    kthdt = getattr(kth, 'dtype', kth)
    if not isinstance(kthdt, types.Integer):
        raise TypeError('Partition index must be integer')

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that a needs some type legalization similar to that of kth, what is valid as a type for a? Seems like Numpy rejects at least scalars and 0d arrays like np.array(1). I guess a could feasibly a sequence type too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took out the branching and added support for Boolean (who knew)!

Put in a guard to reject 0D arrays and support array-like inputs (tuple, list).

raise TypeError('Partition index must be integer')

def np_partition_impl(a, kth):
if len(a.flat) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a.size == 0 ? saves the flat iter and len calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if not isinstance(kth, types.Integer):
raise TypeError('Partition index must be integer')

def np_partition_impl(a, kth):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a needs to be put through _asarray. Without this asking for array attrs on a is an error. Also see my note on type legalization above.

As an aside, it's weird in NumPy that this is valid:

np.partition([], 2)
np.partition(np.array([]), 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, now using _asarray.

Those two edge cases are pretty weird - I added them explicitly to check the behaviour is equivalently weird.

@rjenc29
Copy link
Contributor Author

rjenc29 commented Oct 5, 2018

Added references to the original NumPy tests and a sanity check along the lines suggested which I have dropped in a few places.

Added a bunch of additional tests - both positive (e.g. support for array-like inputs) and negative (reject 0D arrays).

The last CI failed but I suspect this is spurious. I will commit something tomorrow morning to give CI a kick.

Cheers!

@stuartarchibald
Copy link
Contributor

Thanks @rjenc29 , I've given CI a shove. By the nature of the way it broke/got stuck, there may be a memory leak.

@rjenc29
Copy link
Contributor Author

rjenc29 commented Oct 5, 2018

Great, thanks! I will keep an eye out.

@rjenc29
Copy link
Contributor Author

rjenc29 commented Oct 6, 2018

Added handling for another edge case (reject multi-dimensional kth) and some minor tweaks.

A few of the CI jobs timed out; those which did not have passed.

I will hold off on any more changes at this time; if you have time to review, that would be great.

@rjenc29
Copy link
Contributor Author

rjenc29 commented Oct 7, 2018

5 jobs timed out in the last CI run (4 on the previous).

Just a thought - there's a way of speeding up the case where kth elements are sequential - e.g. np.partition(some_array, (1, 2, 3, 4)) - it's a bit fiddly, but it seems to work. Anyway, I will park it for now rather than keep moving the review goal posts :)

@seibert
Copy link
Contributor

seibert commented Oct 8, 2018

Aside: Our test duration has been creeping up on the Travis CI 50 minute maximum for a while, and it seems we've finally hit it. Might be a good time to abandon it since the same tests finish in 27 minutes on Azure.

@stuartarchibald
Copy link
Contributor

Moving test duration discussion to #3391.

@stuartarchibald
Copy link
Contributor

@sklam as this contains a variant of my patch stuartarchibald@90acad4 please could you take a quick look. I'm of the view that this PR is good to go.

@stuartarchibald
Copy link
Contributor

@rjenc29 this has a conflict since merging #3337, please could you merge in/rebase on master? Thanks.

@rjenc29
Copy link
Contributor Author

rjenc29 commented Oct 16, 2018

Done - will keep an eye on CI...

@stuartarchibald
Copy link
Contributor

@rjenc29 thanks for the fix up. @sklam confirmed "Looks good" over private channel (thanks!). This can be merged once CI passes.

@stuartarchibald stuartarchibald merged commit 92d7b37 into numba:master Oct 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants