Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add argmax and argmin to ExtensionArray #27801

Merged
merged 36 commits into from Jul 8, 2020

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Aug 7, 2019

The methods added in this PR ignore nan ,i.e., skipna=True. The existing categorical.min return nan if categorical contain any nan. This behavior is expected in test_min_max (tests/arrays/categorical/test_analytics.py).

@gfyoung gfyoung added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 9, 2019
pass
else:
ser = pd.Series(data)
with pytest.raises(TypeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review.
I think this is non-sense to implement argmax, argmin, max and min for ArrowBoolArray which have two values only. Calling those methods will raise NotImplementedError. Getting but not calling the max and min attributes by gettattr doesn't raise any error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @gfyoung was asking that you assert something about the error message with the match= keyword.

Copy link
Contributor Author

@makbigc makbigc Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions just raise the TypeError without any error message.


def test_min(self):
# GH 24382
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come these are empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods added in this PR ignore nan ,i.e., skipna=True. The existing categorical.min return nan if categorical contain any nan. This behavior is expected in test_min_max (tests/arrays/categorical/test_analytics.py).

If min and max of the generic EA ignoring nan is what we want, future PR is required to add skipna parameter to categorical.min and categorical.max.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want a release where these become out of sync, so perhaps a PR implementing skipna=True/False for Categorical first makes sense.

pandas/core/arrays/base.py Outdated Show resolved Hide resolved
pass
else:
ser = pd.Series(data)
with pytest.raises(TypeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @gfyoung was asking that you assert something about the error message with the match= keyword.

pandas/tests/extension/base/methods.py Outdated Show resolved Hide resolved

def test_min(self):
# GH 24382
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want a release where these become out of sync, so perhaps a PR implementing skipna=True/False for Categorical first makes sense.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

can you merge master; move the release note to 1.0

@makbigc
Copy link
Contributor Author

makbigc commented Sep 10, 2019

@jreback This PR is pending for #27929 in which whether the keyword numeric_only should be deprecated is still in discussion.

@jbrockmendel
Copy link
Member

@makbigc can you rebase? I think getting these implemented could help with some of the groupby cleanup ive been working on

@jbrockmendel
Copy link
Member

@makbigc can you rebase

@makbigc
Copy link
Contributor Author

makbigc commented Oct 29, 2019

sorry for the late reply. I will work on it.

@@ -6,6 +6,7 @@ What's new in 0.25.1 (July XX, 2019)
Enhancements
~~~~~~~~~~~~

- Add :meth:`ExtensionArray.argmax`, :meth:`ExtensionArray.max`, :meth:`ExtensionArray.argmin` and :meth:`ExtensionArray.min` (:issue:`24382`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to 1.0.0 now?

@alimcmaster1
Copy link
Member

This would definitely be a neat feature @makbigc are you still wanting to work on this?

@alimcmaster1
Copy link
Member

@makbigc can you merge master?

@pep8speaks
Copy link

pep8speaks commented Dec 4, 2019

Hello @makbigc! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-08 12:15:58 UTC

@makbigc
Copy link
Contributor Author

makbigc commented Dec 7, 2019

I don't know what linting error I have made.
Please tell me anything else to change.
Thanks.

@alimcmaster1
Copy link
Member

@makbigc mind merging master - will then take a look at any test failures.

@MarcoGorelli
Copy link
Member

@makbigc sorry to chase you up, just wanted to ask - are you still working on this? Thanks :)

@jorisvandenbossche
Copy link
Member

OK, I pushed some updates to this PR:

  • Removed min/max for now, to focus this PR on argmin/argmax (and it's still in the git history of course)
  • Made an implementation of argmin/argmax based on _values_for_argsort, with missing value handling, similar as we currently have pandas/core/sorting.py::nargsort
  • Expanded the tests to cover duplicated values / all missing values.

So all please take a new look!

I didn't yet hook this into Series.argmin/argmax (there is the question what the all-NaN case, where Series currently deviates from EA -> #33941).

What's also still missing is a skipna=False keyword, but let's start with reviewing the current updates (-> #33942 for an issue about this)

@jbrockmendel
Copy link
Member

Made an implementation of argmin/argmax based on _values_for_argsort

My understanding based on other threads is that we shouldn't be using values_for(argsort|factorize) for anything other than (argsort|factorize)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 9, 2020

@jbrockmendel I think argsort and argmax/min are basically the same regarding requirements on _values_for_argsort ?

Before we better define the expected sematics of _values_for_argsort, we indeed shouldn't use it in other context as argsort (eg indexing). But IMO argsort and argmax/min are the same kind of usage.

@jbrockmendel
Copy link
Member

Before we better define the expected sematics of _values_for_argsort, we indeed shouldn't use it in other context as argsort (eg indexing). But IMO argsort and argmax/min are the same kind of usage.

I dont understand this distinction. My understanding based on what you've said elsewhere is that we cant count on _values_for_argsort to be implemented, only argsort

@jorisvandenbossche
Copy link
Member

I dont understand this distinction.

_values_for_argsort is used to implement argsort, and hence has certain requirements on the return value of _values_for_argsort (sort in the same order, but in practice now don't need to handle missing values).
As far as I can see, argmax/argmin have the exact same requirements as argsort (at least I can't think of a case where argsort would work but argmax/argmin would fail or be incorrect. Can you?). But that is not at all the case for using those values in indexing context, or in joining operations, where there are additional requirements (hashable, handle missing values), and thus for those we can't simply assume that _values_for_argsort is usable for those use cases just because it is usable for argsort. But for argmax/argmin, I think we clearly can make this assumption.

My understanding based on what you've said elsewhere is that we cant count on _values_for_argsort to be implemented, only argsort

That was indeed the original idea: as EA author you can either implement _values_for_argsort, or either override EA.argsort with a custom implementation (which indeed means we can't rely on _values_for_argsort being usable directly).
But so, then we can simply expand this implmentation note in the docs to: you can either implement _values_for_argsort (and then argsort, argmin and argmax provide a default implementation based on those values), or implement argsort, argmin and argmax yourself. I don't think that is any problem?

Now, what became apparent in the recent discussions regarding those _values_for_argsort, is that this original idea of "either implement EA._values_for_argsort or either override EA.argsort" in practice isn't followed anymore. We "broke" that since the original EA implementation while fixing some bugs last year, so this is something to be fleshed out -> #27218 (and the last paragraph of #33276 (comment)). But if we can fix this at some point for argsort, we can at once fix it for argmax/argmin as well (as the fix will be the same, I assume). So I dont't think this "broken" part of the EA interface is a reason to not use _values_for_argsort for argmax/argmin.

@jorisvandenbossche
Copy link
Member

@jbrockmendel any response on the above? Do you understand why / are you OK with using _values_for_argsort here ?

@jbrockmendel
Copy link
Member

Do you understand why / are you OK with using _values_for_argsort here ?

OK with it

@jorisvandenbossche
Copy link
Member

And more specific comments on the PR?

@@ -319,6 +319,33 @@ def nargsort(
return indexer


def nargminmax(values, method: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don’t think these
should be in base.py

rather in array_ops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I put them here is because I think it makes sense to keep it close to nargsort, since the code is very similar (using the same approach with the idx/non_nan_idx etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don’t think these should be in base.py

BTW, this is not base.py, but core/sorting.py, which groups a whole bunch of functionality related to sortable values.
(it might make sense to move sorting.py into the array_ops submodule, but I would do that as a separate move)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k that makes sense (and let’s move sorting.py) as a followon

@@ -319,6 +319,33 @@ def nargsort(
return indexer


def nargminmax(values, method: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotate values as EA?

func = np.argmax if method == "argmax" else np.argmin

mask = np.asarray(isna(values))
values = values._values_for_argsort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not wild about relying on non-public attrs here. could we have the EA method pass values and mask?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_values_for_argsort is a "public developer" API (it's part of the EA interface), that's the entire point of it.

I know we have had the discussion about the point of _values_for_argsort and in principle we could also do without. But at this point, we have that method, it is used for argsort as well, so I think it is most logical that I use it here. And we can continue that general discussion about _values_for_argsort elsewhere.

@jbrockmendel
Copy link
Member

And more specific comments on the PR?

With the EA methods in place, should we be dispatching to it from Series/DataFrame etc?

@jorisvandenbossche
Copy link
Member

should we be dispatching to it from Series/DataFrame etc?

Yes, see my "I didn't yet hook this into Series.argmin/argmax" above (#27801 (comment)).

There are some open questions regarding behaviour: #33941 (I suppose we are fine with new extension dtypes to deviate from the current Series behaviour, but not fully sure for longer existing ones (categorical, datetimetz etc)).

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question inilne.

Can you add this to the Methods in the ExtensionArray docstring?

pandas/core/sorting.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

Any more comments on the current PR?

(I know it still needs to be used in Series.argmin/argmax. I can do that here, but given the API changes ( #33941, #33942), could also do that in a follow-up PR)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback
Copy link
Contributor

jreback commented Jun 20, 2020

@makbigc can you merge master

@jorisvandenbossche
Copy link
Member

Going to merge this once CI passes (after the update with master)

@jorisvandenbossche jorisvandenbossche merged commit 16bbae0 into pandas-dev:master Jul 8, 2020
@jorisvandenbossche
Copy link
Member

@makbigc thanks for starting this PR, and sorry again the review / discussion process didn't go that smoothly initially.

@jorisvandenbossche
Copy link
Member

Opened #35178 as follow-up issue to actually use this in the Series methods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement min, argmin, max, argmax on ExtensionArrays?
10 participants