Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: argsort behaviour for ExtensionArray with missing values #21801

Open
jorisvandenbossche opened this issue Jul 7, 2018 · 12 comments

Comments

@jorisvandenbossche
Copy link
Member

commented Jul 7, 2018

Currently we don't specify what the behaviour should be for ExtensionArray.argsort when there are missing values.
This is not a huge problem because the Series.sort_values deals with the missing values itself (only argsorts the non-missing data), but still we should pin down and test the behaviour.

I suppose we should follow numpy's example here and put them last (which is also consistent with the default behaviour of sort_values):

In [114]: a = np.array([1, 3, 2, np.nan, 4, np.nan])

In [115]: a.argsort()
Out[115]: array([0, 2, 1, 4, 3, 5])

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

Some examples on this

Numpy sorts NaNs last:

In [5]: a = np.array([2, 1, np.nan, 2])

In [6]: a.argsort()
Out[6]: array([1, 0, 3, 2])

In [7]: a[a.argsort()] 
Out[7]: array([ 1.,  2.,  2., nan])

Categorical already existed for a long time, and has this "NaN first" behaviour

In [8]: a = pd.Categorical(['b', 'a', np.nan, 'b'])   

In [9]: a.argsort()       
Out[9]: array([2, 1, 0, 3])

In [10]: a[a.argsort()]      
Out[10]: 
[NaN, a, b, b]
Categories (2, object): [a, b]

Inconsistent here is that Categorical also has a sort_values method (which other EAs don't have) which puts NaNs last by default:

In [11]: a.sort_values()     
Out[11]: 
[a, b, b, NaN]
Categories (2, object): [a, b]

The new IntegerArray (and also the datetimelike arrays) copied this behaviour from Categorical:

In [12]: a = pd.array([2, 1, np.nan, 2], dtype='Int64')

In [13]: a.argsort()  
Out[13]: array([2, 1, 0, 3])

In [14]: a[a.argsort()]       
Out[14]: 
<IntegerArray>
[NaN, 1, 2, 2]
Length: 4, dtype: Int64

On the other hand, IntervalArray currently errors on it:

In [20]: a= pd.arrays.IntervalArray.from_tuples([(1, 2), (0, 1), np.nan, (1, 2)]) 

In [21]: a.argsort()  
...
TypeError: unorderable types: Interval() > float()
@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

I would say ideally, we would follow the "NaN last" behaviour (consistent with numpy's argsort + consistent with the default of sort_values in pandas).

The main question would be how to get there: can we just change this? (that would be a hard breaking change, but maybe we can do it for the recent EAs?) How could we deprecate it? When deprecating, we want a way for users to already get the future behaviour (but adding a keyword only for the deprecation (which later needs to be deprecated itself) is also ugly ..).

One option could be to add a na_position keyword (-> None by default (current default), will change to 'last' in the future).
One disadvantage is that this makes the EA interface (what EA authors need to implement) more complex, as the argsort method needs to handle this additional case. And another disadvantage, is that this makes the _values_for_argsort option of the interface quite impossible to do (as the values might need to be different depending on whether the NAs sort first or last)

cc @jreback @TomAugspurger

@makbigc

This comment has been minimized.

Copy link
Contributor

commented May 12, 2019

When nan is invloved, argsort behaves in two ways.

The element of index 1 is always nan in the following.

Putting the nan in the beginning

In [27]: arr = integer_array([1, np.nan, 2])

In [28]: arr
Out[28]:

[1, NaN, 2]
Length: 3, dtype: Int64

In [29]: arr.argsort()
Out[29]: array([1, 0, 2])

Putting the nan in the end

In [20]: arr1 = integer_array([1, np.nan, 0], dtype='uint8')

In [21]: arr1
Out[21]:

[1, NaN, 0]
Length: 3, dtype: UInt8

In [22]: arr1.argsort()
Out[22]: array([2, 0, 1])

In [23]: idx = pd.Index([1, np.nan, 2])

In [24]: arr = idx.array

In [25]: arr
Out[25]:

[1.0, nan, 2.0]
Length: 3, dtype: float64

In [26]: arr.argsort()
Out[26]: array([0, 2, 1])

Should we standardize where nan to be placed? _values_for_argsort returns different value for nan in different EA.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

Still thinking through this... Right now, I'd prefer to not just break things by moving NaN to the end. So I think my preferred route is a keyword to argsort, but I could be convinced otherwise.

Specifically on

And another disadvantage, is that this makes the _values_for_argsort option of the interface quite impossible to do (as the values might need to be different depending on whether the NAs sort first or last)

We could document that _values_for_argsort is supposed to follow the convention that NaNs are sorted last. This may cause silent failures for 3rd party EA authors who currently behave differently (though we can ping the ones we know about on this issue).

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented May 13, 2019

My main hesitation to a keyword long-term is that it forces you to actually implement both behaviours, making the required argsort more complex.

If we would not want to keep the keyword long term, it could also be an option to only add it to Categorical (where we probably need to deprecate it), but do it as a breaking change to the other recent EAs

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

@jorisvandenbossche this is probably a blocker for the RC, since we want to make a breaking change? I can maybe put up a PR tonight.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2019

Do you mean the Categorical part? (as the EA part is covered by #27137 ?)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

Yep.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2019

Did we already decide how to go about it?
Simply break, or add a keyword?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

Simple break IMO.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

agree

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2019

OK

@jreback jreback modified the milestones: 0.25.0, 1.0 Jul 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.