BUG: Series.argsort() incorrect handling of NaNs #12694

seth-p · 2016-03-22T18:11:20Z

It appears that Series.argsort() is implemented as s.dropna().argsort().reindex(s.index, fill_value=-1) = np.argsort(s.dropna()).reindex(s.index, fill_value=-1).

There are two problems with this:
(a) Since the result represents integer indices into the original series s, the result should not have the same index as s.index -- it should either be a Series with index [0, 1, ...], or more likely simply be a NumPyarray;
(b) The way NaNs are effectively removed before calling np.argsort() leads to indexes that are no longer appropriate into the original Series, resulting in the nonsensical results shown in [9] and [10] below.

Note that: "As of NumPy 1.4.0 argsort works with real/complex arrays containing nan values." So it's not obvious to me that there's much to be gained from mucking around with np.argsort(s.values).

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 20:20:57) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: s = pd.Series([200, 100, 400, 500, np.nan, 300], index=list('abcdef'))

In [4]: s
Out[4]:
a    200.0
b    100.0
c    400.0
d    500.0
e      NaN
f    300.0
dtype: float64

In [5]: s.argsort()
Out[5]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [6]: s.dropna().argsort().reindex(s.index, fill_value=-1)
Out[6]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [33]: np.argsort(s.dropna()).reindex(s.index, fill_value=-1)
Out[33]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [34]: np.argsort(s.dropna().values)
Out[34]: array([1, 0, 4, 2, 3], dtype=int64)

In [7]: np.argsort(s.values)
Out[7]: array([1, 0, 5, 2, 3, 4], dtype=int64)

In [8]: s[np.argsort(s.values)]  # desired result
Out[8]:
b    100.0
a    200.0
f    300.0
c    400.0
d    500.0
e      NaN
dtype: float64

In [9]: s[s.argsort()]  # nonsensical result
Out[9]:
b    100.0
a    200.0
e      NaN
c    400.0
f    300.0
d    500.0
dtype: float64

In [10]: s[s.argsort().values]  # nonsensical result
Out[10]:
b    100.0
a    200.0
e      NaN
c    400.0
f    300.0
d    500.0
dtype: float64

In [11]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3.1
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: None
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 1.0b8
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None

The text was updated successfully, but these errors were encountered:

seth-p · 2016-03-22T19:10:50Z

Perhaps what the original implementation was after is pd.Series(np.arange(s.count()), index=s.index[np.argsort(s.values)[:s.count()]]).reindex(s.index, fill_value=-1):

In [60]: s
Out[60]:
a    200.0
b    100.0
c    400.0
d    500.0
e      NaN
f    300.0
dtype: float64

In [61]: pd.Series(np.arange(s.count()),
                   index=s.index[np.argsort(s.values)[:s.count()]]).\
             reindex(s.index, fill_value=-1)
Out[61]:
a    1
b    0
c    3
d    4
e   -1
f    2
dtype: int32

If one does go with this implementation (where the values represent the ordering), perhaps one should rename the method something other than argsort to avoid confusion with np.argsort (where the values represent indices).

jreback · 2016-03-22T19:54:04Z

@seth-p this is exactly the same as np.argsort. Its not implemented at all like you show. Its just a simple mask filled with .argsort. And it behaves quite close to np.argsort. This is not really used much in pandas as the Index generally serves this purpose.

s[s.argsort()] is of course non-sensical. The -1 are the nan placements.

Not really sure what the issue is about here. If/when we have integer NA then these -1's could be NaN.

seth-p · 2016-03-22T20:24:21Z

s.argsort() is not exactly the same as np.argsort() in the presence of NaNs, as shown by the difference between [76] and [77]:

In [75]: s.values
Out[75]: array([ 200.,  100.,  400.,  500.,   nan,  300.])

In [76]: np.argsort(s.values)
Out[76]: array([1, 0, 5, 2, 3, 4], dtype=int64)

In [77]: s.argsort().values
Out[77]: array([ 1,  0,  4,  2, -1,  3], dtype=int64)

The output of [76] makes sense:

In [80]: s[np.argsort(s.values)]
Out[80]:
b    100.0
a    200.0
f    300.0
c    400.0
d    500.0
e      NaN
dtype: float64

The output of [77] does not make sense, and is not useful.

Put another way, can you explain what the output of [77] means?
Can you give me an example of how one would use this (a) at all; (b) in a way that uses the fact that s.argsort() si a Series as opposed to simply a NumPy array?

The documentation just says "Overrides ndarray.argsort. Argsorts the value, omitting NA/null values, and places the result in the same locations as the non-NA values". My point is that (a) omitting NA/null values the way that it does before calling np.argsort, and (b) placing the result back in the same locations as the non-NA values leads to nonsensical results. Technically it does exactly what the documentation says, but -- going out on a limb here -- that can't possibly be what anyone really wants. The result simply makes no sense.

At the risk of repeating myself a bit, the values produced by np.argsort are (integer) indices into the values array, ordered by the values of the array. Placing those integer indices back into the Series with the original index simply makes no sense -- even in the absence of NaNs. Consider:

In [86]: s1
Out[86]:
a    200.0
b    100.0
c    400.0
d    500.0
e    900.0
f    300.0
dtype: float64

In [87]: s1.argsort()
Out[87]:
a    1
b    0
c    5
d    2
e    3
f    4
dtype: int64

In what way should f be associated with 4?

The presence of NaNs simply messes things up further.

I think the only two plausible implementations of argsort:
(1) like np.argsort, return a NumPy array with integer indices sorted by the values at their locations; or
(2) return a Series with the same index as the orgiinal series, and with values indicating the ordering of the values.
Here are examples of both (in the absence of NaNs, for simplicity):

In [86]: s1
Out[91]:
a    200.0
b    100.0
c    400.0
d    500.0
e    900.0
f    300.0
dtype: float64

In [88]: np.argsort(s1.values)
Out[88]: array([1, 0, 5, 2, 3, 4], dtype=int64)

In [89]: pd.Series(np.arange(s1.count()), index=s1.index[np.argsort(s1.values)[:s1.count()]]).reindex(s1.index, fill_value=-1)
Out[89]:
a    1
b    0
c    3
d    4
e    5
f    2
dtype: int32

Here f is associated with 2 in the sense that the value at f, 300, is the third (i.e. 2 when counting from 0) smallest value in the Series.

jreback · 2016-03-22T20:35:36Z

and I'll ask you back, what is the purpose of doing .argsort() in the first place? you can simply .sort_values(). In numpy it makes sense, but in pandas I don't even see a use for it.

jreback · 2016-03-22T20:36:42Z

To be honest its almost an internal routine.

In [11]: s[s.argsort()!=-1]
Out[11]: 
a    200.0
b    100.0
c    400.0
d    500.0
f    300.0
dtype: float64

In [12]: s.loc[s.argsort().index]
Out[12]: 
a    200.0
b    100.0
c    400.0
d    500.0
e      NaN
f    300.0
dtype: float64

seth-p · 2016-03-22T21:06:49Z

Of course you don't use the Pandas s.argsort() -- it doesn't make sense!!! :-)

Seriously:

s[s.argsort()!=-1] is simply the same as s.dropna(), no? So that doesn't really merit a separate function. (Just to be clear, this doesn't sort anything.) You could do s.dropna()[s.argsort()[s.argsort()!=-1]], but that is (i) really a mouthful to get around faulty NaN handling; and (ii) just uses the values and not the Series structure of s.argsort().
np.argsort(s.values) is useful. Suppose s1 and s2 are two Series with the same index, and you want the values in s2 corresponding to the 100 smallest values in s1: s2[np.argsort(s1.values)[:100]]. Seems legit to me.
My proposed alternative to Series.argsort(), new_argsort() below, is similarly (alternatively?) useful if you want to know the order of a given element. i.e. suppose s is a series of stock market caps, with s.index being tickers, then (-s).new_argsort()['MSFT'] being 8 tells you that 'MSFT' is the 9th largest stock (a value of -1 would indicate an original value of NaN). This too seems legit to me. (If I had to guess, I would guess that whoever implemented Series.argsort() thought that this is what they were implementing.)

    def new_argsort(self):
        return pd.Series(np.arange(self.count()),
                         index=self.index[np.argsort(self.values)[:self.count()]]).\
                   reindex(self.index, fill_value=-1)

jreback · 2016-03-22T21:10:47Z

@seth-p 2 you should simply use nlargest/nsmallest. these are orders of magnitute faster.

why go to all this trouble, and if you really really thing you actually need .argsort (which you don't), then just do

s.loc[s.argsort().index] makes WAY more sense in pandas.

seth-p · 2016-03-22T21:20:16Z

I didn't realize nlargest/nsmallest existed. I may well use them. I stumbled on argsort first, and wasted quite a bit of time before I realized it didn't do anything right.

Since s.argsort().index is always identically equal to s.index, I'm not sure what your example is meant to show.

But to be clear, the existing s.argsort() isn't simply useless in the sense that it's redundant; it's useless in the sense that it can't possibly be the right answer to any legitimate question, and any use of it will almost certainly lead to incorrect results.

To prevent someone else going though the headache I just did in misusing it, I strongly suggest that it (i) simply be deleted; (ii) be redefined to return np.argsort(self.values) (a NumPy array, not a Series); or (iii) be redefined as in my new_argsort() above.

jreback · 2016-03-22T21:25:34Z

@seth-p I tell you what, why don't you add an example (or 2) to .argsort. to clarify. It DOES exactly follow the doc-string. My point is that its not really useful in pandas context.

My example is showing idempotency of the tranform (as you showed s[np.argsort(s.values)]. That's exactly the point though, when you have a linked index & values you rarely (if ever) need to actually explicity use indexers directly, instead you use the index.

I'll move this to 0.19.0 (though would take a doc-PR in the meantime). and we can rediscuss then as any changes would not be done until then (as these are all API changes).

seth-p · 2016-03-23T04:33:43Z

I'll think about what I think should be done for 0.19.0, and perhaps take a stab at it.

I can't think of a doc-PR for the existing Series.argsort() other than a bold flashing red warning: "Don't use this function! It doesn't do anything you could possibly want. Use the NumPy Series.values.argsort() instead."

jorisvandenbossche · 2019-04-05T08:48:37Z

Going through some argsort/sorting issues, and I also stumbled on this. I agree with @seth-p that the current implementation of Series.argsort is very strange and basically useless.
(and yes, even if Series.argsort behaved normally, you should almost never have to use it as you can use sort_values, nsmallest, etc directly. But the method does exist, so then at least it should make sense)

It is also true that it behaves as documented: "Argsorts the value, omitting NA/null values, and places the result in the same locations as the non-NA values".
But, this is very different to how np.argsort handles NA-values, so this is a) unexpected and b) I also don't see the usefulness of this strange NA handling

Small example:

In [8]: s = pd.Series([2, 1, np.nan, 1]) 

In [9]: s  
Out[9]: 
0    2.0
1    1.0
2    NaN
3    1.0
dtype: float64

In [10]: s.argsort()   
Out[10]: 
0    1
1    2
2   -1
3    0
dtype: int64

So because the result of argsort is determined on the values of s without the missing value, there is no correct link between the values in the argsort result and the original series. Eg the 1, 2, 0 in the above result don't match with positions in the original Series, but they match positions in s.dropna().

I think ideally we should fix this. But, the question is how to do this?
We could deprecate that behaviour (specifically when there are NaNs), but how would we let users handle that deprecation? Because we can warn this will change in the future, but if they already want to opt in for the future behaviour, we would need to add a specific keyword for triggering this.

tgbrooks · 2019-04-23T17:22:08Z

I too encountered this issue. I wrote code for working with numpy arrays using argsort and then was very surprised to find that it didn't work when I later passed it a Series instead. In particular, I was calling numpy.argsort(array) and when array turned to Series, it was no longer doing the "numpy argsort" even though it really looks like that's what the code should be doing. I'd much rather have gotten an error or warning.

auderson · 2023-03-13T08:56:27Z

@mroeschke
When I was using this code today:

df.iloc[df["A"].abs().argsort()]

I encountered this issue. Of course nobody uses s[s.argsort()], but argsort still has some use cases.

I noticed that you have a nargsort in pandas.core.sorting, will it be good to replace Series.argsort with that?

The current implementation is incorrect:

pandas/pandas/core/series.py

Lines 3751 to 3754 in 6169cba

    
           if mask.any(): 
        
               result = np.full(len(self), -1, dtype=np.intp) 
        
               notmask = ~mask 
        
               result[notmask] = np.argsort(values[notmask], kind=kind)

The returned indices of non-nan items are meaningless when it's placed back to the original position.

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Mar 22, 2016

jreback added this to the 0.19.0 milestone Mar 22, 2016

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Mar 22, 2016

seth-p mentioned this issue Mar 24, 2016

WIP/API: Implemented NDFrame.argsort() and NDFrame.ordering(). #12707

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Apr 5, 2019

karldw mentioned this issue Oct 15, 2019

nlargest gives a zero-row dataframe when ordering columns are all NaN #28984

Closed

mroeschke added the Bug label May 13, 2020

mroeschke removed the API Design label Apr 23, 2021

mzeitlin11 mentioned this issue Jun 19, 2021

BUG: fix Series.argsort #42090

Closed

2 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel added this to the 3.0 milestone Aug 24, 2023

mroeschke linked a pull request Apr 15, 2024 that will close this issue

CLN: enforce the deprecation of the Series.argsort NA behavior #58232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Series.argsort() incorrect handling of NaNs #12694

BUG: Series.argsort() incorrect handling of NaNs #12694

seth-p commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 23, 2016

jorisvandenbossche commented Apr 5, 2019

tgbrooks commented Apr 23, 2019

auderson commented Mar 13, 2023 •

edited

BUG: Series.argsort() incorrect handling of NaNs #12694

BUG: Series.argsort() incorrect handling of NaNs #12694

Comments

seth-p commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 22, 2016

jreback commented Mar 22, 2016

seth-p commented Mar 23, 2016

jorisvandenbossche commented Apr 5, 2019

tgbrooks commented Apr 23, 2019

auderson commented Mar 13, 2023 • edited

auderson commented Mar 13, 2023 •

edited