Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.argsort() incorrect handling of NaNs #12694

Open
seth-p opened this issue Mar 22, 2016 · 13 comments · May be fixed by #58232
Open

BUG: Series.argsort() incorrect handling of NaNs #12694

seth-p opened this issue Mar 22, 2016 · 13 comments · May be fixed by #58232
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@seth-p
Copy link
Contributor

seth-p commented Mar 22, 2016

It appears that Series.argsort() is implemented as s.dropna().argsort().reindex(s.index, fill_value=-1) = np.argsort(s.dropna()).reindex(s.index, fill_value=-1).

There are two problems with this:
(a) Since the result represents integer indices into the original series s, the result should not have the same index as s.index -- it should either be a Series with index [0, 1, ...], or more likely simply be a NumPyarray;
(b) The way NaNs are effectively removed before calling np.argsort() leads to indexes that are no longer appropriate into the original Series, resulting in the nonsensical results shown in [9] and [10] below.

Note that: "As of NumPy 1.4.0 argsort works with real/complex arrays containing nan values." So it's not obvious to me that there's much to be gained from mucking around with np.argsort(s.values).

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 20:20:57) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: s = pd.Series([200, 100, 400, 500, np.nan, 300], index=list('abcdef'))

In [4]: s
Out[4]:
a    200.0
b    100.0
c    400.0
d    500.0
e      NaN
f    300.0
dtype: float64

In [5]: s.argsort()
Out[5]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [6]: s.dropna().argsort().reindex(s.index, fill_value=-1)
Out[6]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [33]: np.argsort(s.dropna()).reindex(s.index, fill_value=-1)
Out[33]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [34]: np.argsort(s.dropna().values)
Out[34]: array([1, 0, 4, 2, 3], dtype=int64)

In [7]: np.argsort(s.values)
Out[7]: array([1, 0, 5, 2, 3, 4], dtype=int64)

In [8]: s[np.argsort(s.values)]  # desired result
Out[8]:
b    100.0
a    200.0
f    300.0
c    400.0
d    500.0
e      NaN
dtype: float64

In [9]: s[s.argsort()]  # nonsensical result
Out[9]:
b    100.0
a    200.0
e      NaN
c    400.0
f    300.0
d    500.0
dtype: float64

In [10]: s[s.argsort().values]  # nonsensical result
Out[10]:
b    100.0
a    200.0
e      NaN
c    400.0
f    300.0
d    500.0
dtype: float64

In [11]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3.1
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: None
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 1.0b8
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
@seth-p
Copy link
Contributor Author

seth-p commented Mar 22, 2016

Perhaps what the original implementation was after is pd.Series(np.arange(s.count()), index=s.index[np.argsort(s.values)[:s.count()]]).reindex(s.index, fill_value=-1):

In [60]: s
Out[60]:
a    200.0
b    100.0
c    400.0
d    500.0
e      NaN
f    300.0
dtype: float64

In [61]: pd.Series(np.arange(s.count()),
                   index=s.index[np.argsort(s.values)[:s.count()]]).\
             reindex(s.index, fill_value=-1)
Out[61]:
a    1
b    0
c    3
d    4
e   -1
f    2
dtype: int32

If one does go with this implementation (where the values represent the ordering), perhaps one should rename the method something other than argsort to avoid confusion with np.argsort (where the values represent indices).

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@seth-p this is exactly the same as np.argsort. Its not implemented at all like you show. Its just a simple mask filled with .argsort. And it behaves quite close to np.argsort. This is not really used much in pandas as the Index generally serves this purpose.

s[s.argsort()] is of course non-sensical. The -1 are the nan placements.

Not really sure what the issue is about here. If/when we have integer NA then these -1's could be NaN.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Mar 22, 2016
@seth-p
Copy link
Contributor Author

seth-p commented Mar 22, 2016

s.argsort() is not exactly the same as np.argsort() in the presence of NaNs, as shown by the difference between [76] and [77]:

In [75]: s.values
Out[75]: array([ 200.,  100.,  400.,  500.,   nan,  300.])

In [76]: np.argsort(s.values)
Out[76]: array([1, 0, 5, 2, 3, 4], dtype=int64)

In [77]: s.argsort().values
Out[77]: array([ 1,  0,  4,  2, -1,  3], dtype=int64)

The output of [76] makes sense:

In [80]: s[np.argsort(s.values)]
Out[80]:
b    100.0
a    200.0
f    300.0
c    400.0
d    500.0
e      NaN
dtype: float64

The output of [77] does not make sense, and is not useful.

Put another way, can you explain what the output of [77] means?
Can you give me an example of how one would use this (a) at all; (b) in a way that uses the fact that s.argsort() si a Series as opposed to simply a NumPy array?

The documentation just says "Overrides ndarray.argsort. Argsorts the value, omitting NA/null values, and places the result in the same locations as the non-NA values". My point is that (a) omitting NA/null values the way that it does before calling np.argsort, and (b) placing the result back in the same locations as the non-NA values leads to nonsensical results. Technically it does exactly what the documentation says, but -- going out on a limb here -- that can't possibly be what anyone really wants. The result simply makes no sense.

At the risk of repeating myself a bit, the values produced by np.argsort are (integer) indices into the values array, ordered by the values of the array. Placing those integer indices back into the Series with the original index simply makes no sense -- even in the absence of NaNs. Consider:

In [86]: s1
Out[86]:
a    200.0
b    100.0
c    400.0
d    500.0
e    900.0
f    300.0
dtype: float64

In [87]: s1.argsort()
Out[87]:
a    1
b    0
c    5
d    2
e    3
f    4
dtype: int64

In what way should f be associated with 4?

The presence of NaNs simply messes things up further.

I think the only two plausible implementations of argsort:
(1) like np.argsort, return a NumPy array with integer indices sorted by the values at their locations; or
(2) return a Series with the same index as the orgiinal series, and with values indicating the ordering of the values.
Here are examples of both (in the absence of NaNs, for simplicity):

In [86]: s1
Out[91]:
a    200.0
b    100.0
c    400.0
d    500.0
e    900.0
f    300.0
dtype: float64

In [88]: np.argsort(s1.values)
Out[88]: array([1, 0, 5, 2, 3, 4], dtype=int64)

In [89]: pd.Series(np.arange(s1.count()), index=s1.index[np.argsort(s1.values)[:s1.count()]]).reindex(s1.index, fill_value=-1)
Out[89]:
a    1
b    0
c    3
d    4
e    5
f    2
dtype: int32

Here f is associated with 2 in the sense that the value at f, 300, is the third (i.e. 2 when counting from 0) smallest value in the Series.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

and I'll ask you back, what is the purpose of doing .argsort() in the first place? you can simply .sort_values(). In numpy it makes sense, but in pandas I don't even see a use for it.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

To be honest its almost an internal routine.

In [11]: s[s.argsort()!=-1]
Out[11]: 
a    200.0
b    100.0
c    400.0
d    500.0
f    300.0
dtype: float64
In [12]: s.loc[s.argsort().index]
Out[12]: 
a    200.0
b    100.0
c    400.0
d    500.0
e      NaN
f    300.0
dtype: float64

@seth-p
Copy link
Contributor Author

seth-p commented Mar 22, 2016

Of course you don't use the Pandas s.argsort() -- it doesn't make sense!!! :-)

Seriously:

  1. s[s.argsort()!=-1] is simply the same as s.dropna(), no? So that doesn't really merit a separate function. (Just to be clear, this doesn't sort anything.) You could do s.dropna()[s.argsort()[s.argsort()!=-1]], but that is (i) really a mouthful to get around faulty NaN handling; and (ii) just uses the values and not the Series structure of s.argsort().
  2. np.argsort(s.values) is useful. Suppose s1 and s2 are two Series with the same index, and you want the values in s2 corresponding to the 100 smallest values in s1: s2[np.argsort(s1.values)[:100]]. Seems legit to me.
  3. My proposed alternative to Series.argsort(), new_argsort() below, is similarly (alternatively?) useful if you want to know the order of a given element. i.e. suppose s is a series of stock market caps, with s.index being tickers, then (-s).new_argsort()['MSFT'] being 8 tells you that 'MSFT' is the 9th largest stock (a value of -1 would indicate an original value of NaN). This too seems legit to me. (If I had to guess, I would guess that whoever implemented Series.argsort() thought that this is what they were implementing.)
    def new_argsort(self):
        return pd.Series(np.arange(self.count()),
                         index=self.index[np.argsort(self.values)[:self.count()]]).\
                   reindex(self.index, fill_value=-1)

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@seth-p 2 you should simply use nlargest/nsmallest. these are orders of magnitute faster.

why go to all this trouble, and if you really really thing you actually need .argsort (which you don't), then just do

s.loc[s.argsort().index] makes WAY more sense in pandas.

@seth-p
Copy link
Contributor Author

seth-p commented Mar 22, 2016

I didn't realize nlargest/nsmallest existed. I may well use them. I stumbled on argsort first, and wasted quite a bit of time before I realized it didn't do anything right.

Since s.argsort().index is always identically equal to s.index, I'm not sure what your example is meant to show.

But to be clear, the existing s.argsort() isn't simply useless in the sense that it's redundant; it's useless in the sense that it can't possibly be the right answer to any legitimate question, and any use of it will almost certainly lead to incorrect results.

To prevent someone else going though the headache I just did in misusing it, I strongly suggest that it (i) simply be deleted; (ii) be redefined to return np.argsort(self.values) (a NumPy array, not a Series); or (iii) be redefined as in my new_argsort() above.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@seth-p I tell you what, why don't you add an example (or 2) to .argsort. to clarify. It DOES exactly follow the doc-string. My point is that its not really useful in pandas context.

My example is showing idempotency of the tranform (as you showed s[np.argsort(s.values)]. That's exactly the point though, when you have a linked index & values you rarely (if ever) need to actually explicity use indexers directly, instead you use the index.

I'll move this to 0.19.0 (though would take a doc-PR in the meantime). and we can rediscuss then as any changes would not be done until then (as these are all API changes).

@jreback jreback added this to the 0.19.0 milestone Mar 22, 2016
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Mar 22, 2016
@seth-p
Copy link
Contributor Author

seth-p commented Mar 23, 2016

I'll think about what I think should be done for 0.19.0, and perhaps take a stab at it.

I can't think of a doc-PR for the existing Series.argsort() other than a bold flashing red warning: "Don't use this function! It doesn't do anything you could possibly want. Use the NumPy Series.values.argsort() instead."

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Apr 5, 2019
@jorisvandenbossche
Copy link
Member

Going through some argsort/sorting issues, and I also stumbled on this. I agree with @seth-p that the current implementation of Series.argsort is very strange and basically useless.
(and yes, even if Series.argsort behaved normally, you should almost never have to use it as you can use sort_values, nsmallest, etc directly. But the method does exist, so then at least it should make sense)

It is also true that it behaves as documented: "Argsorts the value, omitting NA/null values, and places the result in the same locations as the non-NA values".
But, this is very different to how np.argsort handles NA-values, so this is a) unexpected and b) I also don't see the usefulness of this strange NA handling

Small example:

In [8]: s = pd.Series([2, 1, np.nan, 1]) 

In [9]: s  
Out[9]: 
0    2.0
1    1.0
2    NaN
3    1.0
dtype: float64

In [10]: s.argsort()   
Out[10]: 
0    1
1    2
2   -1
3    0
dtype: int64

So because the result of argsort is determined on the values of s without the missing value, there is no correct link between the values in the argsort result and the original series. Eg the 1, 2, 0 in the above result don't match with positions in the original Series, but they match positions in s.dropna().

I think ideally we should fix this. But, the question is how to do this?
We could deprecate that behaviour (specifically when there are NaNs), but how would we let users handle that deprecation? Because we can warn this will change in the future, but if they already want to opt in for the future behaviour, we would need to add a specific keyword for triggering this.

@tgbrooks
Copy link

I too encountered this issue. I wrote code for working with numpy arrays using argsort and then was very surprised to find that it didn't work when I later passed it a Series instead. In particular, I was calling numpy.argsort(array) and when array turned to Series, it was no longer doing the "numpy argsort" even though it really looks like that's what the code should be doing. I'd much rather have gotten an error or warning.

@auderson
Copy link
Contributor

auderson commented Mar 13, 2023

@mroeschke
When I was using this code today:

df.iloc[df["A"].abs().argsort()]

I encountered this issue. Of course nobody uses s[s.argsort()], but argsort still has some use cases.

I noticed that you have a nargsort in pandas.core.sorting, will it be good to replace Series.argsort with that?

The current implementation is incorrect:

pandas/pandas/core/series.py

Lines 3751 to 3754 in 6169cba

if mask.any():
result = np.full(len(self), -1, dtype=np.intp)
notmask = ~mask
result[notmask] = np.argsort(values[notmask], kind=kind)

The returned indices of non-nan items are meaningless when it's placed back to the original position.

@jbrockmendel jbrockmendel added this to the 3.0 milestone Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
7 participants