Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.sort_values() not respecting na_position with categoricals #22556

Closed
zapnat opened this issue Aug 31, 2018 · 4 comments

Comments

Projects
None yet
5 participants
@zapnat
Copy link

commented Aug 31, 2018

Problem description

DataFrame.sort_values() appears not to respect the na_position parameter when sorting by a categorical series:

>>> import pandas as pd
>>> c = pd.Categorical(['A', np.nan, 'B'], categories=['A','B'], ordered=True)
>>> df = pd.DataFrame({'c': c})
>>> df.sort_values(by='c', na_position='first')
     c
1  NaN
0    A
2    B
>>> df.sort_values(by='c', na_position='last')
     c
1  NaN
0    A
2    B

Unexpectedly, the NaNs always come first regardless of na_position.

Additional information

Series.sort_values() works as expected:

>>> c.sort_values(na_position='first')
[NaN, A, B]
Categories (2, object): [A < B]
>>> c.sort_values(na_position='last')
[A, B, NaN]
Categories (2, object): [A < B]

Strangely, df.sort_values() does seem to respect na_position if you sort by more than one column (even the same column):

>>> df.sort_values(by=['c','c'], na_position='first')
     c
1  NaN
0    A
2    B
>>> df.sort_values(by=['c','c'], na_position='last')
     c
0    A
2    B
1  NaN

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@gfyoung

This comment has been minimized.

Copy link
Member

commented Sep 1, 2018

How odd! Investigation and PR are welcome!

@Naman-Garg-06

This comment has been minimized.

Copy link

commented Sep 2, 2018

c = pd.Series(['A', np.nan, 'B'])
using Series instead of Categorical, this problem can be tackled.

@datajanko

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2018

I think the problem is in pandas/core/sorting.py

def nargsort(items, kind='quicksort', ascending=True, na_position='last'):
    """
    This is intended to be a drop-in replacement for np.argsort which
    handles NaNs. It adds ascending and na_position parameters.
    GH #6399, #5231
    """


    # specially handle Categorical
    if is_categorical_dtype(items):
        return items.argsort(ascending=ascending, kind=kind)

For categorically, we don't pass na_position

My suggestion: if na_position is first, put nan first, else put it last (Of course one has to check if nan is in etc). Any other suggestions? I'd like to work on this

Btw: Series does not use this sorting function, which I find a bit odd. Additionally, the problem does not appear when using by=['c', 'c']in zapnat's example because a lex sorter is used then.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2018

i think this is a duplicate issue - pls do a search

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.