df.sort_values() not respecting na_position with categoricals #22556

zapnat · 2018-08-31T17:21:13Z

Problem description

DataFrame.sort_values() appears not to respect the na_position parameter when sorting by a categorical series:

>>> import pandas as pd
>>> c = pd.Categorical(['A', np.nan, 'B'], categories=['A','B'], ordered=True)
>>> df = pd.DataFrame({'c': c})
>>> df.sort_values(by='c', na_position='first')
     c
1  NaN
0    A
2    B
>>> df.sort_values(by='c', na_position='last')
     c
1  NaN
0    A
2    B

Unexpectedly, the NaNs always come first regardless of na_position.

Additional information

Series.sort_values() works as expected:

>>> c.sort_values(na_position='first')
[NaN, A, B]
Categories (2, object): [A < B]
>>> c.sort_values(na_position='last')
[A, B, NaN]
Categories (2, object): [A < B]

Strangely, df.sort_values() does seem to respect na_position if you sort by more than one column (even the same column):

>>> df.sort_values(by=['c','c'], na_position='first')
     c
1  NaN
0    A
2    B
>>> df.sort_values(by=['c','c'], na_position='last')
     c
0    A
2    B
1  NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-09-01T22:52:57Z

How odd! Investigation and PR are welcome!

Naman-Garg-06 · 2018-09-02T13:53:24Z

c = pd.Series(['A', np.nan, 'B'])
using Series instead of Categorical, this problem can be tackled.

datajanko · 2018-09-08T19:23:36Z

I think the problem is in pandas/core/sorting.py

def nargsort(items, kind='quicksort', ascending=True, na_position='last'):
    """
    This is intended to be a drop-in replacement for np.argsort which
    handles NaNs. It adds ascending and na_position parameters.
    GH #6399, #5231
    """


    # specially handle Categorical
    if is_categorical_dtype(items):
        return items.argsort(ascending=ascending, kind=kind)

For categorically, we don't pass na_position

My suggestion: if na_position is first, put nan first, else put it last (Of course one has to check if nan is in etc). Any other suggestions? I'd like to work on this

Btw: Series does not use this sorting function, which I find a bit odd. Additionally, the problem does not appear when using by=['c', 'c']in zapnat's example because a lex sorter is used then.

jreback · 2018-09-08T19:57:25Z

i think this is a duplicate issue - pls do a search

…2556 (#22640)

…ndas-dev#22556 (pandas-dev#22640)

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Sep 1, 2018

staftermath mentioned this issue Sep 8, 2018

BUG: df.sort_values() not respecting na_position with categoricals #22556 #22640

Merged

4 tasks

jreback added this to the 0.24.0 milestone Oct 7, 2018

jreback closed this as completed in #22640 Oct 18, 2018

jreback pushed a commit that referenced this issue Oct 18, 2018

BUG: df.sort_values() not respecting na_position with categoricals #2…

32ef84b

…2556 (#22640)

tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018

BUG: df.sort_values() not respecting na_position with categoricals pa…

1df2321

…ndas-dev#22556 (pandas-dev#22640)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.sort_values() not respecting na_position with categoricals #22556

df.sort_values() not respecting na_position with categoricals #22556

zapnat commented Aug 31, 2018

gfyoung commented Sep 1, 2018

Naman-Garg-06 commented Sep 2, 2018

datajanko commented Sep 8, 2018

jreback commented Sep 8, 2018

df.sort_values() not respecting na_position with categoricals #22556

df.sort_values() not respecting na_position with categoricals #22556

Comments

zapnat commented Aug 31, 2018

Problem description

Additional information

Output of pd.show_versions()

gfyoung commented Sep 1, 2018

Naman-Garg-06 commented Sep 2, 2018

datajanko commented Sep 8, 2018

jreback commented Sep 8, 2018

Output of `pd.show_versions()`