Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.sort_values() not respecting na_position with categoricals #22556

Closed
zapnat opened this issue Aug 31, 2018 · 4 comments · Fixed by #22640
Closed

df.sort_values() not respecting na_position with categoricals #22556

zapnat opened this issue Aug 31, 2018 · 4 comments · Fixed by #22640
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@zapnat
Copy link

zapnat commented Aug 31, 2018

Problem description

DataFrame.sort_values() appears not to respect the na_position parameter when sorting by a categorical series:

>>> import pandas as pd
>>> c = pd.Categorical(['A', np.nan, 'B'], categories=['A','B'], ordered=True)
>>> df = pd.DataFrame({'c': c})
>>> df.sort_values(by='c', na_position='first')
     c
1  NaN
0    A
2    B
>>> df.sort_values(by='c', na_position='last')
     c
1  NaN
0    A
2    B

Unexpectedly, the NaNs always come first regardless of na_position.

Additional information

Series.sort_values() works as expected:

>>> c.sort_values(na_position='first')
[NaN, A, B]
Categories (2, object): [A < B]
>>> c.sort_values(na_position='last')
[A, B, NaN]
Categories (2, object): [A < B]

Strangely, df.sort_values() does seem to respect na_position if you sort by more than one column (even the same column):

>>> df.sort_values(by=['c','c'], na_position='first')
     c
1  NaN
0    A
2    B
>>> df.sort_values(by=['c','c'], na_position='last')
     c
0    A
2    B
1  NaN

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Sep 1, 2018
@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2018

How odd! Investigation and PR are welcome!

@Naman-Garg-06
Copy link

c = pd.Series(['A', np.nan, 'B'])
using Series instead of Categorical, this problem can be tackled.

@datajanko
Copy link
Contributor

I think the problem is in pandas/core/sorting.py

def nargsort(items, kind='quicksort', ascending=True, na_position='last'):
    """
    This is intended to be a drop-in replacement for np.argsort which
    handles NaNs. It adds ascending and na_position parameters.
    GH #6399, #5231
    """


    # specially handle Categorical
    if is_categorical_dtype(items):
        return items.argsort(ascending=ascending, kind=kind)

For categorically, we don't pass na_position

My suggestion: if na_position is first, put nan first, else put it last (Of course one has to check if nan is in etc). Any other suggestions? I'd like to work on this

Btw: Series does not use this sorting function, which I find a bit odd. Additionally, the problem does not appear when using by=['c', 'c']in zapnat's example because a lex sorter is used then.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2018

i think this is a duplicate issue - pls do a search

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants