Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mergesort is not stable when sorting by a categorical column #16793

Closed
has2k1 opened this issue Jun 28, 2017 · 2 comments · Fixed by #16834
Closed

mergesort is not stable when sorting by a categorical column #16793

has2k1 opened this issue Jun 28, 2017 · 2 comments · Fixed by #16834
Labels
Bug Categorical Categorical Data Type
Milestone

Comments

@has2k1
Copy link
Contributor

has2k1 commented Jun 28, 2017

import pandas as pd
import numpy as np

n = 5  # not a problem for n < 5

df = pd.DataFrame({
    'x': pd.Categorical(np.repeat([1, 2, 3, 4], n), ordered=True)
})

df.sort_values('x', kind='mergesort')

output:

        x
0	1
1	1
2	1
3	1
4	1
8	2
7	2
9	2
5	2
6	2
10	3
11	3
12	3
13	3
14	3
18	4
15	4
16	4
17	4
19	4

Problem description

When sorting (using mergesort) a dataframe by an ordered categorical column, the sorting should be stable. In the example above x==2 and x==4 the values have been scrambled.

Expected Output

The index should remain in order since the column is already sorted.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.1-gentoo
machine: x86_64
processor: Intel(R)
byteorder: little
LC_ALL: en_US.utf8
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.0
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1 chris-b1 added Bug Categorical Categorical Data Type labels Jun 29, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Jun 29, 2017
@chris-b1
Copy link
Contributor

Yeah, looks like we throw away the kind argument on Categoricals, shouldn't be too hard to trace through, PR welcome!

return items.argsort(ascending=ascending)

@ri938
Copy link
Contributor

ri938 commented Jul 2, 2017

Code currently forbids anything but the default kind='quicksort' from being passed to categorical argsort. See validate_argsort_with_ascending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants