BUG: Always cast to Categorical in lexsort_indexer #36385

dsaxton · 2020-09-15T16:06:37Z

closes BUG: df.sort_values w/ key function fails with multiple sort columns and Categorical sorting #36383
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jbrockmendel · 2020-09-15T18:23:41Z

Not sure if this is related, but I noticed a problem in Categorical.sort_values setting self._codes = foo instead of self._codes[:] = foo

jreback · 2020-09-15T22:22:10Z

pandas/tests/frame/methods/test_sort_values.py

+        categories = ["c", "b", "a"]
+        df = pd.DataFrame({"x": [1, 1, 1], "y": ["a", "b", "c"]})
+
+        def sorter(key):


can you try when the categorical is ordered=False as well (parameterize)

Yeah this is broken actually. Will need to be more careful above.

Or maybe my expectation is off about how this should behave when the categorical is unordered. This is odd (it seems to respect the order in which categories are given even when ordered=False):

[ins] In [1]: import pandas as pd ...: ...: ...: values = ["a", "b", "c"] ...: ...: cat = pd.Categorical(values, categories=["a", "b", "c"], ordered=False) ...: print(cat.sort_values()) ...: ...: cat = pd.Categorical(values, categories=["c", "b", "a"], ordered=False) ...: print(cat.sort_values()) ...: ...: print(pd.__version__) ...: ['a', 'b', 'c'] Categories (3, object): ['a', 'b', 'c'] ['c', 'b', 'a'] Categories (3, object): ['c', 'b', 'a'] 1.2.0.dev0+390.g595791b6f.dirty

Maybe sorting an unordered categorical should actually be raising.

Maybe sorting an unordered categorical should actually be raising.

had this discussion with @jorisvandenbossche a while back......

yeah sorting just gives back the same ordering as the categories that you have, they just don't mean anything.

so we do allow it.

what I think is broken is actually this

In [184]: pd.Categorical(pd.Categorical(values, categories=["a", "b", "c"], ordered=False), ordered=True) Out[184]: [a, b, c] Categories (3, object): [a < b < c]

I think this should raise, though we do allow this via .set_ordered() so maybe its ok

cc @TomAugspurger

since we aren't actually testing this likely i think prob ok to merge this and open an issue for discussion.

I would almost expect sorting an unordered categorical to simply return the original array (since maybe you could argue it's already "trivially ordered" in some sense) if it weren't to raise

Actually, R does the same thing as pandas interestingly enough:

> x <- factor(c("a", "b", "c", "a"), levels = c("c", "b", "a"), ordered = FALSE) > x [1] a b c a Levels: c b a > sort(x) [1] c b a a Levels: c b a

I guess because it's easiest just to always sort by the underlying codes.

I would almost expect sorting an unordered categorical to simply return the original array (since maybe you could argue it's already "trivially ordered" in some sense)

@dsaxton strings also don't necessarily have a meaningfull order, but we still sort them lexicographically. In the same way, we still sort an unordered categorical, using the order of the categories (which is the same as lexicographically sorted in most cases, unless you specified the categories manually in a certain order).

There are lots of reasons to allow sorting for an "unordered" categorical. One example is to get a deterministic order of your values, which can be useful regardless of the order of the categories having a meaning or not.

…values-with-key

jreback · 2020-09-17T02:31:19Z

thanks @dsaxton

simonjayhawkins · 2020-09-17T14:30:10Z

@jreback you milestone 1.2. change note in 1.1.3. which should it be?

simonjayhawkins · 2020-09-19T14:00:59Z

@jreback ok to backport? see #36385 (comment)

jreback · 2020-09-19T14:02:04Z

yeah this is ok

simonjayhawkins · 2020-09-19T14:02:23Z

thanks

simonjayhawkins · 2020-09-19T14:03:05Z

@meeseeksdev backport 1.1.x

…ort_indexer

#36477) Co-authored-by: Daniel Saxton <2658661+dsaxton@users.noreply.github.com>

BUG: Always cast to Categorical in lexsort_indexer

1df4dd7

dsaxton added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type labels Sep 15, 2020

dsaxton added this to the 1.1.3 milestone Sep 15, 2020

Daniel Saxton added 4 commits September 15, 2020 11:07

Nit

9652042

Edit test

d45ae2d

Diff order

ab13b98

Drop import

595791b

jreback requested changes Sep 15, 2020

View reviewed changes

Daniel Saxton added 2 commits September 15, 2020 18:28

Param

578bb3d

Merge remote-tracking branch 'upstream/master' into categorical-sort-…

37d73e3

…values-with-key

jreback modified the milestones: 1.1.3, 1.2 Sep 17, 2020

jreback approved these changes Sep 17, 2020

View reviewed changes

jreback merged commit 70d618c into pandas-dev:master Sep 17, 2020

dsaxton deleted the categorical-sort-values-with-key branch September 17, 2020 02:37

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request Sep 17, 2020

BUG: Always cast to Categorical in lexsort_indexer (pandas-dev#36385)

7413f10

simonjayhawkins mentioned this pull request Sep 19, 2020

BUG: fix isin with nans and large arrays #36266

Merged

5 tasks

simonjayhawkins modified the milestones: 1.2, 1.1.3 Sep 19, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Sep 19, 2020

Backport PR pandas-dev#36385: BUG: Always cast to Categorical in lexs…

b7b0883

…ort_indexer

meeseeksmachine mentioned this pull request Sep 19, 2020

Backport PR #36385 on branch 1.1.x (BUG: Always cast to Categorical in lexsort_indexer) #36477

Merged

simonjayhawkins pushed a commit that referenced this pull request Sep 19, 2020

Backport PR #36385: BUG: Always cast to Categorical in lexsort_indexer (

d05a9ca

#36477) Co-authored-by: Daniel Saxton <2658661+dsaxton@users.noreply.github.com>

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG: Always cast to Categorical in lexsort_indexer (pandas-dev#36385)

6c17901

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Always cast to Categorical in lexsort_indexer #36385

BUG: Always cast to Categorical in lexsort_indexer #36385

dsaxton commented Sep 15, 2020

jbrockmendel commented Sep 15, 2020

jreback Sep 15, 2020

dsaxton Sep 15, 2020

dsaxton Sep 15, 2020

jreback Sep 15, 2020

jreback Sep 15, 2020

jreback Sep 15, 2020

jreback Sep 15, 2020

dsaxton Sep 15, 2020

dsaxton Sep 17, 2020

jorisvandenbossche Sep 19, 2020

jreback commented Sep 17, 2020

simonjayhawkins commented Sep 17, 2020

simonjayhawkins commented Sep 19, 2020

jreback commented Sep 19, 2020

simonjayhawkins commented Sep 19, 2020

simonjayhawkins commented Sep 19, 2020

BUG: Always cast to Categorical in lexsort_indexer #36385

BUG: Always cast to Categorical in lexsort_indexer #36385

Conversation

dsaxton commented Sep 15, 2020

jbrockmendel commented Sep 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 17, 2020

simonjayhawkins commented Sep 17, 2020

simonjayhawkins commented Sep 19, 2020

jreback commented Sep 19, 2020

simonjayhawkins commented Sep 19, 2020

simonjayhawkins commented Sep 19, 2020