Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Always cast to Categorical in lexsort_indexer #36385

Merged
merged 7 commits into from
Sep 17, 2020
Merged

BUG: Always cast to Categorical in lexsort_indexer #36385

merged 7 commits into from
Sep 17, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Sep 15, 2020

@dsaxton dsaxton added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type labels Sep 15, 2020
@dsaxton dsaxton added this to the 1.1.3 milestone Sep 15, 2020
@jbrockmendel
Copy link
Member

Not sure if this is related, but I noticed a problem in Categorical.sort_values setting self._codes = foo instead of self._codes[:] = foo

categories = ["c", "b", "a"]
df = pd.DataFrame({"x": [1, 1, 1], "y": ["a", "b", "c"]})

def sorter(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try when the categorical is ordered=False as well (parameterize)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is broken actually. Will need to be more careful above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe my expectation is off about how this should behave when the categorical is unordered. This is odd (it seems to respect the order in which categories are given even when ordered=False):

[ins] In [1]: import pandas as pd
         ...:
         ...:
         ...: values = ["a", "b", "c"]
         ...:
         ...: cat = pd.Categorical(values, categories=["a", "b", "c"], ordered=False)
         ...: print(cat.sort_values())
         ...:
         ...: cat = pd.Categorical(values, categories=["c", "b", "a"], ordered=False)
         ...: print(cat.sort_values())
         ...:
         ...: print(pd.__version__)
         ...:
['a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
['c', 'b', 'a']
Categories (3, object): ['c', 'b', 'a']
1.2.0.dev0+390.g595791b6f.dirty

Maybe sorting an unordered categorical should actually be raising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe sorting an unordered categorical should actually be raising.

had this discussion with @jorisvandenbossche a while back......

yeah sorting just gives back the same ordering as the categories that you have, they just don't mean anything.

so we do allow it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I think is broken is actually this

In [184]: pd.Categorical(pd.Categorical(values, categories=["a", "b", "c"], ordered=False), ordered=True)                                                       
Out[184]: 
[a, b, c]
Categories (3, object): [a < b < c]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should raise, though we do allow this via .set_ordered() so maybe its ok

cc @TomAugspurger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we aren't actually testing this likely i think prob ok to merge this and open an issue for discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would almost expect sorting an unordered categorical to simply return the original array (since maybe you could argue it's already "trivially ordered" in some sense) if it weren't to raise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, R does the same thing as pandas interestingly enough:

> x <- factor(c("a", "b", "c", "a"), levels = c("c", "b", "a"), ordered = FALSE)
> x
[1] a b c a
Levels: c b a
> sort(x)
[1] c b a a
Levels: c b a

I guess because it's easiest just to always sort by the underlying codes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would almost expect sorting an unordered categorical to simply return the original array (since maybe you could argue it's already "trivially ordered" in some sense)

@dsaxton strings also don't necessarily have a meaningfull order, but we still sort them lexicographically. In the same way, we still sort an unordered categorical, using the order of the categories (which is the same as lexicographically sorted in most cases, unless you specified the categories manually in a certain order).

There are lots of reasons to allow sorting for an "unordered" categorical. One example is to get a deterministic order of your values, which can be useful regardless of the order of the categories having a meaning or not.

@jreback jreback modified the milestones: 1.1.3, 1.2 Sep 17, 2020
@jreback jreback merged commit 70d618c into pandas-dev:master Sep 17, 2020
@jreback
Copy link
Contributor

jreback commented Sep 17, 2020

thanks @dsaxton

@dsaxton dsaxton deleted the categorical-sort-values-with-key branch September 17, 2020 02:37
@simonjayhawkins
Copy link
Member

@jreback you milestone 1.2. change note in 1.1.3. which should it be?

@simonjayhawkins
Copy link
Member

@jreback ok to backport? see #36385 (comment)

@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

yeah this is ok

@simonjayhawkins
Copy link
Member

thanks

@simonjayhawkins simonjayhawkins modified the milestones: 1.2, 1.1.3 Sep 19, 2020
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

simonjayhawkins pushed a commit that referenced this pull request Sep 19, 2020
#36477)

Co-authored-by: Daniel Saxton <2658661+dsaxton@users.noreply.github.com>
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: df.sort_values w/ key function fails with multiple sort columns and Categorical sorting
5 participants