Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical: don't sort the categoricals if Categorical(..., ordered=False) #9347

Closed
wants to merge 1 commit into from

Conversation

jankatins
Copy link
Contributor

In mwaskom/seaborn#361 it was discussed that lexicographical sorting the categories is only appropiate if an order is specified/implied. If this is explicitly not done, e.g. with Categorical(..., ordered=False) then the order should be taken from the order of appearance, similar to the current Series.unique() implementation.

…False)

In mwaskom/seaborn#361 it was discussed
that lexicographical sorting the categories is only appropiate if an
order is specified/implied. If this is explicitly not done, e.g. with
`Categorical(..., ordered=False)` then the order should be taken
from the order of appearance, similar to the current `Series.unique()`
implementation.
@jankatins
Copy link
Contributor Author

Note this should only be taken if Series.unique() is kept as it is now (#9346)

I'm also not so sure what is best here: if a categorical is changed from ordered==True to ordered==False, the order of the categories should not change, which means that in the following case the two categoricals are not equal:

cat1 = Series(Categorical(["a","c","b"], ordered=False))
cat2 = Series(Categorical(["a","c","b"]))
cat2.cat.ordered=False

@shoyer
Copy link
Member

shoyer commented Jan 23, 2015

IMO ordered is effectively part of the dtype and should be immutable.

@jankatins
Copy link
Contributor Author

@shoyer: can you elaborate? :-) This is modeled after R's factor which lets you set this (but of course this will produce a new factor, not an inplace operation like we have it here)

@jankatins
Copy link
Contributor Author

@shoyer the more I think about it, the more sense it makes to not let s.cat.order be changeable (only readable). In R that's implicitly there, as every change creates a new factor,

If that's desirable, we need a change_ordered(new_order) method which returns a new copy of the categorical. And remove the setter...

@jreback @jorisvandenbossche any comments?

@@ -268,7 +268,7 @@ def __init__(self, values, categories=None, ordered=None, name=None, fastpath=Fa

if categories is None:
try:
codes, categories = factorize(values, sort=True)
codes, categories = factorize(values, sort=ordered if not ordered is None else True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More succinctly, this could be ordered is None or ordered

@shoyer
Copy link
Member

shoyer commented Jan 25, 2015

I don't think we necessarily need a change ordered method; it's straightforward enough to use Categorical.from_codes(orig.codes, orig.categories, ordered=False) or even just Categorical(orig, orig.categories, ordered=False).

By "part of the dtype" I'm referring to categorical data as it's defined, e.g., in dynd -- everything that's not the specific values of an array is part of the dtype. From a simplicity perspective, it's nice if that cannot change.

@jankatins
Copy link
Contributor Author

But right now it can change, as it is "only inplace", as it is a setter: s.cat.ordered = False. It's actually the only part of the s.cat-API, which does inplace by default (and only), all others are methods with a default of inplace=False.

E.g.

df["cat"] = ...
df.cat.cat.ordered = False

will change the dataframe itself

For all else, you have to assign again or use inplace=True:

df["cat"] = df.cat.cat.reorder_categories([...])
df.cat.cat.reorder_categories([...], inplace=True)

@bashtage
Copy link
Contributor

The current method to change the order is not friendly and requires more knowledge than most users would probably have. Of course, many people would just use

ordered = pd.Categorical(['a','b','c']), 
unordered = pd.Categorical(ordered, ordered=False)

which is a little wasteful but simple. I could see something like

unordered = ordered.swap_ordering()

or

ordered.swap_ordering(inplace=True)

making it more obvious how to remove (or add) ordering.

One of the dangers of the current approach to ordering is

unordered = pd.Categorical(['a','c','d','b'], ordered=False)
unordered.ordered=True

which seems to work but only by accident.

@jankatins
Copy link
Contributor Author

actually we tried to hide pd.Categorical() and promoted s.astype("category") and s.cat instead, so I don't think that's a good option...

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@JanSchulz I think this PR is fine.

separate issue is whether to change what ordered does now (e.g. make it a immutable property). (so let's open a new one for that).

@jreback jreback added API Design Categorical Categorical Data Type labels Jan 25, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 25, 2015
@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@JanSchulz pls add a release note (API section, maybe add a small example to be clear).

@jorisvandenbossche
Copy link
Member

As a comparison, R does lexicographical sort the categories for unordered factors (but that is the default).

And apart from that, I don't know if I find it more logical or not to sort the categories or not. @JanSchulz As you said in another discussion (about the sorting I think), that often 'the order of appearance' in a dataset does not really mean something. So I don't know if I find it worth changing.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

@JanSchulz can you add a release note (and maybe a short example of the behavior change to 0.16.0)

@jreback
Copy link
Contributor

jreback commented Mar 5, 2015

@JanSchulz can you add a release note explaining this change? otherwise looks gtg.

@jorisvandenbossche
Copy link
Member

@jreback @JanSchulz I am not really sure I think this is a good idea. I think having it sorted lexicographically is more logical, as there is no inherent order in the categories. But you still want to have this somewhat consistent. In that way, sorting it lexicographically makes the most sense I think.

This is also what R does (but there the default is unordered), and I think also what dynd does in factor_categorical (although I don't know if dynd has the distinction between ordered and unordered)

@jreback
Copy link
Contributor

jreback commented Mar 6, 2015

In [33]: pd.Categorical(["a", "c", "b", "a"], categories=['a','c','b'], ordered=None)
Out[33]: 
[a, c, b, a]
Categories (3, object): [a < c < b]

In [34]: pd.Categorical(["a", "c", "b", "a"], categories=['a','c','b'], ordered=True)
Out[34]: 
[a, c, b, a]
Categories (3, object): [a < c < b]

In [35]: pd.Categorical(["a", "c", "b", "a"], categories=['a','c','b'], ordered=False)
Out[35]: 
[a, c, b, a]
Categories (3, object): [a, c, b]
In [36]: pd.Categorical(["a", "c", "b", "a"], ordered=None)
Out[36]: 
[a, c, b, a]
Categories (3, object): [a < b < c]

In [37]: pd.Categorical(["a", "c", "b", "a"], ordered=True)
Out[37]: 
[a, c, b, a]
Categories (3, object): [a < b < c]

In [38]: pd.Categorical(["a", "c", "b", "a"], ordered=False)
Out[38]: 
[a, c, b, a]
Categories (3, object): [a, b, c]

So think the intent is to have the 2nd section (where no ordering is specified explicity) be turned into the first section by default (for all cases). They will still be ordered/unordered as a Categorical type as indicated by the ordered attribute.

This will effectively remove the default ordering from lexiographic to the discovery order (which is how .factorize() works).

a couple of options I see

  1. leave this alone
  2. make the default as I describe above, e.g. to the order of appearance
  3. require categories to be specified when ordered=True (e.g. force the user to say the actual ordering)
  4. leave this alone, but make ordered=None -> ordered=False, e.g. if you don't specify an ordering they your ordering is lexographic but you are not considered ordered (this is the same as 1), but we don't default the order

aside from this, I think we need for #9190

def set_order(self, ordered, inplace=False):
    Parameters
    ----------------
    ordered : boolean
       set the ordered attribute for this Categorical to be the passed ordered
    inplace : boolean, default False
        modify the categorical inplace

and then raise on cat.ordered = value (or could deprecate and suggest set_order)

I think 4) is the most logical here. Basically turning default Categoricals to unordered. The ordering still remains lexographic, unless otherwise overriden.

@jorisvandenbossche
Copy link
Member

Yes, I also wanted to raise the idea to have unordered as a default (wanted to make an issue for that, but started with reraising in this issue). Do we discuss that here or open a new one?

I also think ordered=False as a default makes more sense:

  • It is more in line with the distinction of 'categorical' vs 'ordinal' variable (eg http://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htm), where we provide both in pandas through Categorical, but with the 'categorical' meaning as default.
  • Many common examples, saying gender (F, M), country, color, etc have no intrinsic order. So to say that defaulting to unordered makes as much sense I think.
  • In line with the default of R
  • it makes the default more straightforward (now None -> True or False depending on the data type)

@jankatins
Copy link
Contributor Author

I can live with making it unordered as default.

Regarding the default order on the categories: I actually don't mind much, if you want to have it ordered in most of the case you have to reorder anyway... i would lean slightly to let it be as it is now (lexi sorting and not order of appearance). Someone who is willing to make a decision should do it by either pulling this PR or closing it :-)

-> In line with option 4 above (default to unordered but lexi-sort the categories)

Regarding the method which changes the "ordered" property: maybe a s.cat.as_[un]ordered(inplace=False) or s.cat.as_nominal/ordinal(inplace=False) would be more clear. set_order can mean both setting specifically ordered categories or setting the order property.

Not sure about deprecation or not. If deprecation, just one cycle?

@jreback
Copy link
Contributor

jreback commented Mar 7, 2015

closing in favor of #9611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants