Categorical: don't sort the categoricals if Categorical(..., ordered=False) #9347

jankatins · 2015-01-23T21:58:53Z

In mwaskom/seaborn#361 it was discussed that lexicographical sorting the categories is only appropiate if an order is specified/implied. If this is explicitly not done, e.g. with Categorical(..., ordered=False) then the order should be taken from the order of appearance, similar to the current Series.unique() implementation.

…False) In mwaskom/seaborn#361 it was discussed that lexicographical sorting the categories is only appropiate if an order is specified/implied. If this is explicitly not done, e.g. with `Categorical(..., ordered=False)` then the order should be taken from the order of appearance, similar to the current `Series.unique()` implementation.

jankatins · 2015-01-23T21:59:11Z

Note this should only be taken if Series.unique() is kept as it is now (#9346)

I'm also not so sure what is best here: if a categorical is changed from ordered==True to ordered==False, the order of the categories should not change, which means that in the following case the two categoricals are not equal:

cat1 = Series(Categorical(["a","c","b"], ordered=False))
cat2 = Series(Categorical(["a","c","b"]))
cat2.cat.ordered=False

shoyer · 2015-01-23T22:10:31Z

IMO ordered is effectively part of the dtype and should be immutable.

jankatins · 2015-01-23T22:14:01Z

@shoyer: can you elaborate? :-) This is modeled after R's factor which lets you set this (but of course this will produce a new factor, not an inplace operation like we have it here)

jankatins · 2015-01-24T22:22:30Z

@shoyer the more I think about it, the more sense it makes to not let s.cat.order be changeable (only readable). In R that's implicitly there, as every change creates a new factor,

If that's desirable, we need a change_ordered(new_order) method which returns a new copy of the categorical. And remove the setter...

@jreback @jorisvandenbossche any comments?

shoyer · 2015-01-25T01:46:15Z

pandas/core/categorical.py

@@ -268,7 +268,7 @@ def __init__(self, values, categories=None, ordered=None, name=None, fastpath=Fa

        if categories is None:
            try:
-                codes, categories = factorize(values, sort=True)
+                codes, categories = factorize(values, sort=ordered if not ordered is None else True)


More succinctly, this could be ordered is None or ordered

shoyer · 2015-01-25T01:50:39Z

I don't think we necessarily need a change ordered method; it's straightforward enough to use Categorical.from_codes(orig.codes, orig.categories, ordered=False) or even just Categorical(orig, orig.categories, ordered=False).

By "part of the dtype" I'm referring to categorical data as it's defined, e.g., in dynd -- everything that's not the specific values of an array is part of the dtype. From a simplicity perspective, it's nice if that cannot change.

jankatins · 2015-01-25T09:52:30Z

But right now it can change, as it is "only inplace", as it is a setter: s.cat.ordered = False. It's actually the only part of the s.cat-API, which does inplace by default (and only), all others are methods with a default of inplace=False.

E.g.

df["cat"] = ...
df.cat.cat.ordered = False

will change the dataframe itself

For all else, you have to assign again or use inplace=True:

df["cat"] = df.cat.cat.reorder_categories([...])
df.cat.cat.reorder_categories([...], inplace=True)

bashtage · 2015-01-25T17:31:34Z

The current method to change the order is not friendly and requires more knowledge than most users would probably have. Of course, many people would just use

ordered = pd.Categorical(['a','b','c']), 
unordered = pd.Categorical(ordered, ordered=False)

which is a little wasteful but simple. I could see something like

unordered = ordered.swap_ordering()

or

ordered.swap_ordering(inplace=True)

making it more obvious how to remove (or add) ordering.

One of the dangers of the current approach to ordering is

unordered = pd.Categorical(['a','c','d','b'], ordered=False)
unordered.ordered=True

which seems to work but only by accident.

jankatins · 2015-01-25T20:42:41Z

actually we tried to hide pd.Categorical() and promoted s.astype("category") and s.cat instead, so I don't think that's a good option...

jreback · 2015-01-25T21:31:41Z

@JanSchulz I think this PR is fine.

separate issue is whether to change what ordered does now (e.g. make it a immutable property). (so let's open a new one for that).

jreback · 2015-01-25T21:32:20Z

@JanSchulz pls add a release note (API section, maybe add a small example to be clear).

jorisvandenbossche · 2015-01-25T21:48:21Z

As a comparison, R does lexicographical sort the categories for unordered factors (but that is the default).

And apart from that, I don't know if I find it more logical or not to sort the categories or not. @JanSchulz As you said in another discussion (about the sorting I think), that often 'the order of appearance' in a dataset does not really mean something. So I don't know if I find it worth changing.

jreback · 2015-03-03T01:02:58Z

@JanSchulz can you add a release note (and maybe a short example of the behavior change to 0.16.0)

jreback · 2015-03-05T23:29:24Z

@JanSchulz can you add a release note explaining this change? otherwise looks gtg.

jorisvandenbossche · 2015-03-06T10:26:09Z

@jreback @JanSchulz I am not really sure I think this is a good idea. I think having it sorted lexicographically is more logical, as there is no inherent order in the categories. But you still want to have this somewhat consistent. In that way, sorting it lexicographically makes the most sense I think.

This is also what R does (but there the default is unordered), and I think also what dynd does in factor_categorical (although I don't know if dynd has the distinction between ordered and unordered)

jreback · 2015-03-06T13:10:13Z

In [33]: pd.Categorical(["a", "c", "b", "a"], categories=['a','c','b'], ordered=None)
Out[33]: 
[a, c, b, a]
Categories (3, object): [a < c < b]

In [34]: pd.Categorical(["a", "c", "b", "a"], categories=['a','c','b'], ordered=True)
Out[34]: 
[a, c, b, a]
Categories (3, object): [a < c < b]

In [35]: pd.Categorical(["a", "c", "b", "a"], categories=['a','c','b'], ordered=False)
Out[35]: 
[a, c, b, a]
Categories (3, object): [a, c, b]

In [36]: pd.Categorical(["a", "c", "b", "a"], ordered=None)
Out[36]: 
[a, c, b, a]
Categories (3, object): [a < b < c]

In [37]: pd.Categorical(["a", "c", "b", "a"], ordered=True)
Out[37]: 
[a, c, b, a]
Categories (3, object): [a < b < c]

In [38]: pd.Categorical(["a", "c", "b", "a"], ordered=False)
Out[38]: 
[a, c, b, a]
Categories (3, object): [a, b, c]

So think the intent is to have the 2nd section (where no ordering is specified explicity) be turned into the first section by default (for all cases). They will still be ordered/unordered as a Categorical type as indicated by the ordered attribute.

This will effectively remove the default ordering from lexiographic to the discovery order (which is how .factorize() works).

a couple of options I see

leave this alone
make the default as I describe above, e.g. to the order of appearance
require categories to be specified when ordered=True (e.g. force the user to say the actual ordering)
leave this alone, but make ordered=None -> ordered=False, e.g. if you don't specify an ordering they your ordering is lexographic but you are not considered ordered (this is the same as 1), but we don't default the order

aside from this, I think we need for #9190

def set_order(self, ordered, inplace=False):
    Parameters
    ----------------
    ordered : boolean
       set the ordered attribute for this Categorical to be the passed ordered
    inplace : boolean, default False
        modify the categorical inplace

and then raise on cat.ordered = value (or could deprecate and suggest set_order)

I think 4) is the most logical here. Basically turning default Categoricals to unordered. The ordering still remains lexographic, unless otherwise overriden.

jorisvandenbossche · 2015-03-06T13:25:05Z

Yes, I also wanted to raise the idea to have unordered as a default (wanted to make an issue for that, but started with reraising in this issue). Do we discuss that here or open a new one?

I also think ordered=False as a default makes more sense:

It is more in line with the distinction of 'categorical' vs 'ordinal' variable (eg http://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htm), where we provide both in pandas through Categorical, but with the 'categorical' meaning as default.
Many common examples, saying gender (F, M), country, color, etc have no intrinsic order. So to say that defaulting to unordered makes as much sense I think.
In line with the default of R
it makes the default more straightforward (now None -> True or False depending on the data type)

jankatins · 2015-03-06T15:45:56Z

I can live with making it unordered as default.

Regarding the default order on the categories: I actually don't mind much, if you want to have it ordered in most of the case you have to reorder anyway... i would lean slightly to let it be as it is now (lexi sorting and not order of appearance). Someone who is willing to make a decision should do it by either pulling this PR or closing it :-)

-> In line with option 4 above (default to unordered but lexi-sort the categories)

Regarding the method which changes the "ordered" property: maybe a s.cat.as_[un]ordered(inplace=False) or s.cat.as_nominal/ordinal(inplace=False) would be more clear. set_order can mean both setting specifically ordered categories or setting the order property.

Not sure about deprecation or not. If deprecation, just one cycle?

jreback · 2015-03-07T23:14:32Z

closing in favor of #9611

jankatins mentioned this pull request Jan 24, 2015

Feature Request: categorical.reset_order #9190

Closed

shoyer reviewed Jan 25, 2015
View reviewed changes

jreback added API Design Categorical Categorical Data Type labels Jan 25, 2015

jreback added this to the 0.16.0 milestone Jan 25, 2015

jreback mentioned this pull request Mar 7, 2015

API: deprecate setting of .ordered directly (GH9347, GH9190) #9611

Closed

jreback closed this Mar 7, 2015

jreback mentioned this pull request Mar 10, 2015

API: deprecate setting of .ordered directly (GH9347, GH9190) #9622

Merged

jreback mentioned this pull request Apr 3, 2016

DOC: Categorical sort_values and sort Documentation #12785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical: don't sort the categoricals if Categorical(..., ordered=False) #9347

Categorical: don't sort the categoricals if Categorical(..., ordered=False) #9347

jankatins commented Jan 23, 2015

jankatins commented Jan 23, 2015

shoyer commented Jan 23, 2015

jankatins commented Jan 23, 2015

jankatins commented Jan 24, 2015

shoyer Jan 25, 2015

shoyer commented Jan 25, 2015

jankatins commented Jan 25, 2015

bashtage commented Jan 25, 2015

jankatins commented Jan 25, 2015

jreback commented Jan 25, 2015

jreback commented Jan 25, 2015

jorisvandenbossche commented Jan 25, 2015

jreback commented Mar 3, 2015

jreback commented Mar 5, 2015

jorisvandenbossche commented Mar 6, 2015

jreback commented Mar 6, 2015

jorisvandenbossche commented Mar 6, 2015

jankatins commented Mar 6, 2015

jreback commented Mar 7, 2015

Categorical: don't sort the categoricals if Categorical(..., ordered=False) #9347

Categorical: don't sort the categoricals if Categorical(..., ordered=False) #9347

Conversation

jankatins commented Jan 23, 2015

jankatins commented Jan 23, 2015

shoyer commented Jan 23, 2015

jankatins commented Jan 23, 2015

jankatins commented Jan 24, 2015

shoyer Jan 25, 2015

Choose a reason for hiding this comment

shoyer commented Jan 25, 2015

jankatins commented Jan 25, 2015

bashtage commented Jan 25, 2015

jankatins commented Jan 25, 2015

jreback commented Jan 25, 2015

jreback commented Jan 25, 2015

jorisvandenbossche commented Jan 25, 2015

jreback commented Mar 3, 2015

jreback commented Mar 5, 2015

jorisvandenbossche commented Mar 6, 2015

jreback commented Mar 6, 2015

jorisvandenbossche commented Mar 6, 2015

jankatins commented Mar 6, 2015

jreback commented Mar 7, 2015