Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: deprecate setting of .ordered directly (GH9347, GH9190) #9611

Closed
wants to merge 6 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 7, 2015

closes #9347
closes #9190
closes #9148

so this is option 4).

Default is now ordered=False (independently of whether you specify categories or not)
The ordering is still what factorize does (seen order, if no categories are specified).

Add set_ordered to change the ordered flag, which by default returns a new object.

In [1]: cat = pd.Categorical([0,1,2])
In [2]: cat
Out[2]: 
[0, 1, 2]
Categories (3, int64): [0, 1, 2]

In [3]: cat.ordered
Out[3]: False

In [4]: cat.ordered=True
pandas/core/categorical.py:452: FutureWarning: Setting 'ordered' directly is deprecated, use 'set_ordered'

In [5]: cat = cat.set_ordered(True)

In [6]: cat
Out[6]: 
[0, 1, 2]
Categories (3, int64): [0 < 1 < 2]

In [7]: cat.ordered
Out[7]: True

In [8]: cat = pd.Categorical([0,1,2],ordered=True)

In [9]: cat
Out[9]: 
[0, 1, 2]
Categories (3, int64): [0 < 1 < 2]

In [12]: cat.ordered
Out[12]: True

Further added the ability for astype to pass on keywords to the constructor

In [1]: Series(["a","b","c","a"]).astype('category',ordered=True)
Out[1]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

In [2]: Series(["a","b","c","a"]).astype('category',categories=list('abcdef'),ordered=False)
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (6, object): [a, b, c, d, e, f]

Furthermore I put a warning as seemingly simple operations will fail because the user had an implicity ordered Categorical.

In [1]: df = DataFrame({ 'A' : Series(list('aabc')).astype('category'), 'B' : np.arange(4) })

In [2]: df['A'].order()
TypeError: Categorical not ordered
you can use .set_ordered(True) to change the Categorical to an ordered one

In [3]: df.groupby('A').sum()
ValueError: cannot sort by an unordered Categorical in the grouper
you can set sort=False in the groupby expression or
make the categorical ordered by using .set_ordered(True)

@jreback jreback added API Design Categorical Categorical Data Type labels Mar 7, 2015
@jreback jreback added this to the 0.16.0 milestone Mar 7, 2015
@jreback jreback force-pushed the cat branch 7 times, most recently from f73312b to d4e9254 Compare March 7, 2015 17:47
@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2015

One point to note.

supplying a categories DOES NOT make the categorical ordered. Ordering is a separate and distinct property.

In [3]: pd.Categorical(["a","a","b","b"], categories=["a","b","z"])
Out[3]: 
[a, a, b, b]
Categories (3, object): [a, b, z]

In [4]: pd.Categorical(["a","a","b","b"], categories=["a","b","z"], ordered=True)
Out[4]: 
[a, a, b, b]
Categories (3, object): [a < b < z]

@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2015


In [2]: df['A'].order()
TypeError: Categorical not ordered
you can use .set_ordered(True) to change the Categorical to an ordered one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the wording suggest that set_ordered(True) is inplace and it exposes the underlying object name instead of talking about "categorical data". On the other hand, I tried to get a better wording for 5 min, but couldn't find one :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to discourage the inplace option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw that set_categories() has the same issue... ok, I drop my case here :-)

@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2015

@JanSchulz jreback@0becba3

@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2015

The original discussion in #9347, which the current impl.

In [14]: cat1 = Series(pd.Categorical(["a","c","b"], ordered=False))

In [16]: cat1
Out[16]: 
0    a
1    c
2    b
dtype: category
Categories (3, object): [a, c, b]

In [15]: cat2 = Series(pd.Categorical(["a","c","b"], ordered=True))

In [17]: cat2
Out[17]: 
0    a
1    c
2    b
dtype: category
Categories (3, object): [a < b < c]

In [19]: cat3 = Series(pd.Categorical(["a","c","b"], categories=['b','c','a'], ordered=True))

In [20]: cat3
Out[20]: 
0    a
1    c
2    b
dtype: category
Categories (3, object): [b < c < a]

@JanSchulz does this look right?

@jankatins
Copy link
Contributor

No, the first example is what #9347 is/was all about and which was kind of rejected by all "option 4" votes:

  1. leave this alone, but make ordered=None -> ordered=False, e.g. if you don't specify an ordering then your ordering is lexographic but you are not considered ordered (this is the same as 1), but we don't default the order

@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2015

ok, so I should leave the sorting alone, then just set the ordered flag. ok.

@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2015

ok, if I change back to what was there before (e.g. this should be option 4)

In [4]: cat1
Out[4]: 
0    a
1    c
2    b
dtype: category
Categories (3, object): [a, b, c]

In [5]: cat2
Out[5]: 
0    a
1    c
2    b
dtype: category
Categories (3, object): [a < b < c]

In [6]: cat3
Out[6]: 
0    a
1    c
2    b
dtype: category
Categories (3, object): [b < c < a]

@jankatins
Copy link
Contributor

yep

cat
cat.ordered

# you can set in the construtor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "constru_c_tor"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jorisvandenbossche
Copy link
Member

@jreback @JanSchulz I don't have time today to look at this in more detail, so if possible, can you keep it open a bit longer?

But already one thing: getting errors with a plain groupby does really not look nice in my opinion. Can't we just sort it in the 'order' of the categories? I know it is not 'ordered' in the sense that the first element is regarded as lower as the last etc, but still, you have the 'order' of the categories in the categories attribute.

@jreback
Copy link
Contributor Author

jreback commented Mar 8, 2015

So here is a possible compromise. automatically ordering basically violates the API guarantee that we shouldn't allow sorting (or sort-like operations) on unordered Categoricals.

But could simply do an implicit .as_ordered() inside the operation and warn (OrderingWarning) or something, that what they are doing is bad. I think we could do this for all current raises for non-ordering operations, e.g. in groupby, min, max, order, argosrt, searchsorted.

In [1]: df = DataFrame({ 'A' : Series(list('aabc')).astype('category'), 'B' : np.arange(4) })

In [2]: df['A'].order()
pandas/core/categorical.py:1012: OrderingWarning: Categorical is not ordered
sort will be in the order of the categories
you can use .as_ordered() to change the Categorical to an ordered one

  warn("Categorical is not ordered\n"
Out[2]: 
0    a
1    a
2    b
3    c
Name: A, dtype: category
Categories (3, object): [a, b, c]

@jreback
Copy link
Contributor Author

jreback commented Mar 8, 2015

latest commit changes all of the raises to OrderingWarning. So users can do what they were doing, now will get a warning.

In [1]: s = Series([1,2,3,1], dtype="category")

In [2]: s.min()
pandas/core/categorical.py:843: OrderingWarning: Categorical is not ordered
sort will be in the order of the categories
you can use .as_ordered() to change the Categorical to an ordered one

  OrderingWarning)
Out[2]: 1

@jreback jreback force-pushed the cat branch 4 times, most recently from 5c61bc5 to bba7f56 Compare March 8, 2015 18:36
@jankatins
Copy link
Contributor

I don't like the "warn + change instead of raise"-thing... The old thing (raise in groupby) is not so nice, but at least it's not changing a datatype. The way I think about categorical data is like an user defined integer: you have certain valid values (0...INTMAX), maybe an order (0<1, etc). If we had a "unorderable int", it would be bad if min() just issues a warning and than proceeds as if it is a normal int.

The only place I find that "not so nice" is the groupby and I think that is more of a problem of the default of the groupby.

I see several ways to fix that, but all have drawbacks:

  • Use this "warn + change" approach as in the before last commit (see above...).
  • keep ordered=True as the implicit default -> makes the simple groupby case happy, but not so consistent with R and other expectations
  • change to sort=False in groupby -> API change :-(
  • change to sort=Null and only make that False (with a warning) if the input is a unordered Categorical
  • simple raise an error for the "simple case", which lets the user know that he should always use explicit ordering information in the creation if a categorical: astype("category", ordered=True)

My preference would be sort=Null or raise if unordered...

@shoyer
Copy link
Member

shoyer commented Mar 8, 2015

I need to look into this more deeply, but my inclination would be to switch to sort=None in groupby, which could be interpreted as "sort if possible". In practice, that would mean "sort everything except unordered categoricals." I would certainly not change categorical dtypes in groupby operations if possible.... though in the current implementation (without CategoricalIndex), we basically do that anyways.

@jreback
Copy link
Contributor Author

jreback commented Mar 8, 2015

this commit here implementes the sort=None option.

and then we have the groupby back to 'working' (its doing what it was doing before we changed the default), but you force sort=True then it WILL raise.

.order()/min/max will still raise

In [1]: df = DataFrame({ 'A' : Series(list('aabc')).astype('category'), 'B' : np.arange(4) })

In [2]: df.groupby('A').sum()
Out[2]: 
   B
A   
a  1
b  2
c  3

In [3]: df['A'].order()
TypeError: Categorical not ordered
you can use .as_ordered() to change the Categorical to an ordered one

@jreback
Copy link
Contributor Author

jreback commented Mar 9, 2015

@jorisvandenbossche @shoyer

If you guys can have a look today would be gr8. Need this for the rc.

@JanSchulz made the list above.

I think sort=None is a reasonable approach to this.

@jorisvandenbossche
Copy link
Member

I think there is another option that was not included in @JanSchulz his list for possible fixes:

  • allow sorting of unordered categoricals. And then we don't have to change anything in the groupby.
    This maybe sounds strange, but it is also what R does: it allows sorting, but not max/min eg:

    > cat <- as.factor(c('b', 'a', 'c', 'b'))
    > cat
    [1] b a c b
    Levels: a b c
    > sort(cat)
    [1] a b b c
    Levels: a b c
    > min(cat)
    Error in Summary.factor(c(2L, 1L, 3L, 2L), na.rm = FALSE) : 
    min not meaningful for factors
    > cat2 <- as.ordered(c('b', 'a', 'c', 'b'))
    > cat2
    [1] b a c b
    Levels: a < b < c
    > sort(cat2)
    [1] a b b c
    Levels: a < b < c
    > min(cat2)
    [1] a
    Levels: a < b < c
    

    If this is good or bad, I can't really say. But the question maybe is what then is the added value of using an ordered categorical above an unorderd.

But to put in the other way around: why not sorting an unordered categorical? It is a convenience to be able to do this (see eg the groupby case), and what harm does it do that this works without having to convert it to an ordered categorical? Just sorting the values in the order as it is in the categories.

@jreback In the case of using sort=None in groupby, what is the order of the result when grouping by an unordered categorical? The order of occurrence?

@jreback
Copy link
Contributor Author

jreback commented Mar 9, 2015

@jorisvandenbossche sort=None -> sort=True for everything but unordered categoricals.

I like your soln. It basically doesn't care on groupby's which is fine; they will be ordered as they are ordered now (and not custom ordered like ordered categoricals).

If you think about it we actually have 2 dtypes here (ordered & unorderded) categoricals (kind of like a sub-dtype).

@jreback
Copy link
Contributor Author

jreback commented Mar 10, 2015

closing in favor of #9622

@jreback jreback closed this Mar 10, 2015
jreback pushed a commit that referenced this pull request Jul 19, 2016
xref #9611

Author: gfyoung <gfyoung17@gmail.com>

Closes #13671 from gfyoung/cat-set-order-removal and squashes the following commits:

58938e7 [gfyoung] CLN: removed setter method of categorical's ordered attribute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: categorical.reset_order Ordered vs. Unordered Categoricals
4 participants