BUG, DEP, DOC: Patch and Align Categorical's Sorting API #12882

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants
Member

gfyoung commented Apr 12, 2016

Clarifies the meaning of 'sort' in the context of Categorical to mean 'organization' rather than 'order',
as it is possible to call this method (as well as sort_values) when the Categorical is unordered.

Also patches a bug in Categorical.sort_values in which na_position was not being respected when
ascending was set to True. This commit aligns the behaviour with that of Series.

Finally, deprecates sort in favor of sort_values, which is in alignment with what was done with Series back in #10726.

Closes #12785.

@jreback jreback and 1 other commented on an outdated diff Apr 12, 2016

pandas/core/categorical.py
+ >>> c.sort_values()
+ [1, 1, 2, 2, 5]
+ Categories (3, int64): [1, 2, 5]
+ >>> c.sort_values(ascending=False)
+ [5, 2, 2, 1, 1]
+ Categories (3, int64): [1, 2, 5]
+
+ Inplace sorting can be done as well:
+
+ >>> c.sort_values(inplace=True)
+ >>> c
+ [1, 1, 2, 2, 5]
+ Categories (3, int64): [1, 2, 5]
+ >>>
+ >>> c = pd.Categorical([1, 2, 2, 1, 5])
+ >>> c.sort() # equivalent call
@jreback

jreback Apr 12, 2016

Contributor

.sort() is deprecated, so remove from the example

@gfyoung

gfyoung Apr 12, 2016

Member

The documentation doesn't say that. You might be thinking about order?

@jreback

jreback Apr 12, 2016

Contributor

exactly it was supposed to be deprecated I think. Can you look at the change in the sort PR and see. This didn't apply to Index, and for Series we DID deprecate it (along with order on both). So IIRC this should be deprecated.

@gfyoung

gfyoung Apr 12, 2016

Member

PR number just to double check?

EDIT: ignore, as I found it in test_categorical.py

@jreback jreback and 1 other commented on an outdated diff Apr 12, 2016

pandas/core/categorical.py
See Also
--------
- Category.sort_values
@jreback

jreback Apr 12, 2016

Contributor

hmm, this should be deprecated.

@gfyoung

gfyoung Apr 12, 2016

Member

I don't follow this comment. Could you clarify?

@jreback

jreback Apr 12, 2016

Contributor

see above

@jreback jreback and 1 other commented on an outdated diff Apr 12, 2016

pandas/tests/test_categorical.py
@@ -1307,6 +1307,57 @@ def test_sort(self):
exp = np.array(["a", "b", "c", "d"], dtype=object)
self.assert_numpy_array_equal(cat1.__array__(), exp)
+ # reverse
+ cat = Categorical(["a", "c", "c", "b", "d"], ordered=True)
+ res = cat.sort_values(ascending=False)
+ exp_val = np.array(["d", "c", "c", "b", "a"], dtype=object)
+ exp_categories = np.array(["a", "b", "c", "d"], dtype=object)
+ self.assert_numpy_array_equal(res.__array__(), exp_val)
+ self.assert_numpy_array_equal(res.categories, exp_categories)
+
+ # some NaN positions
@jreback

jreback Apr 12, 2016

Contributor

make new test method test_sort_values_na_position

Contributor

jreback commented Apr 12, 2016

yeah I am pretty sure that Categorical.sort was supposed to be deprecated (IOW, it should actually be calling .order rather than .sort_values. It was added for method compat with Series/Index, and never actually exposed in api.rst.

so @gfyoung pls change the doc-string, have it call .order, add a tests for the deprecation, and a note in the Deprecation section. It was a never advertised method, so I don't think this is a big deal.

Member

gfyoung commented Apr 12, 2016

@jreback : Wait...so the wrong method was deprecated then? Because in the documentation, order is deprecated but not sort.

Contributor

jreback commented Apr 12, 2016

no .order was ALSO deprecated

Member

gfyoung commented Apr 12, 2016

Read your comment above

Contributor

jreback commented Apr 12, 2016

all .sort and .order are deprecated in favor of .sort_values. https://github.com/pydata/pandas/pull/10726/files back in 0.17.0

Member

gfyoung commented Apr 12, 2016

Okay, so everyone calls sort_values now! Got it.

gfyoung changed the title from BUG, DOC: Clarify and patch Categorical's sort API to BUG, DEP, DOC: Clarify and patch Categorical's sort API Apr 12, 2016

gfyoung changed the title from BUG, DEP, DOC: Clarify and patch Categorical's sort API to BUG, DEP, DOC: Patch and Align Categorical's Sorting API Apr 12, 2016

@jreback jreback and 1 other commented on an outdated diff Apr 12, 2016

pandas/tests/test_categorical.py
with tm.assert_produces_warning(FutureWarning):
- c.order()
+ c.order() # deprecated in gh-10726
+ c.sort() # deprecated in gh-12882
@jreback

jreback Apr 12, 2016

Contributor

I think you need to do 2 asserts (e.g. say the first one DIDn't assert, but the 2nd one did) (you could do in a loop as well)

@gfyoung

gfyoung Apr 12, 2016

Member

Oh, whoops. Good point. I'll do the loop since both need to raise FutureWarning.

@jreback jreback commented on the diff Apr 12, 2016

pandas/core/categorical.py
- Categorical.sort is the equivalent but sorts the Categorical inplace.
-
- Parameters
- ----------
- inplace : boolean, default False
- Do operation in place.
- ascending : boolean, default True
- Sort ascending. Passing False sorts descending
- na_position : {'first', 'last'} (optional, default='last')
- 'first' puts NaNs at the beginning
- 'last' puts NaNs at the end
-
- Returns
- -------
- y : Category or None
+ DEPRECATED: use :meth:`Categorical.sort_values`. That function
@jreback

jreback Apr 12, 2016

Contributor

need an entry in whatsnew in the Deprecated section

jsexauer referenced this pull request Apr 12, 2016

Open

DEPR: deprecations from prior versions #6581

0 of 51 tasks complete

Is it possible there was a reason why sort was not deprecated? I can imagine it was done on purpose because a Categorical is actually more array-like than series-like, and then following the numpy 'rules' does make some sense.
But, I can't really remember if this was the reason to be honest, or whether it was an oversight.

@jorisvandenbossche jorisvandenbossche commented on an outdated diff Apr 12, 2016

pandas/core/categorical.py
@@ -1157,30 +1157,74 @@ def argsort(self, ascending=True, **kwargs):
return result
def sort_values(self, inplace=False, ascending=True, na_position='last'):
- """ Sorts the Category by category value returning a new Categorical by
- default.
-
- Only ordered Categoricals can be sorted!
-
- Categorical.sort is the equivalent but sorts the Categorical inplace.
+ """ Sorts the Categorical by category value returning a new
+ Categorical by default. While an ordering is applied to the
@jorisvandenbossche

jorisvandenbossche Apr 12, 2016

Owner

Can you leave the first sentence alone (apart from the Category -> Categorical correction of course). I mean, leave a blank line before the rest of the paragraph

@jorisvandenbossche jorisvandenbossche commented on the diff Apr 12, 2016

pandas/core/categorical.py
Parameters
----------
inplace : boolean, default False
Do operation in place.
ascending : boolean, default True
- Sort ascending. Passing False sorts descending
+ Order ascending. Passing False orders descending. The
@jorisvandenbossche

jorisvandenbossche Apr 12, 2016

Owner

I would leave this as 'Sort' instead of 'order'. Is there a reason you changed this?

@gfyoung

gfyoung Apr 12, 2016

Member

In the paragraph above, I took some care to explain that sort in the context of Categorical is not synonymous with order. That's why I changed it here.

@jorisvandenbossche

jorisvandenbossche Apr 12, 2016

Owner

Yes, but for exactly that reason I think 'sort' is the correct one to use.

I am not native english, but 'sort' is just organizing in groups, while 'order' has some meaning of, well, order. So the sort_values is thus in all cases sort and not always order

@jorisvandenbossche

jorisvandenbossche Apr 12, 2016

Owner

ah, but I see, here you refer of course to the ascending or not, which has an 'order' meaning .. :-)

@gfyoung

gfyoung Apr 12, 2016

Member

Yes, exactly. 😄

@gfyoung

gfyoung Apr 12, 2016

Member

As a native English speaker, I can say that your definitions are perfectly fine 😄 , but when it comes to using the term "sort" in the context of code writing, IINM the connotation is almost always "order". That's why I logged the issue in the first place.

@jorisvandenbossche

jorisvandenbossche Apr 12, 2016

Owner

Yep, that certainly true, and its what makes this case a bit confusing

Contributor

jreback commented Apr 12, 2016

@jorisvandenbossche no it was ADDED in that PR for conformity. with the Series/Index API. I think it was for conforming with a numpy-like object (which DOES have sort). But I don't think it really makes sense.

@jorisvandenbossche jorisvandenbossche commented on the diff Apr 12, 2016

pandas/core/categorical.py
Parameters
----------
inplace : boolean, default False
Do operation in place.
ascending : boolean, default True
- Sort ascending. Passing False sorts descending
+ Order ascending. Passing False orders descending. The
+ ordering parameter provides the method by which the
+ category values are organized.
@jorisvandenbossche

jorisvandenbossche Apr 12, 2016

Owner

The second sentence seems a bit redundant to me ?

@gfyoung

gfyoung Apr 12, 2016

Member

Again, it's that somewhat tedious differentiation between sort and order for Categorical. If you remove the last sentence, I think it gives the impression that sort and order are synonymous, which they are not in this case.

Member

gfyoung commented Apr 12, 2016

@jreback : Agreed. Judging from what I have seen with Series and DataFrame, the term sort is also a tad too generic to describe the types of ordering you can do with pandas objects.

Looking back at PR #10726 (the sort unification PR), my conclusion is that sort was already implemented (it was not added in that PR) and it was explicitly left alone for the reason I stated above. Categorical is an ndarray-substitute, so it makes sense to follow its API (at least for the methods with the same name, of course no problem to have an additional sort_values)

Member

gfyoung commented Apr 12, 2016

@jorisvandenbossche :

  1. Could you explain what you mean by that? I don't quite understand how Categorical can substitute for ndarray.

  2. Given that pandas is trying to move away from numpy, doesn't it make more sense to align the API with pandas objects and not numpy ones?

Contributor

jreback commented Apr 12, 2016

@jorisvandenbossche hmm, ok maybe I am wrong about when this was added. But we have removed (well deprecated) .sort from everything except Index (where it raises). So not sure we really need this addtl method.

Contributor

jreback commented Apr 12, 2016

.sort was added when we originally added Categorical (or well updated it): 0f62d3f

@gfyoung gfyoung BUG, DOC, DEP: Patch and Align Categorical's Sorting API
Clarifies the meaning of 'sort' in the context of
Categorical to mean 'organization' rather than 'order',
as it is possible to call this method (as well as
'sort_values') when the Categorical is unordered.

Also patches a bug in 'Categorical.sort_values' in
which 'na_position' was not being respected when
'ascending' was set to 'True'. This commit aligns
the behaviour with that of Series.

Finally, this commit deprecates 'sort' in favor
of 'sort_values,' which is in alignment with the
Series API as well.

Closes gh-12785.
f324a9c
Member

gfyoung commented Apr 13, 2016

@jreback : Travis is giving a green light once more, and I have addressed all of the changes / comments so far. Ready to merge if everyone is happy.

jreback added this to the 0.18.1 milestone Apr 14, 2016

jreback closed this in 2ea0601 Apr 14, 2016

Contributor

jreback commented Apr 14, 2016

thanks!

gfyoung deleted the gfyoung:categorical-sort-doc branch Apr 14, 2016

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this pull request Jun 19, 2017

@gfyoung gfyoung MAINT: Drop Categorical.order & sort
Deprecated back in 0.18.1

xref gh-12882
ab3264e

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this pull request Jun 19, 2017

@gfyoung gfyoung MAINT: Drop Categorical.order & sort
Deprecated back in 0.18.1

xref gh-12882
9891a5b

@jreback jreback added a commit that referenced this pull request Jun 19, 2017

@gfyoung @jreback gfyoung + jreback MAINT: Drop Categorical.order & sort (#16728)
Deprecated back in 0.18.1

xref gh-12882
8b5e3d6

@guillemborrell guillemborrell added a commit to guillemborrell/pandas that referenced this pull request Jul 7, 2017

@gfyoung @guillemborrell gfyoung + guillemborrell MAINT: Drop Categorical.order & sort (#16728)
Deprecated back in 0.18.1

xref gh-12882
a69eb57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment