Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: don't sort unique values from categoricals #9331

Merged
merged 1 commit into from
Feb 13, 2015

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jan 22, 2015

This should resolve the inconsistency @mwaskom reported in #9148.

CC @jreback @TomAugspurger @JanSchulz

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

this makes sense

@mwaskom
Copy link
Contributor

mwaskom commented Jan 22, 2015

👍 thanks @shoyer

@shoyer shoyer added Categorical Categorical Data Type Enhancement labels Jan 22, 2015
@shoyer shoyer added this to the 0.16.0 milestone Jan 22, 2015
cat = Categorical(["a","b","a", np.nan], categories=["a","b","c"])
res = cat.unique()
exp = np.asarray(["a","b", np.nan], dtype=object)
self.assert_numpy_array_equal(res, exp)

cat = Categorical(["b", "a"], categories=["a", "b"], ordered=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment would be nice for the next coder :-)

@shoyer
Copy link
Member Author

shoyer commented Jan 22, 2015

Actually, does it even make sense to sort the unique values for ordered Categoricals? .unique() does not sort for any other dtype, and it's easy enough to chain .unique().order() if desired.

@jorisvandenbossche
Copy link
Member

@shoyer I just wanted to ask a similar question: what if you want the unique values in order of appearance when you have an ordered categorical? That is now rather impractical to obtain.

And another, thing: I don't think that .unique().order() is possible to obtain the unique ordered categories? As .unique already returns a plain ndarray, and this does not know anymore about the sort order of the categorical.

@shoyer
Copy link
Member Author

shoyer commented Jan 22, 2015

@jorisvandenbossche Maybe .unique() just should return another Categorical object then? It's straightforward enough to coerce to ndarray when desired via np.asarray.

@jorisvandenbossche
Copy link
Member

I think we had that discussion before if I recall correctly, and the we decided to change it from categorical to ndarray ... (but I should look it up)

Update: maybe I was wrong, I don't directly find it. There was a change in unique letting return only the "used" categories: #8937
Update2: ok, the discussion was about that it first returned an Index, and that was changed to an array.

@jankatins
Copy link
Contributor

I don't think .unique() should return anything else than what the rest of pandas returns so that there are as few surprises and needs for special casing Series/cols of categorical values.

[EDIT: on the other hand R returns a factor for unique(factor(...)): https://github.com//issues/8559#issuecomment-59416072]

from pandas.core.nanops import unique1d
# unlike np.unique, unique1d does not sort
unique_codes = unique1d(self.codes)
if self.ordered:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this (and read the bugreports which have problems with this), the more I disagree with this change: if unique returns values in the order in which they appear in the rows, then this "contract" should be used. "order" only says that values in the categorical can be ordered, not that unique switches sorts the values. If one wants to have the categories in order, then take the categories directly (e.g. mwaskom/seaborn#361)

The only think missing is the "get me all unique values in the same order as specified by the categories", and one could give that a new keyword or a new method

@jorisvandenbossche
Copy link
Member

@JanSchulz for that last thing you mention (get me all unique values in the same order as specified by the categories), you can do:

In [44]: s = pd.Series(pd.Categorical(list('acbab'), list('abcde')))

In [45]: s
Out[45]:
0    a
1    c
2    b
3    a
4    b
dtype: category
Categories (5, object): [a < b < c < d < e]

In [46]: s.unique()
Out[46]: array(['a', 'b', 'c'], dtype=object)  # <----- this should then be [a, c, b] 

In [49]: s.cat.remove_unused_categories().cat.categories
Out[49]: Index([u'a', u'b', u'c'], dtype='object')   # <----- this stays as it is 

@jankatins
Copy link
Contributor

The last step is probably costly, as remove_unused_categories() does a complete redo of the factorization :-(

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

Here is a cheap way of doing the last step (ignoring that you need to take out -1 for nan and then re-add them)

In [25]: s.cat.categories.take(unique1d(s.cat.codes))
Out[25]: Index([u'a', u'c', u'b'], dtype='object')

In fact I think unique should just do this. It will handle ordered and unordered the same. Namely the same order as the categories are in (for non-ordered this is incidental, but makes sense).

In the actual Categorical

In [27]: c = s.values

In [28]: c.categories.take(unique1d(c.codes))
Out[28]: Index([u'a', u'c', u'b'], dtype='object')

@jankatins
Copy link
Contributor

@jreback: nope, that's the sorting order as defined on the rows ("by apearance"), not the categories. Yours should be the implementation of normal unique().

The last step above would be s.cat.categories.take(np.unique(s.cat.codes)), as that would be ordered in the same way as categories.

So basically:

def unique(sort=False)
   """ ...
   sort: sort the unique values by the order specified in `categories`
   """
    from pandas.core.nanops import unique1d
    # unlike np.unique, unique1d and pandas does not sort by default
    unique_codes = unique1d(self.codes)
    if sort:
        unique_codes = np.sort(unique_codes)
        # for compatibility with normal unique, which has nan last
        if unique_codes[0] == -1:
            unique_codes[0:-1] = unique_codes[1:]
            unique_codes[-1] = -1

@shoyer
Copy link
Member Author

shoyer commented Jan 22, 2015

OK, I think we are in agreement that unique should not sort by default. I don't think that should be an option, either -- that's not found in any other unique methods, and it's better to support composability than to add flags.

Either it should return a Categorical (which is fine) or it can return an ndarray (like it currently does). In the later case, it's easy to sort by turning things back into a categorical if desired: Categorical(c.unique(), c.categories).order(). Given the ease of doing this and the advantages of uniform signatures, I suppose we should probably stick with ndarray.

@TomAugspurger
Copy link
Contributor

+1 for no sort option and +1 for returning an ndarray.

On Thu, Jan 22, 2015 at 12:31 PM, Stephan Hoyer notifications@github.com
wrote:

OK, I think we are in agreement that unique should not sort by default. I
don't think that should be an option, either -- that's not found in any
other unique methods, and it's better to support composability than to add
flags.

Either it should return a Categorical (which is fine) or it can return an
ndarray (like it currently does). In the later case, it's easy to sort by
turning things back into a categorical if desired: Categorical(c.unique(),
c.categories).order(). Given the ease of doing this and the advantages of
uniform signatures, I suppose we should probably stick with ndarray.


Reply to this email directly or view it on GitHub
#9331 (comment).

@jorisvandenbossche
Copy link
Member

Also +1 for no sorting at all (ordered or unordered)

For the return value: Series.values also returns a Categorical instead of an array, so in that regard it would be somewhat consistent within the 'categorical series' to let Series.unique return a Categorical, but of course not consistent within unique

@shoyer shoyer changed the title ENH: don't sort unique values from unordered categoricals BUG: don't sort unique values from categoricals Jan 23, 2015
@shoyer shoyer added Bug and removed Enhancement labels Jan 23, 2015
@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

OK, I went for the simple route of making unique never sort and return an ndarray.

In principle, returning a categorical from unique is the right idea -- that would be the right decision if categorical was a real numpy dtype. But with categorical being mostly but not fully ndarray-like, it just causes more hassle for now.

PS @stevesimmons you should setup your editor to trim trailing whitespace. It's the source of some spurious changes in this PR. Not a big deal, just something to keep in mind for next time.

if unique_codes[0] == -1:
unique_codes[0:-1] = unique_codes[1:]
unique_codes[-1] = -1
from pandas.core.nanops import unique1d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need the nan insertion otherwise the -1 code would have meaning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback This is fed into take_1d which fills -1 with fill_value... which is happily exactly what we want here to handle NaN. That behavior is unchanged from before (and still tested). So I think this is OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes that's right
ok then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik unique sorts nan last...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JanSchulz Nope, unique1d does not sort NaN last. I modified the test involving NaNs to make sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, s.unique() also not. Seems that sorting an nan handling is only done in numpy...

s = pd.Series([1,2,3,4,5,np.nan,6,1,2,3,4])
s.unique()
array([  1.,   2.,   3.,   4.,   5.,  nan,   6.])

Sorry for the noise...

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

Just realized we don't actually guarantee ordering for unique values anywhere in general. Made a new issue about that: #9346.

This should resolve the inconsistency mwaskom reported in GH9148.

CC jreback TomAugspurger JanSchulz
@shoyer
Copy link
Member Author

shoyer commented Feb 12, 2015

OK, I updated the change note based on @JanSchulz's suggestion.

Any other comments before I merge this? I am inclined to stay with ndarray output for this change, though that is by no means a final decision -- we could change this to Categorical in the future if someone makes a case for that.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2015

yep this is consistent with the rest of pandas so lgtm

@jorisvandenbossche
Copy link
Member

ok for me!
(only small side note that your formulation in the docstring does depend on the outcome of #9346)

shoyer added a commit that referenced this pull request Feb 13, 2015
BUG: don't sort unique values from categoricals
@shoyer shoyer merged commit e266c3d into pandas-dev:master Feb 13, 2015
@shoyer shoyer deleted the categorical-unique-order branch February 13, 2015 00:42
@shoyer
Copy link
Member Author

shoyer commented Feb 13, 2015

Merged.

@jorisvandenbossche agreed, may need to revisit this following #9346 (which will hopefully be before 0.16)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants