New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Should factorize(categorical) return a Categorical for uniques? #19721

Open
TomAugspurger opened this Issue Feb 15, 2018 · 14 comments

Comments

Projects
None yet
3 participants
@TomAugspurger
Contributor

TomAugspurger commented Feb 15, 2018

Returning a categorical feels more natural to me

In [11]: pd.factorize(pd.Categorical(['a', 'a', 'c']))
Out[11]: (array([0, 0, 1]), array([0, 1]))

That's kind of what we do for a DatetimeIndex with TZ:

In [10]: pd.factorize(pd.Series(pd.DatetimeIndex(['2017', '2017'], tz='US/Eastern')))
Out[10]:
(array([0, 0]),
 DatetimeIndex(['2017-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None))

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Feb 15, 2018

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 16, 2018

Shouldn't it rather give the same as:

In [17]: pd.factorize(['a', 'a', 'c'])
Out[17]: (array([0, 0, 1]), array(['a', 'c'], dtype=object))

(the fact that it returns [0, 1] as the unique labels clearly is a bug I think, it seems to be factorizing the codes)

When factorizing a Categorical, I would expect to get back (codes, categories), not (codes, categorical)

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 16, 2018

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 16, 2018

Some other weirdness

In [1]: import pandas as pd

In [2]: pd.factorize(pd.Categorical(['a', 'a', 'b']))
Out[2]: (array([0, 0, 1]), array([0, 1]))

In [3]: pd.factorize(pd.Index(pd.Categorical(['a', 'a', 'b'])))
Out[3]:
(array([0, 0, 1]),
 CategoricalIndex([nan, nan], categories=['a', 'b'], ordered=False, dtype='category'))

In [4]: pd.factorize(pd.Series(pd.Categorical(['a', 'a', 'b'])))
Out[4]: (array([0, 0, 1]), Int64Index([0, 1], dtype='int64'))

Out[3] should be CategoricalIndex(['a', 'b']) or Categorical(['a', 'b']), or ndarray(['a', 'b']),

Out[4] should be similar.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 16, 2018

In general, I think the type of the uniques should be the same array-like as the input (or in case of Series/Index, the array-like that is backing the Series).
So eg for DatetimeIndex with tz or Series of such values, this should in the future give a DatetimeTZArray (and that it now gives a DatetimeIndex is then logical I think). For numerical Series/Index/array, it is a normal numpy array.
So in the future for a certain extension type (Index with such values, Series with such values of Array class itself), would return an extension array of its type?

But, I would personally deviate from this rule for Categoricals, and in those cases return the array-like of which the categories are made up, instead of a Categorical itself.

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 16, 2018

the return values now are all as expected
by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 16, 2018

hmm actually these do look a bit odd
factorizing should be in the categories not the codes

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 16, 2018

the return values now are all as expected
by definition factorizing gives u a categorical return data ie codes and uniques (categories); further the categories are all array like

Jeff, did you look at the examples Tom showed? It is clearly not returning the " uniques (categories)" in case of categorical input.

And the general question is not only about the current behaviour, but what to do for the new extension arrays

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 16, 2018

Cross posted :)

Let's limit it to just fixing

In [2]: pd.factorize(pd.Series(pd.Categorical(['a', 'a'])))
Out[2]: (array([0, 0]), Int64Index([0], dtype='int64'))

to return the unique categories.

Out[2]: (array([0, 0]), array(['a']))
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 16, 2018

Yes, agreed the above should be fixed.

But, I think we still need to discuss what it should do for ExtensionArrays.
Currently the docstring says that ndarray is returned for arrays, and Index for Index/Series, which is also what it does:

In [74]: for vals in [np.array([0, 1, 2]), np.array(['a', 'b', 'c']), pd.DatetimeIndex([1, 2, 3], tz='UTC'), pd.Categorical(['a', 'b', 'a'])]:
    ...:     for cont in [lambda x: x, pd.Series, pd.Index]:
    ...:         obj = cont(vals)
    ...:         codes, uniques = pd.factorize(obj)
    ...:         print(vals.dtype, type(obj).__name__, '  ->  ', type(uniques).__name__)
    ...:         
    ...:         
int64 ndarray   ->   ndarray
int64 Series   ->   Int64Index
int64 Int64Index   ->   Int64Index
<U1 ndarray   ->   ndarray
<U1 Series   ->   Index
<U1 Index   ->   Index
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex  # this deviation is normal since I used index as input (since there is no array-like yet)
datetime64[ns, UTC] Series   ->   DatetimeIndex
datetime64[ns, UTC] DatetimeIndex   ->   DatetimeIndex
category Categorical   ->   ndarray
category Series   ->   Int64Index
category CategoricalIndex   ->   CategoricalIndex

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 16, 2018

But, for a new ExtensionArray, I don't think we want to return an Index? As for now I think we will not yet support an extension array to be stored in an Index? Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

Although, of course, in the idea that those unique values will typically be not that many in number (or at least much shorter than the input array), it is maybe not a problem to have a materialized ExtensionArray stored in Index.
(on the other hand, I think array-likes would be a more logical return value here, and I think we only return Index because of the absence of arrays likes for all types up to now)

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 16, 2018

As for now I think we will not yet support an extension array to be stored in an Index?

Correct, though there was interest in that on Twitter.

Shouldn't it rather be a instance of the ExtensionArray itself (holding the unique values)?

I see use-cases for either ndarray[object] or ExtensionArray, since they support different operations. Does this warrant a new keyword to factorize?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 27, 2018

I'm revisiting this now. I think that the rule should be that uniques has the same type as the input (except Series -> Index)

  • factorize(ndarray)[1] -> ndarray
  • factorize(ExtensionArray)[1] -> ExtensionArray
  • factorize(Serise/Indxe)[1] -> Index

In particular, we'd change pd.factorize(Categorical) to return a Categorical for uniques.

Consider the case of unobserved categories. I'd like to treat these two arrays differently

In [6]: c1 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b'])

In [7]: c2 = pd.Categorical(['a', 'a', 'b'], categories=['a', 'b', 'c'])

If we return an ndarray for uniques, these will be the same:

In [8]: pd.factorize(c1)[1]
Out[8]: array(['a', 'b'], dtype=object)

In [9]: pd.factorize(c2)[1]
Out[9]: array(['a', 'b'], dtype=object)

But if we return a Categorical, we can preserve the unobsesrved categories in uniques.dtype. That also would avoid the somewhat strange situation when moving to a CategoricalIndex

In [10]: pd.factorize(pd.CategoricalIndex(c1))[1]
Out[10]: CategoricalIndex(['a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

In [11]: pd.factorize(pd.CategoricalIndex(c2))[1]
Out[11]: CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 27, 2018

I still think that factorize(Series[ExtensionArray])[1] -> ExtensionArray might make more sense than returning an Index.

It seems strange to be that the result dtype would change based on the container (to be clear, I'm OK with the output container changing from Categorical -> CategoricalIndex. But I'd like pd.factorize(cat)[1].dtype to match pd.factorize(cat_index)[1].dtype).)

Agreed that the dtype should be the same.
But here, I also find it strange why factorize(categorical) would return a Categorical, while factorize(Series[categorical]) would return CategoricalIndex. I mean, I don't see any justification or logic for this difference? (apart from historical reasons, which might be enough to keep it like that and do the same for ExtensionArrays ..)

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Feb 27, 2018

apart from historical reasons

100% agreed. If I were starting from scratch I would say factorize(series)[1] -> array (ndarray or ExtensionArray). I'm just not sure that changing it is worthwhile. And why stop there? Shouldn't factorize(index) return an array? Index and Series are both just containers, so why box one and not the other?

I'll think about if there's a deprecation path, and if we even want to go down this path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment