Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical type #16015

Merged
merged 7 commits into from
Sep 23, 2017
Merged

Categorical type #16015

merged 7 commits into from
Sep 23, 2017

Conversation

TomAugspurger
Copy link
Contributor

We extended the CategoricalDtype to accept optional categories and ordered
argument. CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

Dunno if we would want to do this for 0.20 or not. It ended up not being too big a change, so maybe.

@jreback
Copy link
Contributor

jreback commented Apr 16, 2017

.dtype on a Categorical/CI should construct this

Categorica.is_dtype_equal should just work

@jreback
Copy link
Contributor

jreback commented Apr 16, 2017

can u run asv on this to make sure nothing much changing perf wise

@TomAugspurger
Copy link
Contributor Author

.dtype on a Categorical/CI should construct this

Ha, that's why it felt like so much less code this time, because I forgot to do the other half :)

I'll work on this a bit more today, but will probably push to 0.21

@jreback
Copy link
Contributor

jreback commented Apr 16, 2017

yeah this feels like it should be in 0.21.0. maybe we also expose this in pandas.api.types.CategoricalDtype rather than at the top-level.

@@ -131,6 +171,33 @@ def __eq__(self, other):

return isinstance(other, CategoricalDtype)

@staticmethod
def _hash_categories(categories):
from pandas.tools.hashing import hash_array, _combine_hash_arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just usehash_pandas_object. Its an ordered hash of the data.

@@ -356,6 +357,70 @@ def test_not_string(self):
self.assertFalse(is_string_dtype(PeriodDtype('D')))


class TestCategoricalDtypeParametrized(object):
Copy link
Member

@gfyoung gfyoung Apr 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call this TestCategoricalDtype (naming consistency)

result = pd.Series(['a', 'b'], dtype=pd.CategoricalDtype(['b', 'a']))
assert is_categorical_dtype(result)
tm.assert_index_equal(result.cat.categories, pd.Index(['b', 'a']))
assert result.cat.ordered is False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's more idiomatic: assert ... is False or assert not ... ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are asserting that the result is a boolean (False) , then the first is more appropriate. If its just not True then the 2nd is fine (e.g. imagine 0 satisfies 2nd but not 1st). In this case we do want to assert a boolean result.

def test_categories(self):
result = CategoricalDtype(['a', 'b', 'c'])
tm.assert_index_equal(result.categories, pd.Index(['a', 'b', 'c']))
assert result.ordered is False
Copy link
Member

@gfyoung gfyoung Apr 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here

@jreback jreback added the Categorical Categorical Data Type label Apr 17, 2017
@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Apr 17, 2017
jreback added a commit to jreback/pandas that referenced this pull request Apr 23, 2017
jreback added a commit to jreback/pandas that referenced this pull request Apr 24, 2017
jreback added a commit that referenced this pull request Apr 24, 2017
* API: expose dtypes in pandas.api.types

xref #16015
xref apache/arrow#585
xref #16042
xref #15541 (comment)
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
@pep8speaks
Copy link

pep8speaks commented Jul 17, 2017

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 23, 2017 at 16:34 Hours UTC

@@ -618,3 +618,8 @@ def test_categorical_equality(self, ordered, other, expected):
c1 = CategoricalDtype(['a', 'b'], ordered)
result = c1 == other
assert result == expected

def test_mixed(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to fail at the moment. hash_object_array tries to work on the mixed array first, and if that fails it falls back to hash_object_array(a.astype(str)), which will cause a "collision" with the section set of categoricals. Not sure how I'll handle this yet.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. I don't like the MultiIndex change (already commened on that PR). instead I would simply just change it (rather than adding a kw); the small api break is fine.

CategoricalDtype
----------------

.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged; it exists now.

@@ -6,6 +6,7 @@

from pandas.core.algorithms import factorize, unique, value_counts
from pandas.core.dtypes.missing import isna, isnull, notna, notnull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think exposing this is a good idea here. We don't directly do this with other dtypes. instead the are in pandas.api.types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I could go either way on this. It would be the only type at the top level. On the other hand pd.api.types.CategoricalDtype(values) is quite long :)

I'll think about it some more, but you're probably right.

dtype = CategoricalDtype()
categories = getattr(dtype, 'categories', None)
ordered = getattr(dtype, 'ordered', False)
dtype = CategoricalDtype(categories=categories, ordered=ordered)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really necessary if you allow CategoricalDtype(dtype) IOW, you can check whether the incoming categories is an instance of a CagetegoricalDtype and act accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like overloading the categories like that.

Type for categorical data with the categories and orderedness,
but not the values.

.. versionadded:: 0.20.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged

Examples
--------
>>> t = CategoricalDtype(categories=['b', 'a'], ordered=True)
>>> s = Series(['a', 'a', 'b', 'b', 'a'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't show anything not related to the actual dtype here.


def __new__(cls, categories=None, ordered=False, fastpath=False):
from pandas.core.indexes.base import Index
if categories is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the import inside the function

# validation
cls._validate_categories(categories, fastpath=fastpath)
cls._validate_ordered(ordered)
# We have a choice when hashing pandas unordered categoricals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use blank lines to indicate different thoughts


def __new__(cls, categories=None, ordered=False, fastpath=False):
from pandas.core.indexes.base import Index
if categories is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categories could be a CategoricalDtype (see my comment above)

@TomAugspurger TomAugspurger force-pushed the categorical-type branch 2 times, most recently from d5eb041 to f069158 Compare August 18, 2017 17:39
@jorisvandenbossche
Copy link
Member

@TomAugspurger just to check: this is ready for review?

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche Agreed about not having a fastpath in the public API. see 91752fe

Initially we used __new__ for initialization, since we cached the instance (even though it didn't take parameters). That commit cleans things up a bit.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look good. a few changes.

@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
Categorical
~~~~~~~~~~~

If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
.. autoclass:: api.types.CategoricalDtype
:members: categories, ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we should use the auto* things more readily in other places. Maybe make an issue about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did this instead of autosummary since there are a bunch of unrelated methods that are just there for NumPy duck-typing.


A categorical's type is fully described by

1. its categories: a sequence of unique values and no missing values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe show as

1. ``categories``: ....
2. ``ordered``: .....

as these are the actual kwargs

expects a `dtype`. For example :func:`pandas.read_csv`,
:func:`pandas.DataFrame.astype`, or in the Series constructor.

As a convenience, you can use the string ``'category'`` in place of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put this in a warning/note as this is the backward compat aspect here. (and then refer to this in the whatsnew)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is back incompat about doing dtype='category' ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Jeff was saying that was the only way to do things in the past (backwards compatible). But I don't want to give the impression that we're thinking about deprecating dtype='category'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you didn't do that actual deprecation (or not yet)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you didn't do that actual deprecation (or not yet)?

I don't think this is desirable, 'category' is a nice convenience

Sorry, I was speaking about astype(dtype='category', categories=[...]). I suppose I misread Jeff's comment (as it was about backwards compatibility, I thought it was about that part, as this is the only part that might be back incompat, if we deprecate categories=.. (not dtype='category')

@@ -10,6 +10,8 @@ users upgrade to this version.
Highlights include:

- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data (:issue:`14711`, :issue:`15078`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the link the sub-section (and then list the issues there instead)

expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical``:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this

data = self.categories._format_data(name=self.__class__.__name__)
return tpl.format(data, self.ordered)

def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

categories = list(categories) # breaks if a np.array of categories
cat_array = hash_tuples(categories)
else:
if categories.dtype == 'O':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

# we want to reuse self.dtype if possible, i.e. neither are
# overridden.
if dtype is not None and (categories is not None or
ordered is not None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg?

(pd.Categorical(['a', 'b']), CategoricalDtype(['a', 'b'])),
(pd.CategoricalIndex(['a', 'b']).dtype, CategoricalDtype(['a', 'b'])),
(pd.CategoricalIndex(['a', 'b']), CategoricalDtype(['a', 'b'])),
(CategoricalDtype(), CategoricalDtype()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave original pd.Categorical(['a', 'b']).dtype, CategoricalDtype()) which should still work

assert result == expected

def test_invalid_raises(self):
with tm.assert_raises_regex(TypeError, 'ordered'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might add a test for passing just a string for the first argument (should fail)
e.g.
CategoricalDtype('category') (see my comment about a from_string above, which we should add for compat.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 18, 2017 via email

@jreback
Copy link
Contributor

jreback commented Sep 18, 2017

Yes, but you didn't do that actual deprecation (or not yet)?

I don't think this is desirable, 'category' is a nice convenience

We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
Get a valid instance of `CategoricalDtype` as early as possible, and use that
throughout.
@TomAugspurger
Copy link
Contributor Author

416d1d7 has the changes to Categorical.__init__ that you requested @jreback. It's a bit messy since there are basically 3 ways for specifying the categories

  1. passing dtype=CategoricalDtype(...)
  2. passing categories=...
  3. passing a Categorical for values

I think all three are useful and (hopefully) well-tested now. I'll update the docstring to indicate the priority (basically moving that comment to the docstring)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of comments, which can be put in a followup. let's merge on green.


Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you should say CategoricalDtype(None, False) at the end.

categories = list(categories) # breaks if a np.array of categories
cat_array = hash_tuples(categories)
else:
if categories.dtype == 'O':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neh hash collisions basically don't happen in the space of things we are doing (normally). however you can collide by construction, e.g. code doing something wrong..

In any event we should move this code (can be a followup, maybe make an issue for things to do)

return hash(self) == hash(other)

def __unicode__(self):
tpl = u'CategoricalDtype(categories={}ordered={})'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have this hacky thing in Categorical to repr only 10 categories. CategoricalIndex actually correctly uses the option max_categories, so should do that here (again can do as a followup)

tm.assert_index_equal(result.categories, expected)
assert result.ordered is False

def test_constructor_tuples_datetimes(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is janky. I don't think we should actually allow it. Where did this example come from? So we allow tuples as categories, I guess so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was extracted from a groupby test that was failing when I improperly handled tuples of level-1. Agreed that it's weird, but I'm inclined to support it.

expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is 1 more issue you listed in the top of the PR

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 23, 2017 via email

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Sep 23, 2017
@jreback jreback merged commit e57f189 into pandas-dev:master Sep 23, 2017
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

nice patch @TomAugspurger !

@TomAugspurger TomAugspurger deleted the categorical-type branch September 24, 2017 14:00
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants