Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]API: CategoricalType for specifying categoricals #14698

Closed

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Nov 20, 2016

closes #14676

This adds a new top-level class pd.CategoricalType for representing the type information on a Categorical, without the values. Users can pass this anywhere they would previously pass 'category' if they need to specify non-default values for categories or ordered (you can still use 'category' of course):

In [1]: from pandas import *

In [2]: paste
    >>> t = CategoricalType(categories=['b', 'a'], ordered=True)
    >>> s = Series(['a', 'a', 'b', 'b', 'a'])
    >>> c = s.astype(t)
    >>> c

## -- End pasted text --
Out[2]:
0    a
1    a
2    b
3    b
4    a
dtype: category
Categories (2, object): [b < a]

This is the simplest possible change for now. A bigger change would be to make c.dtype return a CategoricalType instead of 'category'. We could probably do that in a way that's backwards compatible, but I'll need to think on it a bit.

Other places this should work

Implementation-wise I need to document, more tests, clean up some things like the repr. But I wanted to get this out there for discussion. @JanSchulz you might be interested.

@TomAugspurger TomAugspurger added API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Nov 20, 2016
@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Nov 20, 2016


class CategoricalType(CategoricalDtype):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in with the rest of the dtypes

to be honest i wouldn't create this; just add it optionally into the existing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that OK with the caching and stuff that is done on CategoricalDType? This is on CategoricalyDtype:

    def __new__(cls):

        try:
            return cls._cache[cls.name]
        except KeyError:
            c = object.__new__(cls)
            cls._cache[cls.name] = c
            return c

I haven't messed with extension types much. We could make the keys of that internal dict reflect the categories and ordered attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you would had optional attributes and cache based on them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, this seems to be working. Just have to convert the categories to a tuple so that they're hashable. Thanks.

@TomAugspurger
Copy link
Contributor Author

What would the ideal __repr__ of CategoricalDtype be? Currently it's just category. For places like DataFrame.info it's probably best to keep that as is, but that's not as useful if you're inpsecing CategoricalDtype itself.

Also, for equality semantics, this is what I have right now

    An instance of ``CategoricalDtype`` compares equal with any other
    instance of ``CategoricalDtype``, regardless of categories or ordered.
    In addition they compare equal to the string ``'category'``.
    To check whether two instances of a ``CategoricalDtype`` match,
    use the ``is`` operator.

    >>> t1 = CategoricalDtype(['a', 'b'], ordered=True)
    >>> t2 = CategoricalDtype(['a', 'c'], ordered=False)
    >>> t1 == t2
    True
    >>> t1 == 'category'
    True
    >>> t1 is t2
    False
    >>> t1 is CategoricalDtype(['a', 'b'], ordered=True)
    True

though I don't expect people to be working with these objects that much.

Type for categorical data with the categories and orderedness,
but not the values

.. versionadded:: 0.20.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been around for quite some time, you are adding parameter support in 0.20.0


Parameters
----------
categories : list or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list-like, use similar to whats in Categorical now


Examples
--------
>>> t = CategoricalDtype(categories=['b', 'a'], ordered=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these examples are not relevant. This should be not include Series. This is a self-contained type.

name = 'category'
type = CategoricalDtypeType
kind = 'O'
str = '|O08'
base = np.dtype('O')
_cache = {}

def __new__(cls):
def __new__(cls, categories=None, ordered=False):
categories_ = categories if categories is None else tuple(categories)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs all of the validation logic (from Categorical). on the actual categories.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2016

you need to move over the repr (of the categories) and the validation logic on the categories to this class.

Further Categorical should use this internally for storage of categories / ordered.

This is a fairly invasive change and needs some thought.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2016

I would split off the actual issue change and make that a follow up PR. Just putting in place the correct infrastructure can be the scope of this PR>

@TomAugspurger
Copy link
Contributor Author

This is a fairly invasive change and needs some thought.

Yep. I had hoped to do the minimal change of just providing a new API, but let's do it right the first time. I'll spend some time on this the next couple weeks and ping when I have something.

@TomAugspurger
Copy link
Contributor Author

Closing for now. Will reopen a new PR on top of a PR fixing #14711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Expand DataFrame.astype to allow Categorical(categories, ordered)
2 participants