Make categories and ordered part of CategoricalDtype #14711

TomAugspurger · 2016-11-22T14:09:25Z

This is to discuss pushing the Categorical.categories and
Categorical.ordered information into the extension type CategoricalDtype.

pd.CategoricalDtype(categories, ordered=False)

Note that there is no values argument. This is a type constructor, that
isn't attached to any specific Categorical instance.

Why?

Several times now (read_csv(..., dtype=...), .astype(...), Series([], dtype=...))
we have places where we accept dtype='category' which takes the values
in the method (the series, or column from the CSV, etc.)
and hands it off to the value constructor, with no control over the
categories and ordered arguments.

Categorical(values, categories=None, ordered=False)

The proposal here would add the categories and ordered
attributes / arguments to CategoricalDtype and provide a common API
for specifying non-default parameters for the Categorical constructor
in methods like read_csv, astype, etc.

t = pd.CategoricalDtype(['low', 'med', 'high'], ordered=True)
pd.read_csv('foo.csv', dtype={'A': int, 'B': t)
pd.Series(['high', 'low', 'high'], dtype=t)

s = pd.Series(['high', 'low', 'high'])
s.astype(t)

We would continue to accept dtype='category'.

This becomes even more import when doing operations on larger than memory datasets with something like dask or even (read_csv(..., chunksize=N)). Right now you don't have an easy way to specify the categories or ordered for columns (assuming you know them ahead of time).

Issues

CategoricalDtype currently isn't part of the public API. Which methods
on it do we make public?
Equality semantics: For backwards compat, I think all instances
of CategoricalDtype should compare equal with all others. You can use
identity to check if two types are the same

t1 = pd.CategoricalDtype(['a', 'b'], ordered=True)
t2 = pd.CategoricalDtype(['a', 'b'], ordered=False)

t1 == t2  # True
t1 is t2  # False
t1 is t1  # True

Should the categories argument be required? Currently dtype='category'
says 1.) infer the categories based on the values, and 2.) it's unordered.
Would CategoricalDtype(None, ordered=False) be allowed?
Strictness? If I say

pd.Series(['a', 'b', 'c'], dtype=pd.CategoricalDtype(['a', 'b']))

What happens? I would probably expect a TypeError or ValueError as c
isn't "supposed" to be there. Or do we replace 'c' with NA? Should
strict be another parameter to CategoricalDtype (I don't think so).

I'm willing to work on this over the next couple weeks.

xref #14676 (astype)
xref #14503 (read_csv)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-11-22T14:12:50Z

cc @JanSchulz (let me know if you want me to stop pinging you on this categorical issues 😄 ). Also @mrocklin I think this may interest you as the fastparquet stuff could use CategoricalDtype since you know the categories ahead of time.

jankatins · 2016-11-22T19:52:57Z

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object? As pandas2 seems to go the "we ave our own dtypes"-way, it seems that would be a sensible way (if that would also happening with the rest of the dtypes)

[Thanks for pinging me!]

TomAugspurger · 2016-11-22T20:10:58Z

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object?

Yeah, pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'], ordered=True).dtype would be CategoricalDtype(['a', 'b', 'c'], ordered=True).

Dunno if we want to expand the pd.Categorical constructor to take a dtype=CategoricalDtype option. Perhaps.

jreback · 2016-11-22T23:51:26Z

yes, this was the original plan when I first did this, but too many moving parts.

If this can be implemented in a back-compat manner (as this is pretty much an implementation detail). I think would be great.

jorisvandenbossche · 2016-11-23T08:54:39Z

CategoricalDtype currently isn't part of the public API. Which methods on it do we make public?

Specific for categorical, I suppose we would have a categories and ordered attribute? But for the existing attributes/methods, maybe to start with we could say in the docs that none of these are public? (not sure if there are that you would want to use)

Equality semantics: For backwards compat, I think all instances of CategoricalDtype should compare equal with all others. You can use identity to check if two types are the same

Sounds OK

Should the categories argument be required? Would CategoricalDtype(None, ordered=False) be allowed?

I would not allow this to start with. Is the idea that you can this way let the categories be inferred from the values, but set the ordered attribute through the dtype? Or what would be the application?

Strictness? I would probably expect a TypeError or ValueError as c isn't "supposed" to be there. Or do we replace 'c' with NA?

It seems that we currently replace with NA:

In [53]: pd.Categorical(['a', 'b', 'c'], categories=['a', 'b'])
Out[53]: 
[a, b, NaN]
Categories (2, object): [a, b]

In [54]: pd.Series(['a', 'b', 'c']).astype('category', categories=['a', 'b'])
Out[54]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): [a, b]

although I also think I would like a TypeError more as the default behaviour.

jorisvandenbossche · 2016-11-23T12:25:07Z

@TomAugspurger How is this issue different from #14676, can that one be closed?

TomAugspurger · 2016-11-24T00:35:20Z

That one will be a trivial change to .astype once this is fixed. I think leave it open as a reminder to implement it once the CategoricalDtype changes are done.

…

On Nov 23, 2016, at 05:25, Joris Van den Bossche ***@***.***> wrote: @TomAugspurger How is this issue different from #14676, can that one be closed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger · 2016-11-30T22:46:37Z

Small update for future reference. I'm having trouble with the current approach where CategoricalDtype[categories, ordered] is a singleton, i.e. one instance of CategoricalDtype for each category and ordered. The problem is _cache being a (weakref) dictionary. We need

CategoricalDtype([1, 2])

to be different from

CategoricalDtype([1., 2.])

but if we use a dictionary to do singleton stuff, the keys compare equal so we only ever get one. @jreback do you have any links to info on what-all requirements the numpy duck-type has to satisfy?

jreback · 2016-12-01T00:17:01Z

so the cache would need to be changed to something like a WeakDictionary and use keys like:

(c.categories, c.ordered)

BUT the categories are an Index which are not hashable. But prob could do something like:

# showing that ordering does matter (I am deliberately including the index, so we track ordering)
In [50]: c = Series(pd.Categorical([1, 2], ordered=True))

In [51]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[51]: 6505546918052763540

In [52]: c = Series(pd.Categorical([2, 1], ordered=True))

In [53]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[53]: 6135121269729399882

In [54]: c = Series(pd.Categorical([1., 2.], ordered=True))

In [55]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[55]: 6505546918052763540

# you would actually store the CategoricalDtype here (and not c)
In [58]: {(6505546918052763540, c.cat.ordered) : c}
Out[58]: 
{(6505546918052763540, True): 0    1.0
 1    2.0
 dtype: category
 Categories (2, float64): [1.0 < 2.0]}

cc @mikegraham

jreback · 2016-12-01T00:17:53Z

I think I may add an option to hash_pandas_object to directly produce this (if so desired) as this is a data hash for a pandas object.

jreback · 2016-12-01T00:28:59Z

jreback@908161b

mikegraham · 2016-12-01T06:05:20Z

Since an index's __eq__ does not return a bool, I'm not sure that having indices be directly hashable for use as dict keys makes sense to me. It seems to me maybe some proxy is in order to be able to do this...

Is it the case the Index order matters when the categorical is ordered=True and not when not?

jreback · 2016-12-01T11:59:39Z

this is not competing with eq ; we actually use .equals to compare equivalents generally for an Index

furthermore this by by default DOES consider ordering (a user could control this by index=False)

TomAugspurger · 2016-12-01T12:50:02Z

@jreback great, thanks. Had to make a minor patch for a parameter that controls whether or not to categorize before hashing. We can't categorize in the CategoricalDtype constructor since we'll hit an infinite loop.

diff --git a/pandas/tools/hashing.py b/pandas/tools/hashing.py
index 3ed51497d..94739b199 100644
--- a/pandas/tools/hashing.py
+++ b/pandas/tools/hashing.py
@@ -86,7 +86,8 @@ def hash_pandas_object(obj, index=True, encoding='utf8', hash_key=None,
     return h
 
 
-def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
+def hash_array(vals, encoding='utf8', hash_key=None, reduce=False,
+               categorize_objects=True):
     """
     Given a 1d array, return an array of deterministic integers.
 
@@ -100,6 +101,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     hash_key : string key to encode, default to _default_hash_key
     reduce : boolean, default False
         produce a single hash result
+    categorize_objects : bool
 
     Returns
     -------
@@ -137,16 +139,19 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     else:
 
         # its MUCH faster to categorize object dtypes, then hash and rename
-        codes, categories = factorize(vals, sort=False)
-        categories = Index(categories)
-        c = Series(Categorical(codes, categories,
-                               ordered=False, fastpath=True))
-        vals = _hash.hash_object_array(categories.values,
-                                       hash_key,
-                                       encoding)
-
-        # rename & extract
-        vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        if categorize_objects:
+            codes, categories = factorize(vals, sort=False)
+            categories = Index(categories)
+            c = Series(Categorical(codes, categories,
+                                   ordered=False, fastpath=True))
+            vals = _hash.hash_object_array(categories.values,
+                                           hash_key,
+                                           encoding)
+
+            # rename & extract
+            vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        else:
+            vals = _hash.hash_object_array(np.asarray(vals), hash_key, encoding)
 
     # Then, redistribute these 64-bit ints within the space of 64-bit ints
     vals ^= vals >> 30

mikegraham · 2016-12-01T14:15:48Z

this is not competing with __eq__

I must have taken too literally what you meant when you showed an index being used as part of a dict key.

jreback · 2016-12-01T15:02:28Z

@TomAugspurger I wouldn't add like that. instead simply call
_hash.hash_object_array(np.asarray(vals), hash_key, encoding) directly

jreback · 2016-12-01T15:03:44Z

@mikegraham yeah this is to 'hash' Index objects, so in theory it would work for this purpose, IOW, providing a data-hash. But I think we would need much more validation / testing if we actually wanted to replace the equiv of __hash__ for an Index (it may be worth it, but I suspect this is too compute intensive for general usage)

mikegraham · 2016-12-01T17:47:53Z

@jreback I read your code closer and I see what you've done now. I think this code might be a bit unsafe because of hash collisions. If this was a 160+ bit cryptographic hash, I could imagine feeling safe enough, but the hashing here.....I'm really not certain it's the right thing to do.

I'd consider keying on something like

class UnorderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return np.bitwise_xor.reduce(hash_pandas_object(c).values)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.sort_values().equals(other.index.sort_values()))

class OrderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return siphash(hash_pandas_object(c).data)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.equals(other.index))

Obviously having the key and value have to have references to the index is annoying, but you can get weakref semantics with some work.

mrocklin · 2016-12-05T15:35:56Z

As predicted by @mikegraham , here is a failure:

In [1]: import numpy as np

In [2]: from pandas.tools.hashing import hash_array

In [3]: L = ['Ingrid-9Z9fKIZmkO7i7Cn51Li34pJm44fgX6DYGBNj3VPlOH50m7HnBlPxfIwFMrc
   ...: NJNMP6PSgLmwWnInciMWrCSAlLEvt7JkJl4IxiMrVbXSa8ZQoVaq5xoQPjltuJEfwdNlO6jo
   ...: 8qRRHvD8sBEBMQASrRa6TsdaPTPCBo3nwIBpE7YzzmyH0vMBhjQZLx1aCT7faSEx7PgFxQhH
   ...: dKFWROcysamgy9iVj8DO2Fmwg1NNl93rIAqC3mdqfrCxrzfvIY8aJdzin2cHVzy3QUJxZgHv
   ...: tUtOLxoqnUHsYbNTeq0xcLXpTZEZCxD4PGubIuCNf32c33M7HFsnjWSEjE2yVdWKhmSVodyF
   ...: 8hFYVmhYnMCztQnJrt3O8ZvVRXd5IKwlLexiSp4h888w7SzAIcKgc3g5XQJf6MlSMftDXm9l
   ...: IsE1mJNiJEv6uY6pgvC3fUPhatlR5JPpVAHNSbSEE73MBzJrhCAbOLXQumyOXigZuPoME7Qg
   ...: JcBalliQol7YZ9', 'Tim-b9MddTxOWW2AT1Py6vtVbZwGAmYCjbp89p8mxsiFoVX4FyDOF3
   ...: wFiAkyQTUgwg9sVqVYOZo09Dh1AzhFHbgij52ylF0SEwgzjzHH8TGY8Lypart4p4onnDoDvV
   ...: MBa0kdthVGKl6K0BDVGzyOXPXKpmnMF1H6rJzqHJ0HywfwS4XYpVwlAkoeNsiicHkJUFdUAh
   ...: G229INzvIAiJuAHeJDUoyO4DCBqtoZ5TDend6TK7Y914yHlfH3g1WZu5LksKv68VQHJriWFY
   ...: usW5e6ZZ6dKaMjTwEGuRgdT66iU5nqWTHRH8WSzpXoCFwGcTOwyuqPSe0fTe21DVtJn1FKj9
   ...: F9nEnR9xOvJUO7E0piCIF4Ad9yAIDY4DBimpsTfKXCu1vdHpKYerzbndfuFe5AhfMduLYZJi
   ...: 5iAw8qKSwR5h86ttXV0Mc0QmXz8dsRvDgxjXSmupPxBggdlqUlC828hXiTPD7am0yETBV0F3
   ...: bEtvPiNJfremszcV8NcqAoARMe']

In [4]: hash_array(np.asarray(L, dtype=object), 'utf8')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-5f657fa82496> in <module>()
----> 1 hash_array(np.asarray(L, dtype=object), 'utf8')

/home/mrocklin/workspace/pandas/pandas/tools/hashing.py in hash_array(vals, encoding, hash_key)
    127 
    128         # rename & extract
--> 129         vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
    130 
    131     # Then, redistribute these 64-bit ints within the space of 64-bit ints

/home/mrocklin/workspace/pandas/pandas/core/base.py in f(self, *args, **kwargs)
    208 
    209             def f(self, *args, **kwargs):
--> 210                 return self._delegate_method(name, *args, **kwargs)
    211 
    212             f.__name__ = name

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _delegate_method(self, name, *args, **kwargs)
   1959         from pandas import Series
   1960         method = getattr(self.categorical, name)
-> 1961         res = method(*args, **kwargs)
   1962         if res is not None:
   1963             return Series(res, index=self.index)

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in rename_categories(self, new_categories, inplace)
    756         """
    757         cat = self if inplace else self.copy()
--> 758         cat.categories = new_categories
    759         if not inplace:
    760             return cat

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _set_categories(self, categories, fastpath)
    586         """
    587 
--> 588         categories = self._validate_categories(categories, fastpath=fastpath)
    589         if (not fastpath and self._categories is not None and
    590                 len(categories) != len(self._categories)):

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _validate_categories(cls, categories, fastpath)
    572 
    573             if not categories.is_unique:
--> 574                 raise ValueError('Categorical categories must be unique')
    575 
    576         return categories

ValueError: Categorical categories must be unique

Both of these inputs hash down to the same value, which causes Pandas to correctly raise.

mrocklin · 2016-12-05T15:41:43Z

@mikegraham do you have suggestions on the best course of action here?

jreback · 2016-12-05T17:13:32Z

jreback@ac360e9

here is the test code.

I think this might be some sort of overflow in siphash.

mrocklin · 2016-12-05T17:16:33Z

Is siphash a cryptographic hash? If not then it should not surprise us to see collisions.

jreback · 2016-12-05T22:30:49Z

so these are failing if len(string) is an exact power of 2, e.g. these area 2**9, 2**9 + 1 is fine.

mikegraham · 2016-12-05T23:53:03Z

BTW, I had initially done some tests against externally-computed siphash values...we should probably get those into pandas' suite. I can work with you to do that.

jreback · 2016-12-06T00:00:34Z

@mikegraham that would be great

jreback · 2016-12-06T13:24:12Z

so #14804/#14805 this seems to fix the issues. thanks @mikegraham

as an aside if you want to do a PR to lock-down the actual resultant hash values for a tests series would be great (so we match external siphash and don't change in the future)

We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

Closes #14711 Closes #15078 Closes #14676

Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

TomAugspurger added API Design Categorical Categorical Data Type Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions labels Nov 22, 2016

TomAugspurger added this to the 0.20.0 milestone Nov 22, 2016

TomAugspurger mentioned this issue Nov 22, 2016

[WIP]API: CategoricalType for specifying categoricals #14698

Closed

mikegraham mentioned this issue Dec 6, 2016

BUG: This should be a 64-bit int, not an 8-bit int #14804

Closed

jreback mentioned this issue Dec 6, 2016

BUG: fix hash collisions for from int overflow #14805

Merged

TomAugspurger mentioned this issue Dec 12, 2016

Different categories in different partitions dask/dask#1836

Closed

TomAugspurger mentioned this issue Jan 9, 2017

BUG: astype(CategoricalDtype) has no effect #15078

Closed

jreback modified the milestones: 0.20.0, 0.21.0 Mar 23, 2017

TomAugspurger mentioned this issue Apr 16, 2017

Categorical type #16015

Merged

jreback closed this as completed in #16015 Sep 23, 2017

jreback pushed a commit that referenced this issue Sep 23, 2017

Categorical type (#16015)

e57f189

Closes #14711 Closes #15078 Closes #14676

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

Categorical type (pandas-dev#16015)

da7ad15

Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

Categorical type (pandas-dev#16015)

3bb7929

Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make categories and ordered part of CategoricalDtype #14711

Make categories and ordered part of CategoricalDtype #14711

TomAugspurger commented Nov 22, 2016 •

edited

Loading

TomAugspurger commented Nov 22, 2016

jankatins commented Nov 22, 2016

TomAugspurger commented Nov 22, 2016

jreback commented Nov 22, 2016

jorisvandenbossche commented Nov 23, 2016

jorisvandenbossche commented Nov 23, 2016

TomAugspurger commented Nov 24, 2016 via email

TomAugspurger commented Nov 30, 2016

jreback commented Dec 1, 2016 •

edited

Loading

jreback commented Dec 1, 2016

jreback commented Dec 1, 2016

mikegraham commented Dec 1, 2016

jreback commented Dec 1, 2016

TomAugspurger commented Dec 1, 2016

mikegraham commented Dec 1, 2016

jreback commented Dec 1, 2016

jreback commented Dec 1, 2016

mikegraham commented Dec 1, 2016 •

edited

Loading

mrocklin commented Dec 5, 2016

mrocklin commented Dec 5, 2016

jreback commented Dec 5, 2016

mrocklin commented Dec 5, 2016

jreback commented Dec 5, 2016 •

edited

Loading

mikegraham commented Dec 5, 2016

jreback commented Dec 6, 2016

jreback commented Dec 6, 2016

Make categories and ordered part of CategoricalDtype #14711

Make categories and ordered part of CategoricalDtype #14711

Comments

TomAugspurger commented Nov 22, 2016 • edited Loading

Why?

Issues

TomAugspurger commented Nov 22, 2016

jankatins commented Nov 22, 2016

TomAugspurger commented Nov 22, 2016

jreback commented Nov 22, 2016

jorisvandenbossche commented Nov 23, 2016

jorisvandenbossche commented Nov 23, 2016

TomAugspurger commented Nov 24, 2016 via email

TomAugspurger commented Nov 30, 2016

jreback commented Dec 1, 2016 • edited Loading

jreback commented Dec 1, 2016

jreback commented Dec 1, 2016

mikegraham commented Dec 1, 2016

jreback commented Dec 1, 2016

TomAugspurger commented Dec 1, 2016

mikegraham commented Dec 1, 2016

jreback commented Dec 1, 2016

jreback commented Dec 1, 2016

mikegraham commented Dec 1, 2016 • edited Loading

mrocklin commented Dec 5, 2016

mrocklin commented Dec 5, 2016

jreback commented Dec 5, 2016

mrocklin commented Dec 5, 2016

jreback commented Dec 5, 2016 • edited Loading

mikegraham commented Dec 5, 2016

jreback commented Dec 6, 2016

jreback commented Dec 6, 2016

TomAugspurger commented Nov 22, 2016 •

edited

Loading

jreback commented Dec 1, 2016 •

edited

Loading

mikegraham commented Dec 1, 2016 •

edited

Loading

jreback commented Dec 5, 2016 •

edited

Loading