New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make categories and ordered part of CategoricalDtype #14711

Closed
TomAugspurger opened this Issue Nov 22, 2016 · 35 comments

Comments

Projects
None yet
6 participants
@TomAugspurger
Contributor

TomAugspurger commented Nov 22, 2016

This is to discuss pushing the Categorical.categories and
Categorical.ordered information into the extension type CategoricalDtype.

pd.CategoricalDtype(categories, ordered=False)

Note that there is no values argument. This is a type constructor, that
isn't attached to any specific Categorical instance.

Why?

Several times now (read_csv(..., dtype=...), .astype(...), Series([], dtype=...))
we have places where we accept dtype='category' which takes the values
in the method (the series, or column from the CSV, etc.)
and hands it off to the value constructor, with no control over the
categories and ordered arguments.

Categorical(values, categories=None, ordered=False)

The proposal here would add the categories and ordered
attributes / arguments to CategoricalDtype and provide a common API
for specifying non-default parameters for the Categorical constructor
in methods like read_csv, astype, etc.

t = pd.CategoricalDtype(['low', 'med', 'high'], ordered=True)
pd.read_csv('foo.csv', dtype={'A': int, 'B': t)
pd.Series(['high', 'low', 'high'], dtype=t)

s = pd.Series(['high', 'low', 'high'])
s.astype(t)

We would continue to accept dtype='category'.

This becomes even more import when doing operations on larger than memory datasets with something like dask or even (read_csv(..., chunksize=N)). Right now you don't have an easy way to specify the categories or ordered for columns (assuming you know them ahead of time).

Issues

  1. CategoricalDtype currently isn't part of the public API. Which methods
    on it do we make public?
  2. Equality semantics: For backwards compat, I think all instances
    of CategoricalDtype should compare equal with all others. You can use
    identity to check if two types are the same
t1 = pd.CategoricalDtype(['a', 'b'], ordered=True)
t2 = pd.CategoricalDtype(['a', 'b'], ordered=False)

t1 == t2  # True
t1 is t2  # False
t1 is t1  # True
  1. Should the categories argument be required? Currently dtype='category'
    says 1.) infer the categories based on the values, and 2.) it's unordered.
    Would CategoricalDtype(None, ordered=False) be allowed?
  2. Strictness? If I say
pd.Series(['a', 'b', 'c'], dtype=pd.CategoricalDtype(['a', 'b']))

What happens? I would probably expect a TypeError or ValueError as c
isn't "supposed" to be there. Or do we replace 'c' with NA? Should
strict be another parameter to CategoricalDtype (I don't think so).

I'm willing to work on this over the next couple weeks.

xref #14676 (astype)
xref #14503 (read_csv)

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 22, 2016

Contributor

cc @janschulz (let me know if you want me to stop pinging you on this categorical issues 😄 ). Also @mrocklin I think this may interest you as the fastparquet stuff could use CategoricalDtype since you know the categories ahead of time.

Contributor

TomAugspurger commented Nov 22, 2016

cc @janschulz (let me know if you want me to stop pinging you on this categorical issues 😄 ). Also @mrocklin I think this may interest you as the fastparquet stuff could use CategoricalDtype since you know the categories ahead of time.

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Nov 22, 2016

Contributor

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object? As pandas2 seems to go the "we ave our own dtypes"-way, it seems that would be a sensible way (if that would also happening with the rest of the dtypes)

[Thanks for pinging me!]

Contributor

jankatins commented Nov 22, 2016

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object? As pandas2 seems to go the "we ave our own dtypes"-way, it seems that would be a sensible way (if that would also happening with the rest of the dtypes)

[Thanks for pinging me!]

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 22, 2016

Contributor

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object?

Yeah, pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'], ordered=True).dtype would be CategoricalDtype(['a', 'b', 'c'], ordered=True).

Dunno if we want to expand the pd.Categorical constructor to take a dtype=CategoricalDtype option. Perhaps.

Contributor

TomAugspurger commented Nov 22, 2016

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object?

Yeah, pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'], ordered=True).dtype would be CategoricalDtype(['a', 'b', 'c'], ordered=True).

Dunno if we want to expand the pd.Categorical constructor to take a dtype=CategoricalDtype option. Perhaps.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 22, 2016

Contributor

yes, this was the original plan when I first did this, but too many moving parts.

If this can be implemented in a back-compat manner (as this is pretty much an implementation detail). I think would be great.

Contributor

jreback commented Nov 22, 2016

yes, this was the original plan when I first did this, but too many moving parts.

If this can be implemented in a back-compat manner (as this is pretty much an implementation detail). I think would be great.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 23, 2016

Member
  1. CategoricalDtype currently isn't part of the public API. Which methods on it do we make public?

Specific for categorical, I suppose we would have a categories and ordered attribute? But for the existing attributes/methods, maybe to start with we could say in the docs that none of these are public? (not sure if there are that you would want to use)

  1. Equality semantics: For backwards compat, I think all instances of CategoricalDtype should compare equal with all others. You can use identity to check if two types are the same

Sounds OK

  1. Should the categories argument be required? Would CategoricalDtype(None, ordered=False) be allowed?

I would not allow this to start with. Is the idea that you can this way let the categories be inferred from the values, but set the ordered attribute through the dtype? Or what would be the application?

  1. Strictness? I would probably expect a TypeError or ValueError as c isn't "supposed" to be there. Or do we replace 'c' with NA?

It seems that we currently replace with NA:

In [53]: pd.Categorical(['a', 'b', 'c'], categories=['a', 'b'])
Out[53]: 
[a, b, NaN]
Categories (2, object): [a, b]

In [54]: pd.Series(['a', 'b', 'c']).astype('category', categories=['a', 'b'])
Out[54]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): [a, b]

although I also think I would like a TypeError more as the default behaviour.

Member

jorisvandenbossche commented Nov 23, 2016

  1. CategoricalDtype currently isn't part of the public API. Which methods on it do we make public?

Specific for categorical, I suppose we would have a categories and ordered attribute? But for the existing attributes/methods, maybe to start with we could say in the docs that none of these are public? (not sure if there are that you would want to use)

  1. Equality semantics: For backwards compat, I think all instances of CategoricalDtype should compare equal with all others. You can use identity to check if two types are the same

Sounds OK

  1. Should the categories argument be required? Would CategoricalDtype(None, ordered=False) be allowed?

I would not allow this to start with. Is the idea that you can this way let the categories be inferred from the values, but set the ordered attribute through the dtype? Or what would be the application?

  1. Strictness? I would probably expect a TypeError or ValueError as c isn't "supposed" to be there. Or do we replace 'c' with NA?

It seems that we currently replace with NA:

In [53]: pd.Categorical(['a', 'b', 'c'], categories=['a', 'b'])
Out[53]: 
[a, b, NaN]
Categories (2, object): [a, b]

In [54]: pd.Series(['a', 'b', 'c']).astype('category', categories=['a', 'b'])
Out[54]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): [a, b]

although I also think I would like a TypeError more as the default behaviour.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 23, 2016

Member

@TomAugspurger How is this issue different from #14676, can that one be closed?

Member

jorisvandenbossche commented Nov 23, 2016

@TomAugspurger How is this issue different from #14676, can that one be closed?

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 24, 2016

Contributor
Contributor

TomAugspurger commented Nov 24, 2016

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Nov 30, 2016

Contributor

Small update for future reference. I'm having trouble with the current approach where CategoricalDtype[categories, ordered] is a singleton, i.e. one instance of CategoricalDtype for each category and ordered. The problem is _cache being a (weakref) dictionary. We need

CategoricalDtype([1, 2])

to be different from

CategoricalDtype([1., 2.])

but if we use a dictionary to do singleton stuff, the keys compare equal so we only ever get one. @jreback do you have any links to info on what-all requirements the numpy duck-type has to satisfy?

Contributor

TomAugspurger commented Nov 30, 2016

Small update for future reference. I'm having trouble with the current approach where CategoricalDtype[categories, ordered] is a singleton, i.e. one instance of CategoricalDtype for each category and ordered. The problem is _cache being a (weakref) dictionary. We need

CategoricalDtype([1, 2])

to be different from

CategoricalDtype([1., 2.])

but if we use a dictionary to do singleton stuff, the keys compare equal so we only ever get one. @jreback do you have any links to info on what-all requirements the numpy duck-type has to satisfy?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 1, 2016

Contributor

so the cache would need to be changed to something like a WeakDictionary and use keys like:

(c.categories, c.ordered)

BUT the categories are an Index which are not hashable. But prob could do something like:

# showing that ordering does matter (I am deliberately including the index, so we track ordering)
In [50]: c = Series(pd.Categorical([1, 2], ordered=True))

In [51]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[51]: 6505546918052763540

In [52]: c = Series(pd.Categorical([2, 1], ordered=True))

In [53]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[53]: 6135121269729399882

In [54]: c = Series(pd.Categorical([1., 2.], ordered=True))

In [55]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[55]: 6505546918052763540

# you would actually store the CategoricalDtype here (and not c)
In [58]: {(6505546918052763540, c.cat.ordered) : c}
Out[58]: 
{(6505546918052763540, True): 0    1.0
 1    2.0
 dtype: category
 Categories (2, float64): [1.0 < 2.0]}

cc @mikegraham

Contributor

jreback commented Dec 1, 2016

so the cache would need to be changed to something like a WeakDictionary and use keys like:

(c.categories, c.ordered)

BUT the categories are an Index which are not hashable. But prob could do something like:

# showing that ordering does matter (I am deliberately including the index, so we track ordering)
In [50]: c = Series(pd.Categorical([1, 2], ordered=True))

In [51]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[51]: 6505546918052763540

In [52]: c = Series(pd.Categorical([2, 1], ordered=True))

In [53]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[53]: 6135121269729399882

In [54]: c = Series(pd.Categorical([1., 2.], ordered=True))

In [55]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[55]: 6505546918052763540

# you would actually store the CategoricalDtype here (and not c)
In [58]: {(6505546918052763540, c.cat.ordered) : c}
Out[58]: 
{(6505546918052763540, True): 0    1.0
 1    2.0
 dtype: category
 Categories (2, float64): [1.0 < 2.0]}

cc @mikegraham

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 1, 2016

Contributor

I think I may add an option to hash_pandas_object to directly produce this (if so desired) as this is a data hash for a pandas object.

Contributor

jreback commented Dec 1, 2016

I think I may add an option to hash_pandas_object to directly produce this (if so desired) as this is a data hash for a pandas object.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback
Contributor

jreback commented Dec 1, 2016

@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 1, 2016

Contributor

Since an index's __eq__ does not return a bool, I'm not sure that having indices be directly hashable for use as dict keys makes sense to me. It seems to me maybe some proxy is in order to be able to do this...

Is it the case the Index order matters when the categorical is ordered=True and not when not?

Contributor

mikegraham commented Dec 1, 2016

Since an index's __eq__ does not return a bool, I'm not sure that having indices be directly hashable for use as dict keys makes sense to me. It seems to me maybe some proxy is in order to be able to do this...

Is it the case the Index order matters when the categorical is ordered=True and not when not?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 1, 2016

Contributor

this is not competing with eq ; we actually use .equals to compare equivalents generally for an Index

furthermore this by by default DOES consider ordering (a user could control this by index=False)

Contributor

jreback commented Dec 1, 2016

this is not competing with eq ; we actually use .equals to compare equivalents generally for an Index

furthermore this by by default DOES consider ordering (a user could control this by index=False)

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Dec 1, 2016

Contributor

@jreback great, thanks. Had to make a minor patch for a parameter that controls whether or not to categorize before hashing. We can't categorize in the CategoricalDtype constructor since we'll hit an infinite loop.

diff --git a/pandas/tools/hashing.py b/pandas/tools/hashing.py
index 3ed51497d..94739b199 100644
--- a/pandas/tools/hashing.py
+++ b/pandas/tools/hashing.py
@@ -86,7 +86,8 @@ def hash_pandas_object(obj, index=True, encoding='utf8', hash_key=None,
     return h
 
 
-def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
+def hash_array(vals, encoding='utf8', hash_key=None, reduce=False,
+               categorize_objects=True):
     """
     Given a 1d array, return an array of deterministic integers.
 
@@ -100,6 +101,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     hash_key : string key to encode, default to _default_hash_key
     reduce : boolean, default False
         produce a single hash result
+    categorize_objects : bool
 
     Returns
     -------
@@ -137,16 +139,19 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     else:
 
         # its MUCH faster to categorize object dtypes, then hash and rename
-        codes, categories = factorize(vals, sort=False)
-        categories = Index(categories)
-        c = Series(Categorical(codes, categories,
-                               ordered=False, fastpath=True))
-        vals = _hash.hash_object_array(categories.values,
-                                       hash_key,
-                                       encoding)
-
-        # rename & extract
-        vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        if categorize_objects:
+            codes, categories = factorize(vals, sort=False)
+            categories = Index(categories)
+            c = Series(Categorical(codes, categories,
+                                   ordered=False, fastpath=True))
+            vals = _hash.hash_object_array(categories.values,
+                                           hash_key,
+                                           encoding)
+
+            # rename & extract
+            vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        else:
+            vals = _hash.hash_object_array(np.asarray(vals), hash_key, encoding)
 
     # Then, redistribute these 64-bit ints within the space of 64-bit ints
     vals ^= vals >> 30
Contributor

TomAugspurger commented Dec 1, 2016

@jreback great, thanks. Had to make a minor patch for a parameter that controls whether or not to categorize before hashing. We can't categorize in the CategoricalDtype constructor since we'll hit an infinite loop.

diff --git a/pandas/tools/hashing.py b/pandas/tools/hashing.py
index 3ed51497d..94739b199 100644
--- a/pandas/tools/hashing.py
+++ b/pandas/tools/hashing.py
@@ -86,7 +86,8 @@ def hash_pandas_object(obj, index=True, encoding='utf8', hash_key=None,
     return h
 
 
-def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
+def hash_array(vals, encoding='utf8', hash_key=None, reduce=False,
+               categorize_objects=True):
     """
     Given a 1d array, return an array of deterministic integers.
 
@@ -100,6 +101,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     hash_key : string key to encode, default to _default_hash_key
     reduce : boolean, default False
         produce a single hash result
+    categorize_objects : bool
 
     Returns
     -------
@@ -137,16 +139,19 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     else:
 
         # its MUCH faster to categorize object dtypes, then hash and rename
-        codes, categories = factorize(vals, sort=False)
-        categories = Index(categories)
-        c = Series(Categorical(codes, categories,
-                               ordered=False, fastpath=True))
-        vals = _hash.hash_object_array(categories.values,
-                                       hash_key,
-                                       encoding)
-
-        # rename & extract
-        vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        if categorize_objects:
+            codes, categories = factorize(vals, sort=False)
+            categories = Index(categories)
+            c = Series(Categorical(codes, categories,
+                                   ordered=False, fastpath=True))
+            vals = _hash.hash_object_array(categories.values,
+                                           hash_key,
+                                           encoding)
+
+            # rename & extract
+            vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        else:
+            vals = _hash.hash_object_array(np.asarray(vals), hash_key, encoding)
 
     # Then, redistribute these 64-bit ints within the space of 64-bit ints
     vals ^= vals >> 30
@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 1, 2016

Contributor

this is not competing with __eq__

I must have taken too literally what you meant when you showed an index being used as part of a dict key.

Contributor

mikegraham commented Dec 1, 2016

this is not competing with __eq__

I must have taken too literally what you meant when you showed an index being used as part of a dict key.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 1, 2016

Contributor

@TomAugspurger I wouldn't add like that. instead simply call
_hash.hash_object_array(np.asarray(vals), hash_key, encoding) directly

Contributor

jreback commented Dec 1, 2016

@TomAugspurger I wouldn't add like that. instead simply call
_hash.hash_object_array(np.asarray(vals), hash_key, encoding) directly

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 1, 2016

Contributor

@mikegraham yeah this is to 'hash' Index objects, so in theory it would work for this purpose, IOW, providing a data-hash. But I think we would need much more validation / testing if we actually wanted to replace the equiv of __hash__ for an Index (it may be worth it, but I suspect this is too compute intensive for general usage)

Contributor

jreback commented Dec 1, 2016

@mikegraham yeah this is to 'hash' Index objects, so in theory it would work for this purpose, IOW, providing a data-hash. But I think we would need much more validation / testing if we actually wanted to replace the equiv of __hash__ for an Index (it may be worth it, but I suspect this is too compute intensive for general usage)

@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 1, 2016

Contributor

@jreback I read your code closer and I see what you've done now. I think this code might be a bit unsafe because of hash collisions. If this was a 160+ bit cryptographic hash, I could imagine feeling safe enough, but the hashing here.....I'm really not certain it's the right thing to do.

I'd consider keying on something like

class UnorderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return np.bitwise_xor.reduce(hash_pandas_object(c).values)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.sort_values().equals(other.index.sort_values()))

class OrderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return siphash(hash_pandas_object(c).data)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.equals(other.index))

Obviously having the key and value have to have references to the index is annoying, but you can get weakref semantics with some work.

Contributor

mikegraham commented Dec 1, 2016

@jreback I read your code closer and I see what you've done now. I think this code might be a bit unsafe because of hash collisions. If this was a 160+ bit cryptographic hash, I could imagine feeling safe enough, but the hashing here.....I'm really not certain it's the right thing to do.

I'd consider keying on something like

class UnorderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return np.bitwise_xor.reduce(hash_pandas_object(c).values)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.sort_values().equals(other.index.sort_values()))

class OrderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return siphash(hash_pandas_object(c).data)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.equals(other.index))

Obviously having the key and value have to have references to the index is annoying, but you can get weakref semantics with some work.

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Dec 5, 2016

Contributor

As predicted by @mikegraham , here is a failure:

In [1]: import numpy as np

In [2]: from pandas.tools.hashing import hash_array

In [3]: L = ['Ingrid-9Z9fKIZmkO7i7Cn51Li34pJm44fgX6DYGBNj3VPlOH50m7HnBlPxfIwFMrc
   ...: NJNMP6PSgLmwWnInciMWrCSAlLEvt7JkJl4IxiMrVbXSa8ZQoVaq5xoQPjltuJEfwdNlO6jo
   ...: 8qRRHvD8sBEBMQASrRa6TsdaPTPCBo3nwIBpE7YzzmyH0vMBhjQZLx1aCT7faSEx7PgFxQhH
   ...: dKFWROcysamgy9iVj8DO2Fmwg1NNl93rIAqC3mdqfrCxrzfvIY8aJdzin2cHVzy3QUJxZgHv
   ...: tUtOLxoqnUHsYbNTeq0xcLXpTZEZCxD4PGubIuCNf32c33M7HFsnjWSEjE2yVdWKhmSVodyF
   ...: 8hFYVmhYnMCztQnJrt3O8ZvVRXd5IKwlLexiSp4h888w7SzAIcKgc3g5XQJf6MlSMftDXm9l
   ...: IsE1mJNiJEv6uY6pgvC3fUPhatlR5JPpVAHNSbSEE73MBzJrhCAbOLXQumyOXigZuPoME7Qg
   ...: JcBalliQol7YZ9', 'Tim-b9MddTxOWW2AT1Py6vtVbZwGAmYCjbp89p8mxsiFoVX4FyDOF3
   ...: wFiAkyQTUgwg9sVqVYOZo09Dh1AzhFHbgij52ylF0SEwgzjzHH8TGY8Lypart4p4onnDoDvV
   ...: MBa0kdthVGKl6K0BDVGzyOXPXKpmnMF1H6rJzqHJ0HywfwS4XYpVwlAkoeNsiicHkJUFdUAh
   ...: G229INzvIAiJuAHeJDUoyO4DCBqtoZ5TDend6TK7Y914yHlfH3g1WZu5LksKv68VQHJriWFY
   ...: usW5e6ZZ6dKaMjTwEGuRgdT66iU5nqWTHRH8WSzpXoCFwGcTOwyuqPSe0fTe21DVtJn1FKj9
   ...: F9nEnR9xOvJUO7E0piCIF4Ad9yAIDY4DBimpsTfKXCu1vdHpKYerzbndfuFe5AhfMduLYZJi
   ...: 5iAw8qKSwR5h86ttXV0Mc0QmXz8dsRvDgxjXSmupPxBggdlqUlC828hXiTPD7am0yETBV0F3
   ...: bEtvPiNJfremszcV8NcqAoARMe']

In [4]: hash_array(np.asarray(L, dtype=object), 'utf8')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-5f657fa82496> in <module>()
----> 1 hash_array(np.asarray(L, dtype=object), 'utf8')

/home/mrocklin/workspace/pandas/pandas/tools/hashing.py in hash_array(vals, encoding, hash_key)
    127 
    128         # rename & extract
--> 129         vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
    130 
    131     # Then, redistribute these 64-bit ints within the space of 64-bit ints

/home/mrocklin/workspace/pandas/pandas/core/base.py in f(self, *args, **kwargs)
    208 
    209             def f(self, *args, **kwargs):
--> 210                 return self._delegate_method(name, *args, **kwargs)
    211 
    212             f.__name__ = name

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _delegate_method(self, name, *args, **kwargs)
   1959         from pandas import Series
   1960         method = getattr(self.categorical, name)
-> 1961         res = method(*args, **kwargs)
   1962         if res is not None:
   1963             return Series(res, index=self.index)

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in rename_categories(self, new_categories, inplace)
    756         """
    757         cat = self if inplace else self.copy()
--> 758         cat.categories = new_categories
    759         if not inplace:
    760             return cat

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _set_categories(self, categories, fastpath)
    586         """
    587 
--> 588         categories = self._validate_categories(categories, fastpath=fastpath)
    589         if (not fastpath and self._categories is not None and
    590                 len(categories) != len(self._categories)):

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _validate_categories(cls, categories, fastpath)
    572 
    573             if not categories.is_unique:
--> 574                 raise ValueError('Categorical categories must be unique')
    575 
    576         return categories

ValueError: Categorical categories must be unique

Both of these inputs hash down to the same value, which causes Pandas to correctly raise.

Contributor

mrocklin commented Dec 5, 2016

As predicted by @mikegraham , here is a failure:

In [1]: import numpy as np

In [2]: from pandas.tools.hashing import hash_array

In [3]: L = ['Ingrid-9Z9fKIZmkO7i7Cn51Li34pJm44fgX6DYGBNj3VPlOH50m7HnBlPxfIwFMrc
   ...: NJNMP6PSgLmwWnInciMWrCSAlLEvt7JkJl4IxiMrVbXSa8ZQoVaq5xoQPjltuJEfwdNlO6jo
   ...: 8qRRHvD8sBEBMQASrRa6TsdaPTPCBo3nwIBpE7YzzmyH0vMBhjQZLx1aCT7faSEx7PgFxQhH
   ...: dKFWROcysamgy9iVj8DO2Fmwg1NNl93rIAqC3mdqfrCxrzfvIY8aJdzin2cHVzy3QUJxZgHv
   ...: tUtOLxoqnUHsYbNTeq0xcLXpTZEZCxD4PGubIuCNf32c33M7HFsnjWSEjE2yVdWKhmSVodyF
   ...: 8hFYVmhYnMCztQnJrt3O8ZvVRXd5IKwlLexiSp4h888w7SzAIcKgc3g5XQJf6MlSMftDXm9l
   ...: IsE1mJNiJEv6uY6pgvC3fUPhatlR5JPpVAHNSbSEE73MBzJrhCAbOLXQumyOXigZuPoME7Qg
   ...: JcBalliQol7YZ9', 'Tim-b9MddTxOWW2AT1Py6vtVbZwGAmYCjbp89p8mxsiFoVX4FyDOF3
   ...: wFiAkyQTUgwg9sVqVYOZo09Dh1AzhFHbgij52ylF0SEwgzjzHH8TGY8Lypart4p4onnDoDvV
   ...: MBa0kdthVGKl6K0BDVGzyOXPXKpmnMF1H6rJzqHJ0HywfwS4XYpVwlAkoeNsiicHkJUFdUAh
   ...: G229INzvIAiJuAHeJDUoyO4DCBqtoZ5TDend6TK7Y914yHlfH3g1WZu5LksKv68VQHJriWFY
   ...: usW5e6ZZ6dKaMjTwEGuRgdT66iU5nqWTHRH8WSzpXoCFwGcTOwyuqPSe0fTe21DVtJn1FKj9
   ...: F9nEnR9xOvJUO7E0piCIF4Ad9yAIDY4DBimpsTfKXCu1vdHpKYerzbndfuFe5AhfMduLYZJi
   ...: 5iAw8qKSwR5h86ttXV0Mc0QmXz8dsRvDgxjXSmupPxBggdlqUlC828hXiTPD7am0yETBV0F3
   ...: bEtvPiNJfremszcV8NcqAoARMe']

In [4]: hash_array(np.asarray(L, dtype=object), 'utf8')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-5f657fa82496> in <module>()
----> 1 hash_array(np.asarray(L, dtype=object), 'utf8')

/home/mrocklin/workspace/pandas/pandas/tools/hashing.py in hash_array(vals, encoding, hash_key)
    127 
    128         # rename & extract
--> 129         vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
    130 
    131     # Then, redistribute these 64-bit ints within the space of 64-bit ints

/home/mrocklin/workspace/pandas/pandas/core/base.py in f(self, *args, **kwargs)
    208 
    209             def f(self, *args, **kwargs):
--> 210                 return self._delegate_method(name, *args, **kwargs)
    211 
    212             f.__name__ = name

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _delegate_method(self, name, *args, **kwargs)
   1959         from pandas import Series
   1960         method = getattr(self.categorical, name)
-> 1961         res = method(*args, **kwargs)
   1962         if res is not None:
   1963             return Series(res, index=self.index)

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in rename_categories(self, new_categories, inplace)
    756         """
    757         cat = self if inplace else self.copy()
--> 758         cat.categories = new_categories
    759         if not inplace:
    760             return cat

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _set_categories(self, categories, fastpath)
    586         """
    587 
--> 588         categories = self._validate_categories(categories, fastpath=fastpath)
    589         if (not fastpath and self._categories is not None and
    590                 len(categories) != len(self._categories)):

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _validate_categories(cls, categories, fastpath)
    572 
    573             if not categories.is_unique:
--> 574                 raise ValueError('Categorical categories must be unique')
    575 
    576         return categories

ValueError: Categorical categories must be unique

Both of these inputs hash down to the same value, which causes Pandas to correctly raise.

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Dec 5, 2016

Contributor

@mikegraham do you have suggestions on the best course of action here?

Contributor

mrocklin commented Dec 5, 2016

@mikegraham do you have suggestions on the best course of action here?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 5, 2016

Contributor

jreback@ac360e9

here is the test code.

I think this might be some sort of overflow in siphash.

Contributor

jreback commented Dec 5, 2016

jreback@ac360e9

here is the test code.

I think this might be some sort of overflow in siphash.

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Dec 5, 2016

Contributor

Is siphash a cryptographic hash? If not then it should not surprise us to see collisions.

Contributor

mrocklin commented Dec 5, 2016

Is siphash a cryptographic hash? If not then it should not surprise us to see collisions.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 5, 2016

Contributor

so these are failing if len(string) is an exact power of 2, e.g. these area 2**9, 2**9 + 1 is fine.

Contributor

jreback commented Dec 5, 2016

so these are failing if len(string) is an exact power of 2, e.g. these area 2**9, 2**9 + 1 is fine.

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Dec 5, 2016

Contributor

http://crypto.stackexchange.com/questions/17996/is-siphash-cryptographically-secure

This implies that collisions are likely. I don't think that this is a problem with the implementation. My guess is that we can't use siphash and assume that it won't collide.

Contributor

mrocklin commented Dec 5, 2016

http://crypto.stackexchange.com/questions/17996/is-siphash-cryptographically-secure

This implies that collisions are likely. I don't think that this is a problem with the implementation. My guess is that we can't use siphash and assume that it won't collide.

@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 5, 2016

Contributor

siphash isn't a cryptographic hash. In many senses it should be about as good as one, but it's by definition 64-bit, which obviously will make collisions very normal.

I don't see the uniqueness assumption of categorical values as being needed to do this trick -- shouldn't you be able to accomplish the same optimization trick more reliably with a drop_duplicates and a merge?

Contributor

mikegraham commented Dec 5, 2016

siphash isn't a cryptographic hash. In many senses it should be about as good as one, but it's by definition 64-bit, which obviously will make collisions very normal.

I don't see the uniqueness assumption of categorical values as being needed to do this trick -- shouldn't you be able to accomplish the same optimization trick more reliably with a drop_duplicates and a merge?

@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 5, 2016

Contributor

(I don't know what optimizations are hiding in the categorical implementation that are hard to reproduce in other contexts.)

Contributor

mikegraham commented Dec 5, 2016

(I don't know what optimizations are hiding in the categorical implementation that are hard to reproduce in other contexts.)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 5, 2016

Contributor

why do powers of 2 len hash the same? that seems very odd

Contributor

jreback commented Dec 5, 2016

why do powers of 2 len hash the same? that seems very odd

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Dec 5, 2016

Contributor

@jreback one solution to this problem would be to keep this optimization in, but give up quickly if there are duplicate hash values and revert to the old solution. This optimization is probably good 95% of the time. The other 5% will just be slow.

Contributor

mrocklin commented Dec 5, 2016

@jreback one solution to this problem would be to keep this optimization in, but give up quickly if there are duplicate hash values and revert to the old solution. This optimization is probably good 95% of the time. The other 5% will just be slow.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 5, 2016

Contributor

@mikegraham this is not using categoricals at all
it is directly repro as above (with raw siphash)

is my key wrong?

Contributor

jreback commented Dec 5, 2016

@mikegraham this is not using categoricals at all
it is directly repro as above (with raw siphash)

is my key wrong?

@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 5, 2016

Contributor

@jreback I hadn't ignored your finding there. I was getting a pandas master environment going to get a closer look at it.

Contributor

mikegraham commented Dec 5, 2016

@jreback I hadn't ignored your finding there. I was getting a pandas master environment going to get a closer look at it.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 5, 2016

Contributor

@mikebailey sure no prob

from pandas import _hash
In [26]: def f(N):
    ...:    result = _hash.hash_object_array(tm.rands_array(N, 2), '0123456789012345', 'utf-8')
    ...:    return result[0] == result[1]
    ...: 

In [27]: f(2**7)
Out[27]: False

In [28]: f(2**8)
Out[28]: True

In [29]: f(2**8+1)
Out[29]: False

In [30]: f(2**9)
Out[30]: True

In [31]: f(2**10)
Out[31]: True

In [32]: f(2**12)
Out[32]: True

In [33]: f(2**12+1)
Out[33]: False
Contributor

jreback commented Dec 5, 2016

@mikebailey sure no prob

from pandas import _hash
In [26]: def f(N):
    ...:    result = _hash.hash_object_array(tm.rands_array(N, 2), '0123456789012345', 'utf-8')
    ...:    return result[0] == result[1]
    ...: 

In [27]: f(2**7)
Out[27]: False

In [28]: f(2**8)
Out[28]: True

In [29]: f(2**8+1)
Out[29]: False

In [30]: f(2**9)
Out[30]: True

In [31]: f(2**10)
Out[31]: True

In [32]: f(2**12)
Out[32]: True

In [33]: f(2**12+1)
Out[33]: False
@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 5, 2016

Contributor

Looks like we're using a uint8_t where we mean to be using a uint64_t...let me get you a PR to show you.

Contributor

mikegraham commented Dec 5, 2016

Looks like we're using a uint8_t where we mean to be using a uint64_t...let me get you a PR to show you.

@mikegraham

This comment has been minimized.

Show comment
Hide comment
@mikegraham

mikegraham Dec 5, 2016

Contributor

BTW, I had initially done some tests against externally-computed siphash values...we should probably get those into pandas' suite. I can work with you to do that.

Contributor

mikegraham commented Dec 5, 2016

BTW, I had initially done some tests against externally-computed siphash values...we should probably get those into pandas' suite. I can work with you to do that.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 6, 2016

Contributor

@mikegraham that would be great

Contributor

jreback commented Dec 6, 2016

@mikegraham that would be great

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 6, 2016

Contributor

so #14804/#14805 this seems to fix the issues. thanks @mikegraham

as an aside if you want to do a PR to lock-down the actual resultant hash values for a tests series would be great (so we match external siphash and don't change in the future)

Contributor

jreback commented Dec 6, 2016

so #14804/#14805 this seems to fix the issues. thanks @mikegraham

as an aside if you want to do a PR to lock-down the actual resultant hash values for a tests series would be great (so we match external siphash and don't change in the future)

@jreback jreback modified the milestones: 0.20.0, 0.21.0 Mar 23, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 25, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 30, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 31, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 6, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 10, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 17, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 20, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017

ENH: Parametrized CategoricalDtype
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes #14711
Closes #15078
Closes #14676

@jreback jreback closed this in #16015 Sep 23, 2017

jreback added a commit that referenced this issue Sep 23, 2017

Categorical type (#16015)
Closes #14711
Closes #15078
Closes #14676

alanbato added a commit to alanbato/pandas that referenced this issue Nov 10, 2017

No-Stream added a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment