Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make categories and ordered part of CategoricalDtype #14711

Closed
TomAugspurger opened this issue Nov 22, 2016 · 35 comments
Closed

Make categories and ordered part of CategoricalDtype #14711

TomAugspurger opened this issue Nov 22, 2016 · 35 comments
Labels
API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 22, 2016

This is to discuss pushing the Categorical.categories and
Categorical.ordered information into the extension type CategoricalDtype.

pd.CategoricalDtype(categories, ordered=False)

Note that there is no values argument. This is a type constructor, that
isn't attached to any specific Categorical instance.

Why?

Several times now (read_csv(..., dtype=...), .astype(...), Series([], dtype=...))
we have places where we accept dtype='category' which takes the values
in the method (the series, or column from the CSV, etc.)
and hands it off to the value constructor, with no control over the
categories and ordered arguments.

Categorical(values, categories=None, ordered=False)

The proposal here would add the categories and ordered
attributes / arguments to CategoricalDtype and provide a common API
for specifying non-default parameters for the Categorical constructor
in methods like read_csv, astype, etc.

t = pd.CategoricalDtype(['low', 'med', 'high'], ordered=True)
pd.read_csv('foo.csv', dtype={'A': int, 'B': t)
pd.Series(['high', 'low', 'high'], dtype=t)

s = pd.Series(['high', 'low', 'high'])
s.astype(t)

We would continue to accept dtype='category'.

This becomes even more import when doing operations on larger than memory datasets with something like dask or even (read_csv(..., chunksize=N)). Right now you don't have an easy way to specify the categories or ordered for columns (assuming you know them ahead of time).

Issues

  1. CategoricalDtype currently isn't part of the public API. Which methods
    on it do we make public?
  2. Equality semantics: For backwards compat, I think all instances
    of CategoricalDtype should compare equal with all others. You can use
    identity to check if two types are the same
t1 = pd.CategoricalDtype(['a', 'b'], ordered=True)
t2 = pd.CategoricalDtype(['a', 'b'], ordered=False)

t1 == t2  # True
t1 is t2  # False
t1 is t1  # True
  1. Should the categories argument be required? Currently dtype='category'
    says 1.) infer the categories based on the values, and 2.) it's unordered.
    Would CategoricalDtype(None, ordered=False) be allowed?
  2. Strictness? If I say
pd.Series(['a', 'b', 'c'], dtype=pd.CategoricalDtype(['a', 'b']))

What happens? I would probably expect a TypeError or ValueError as c
isn't "supposed" to be there. Or do we replace 'c' with NA? Should
strict be another parameter to CategoricalDtype (I don't think so).

I'm willing to work on this over the next couple weeks.

xref #14676 (astype)
xref #14503 (read_csv)

@TomAugspurger
Copy link
Contributor Author

cc @JanSchulz (let me know if you want me to stop pinging you on this categorical issues 😄 ). Also @mrocklin I think this may interest you as the fastparquet stuff could use CategoricalDtype since you know the categories ahead of time.

@jankatins
Copy link
Contributor

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object? As pandas2 seems to go the "we ave our own dtypes"-way, it seems that would be a sensible way (if that would also happening with the rest of the dtypes)

[Thanks for pinging me!]

@TomAugspurger
Copy link
Contributor Author

Will this also mean that Categorical is changed to save the categories/order information in the the dtype object?

Yeah, pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'], ordered=True).dtype would be CategoricalDtype(['a', 'b', 'c'], ordered=True).

Dunno if we want to expand the pd.Categorical constructor to take a dtype=CategoricalDtype option. Perhaps.

@jreback
Copy link
Contributor

jreback commented Nov 22, 2016

yes, this was the original plan when I first did this, but too many moving parts.

If this can be implemented in a back-compat manner (as this is pretty much an implementation detail). I think would be great.

@jorisvandenbossche
Copy link
Member

  1. CategoricalDtype currently isn't part of the public API. Which methods on it do we make public?

Specific for categorical, I suppose we would have a categories and ordered attribute? But for the existing attributes/methods, maybe to start with we could say in the docs that none of these are public? (not sure if there are that you would want to use)

  1. Equality semantics: For backwards compat, I think all instances of CategoricalDtype should compare equal with all others. You can use identity to check if two types are the same

Sounds OK

  1. Should the categories argument be required? Would CategoricalDtype(None, ordered=False) be allowed?

I would not allow this to start with. Is the idea that you can this way let the categories be inferred from the values, but set the ordered attribute through the dtype? Or what would be the application?

  1. Strictness? I would probably expect a TypeError or ValueError as c isn't "supposed" to be there. Or do we replace 'c' with NA?

It seems that we currently replace with NA:

In [53]: pd.Categorical(['a', 'b', 'c'], categories=['a', 'b'])
Out[53]: 
[a, b, NaN]
Categories (2, object): [a, b]

In [54]: pd.Series(['a', 'b', 'c']).astype('category', categories=['a', 'b'])
Out[54]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): [a, b]

although I also think I would like a TypeError more as the default behaviour.

@jorisvandenbossche
Copy link
Member

@TomAugspurger How is this issue different from #14676, can that one be closed?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 24, 2016 via email

@TomAugspurger
Copy link
Contributor Author

Small update for future reference. I'm having trouble with the current approach where CategoricalDtype[categories, ordered] is a singleton, i.e. one instance of CategoricalDtype for each category and ordered. The problem is _cache being a (weakref) dictionary. We need

CategoricalDtype([1, 2])

to be different from

CategoricalDtype([1., 2.])

but if we use a dictionary to do singleton stuff, the keys compare equal so we only ever get one. @jreback do you have any links to info on what-all requirements the numpy duck-type has to satisfy?

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

so the cache would need to be changed to something like a WeakDictionary and use keys like:

(c.categories, c.ordered)

BUT the categories are an Index which are not hashable. But prob could do something like:

# showing that ordering does matter (I am deliberately including the index, so we track ordering)
In [50]: c = Series(pd.Categorical([1, 2], ordered=True))

In [51]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[51]: 6505546918052763540

In [52]: c = Series(pd.Categorical([2, 1], ordered=True))

In [53]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[53]: 6135121269729399882

In [54]: c = Series(pd.Categorical([1., 2.], ordered=True))

In [55]:  np.bitwise_xor.reduce(hash_pandas_object(c).values)
Out[55]: 6505546918052763540

# you would actually store the CategoricalDtype here (and not c)
In [58]: {(6505546918052763540, c.cat.ordered) : c}
Out[58]: 
{(6505546918052763540, True): 0    1.0
 1    2.0
 dtype: category
 Categories (2, float64): [1.0 < 2.0]}

cc @mikegraham

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

I think I may add an option to hash_pandas_object to directly produce this (if so desired) as this is a data hash for a pandas object.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

jreback@908161b

@mikegraham
Copy link
Contributor

Since an index's __eq__ does not return a bool, I'm not sure that having indices be directly hashable for use as dict keys makes sense to me. It seems to me maybe some proxy is in order to be able to do this...

Is it the case the Index order matters when the categorical is ordered=True and not when not?

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

this is not competing with eq ; we actually use .equals to compare equivalents generally for an Index

furthermore this by by default DOES consider ordering (a user could control this by index=False)

@TomAugspurger
Copy link
Contributor Author

@jreback great, thanks. Had to make a minor patch for a parameter that controls whether or not to categorize before hashing. We can't categorize in the CategoricalDtype constructor since we'll hit an infinite loop.

diff --git a/pandas/tools/hashing.py b/pandas/tools/hashing.py
index 3ed51497d..94739b199 100644
--- a/pandas/tools/hashing.py
+++ b/pandas/tools/hashing.py
@@ -86,7 +86,8 @@ def hash_pandas_object(obj, index=True, encoding='utf8', hash_key=None,
     return h
 
 
-def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
+def hash_array(vals, encoding='utf8', hash_key=None, reduce=False,
+               categorize_objects=True):
     """
     Given a 1d array, return an array of deterministic integers.
 
@@ -100,6 +101,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     hash_key : string key to encode, default to _default_hash_key
     reduce : boolean, default False
         produce a single hash result
+    categorize_objects : bool
 
     Returns
     -------
@@ -137,16 +139,19 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
     else:
 
         # its MUCH faster to categorize object dtypes, then hash and rename
-        codes, categories = factorize(vals, sort=False)
-        categories = Index(categories)
-        c = Series(Categorical(codes, categories,
-                               ordered=False, fastpath=True))
-        vals = _hash.hash_object_array(categories.values,
-                                       hash_key,
-                                       encoding)
-
-        # rename & extract
-        vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        if categorize_objects:
+            codes, categories = factorize(vals, sort=False)
+            categories = Index(categories)
+            c = Series(Categorical(codes, categories,
+                                   ordered=False, fastpath=True))
+            vals = _hash.hash_object_array(categories.values,
+                                           hash_key,
+                                           encoding)
+
+            # rename & extract
+            vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+        else:
+            vals = _hash.hash_object_array(np.asarray(vals), hash_key, encoding)
 
     # Then, redistribute these 64-bit ints within the space of 64-bit ints
     vals ^= vals >> 30

@mikegraham
Copy link
Contributor

this is not competing with __eq__

I must have taken too literally what you meant when you showed an index being used as part of a dict key.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

@TomAugspurger I wouldn't add like that. instead simply call
_hash.hash_object_array(np.asarray(vals), hash_key, encoding) directly

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

@mikegraham yeah this is to 'hash' Index objects, so in theory it would work for this purpose, IOW, providing a data-hash. But I think we would need much more validation / testing if we actually wanted to replace the equiv of __hash__ for an Index (it may be worth it, but I suspect this is too compute intensive for general usage)

@mikegraham
Copy link
Contributor

mikegraham commented Dec 1, 2016

@jreback I read your code closer and I see what you've done now. I think this code might be a bit unsafe because of hash collisions. If this was a 160+ bit cryptographic hash, I could imagine feeling safe enough, but the hashing here.....I'm really not certain it's the right thing to do.

I'd consider keying on something like

class UnorderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return np.bitwise_xor.reduce(hash_pandas_object(c).values)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.sort_values().equals(other.index.sort_values()))

class OrderedCategoricalKey(object):
    def __init__(self, index):
        self.index = index
    def __hash__(self):
        return siphash(hash_pandas_object(c).data)
    def __eq__(self, other):
        return (isinstance(other, type(self)) and 
                self.index.dtype is other.index.dtype and
                self.index.equals(other.index))

Obviously having the key and value have to have references to the index is annoying, but you can get weakref semantics with some work.

@mrocklin
Copy link
Contributor

mrocklin commented Dec 5, 2016

As predicted by @mikegraham , here is a failure:

In [1]: import numpy as np

In [2]: from pandas.tools.hashing import hash_array

In [3]: L = ['Ingrid-9Z9fKIZmkO7i7Cn51Li34pJm44fgX6DYGBNj3VPlOH50m7HnBlPxfIwFMrc
   ...: NJNMP6PSgLmwWnInciMWrCSAlLEvt7JkJl4IxiMrVbXSa8ZQoVaq5xoQPjltuJEfwdNlO6jo
   ...: 8qRRHvD8sBEBMQASrRa6TsdaPTPCBo3nwIBpE7YzzmyH0vMBhjQZLx1aCT7faSEx7PgFxQhH
   ...: dKFWROcysamgy9iVj8DO2Fmwg1NNl93rIAqC3mdqfrCxrzfvIY8aJdzin2cHVzy3QUJxZgHv
   ...: tUtOLxoqnUHsYbNTeq0xcLXpTZEZCxD4PGubIuCNf32c33M7HFsnjWSEjE2yVdWKhmSVodyF
   ...: 8hFYVmhYnMCztQnJrt3O8ZvVRXd5IKwlLexiSp4h888w7SzAIcKgc3g5XQJf6MlSMftDXm9l
   ...: IsE1mJNiJEv6uY6pgvC3fUPhatlR5JPpVAHNSbSEE73MBzJrhCAbOLXQumyOXigZuPoME7Qg
   ...: JcBalliQol7YZ9', 'Tim-b9MddTxOWW2AT1Py6vtVbZwGAmYCjbp89p8mxsiFoVX4FyDOF3
   ...: wFiAkyQTUgwg9sVqVYOZo09Dh1AzhFHbgij52ylF0SEwgzjzHH8TGY8Lypart4p4onnDoDvV
   ...: MBa0kdthVGKl6K0BDVGzyOXPXKpmnMF1H6rJzqHJ0HywfwS4XYpVwlAkoeNsiicHkJUFdUAh
   ...: G229INzvIAiJuAHeJDUoyO4DCBqtoZ5TDend6TK7Y914yHlfH3g1WZu5LksKv68VQHJriWFY
   ...: usW5e6ZZ6dKaMjTwEGuRgdT66iU5nqWTHRH8WSzpXoCFwGcTOwyuqPSe0fTe21DVtJn1FKj9
   ...: F9nEnR9xOvJUO7E0piCIF4Ad9yAIDY4DBimpsTfKXCu1vdHpKYerzbndfuFe5AhfMduLYZJi
   ...: 5iAw8qKSwR5h86ttXV0Mc0QmXz8dsRvDgxjXSmupPxBggdlqUlC828hXiTPD7am0yETBV0F3
   ...: bEtvPiNJfremszcV8NcqAoARMe']

In [4]: hash_array(np.asarray(L, dtype=object), 'utf8')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-5f657fa82496> in <module>()
----> 1 hash_array(np.asarray(L, dtype=object), 'utf8')

/home/mrocklin/workspace/pandas/pandas/tools/hashing.py in hash_array(vals, encoding, hash_key)
    127 
    128         # rename & extract
--> 129         vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
    130 
    131     # Then, redistribute these 64-bit ints within the space of 64-bit ints

/home/mrocklin/workspace/pandas/pandas/core/base.py in f(self, *args, **kwargs)
    208 
    209             def f(self, *args, **kwargs):
--> 210                 return self._delegate_method(name, *args, **kwargs)
    211 
    212             f.__name__ = name

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _delegate_method(self, name, *args, **kwargs)
   1959         from pandas import Series
   1960         method = getattr(self.categorical, name)
-> 1961         res = method(*args, **kwargs)
   1962         if res is not None:
   1963             return Series(res, index=self.index)

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in rename_categories(self, new_categories, inplace)
    756         """
    757         cat = self if inplace else self.copy()
--> 758         cat.categories = new_categories
    759         if not inplace:
    760             return cat

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _set_categories(self, categories, fastpath)
    586         """
    587 
--> 588         categories = self._validate_categories(categories, fastpath=fastpath)
    589         if (not fastpath and self._categories is not None and
    590                 len(categories) != len(self._categories)):

/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _validate_categories(cls, categories, fastpath)
    572 
    573             if not categories.is_unique:
--> 574                 raise ValueError('Categorical categories must be unique')
    575 
    576         return categories

ValueError: Categorical categories must be unique

Both of these inputs hash down to the same value, which causes Pandas to correctly raise.

@mrocklin
Copy link
Contributor

mrocklin commented Dec 5, 2016

@mikegraham do you have suggestions on the best course of action here?

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

jreback@ac360e9

here is the test code.

I think this might be some sort of overflow in siphash.

@mrocklin
Copy link
Contributor

mrocklin commented Dec 5, 2016

Is siphash a cryptographic hash? If not then it should not surprise us to see collisions.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

so these are failing if len(string) is an exact power of 2, e.g. these area 2**9, 2**9 + 1 is fine.

@mikegraham
Copy link
Contributor

BTW, I had initially done some tests against externally-computed siphash values...we should probably get those into pandas' suite. I can work with you to do that.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

@mikegraham that would be great

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

so #14804/#14805 this seems to fix the issues. thanks @mikegraham

as an aside if you want to do a PR to lock-down the actual resultant hash values for a tests series would be great (so we match external siphash and don't change in the future)

@jreback jreback modified the milestones: 0.20.0, 0.21.0 Mar 23, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 25, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 30, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 31, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 6, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 10, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 17, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 20, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
jreback pushed a commit that referenced this issue Sep 23, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

6 participants