Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError in Categorical Constructor with empty data and boolean categories #22702

Closed
TomAugspurger opened this issue Sep 13, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@TomAugspurger
Copy link
Contributor

commented Sep 13, 2018

This works,

In [15]: pd.Categorical([], categories=['a', 'b'])
Out[15]: [], Categories (2, object): [a, b]

This doesn't

In [16]: pd.Categorical([], categories=[True, False])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-8e79cd310199> in <module>()
----> 1 pd.Categorical([], categories=[True, False])

~/sandbox/pandas/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    426
    427         else:
--> 428             codes = _get_codes_for_values(values, dtype.categories)
    429
    430         if null_mask.any():

~/sandbox/pandas/pandas/core/arrays/categorical.py in _get_codes_for_values(values, categories)
   2449     (_, _), cats = _get_data_algo(categories, _hashtables)
   2450     t = hash_klass(len(cats))
-> 2451     t.map_locations(cats)
   2452     return coerce_indexer_dtype(t.lookup(vals), cats)
   2453

~/sandbox/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.map_locations()
   1330             raise KeyError(key)
   1331
-> 1332     def map_locations(self, ndarray[object] values):
   1333         cdef:
   1334             Py_ssize_t i, n = len(values)

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'unsigned long'

the values there is array([], dtype=object). It should be int dtype by this point.

@pganssle

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2018

I haven't bisected on this to check exactly when it first started manifesting, but this issue is not present in version 0.20.3, but it is present in version 0.21.0.

@pganssle

This comment has been minimized.

Copy link
Contributor

commented Sep 14, 2018

Turns out the first bad commit is 7818486859d1aba53ce359b93cfc772e688958e5.

@pganssle

This comment has been minimized.

Copy link
Contributor

commented Sep 14, 2018

Ok, so the problem is that in this function, you have:

    if not is_dtype_equal(values.dtype, categories.dtype):
        values = ensure_object(values)
        categories = ensure_object(categories)

    (hash_klass, vec_klass), vals = _get_data_algo(values, _hashtables)
    (_, _), cats = _get_data_algo(categories, _hashtables)

For boolean categories, the category's dtype is already object I guess, so the if not is_dtype_equal(values.dtype, categories.dtype) check is True and the no coercion to object takes place, but inside _get_data_algo, there is a call to _ensure_data, which detects that categories is all booleans and converts it to uint64, which invalidates the assumption that values.dtype and categories.dtype are the same.

You would think that this would be a problem for pd.CategoricalIndex([], categories=[1, 2]) as well, since integers should also be coerced, but evidently _ensure_data treats Index and ndarray differently, and _ensure_object gives an ndarray while categories is an Index. PR incoming.

pganssle added a commit to pganssle/pandas that referenced this issue Sep 14, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2018

@gfyoung gfyoung added the Bug label Sep 14, 2018

@jreback jreback added this to the 0.24.0 milestone Sep 15, 2018

pganssle added a commit to pganssle/pandas that referenced this issue Sep 15, 2018

pganssle added a commit to pganssle/pandas that referenced this issue Sep 19, 2018

pganssle added a commit to pganssle/pandas that referenced this issue Sep 19, 2018

TomAugspurger added a commit that referenced this issue Sep 20, 2018

BUG: Empty CategoricalIndex fails with boolean categories (#22710)
* TST: Add failing test for empty bool Categoricals

* BUG: Failure in empty boolean CategoricalIndex

Fixes GH #22702.

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

BUG: Empty CategoricalIndex fails with boolean categories (pandas-dev…
…#22710)

* TST: Add failing test for empty bool Categoricals

* BUG: Failure in empty boolean CategoricalIndex

Fixes GH pandas-dev#22702.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.