Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError in Categorical Constructor with empty data and boolean categories #22702

Closed
TomAugspurger opened this issue Sep 13, 2018 · 4 comments · Fixed by #22710
Closed

ValueError in Categorical Constructor with empty data and boolean categories #22702

TomAugspurger opened this issue Sep 13, 2018 · 4 comments · Fixed by #22710
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@TomAugspurger
Copy link
Contributor

This works,

In [15]: pd.Categorical([], categories=['a', 'b'])
Out[15]: [], Categories (2, object): [a, b]

This doesn't

In [16]: pd.Categorical([], categories=[True, False])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-8e79cd310199> in <module>()
----> 1 pd.Categorical([], categories=[True, False])

~/sandbox/pandas/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    426
    427         else:
--> 428             codes = _get_codes_for_values(values, dtype.categories)
    429
    430         if null_mask.any():

~/sandbox/pandas/pandas/core/arrays/categorical.py in _get_codes_for_values(values, categories)
   2449     (_, _), cats = _get_data_algo(categories, _hashtables)
   2450     t = hash_klass(len(cats))
-> 2451     t.map_locations(cats)
   2452     return coerce_indexer_dtype(t.lookup(vals), cats)
   2453

~/sandbox/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.map_locations()
   1330             raise KeyError(key)
   1331
-> 1332     def map_locations(self, ndarray[object] values):
   1333         cdef:
   1334             Py_ssize_t i, n = len(values)

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'unsigned long'

the values there is array([], dtype=object). It should be int dtype by this point.

@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions Categorical Categorical Data Type labels Sep 13, 2018
@pganssle
Copy link
Contributor

pganssle commented Sep 13, 2018

I haven't bisected on this to check exactly when it first started manifesting, but this issue is not present in version 0.20.3, but it is present in version 0.21.0.

@pganssle
Copy link
Contributor

Turns out the first bad commit is 7818486859d1aba53ce359b93cfc772e688958e5.

@pganssle
Copy link
Contributor

pganssle commented Sep 14, 2018

Ok, so the problem is that in this function, you have:

    if not is_dtype_equal(values.dtype, categories.dtype):
        values = ensure_object(values)
        categories = ensure_object(categories)

    (hash_klass, vec_klass), vals = _get_data_algo(values, _hashtables)
    (_, _), cats = _get_data_algo(categories, _hashtables)

For boolean categories, the category's dtype is already object I guess, so the if not is_dtype_equal(values.dtype, categories.dtype) check is True and the no coercion to object takes place, but inside _get_data_algo, there is a call to _ensure_data, which detects that categories is all booleans and converts it to uint64, which invalidates the assumption that values.dtype and categories.dtype are the same.

You would think that this would be a problem for pd.CategoricalIndex([], categories=[1, 2]) as well, since integers should also be coerced, but evidently _ensure_data treats Index and ndarray differently, and _ensure_object gives an ndarray while categories is an Index. PR incoming.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 14, 2018 via email

@gfyoung gfyoung added the Bug label Sep 14, 2018
@jreback jreback added this to the 0.24.0 milestone Sep 15, 2018
pganssle added a commit to pganssle/pandas that referenced this issue Sep 15, 2018
pganssle added a commit to pganssle/pandas that referenced this issue Sep 19, 2018
pganssle added a commit to pganssle/pandas that referenced this issue Sep 19, 2018
TomAugspurger pushed a commit that referenced this issue Sep 20, 2018
* TST: Add failing test for empty bool Categoricals

* BUG: Failure in empty boolean CategoricalIndex

Fixes GH #22702.
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018
…#22710)

* TST: Add failing test for empty bool Categoricals

* BUG: Failure in empty boolean CategoricalIndex

Fixes GH pandas-dev#22702.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants