Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construction of Categorical from array with pd.NA failing #31927

Closed
jorisvandenbossche opened this issue Feb 12, 2020 · 2 comments · Fixed by #31939
Closed

Construction of Categorical from array with pd.NA failing #31927

jorisvandenbossche opened this issue Feb 12, 2020 · 2 comments · Fixed by #31939
Labels
Categorical Categorical Data Type NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@jorisvandenbossche
Copy link
Member

So creating a Categorical from an array with pd.NA fails:

In [10]: pd.Categorical(np.array(["a", "b", pd.NA], dtype=object))  
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/scipy/pandas/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    354             try:
--> 355                 codes, categories = factorize(values, sort=True)
    356             except TypeError:

~/scipy/pandas/pandas/core/algorithms.py in factorize(values, sort, na_sentinel, size_hint)
    635         codes, uniques = _factorize_array(
--> 636             values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
    637         )

~/scipy/pandas/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
    483     table = hash_klass(size_hint or len(values))
--> 484     uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
    485 

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

~/scipy/pandas/pandas/_libs/missing.pyx in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

while from a list actually "works":

In [11]: pd.Categorical(["a", "b", pd.NA]) 
Out[11]: 
[a, b, NaN]
Categories (2, object): [a, b]

This also means that creating a Categorical from a StringArray with missing values won't work (with the same error as above).
I am only not sure if it should work (at least before we can have a "string" dtype index as the categories of the created Categorical).

@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 12, 2020
@dsaxton
Copy link
Member

dsaxton commented Feb 12, 2020

Not too familiar with this part of the code, but could the issue be here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L433? Can that comparison be short-circuited somehow by checking if value is identically na_value2 or equal to na_value2?

@jorisvandenbossche
Copy link
Member Author

Yes, that sounds correct (only that it's in the PyObect or String hashtable lower in the file, but with similar code)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants