Construction of Categorical from array with pd.NA failing #31927

jorisvandenbossche · 2020-02-12T15:11:38Z

So creating a Categorical from an array with pd.NA fails:

In [10]: pd.Categorical(np.array(["a", "b", pd.NA], dtype=object))  
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/scipy/pandas/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    354             try:
--> 355                 codes, categories = factorize(values, sort=True)
    356             except TypeError:

~/scipy/pandas/pandas/core/algorithms.py in factorize(values, sort, na_sentinel, size_hint)
    635         codes, uniques = _factorize_array(
--> 636             values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
    637         )

~/scipy/pandas/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
    483     table = hash_klass(size_hint or len(values))
--> 484     uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
    485 

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

~/scipy/pandas/pandas/_libs/missing.pyx in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

while from a list actually "works":

In [11]: pd.Categorical(["a", "b", pd.NA]) 
Out[11]: 
[a, b, NaN]
Categories (2, object): [a, b]

This also means that creating a Categorical from a StringArray with missing values won't work (with the same error as above).
I am only not sure if it should work (at least before we can have a "string" dtype index as the categories of the created Categorical).

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-02-12T19:11:18Z

Not too familiar with this part of the code, but could the issue be here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L433? Can that comparison be short-circuited somehow by checking if value is identically na_value2 or equal to na_value2?

jorisvandenbossche · 2020-02-12T19:59:38Z

Yes, that sounds correct (only that it's in the PyObect or String hashtable lower in the file, but with similar code)

jorisvandenbossche added Categorical Categorical Data Type NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 12, 2020

jorisvandenbossche mentioned this issue Feb 12, 2020

Add test for MultiIndex Construction with pd.NA #31883

Closed

dsaxton mentioned this issue Feb 12, 2020

BUG: Fix construction of Categorical from pd.NA #31939

Merged

5 tasks

jreback added this to the 1.0.2 milestone Feb 16, 2020

jreback closed this as completed in #31939 Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Construction of Categorical from array with pd.NA failing #31927

Construction of Categorical from array with pd.NA failing #31927

jorisvandenbossche commented Feb 12, 2020

dsaxton commented Feb 12, 2020

jorisvandenbossche commented Feb 12, 2020

Construction of Categorical from array with pd.NA failing #31927

Construction of Categorical from array with pd.NA failing #31927

Comments

jorisvandenbossche commented Feb 12, 2020

dsaxton commented Feb 12, 2020

jorisvandenbossche commented Feb 12, 2020