Skip to content

Missing Values and Categoricals - inconsistent dtypes #23242

@Dr-Irv

Description

@Dr-Irv

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas.api.types import CategoricalDtype

In [4]: s1 = pd.Series([np.nan, np.nan]).astype('category')

In [5]: s1
Out[5]:
0   NaN
1   NaN
dtype: category
Categories (0, float64): []

In [6]: s2 = pd.Series([np.nan, np.nan]).astype(CategoricalDtype([]))

In [7]: s2
Out[7]:
0    NaN
1    NaN
dtype: category
Categories (0, object): []

In [8]: pd.api.types.union_categoricals([s1,s2])
--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-8-8e364c994bd7> in <module>
----> 1 pd.api.types.union_categoricals([s1,s2])

C:\Anaconda3\lib\site-packages\pandas\core\dtypes\concat.py in union_categoricals(to_union, sort_categories, ignore_order)
    361     if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype)
    362                for other in to_union[1:]):
--> 363         raise TypeError("dtype of categories must be the same")
    364
    365     ordered = False

TypeError: dtype of categories must be the same

Problem description

In the above, if you convert a Series using astype('category'), and the Series has all NaN values, the underlying dtype is float, while if you pass CategoricalDtype([]), the underlying dtype is object.

There are a couple of issues that I don't know how to deal with:

  1. If you have categories of a certain underlying dtype, there is no way to change that dtype (e.g., in this example, I would want to change the underlying dtype of the categories backing s1 to be object)
  2. You can't specify the dtype of the underlying categories in the CategoricalDtype constructor

Now, you might ask, why does this matter? Let's suppose I have data that I know to be categorical, and I have missing values, and I want to use union_categoricals() to merge the categories of two different Series that are both category dtype, and each Series was constructed using astype('category'). Let's say that one had all missing values and has underlying dtype float and the second one had strings and missing values, so it ends up with dtype O, then I can't do union_categoricals() on them.

I know there are various workarounds for this, but I still think there should be some way to manage the underlying dtype of the categories of a CategoricalDtype.

Alternatively, maybe union_categoricals() should be smart that when you are doing a union of categories and one of the categories has no choices, then it ignores the dtype when doing the union.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.8.1
pip: 10.0.1
setuptools: 40.4.3
Cython: 0.28.5
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions