-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Code Sample, a copy-pastable example if possible
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from pandas.api.types import CategoricalDtype
In [4]: s1 = pd.Series([np.nan, np.nan]).astype('category')
In [5]: s1
Out[5]:
0 NaN
1 NaN
dtype: category
Categories (0, float64): []
In [6]: s2 = pd.Series([np.nan, np.nan]).astype(CategoricalDtype([]))
In [7]: s2
Out[7]:
0 NaN
1 NaN
dtype: category
Categories (0, object): []
In [8]: pd.api.types.union_categoricals([s1,s2])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-8-8e364c994bd7> in <module>
----> 1 pd.api.types.union_categoricals([s1,s2])
C:\Anaconda3\lib\site-packages\pandas\core\dtypes\concat.py in union_categoricals(to_union, sort_categories, ignore_order)
361 if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype)
362 for other in to_union[1:]):
--> 363 raise TypeError("dtype of categories must be the same")
364
365 ordered = False
TypeError: dtype of categories must be the same
Problem description
In the above, if you convert a Series
using astype('category')
, and the Series
has all NaN
values, the underlying dtype is float
, while if you pass CategoricalDtype([])
, the underlying dtype is object
.
There are a couple of issues that I don't know how to deal with:
- If you have categories of a certain underlying dtype, there is no way to change that dtype (e.g., in this example, I would want to change the underlying dtype of the categories backing
s1
to beobject
) - You can't specify the dtype of the underlying categories in the
CategoricalDtype
constructor
Now, you might ask, why does this matter? Let's suppose I have data that I know to be categorical, and I have missing values, and I want to use union_categoricals()
to merge the categories of two different Series that are both category dtype, and each Series was constructed using astype('category')
. Let's say that one had all missing values and has underlying dtype float
and the second one had strings and missing values, so it ends up with dtype O
, then I can't do union_categoricals()
on them.
I know there are various workarounds for this, but I still think there should be some way to manage the underlying dtype of the categories of a CategoricalDtype
.
Alternatively, maybe union_categoricals()
should be smart that when you are doing a union of categories and one of the categories has no choices, then it ignores the dtype
when doing the union.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: 3.8.1
pip: 10.0.1
setuptools: 40.4.3
Cython: 0.28.5
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None