BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

PetarMPetrov · 2021-02-17T22:33:31Z

Code `Sample`

import numpy as np
import pandas as pd
from scipy import sparse as sp_sparse

# Create categorical type and sparse type from it.
custom_type = pd.CategoricalDtype(categories=['Zero', 'One'])
categorical_sparse_type = pd.SparseDtype(dtype=custom_type, fill_value='Zero')

# Create sparse type from string type
string_sparse_type = pd.SparseDtype(dtype='str', fill_value='Zero')

# Dummy Data
data = np.array([['Zero', 'Zero'],
                 ['One', 'Zero']])

# Create sparse data frame from categorical sparse type
categorical_sparse_df = pd.DataFrame(
    data=data,
    columns=list('AB'),
).astype(categorical_sparse_type)

# Create sparse data frame from string sparse type
string_sparse_df = pd.DataFrame(
    data=data,
    columns=list('AB'),
).astype(string_sparse_type)

The following operation causes an error .

dense_df = categorical_sparse_df.sparse.to_dense()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-95-364b3ddaf122> in <module>
----> 1 dense_df = categorical_sparse_df.sparse.to_dense()

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/accessor.py in to_dense(self)
    302         from pandas import DataFrame
    303 
--> 304         data = {k: v.array.to_dense() for k, v in self._parent.items()}
    305         return DataFrame(data, index=self._parent.index, columns=self._parent.columns)
    306 

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/accessor.py in <dictcomp>(.0)
    302         from pandas import DataFrame
    303 
--> 304         data = {k: v.array.to_dense() for k, v in self._parent.items()}
    305         return DataFrame(data, index=self._parent.index, columns=self._parent.columns)
    306 

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/array.py in to_dense(self)
   1132         arr : NumPy array
   1133         """
-> 1134         return np.asarray(self, dtype=self.sp_values.dtype)
   1135 
   1136     _internal_get_values = to_dense

~/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: data type not understood

In addition, the following does not raise an error, but changes the "Zero"-only column in an unexpected way when groupby is applied.

string_sparse_df.groupby(level=0).apply(lambda x:x)

	A	B
0	Zero	Z
1	One	Z

If the dense version of the data frame is used, the outcome is as expected.

string_sparse_df.sparse.to_dense().groupby(level=0).apply(lambda x:x)

	A	B
0	Zero	Zero
1	One	Zero

Problem description

From the description of pandas.SparseDtype, my understanding is that the dtype argument can be of type ExtensionDtype, which is consistent with CategoricalDtype. However, doing certain operations (example above) with a sparse data frame of such type causes an TypeError.

In addition, replacing the CategoricalDtype with a str type seems to partially fix the problem. However, it still causes issues with groupby when a column consists of only the fill_value.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 7d32926
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-22-generic
Version : #23~18.04.1-Ubuntu SMP Thu Jun 6 08:37:25 UTC 2019
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.2.2
numpy : 1.18.1
pytz : 2020.4
dateutil : 2.8.1
pip : 21.0.1
setuptools : 46.4.0.post20200518
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.0
bottleneck : 1.2.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : 2.7.1
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.5
tables : 3.5.2
tabulate : None
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.44.1

The text was updated successfully, but these errors were encountered:

PetarMPetrov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

PetarMPetrov commented Feb 17, 2021

INSTALLED VERSIONS

BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

Comments

PetarMPetrov commented Feb 17, 2021

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

Code `Sample`

Output of `pd.show_versions()`