Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

Open
PetarMPetrov opened this issue Feb 17, 2021 · 0 comments
Open

BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

PetarMPetrov opened this issue Feb 17, 2021 · 0 comments
Labels
Bug Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. Sparse Sparse Data Type

Comments

@PetarMPetrov
Copy link

Code Sample

import numpy as np
import pandas as pd
from scipy import sparse as sp_sparse

# Create categorical type and sparse type from it.
custom_type = pd.CategoricalDtype(categories=['Zero', 'One'])
categorical_sparse_type = pd.SparseDtype(dtype=custom_type, fill_value='Zero')

# Create sparse type from string type
string_sparse_type = pd.SparseDtype(dtype='str', fill_value='Zero')

# Dummy Data
data = np.array([['Zero', 'Zero'],
                 ['One', 'Zero']])

# Create sparse data frame from categorical sparse type
categorical_sparse_df = pd.DataFrame(
    data=data,
    columns=list('AB'),
).astype(categorical_sparse_type)

# Create sparse data frame from string sparse type
string_sparse_df = pd.DataFrame(
    data=data,
    columns=list('AB'),
).astype(string_sparse_type)

The following operation causes an error .

dense_df = categorical_sparse_df.sparse.to_dense()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-95-364b3ddaf122> in <module>
----> 1 dense_df = categorical_sparse_df.sparse.to_dense()

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/accessor.py in to_dense(self)
    302         from pandas import DataFrame
    303 
--> 304         data = {k: v.array.to_dense() for k, v in self._parent.items()}
    305         return DataFrame(data, index=self._parent.index, columns=self._parent.columns)
    306 

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/accessor.py in <dictcomp>(.0)
    302         from pandas import DataFrame
    303 
--> 304         data = {k: v.array.to_dense() for k, v in self._parent.items()}
    305         return DataFrame(data, index=self._parent.index, columns=self._parent.columns)
    306 

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/array.py in to_dense(self)
   1132         arr : NumPy array
   1133         """
-> 1134         return np.asarray(self, dtype=self.sp_values.dtype)
   1135 
   1136     _internal_get_values = to_dense

~/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: data type not understood

In addition, the following does not raise an error, but changes the "Zero"-only column in an unexpected way when groupby is applied.

string_sparse_df.groupby(level=0).apply(lambda x:x)
  A B
0 Zero Z
1 One Z

If the dense version of the data frame is used, the outcome is as expected.

string_sparse_df.sparse.to_dense().groupby(level=0).apply(lambda x:x)
  A B
0 Zero Zero
1 One Zero

Problem description

From the description of pandas.SparseDtype, my understanding is that the dtype argument can be of type ExtensionDtype, which is consistent with CategoricalDtype. However, doing certain operations (example above) with a sparse data frame of such type causes an TypeError.

In addition, replacing the CategoricalDtype with a str type seems to partially fix the problem. However, it still causes issues with groupby when a column consists of only the fill_value.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 7d32926
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-22-generic
Version : #23~18.04.1-Ubuntu SMP Thu Jun 6 08:37:25 UTC 2019
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.2.2
numpy : 1.18.1
pytz : 2020.4
dateutil : 2.8.1
pip : 21.0.1
setuptools : 46.4.0.post20200518
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.0
bottleneck : 1.2.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : 2.7.1
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.5
tables : 3.5.2
tabulate : None
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.44.1

@PetarMPetrov PetarMPetrov added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2021
@jbrockmendel jbrockmendel added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Sparse Sparse Data Type Categorical Categorical Data Type and removed Dtype Conversions Unexpected or buggy dtype conversions Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

2 participants