Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas 1.0 dropna error with categorical data if pd.options.mode.use_inf_as_na = True #33594

Closed
2 of 3 tasks
thebucc opened this issue Apr 16, 2020 · 3 comments · Fixed by #33629
Closed
2 of 3 tasks
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@thebucc
Copy link

thebucc commented Apr 16, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype

# with categorical column and use_inf_as_na = True -> ERROR
pd.options.mode.use_inf_as_na = True
df1 = pd.DataFrame([['a1', 'good'], ['b1', 'good'], ['c1', 'good'], ['d1', 'bad']], columns=['C1', 'C2'])
df2 = pd.DataFrame([['a1', 'good'], ['b1', np.inf], ['c1', np.NaN], ['d1', 'bad']], columns=['C1', 'C2'])
categories = CategoricalDtype(categories=['good', 'bad'], ordered=True)
df1.loc[:, 'C2'] = df1['C2'].astype(categories)
df2.loc[:, 'C2'] = df2['C2'].astype(categories)
df1.dropna(axis=0)  # ERROR
df2.dropna(axis=0)  # ERROR

Problem description

With the latest version of pandas (1.0.3, installed via pip on python 3.6.8), DataFrame.dropna returns an error if a column is of type CategoricalDtype AND pd.options.mode.use_inf_as_na = True.

Exception with traceback:

Traceback (most recent call last):
File "", line 1, in
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/frame.py", line 4751, in dropna
count = agg_obj.count(axis=agg_axis)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/frame.py", line 7800, in count
result = notna(frame).sum(axis=axis)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/dtypes/missing.py", line 376, in notna
res = isna(obj)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/dtypes/missing.py", line 126, in isna
return _isna(obj)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/dtypes/missing.py", line 185, in _isna_old
return obj._constructor(obj._data.isna(func=_isna_old))
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 555, in isna
return self.apply("apply", func=func)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 390, in apply
result = func(self.values, **kwargs)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/dtypes/missing.py", line 183, in _isna_old
return _isna_ndarraylike_old(obj)
File "/home/sbucc/miniconda3/envs/tf114/lib/python3.6/site-packages/pandas/core/dtypes/missing.py", line 283, in _isna_ndarraylike_old
vec = libmissing.isnaobj_old(values.ravel())
TypeError: Argument 'arr' has incorrect type (expected numpy.ndarray, got Categorical)

This doesn't happen with pandas 0.24.0 or if pd.options.mode.use_inf_as_na = False (default).

Expected Output

no error

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-96-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@thebucc thebucc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2020
@TomAugspurger
Copy link
Contributor

Can you edit your post to include the full traceback and remove all the unnecessary examples (all the things that work). Just have the DataFrame creation and the code that fails.

@dsaxton
Copy link
Member

dsaxton commented Apr 18, 2020

Smaller example:

import pandas as pd

cat = pd.Categorical([1, 2])
pd.options.mode.use_inf_as_na = True
cat.isna()  # Works
pd.Series(cat).isna()  # Raises
pd.DataFrame(cat).isna()  # Raises

Also interesting is that use_inf_as_na disables NA checking for string dtype:

[ins] In [5]: pd.options.mode.use_inf_as_na = True

[ins] In [6]: arr = pd.array(["a", "b", None])

[ins] In [7]: arr.isna()
Out[7]: array([False, False, False])

@dsaxton dsaxton added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 18, 2020
@simonjayhawkins
Copy link
Member

With the latest version of pandas (1.0.3, installed via pip on python 3.6.8)

working in 0.25.3. regression in #29900 (i.e. 1.0.0)

9333e3d is the first bad commit
commit 9333e3d
Author: jbrockmendel jbrockmendel@gmail.com
Date: Sun Dec 1 10:11:52 2019 -0800

DEPR: Categorical.ravel, get_dtype_counts, dtype_str, to_dense (#29900)

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Bug labels Apr 19, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Apr 19, 2020
@simonjayhawkins simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants