Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maybe_cast_to_integer_array fails when the input is all booleans #25211

Closed
vladserkoff opened this issue Feb 7, 2019 · 6 comments

Comments

@vladserkoff
Copy link
Contributor

commented Feb 7, 2019

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd, numpy as np

# expected behaviour with ordinary dtype
In [2]: pd.Series([True, False], dtype=int)
Out[2]:
0    1
1    0
dtype: int64

# broken
In [3]: pd.Series([True, False], dtype=pd.Int64Dtype())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in _try_cast(arr, take_fast_path, dtype, copy, raise_cast_failure)
    694         if is_integer_dtype(dtype):
--> 695             subarr = maybe_cast_to_integer_array(arr, dtype)
    696

/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in maybe_cast_to_integer_array(arr, dtype, copy)
   1304         if not hasattr(arr, "astype"):
-> 1305             casted = np.array(arr, dtype=dtype, copy=copy)
   1306         else:

TypeError: data type not understood

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-3-b747cfcdf17f> in <module>
----> 1 pd.Series([True, False], dtype=pd.Int64Dtype())

/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    260             else:
    261                 data = sanitize_array(data, index, dtype, copy,
--> 262                                       raise_cast_failure=True)
    263
    264                 data = SingleBlockManager(data, index, fastpath=True)

/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_array(data, index, dtype, copy, raise_cast_failure)
    605             try:
    606                 subarr = _try_cast(data, False, dtype, copy,
--> 607                                    raise_cast_failure)
    608             except Exception:
    609                 if raise_cast_failure:  # pragma: no cover

/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in _try_cast(arr, take_fast_path, dtype, copy, raise_cast_failure)
    714             # create an extension array from its dtype
    715             array_type = dtype.construct_array_type()._from_sequence
--> 716             subarr = array_type(arr, dtype=dtype, copy=copy)
    717         elif dtype is not None and raise_cast_failure:
    718             raise

/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/integer.py in _from_sequence(cls, scalars, dtype, copy)
    301     @classmethod
    302     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 303         return integer_array(scalars, dtype=dtype, copy=copy)
    304
    305     @classmethod

/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/integer.py in integer_array(values, dtype, copy)
    109     TypeError if incompatible types
    110     """
--> 111     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    112     return IntegerArray(values, mask)
    113

/usr/local/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/integer.py in coerce_to_array(values, dtype, mask, copy)
    190     elif not (is_integer_dtype(values) or is_float_dtype(values)):
    191         raise TypeError("{} cannot be converted to an IntegerDtype".format(
--> 192             values.dtype))
    193
    194     if mask is None:

TypeError: bool cannot be converted to an IntegerDtype

Problem description

Pandas is unable to convert array of bools to IntegerDtype, while conversion to int is supported. What's interesting, if an arrays contains NaNs, then conversion goes as expected.

In [4]: pd.Series([True, False, np.nan], dtype=pd.Int64Dtype())
Out[4]:
0      1
1      0
2    NaN
dtype: Int64

Expected Output

In [5]: pd.Series([True, False], dtype=pd.Int64Dtype())
Out[4]:
0      1
1      0
dtype: Int64

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.24.1
pytest: 4.2.0
pip: 19.0.1
setuptools: 40.7.3
Cython: 0.29.4
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.4
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.14
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.17
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

Thanks for the report! That does seem strange that NA is required here.

Investigation and PRs are always welcome

@vladserkoff

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2019

I've found the source of the discrepancy between the arrays with and without NA, it's in pandas.core.arrays.integer.coerce_to_array:

values = np.array(values, copy=copy)

elif not (is_integer_dtype(values) or is_float_dtype(values)):
raise TypeError("{} cannot be converted to an IntegerDtype".format(
values.dtype))

When np.array is called on an array with NAs it casts the arrays to float:

In [1]: import numpy as np

In [2]: np.array([False, True]).dtype
Out[2]: dtype('bool')

In [3]: np.array([False, True, np.nan]).dtype
Out[3]: dtype('float64')

Though I'm not sure how to handle this properly, should we cast boolean arrays to numpy floats first?

@gfyoung

This comment has been minimized.

Copy link
Member

commented Feb 10, 2019

should we cast boolean arrays to numpy floats first?

I would prefer casting to int actually. This approach seems a little special-casey, but trying to modify EA logic is by no means straightforward and has its own landmines. Give this a shot!

vladserkoff added a commit to vladserkoff/pandas that referenced this issue Feb 11, 2019

BUG: support casting from bool array to EA Integer dtype
Fixes pandas-dev#25211. Cast boolean array to int before casting to EA Integer dtype.
@vladserkoff

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2019

Well, my PR seems to fix this issue but the tests are failing with several similar errors:

pandas/tests/extension/base/ops.py:33: AssertionError
___________ TestComparisonOps.test_compare_scalar[__gt__-Int32Dtype] ___________
[gw0] darwin -- Python 3.5.6 /Users/vsts/miniconda3/envs/pandas-dev/bin/python

self = <pandas.tests.extension.test_integer.TestComparisonOps object at 0x1378eb470>
data = <IntegerArray>
[  1,   2,   3,   4,   5,   6,   7,   8, NaN,  10,  11,  12,  13,  14,  15,
  16,  17,  18,  19,  20,  ...2,  83,  84,  85,  86,  87,  88,  89,  90,
  91,  92,  93,  94,  95,  96,  97, NaN,  99, 100]
Length: 100, dtype: Int32
all_compare_operators = '__gt__'

    def test_compare_scalar(self, data, all_compare_operators):
        op_name = all_compare_operators
        s = pd.Series(data)
>       self._compare_other(s, data, op_name, 0)

pandas/tests/extension/base/ops.py:148: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/extension/test_integer.py:154: in _compare_other
    self.check_opname(s, op_name, other)
pandas/tests/extension/test_integer.py:151: in check_opname
    other, exc=None)
pandas/tests/extension/base/ops.py:27: in check_opname
    self._check_op(s, op, other, op_name, exc)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.tests.extension.test_integer.TestComparisonOps object at 0x1378eb470>
s = 0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8     NaN
9      10
10     11
11     1... 91
91     92
92     93
93     94
94     95
95     96
96     97
97    NaN
98     99
99    100
Length: 100, dtype: Int32
op = <built-in function gt>, other = 0, op_name = '__gt__', exc = None

    def _check_op(self, s, op, other, op_name, exc=NotImplementedError):
        if exc is None:
            result = op(s, other)
            expected = s.combine(other, op)
>           self.assert_series_equal(result, expected)
E           AssertionError: Attributes are different
E           
E           Attribute "dtype" are different
E           [left]:  bool
E           [right]: Int64

It looks like that these should not be related, and I'd be happy for any help clarifying it.

@gfyoung

This comment has been minimized.

Copy link
Member

commented Feb 11, 2019

@vladserkoff : This is what I was referring to when I was talking about how trying to modify EA logic is by no means straightforward and has its own landmines.

I would try casting before you hit any of the EA logic.

@vladserkoff

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2019

@gfyoung, thanks. I've left the fix where it was, only made sure to not cast to int if the target dtype is not an integer. Tests are now ok, except for an unrelated ImportError in a windows test:

_____________________________ test_oo_optimizable _____________________________
[gw1] win32 -- Python 2.7.15 C:\Miniconda\envs\pandas-dev\python.exe

    def test_oo_optimizable():
        # GH 21071
>       subprocess.check_call([sys.executable, "-OO", "-c", "import pandas"])

pandas\tests\test_downstream.py:63: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

popenargs = (['C:\\Miniconda\\envs\\pandas-dev\\python.exe', '-OO', '-c', 'import pandas'],)
kwargs = {}, retcode = 1
cmd = ['C:\\Miniconda\\envs\\pandas-dev\\python.exe', '-OO', '-c', 'import pandas']

    def check_call(*popenargs, **kwargs):
        """Run command with arguments.  Wait for command to complete.  If
        the exit code was zero then return, otherwise raise
        CalledProcessError.  The CalledProcessError object will have the
        return code in the returncode attribute.
    
        The arguments are the same as for the Popen constructor.  Example:
    
        check_call(["ls", "-l"])
        """
        retcode = call(*popenargs, **kwargs)
        if retcode:
            cmd = kwargs.get("args")
            if cmd is None:
                cmd = popenargs[0]
>           raise CalledProcessError(retcode, cmd)
E           CalledProcessError: Command '['C:\\Miniconda\\envs\\pandas-dev\\python.exe', '-OO', '-c', 'import pandas']' returned non-zero exit status 1

C:\Miniconda\envs\pandas-dev\lib\subprocess.py:190: CalledProcessError
---------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):

  File "<string>", line 1, in <module>

  File "pandas\__init__.py", line 35, in <module>

    "the C extensions first.".format(module))

ImportError: C extension: DLL load failed: The parameter is incorrect. not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.

Unfortunately Codecov is down by 50%, and I'm afraid I can't fix this as I'm new to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.