Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: assigning Series.array / PandasArray to column fails #26390

Closed
jorisvandenbossche opened this issue May 14, 2019 · 6 comments

Comments

@jorisvandenbossche
Copy link
Member

commented May 14, 2019

Assigning a PandasArray (so also the result of df['a'].array) of the correct length to add a column fails:

In [1]: df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': ['a', 'b', 'c', 'd']})                                                                                     

In [2]: df['c'] = pd.array([1, 2, None, 3])                                                                                                                   
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/scipy/pandas/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2672             try:
-> 2673                 return self._engine.get_loc(key)
   2674             except KeyError:

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'c'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/scipy/pandas/pandas/core/internals/managers.py in set(self, item, value)
   1048         try:
-> 1049             loc = self.items.get_loc(item)
   1050         except KeyError:

~/scipy/pandas/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2674             except KeyError:
-> 2675                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2676         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

~/scipy/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'c'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-2-03925b585d9b> in <module>
----> 1 df['c'] = pd.array([1, 2, None, 3])

~/scipy/pandas/pandas/core/frame.py in __setitem__(self, key, value)
   3334         else:
   3335             # set column
-> 3336             self._set_item(key, value)
   3337 
   3338     def _setitem_slice(self, key, value):

~/scipy/pandas/pandas/core/frame.py in _set_item(self, key, value)
   3410         self._ensure_valid_index(value)
   3411         value = self._sanitize_column(key, value)
-> 3412         NDFrame._set_item(self, key, value)
   3413 
   3414         # check if we are modifying a copy

~/scipy/pandas/pandas/core/generic.py in _set_item(self, key, value)
   3232 
   3233     def _set_item(self, key, value):
-> 3234         self._data.set(key, value)
   3235         self._clear_item_cache()
   3236 

~/scipy/pandas/pandas/core/internals/managers.py in set(self, item, value)
   1050         except KeyError:
   1051             # This item wasn't present, just insert at end
-> 1052             self.insert(len(self.items), item, value)
   1053             return
   1054 

~/scipy/pandas/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1152 
   1153         block = make_block(values=value, ndim=self.ndim,
-> 1154                            placement=slice(loc, loc + 1))
   1155 
   1156         for blkno, count in _fast_count_smallints(self._blknos[loc:]):

~/scipy/pandas/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath)
   3052         values = DatetimeArray._simple_new(values, dtype=dtype)
   3053 
-> 3054     return klass(values, ndim=ndim, placement=placement)
   3055 
   3056 

~/scipy/pandas/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
   2584             values = np.array(values, dtype=object)
   2585 
-> 2586         super().__init__(values, ndim=ndim, placement=placement)
   2587 
   2588     @property

~/scipy/pandas/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
     74 
     75     def __init__(self, values, placement, ndim=None):
---> 76         self.ndim = self._check_ndim(values, ndim)
     77         self.mgr_locs = placement
     78         self.values = values

~/scipy/pandas/pandas/core/internals/blocks.py in _check_ndim(self, values, ndim)
    111             msg = ("Wrong number of dimensions. values.ndim != ndim "
    112                    "[{} != {}]")
--> 113             raise ValueError(msg.format(values.ndim, ndim))
    114 
    115         return ndim

ValueError: Wrong number of dimensions. values.ndim != ndim [1 != 2]

Note this only fails for the PandasArray types (so when creating a FloatBlock or IntBlock, .. which expect 2D data, so when not creating an ExtensionBlock as is done for an "actual" ExtensionArray).

@shantanu-gontia

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

This seems to work for me

In [1]: import pandas as pd                                                     

In [2]: pd.__version__                                                          
Out[2]: '0.24.2'

In [3]: df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': ['a', 'b', 'c', 'd']})       

In [4]: df['c'] = pd.array([1, 2, None, 3])                                     

In [5]: df                                                                      
Out[5]: 
   a  b     c
0  1  a     1
1  2  b     2
2  3  c  None
3  4  d     3
@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented May 14, 2019

Indeed, on 0.24.2 it 'seems' to work, but we incorrectly store the PandasArray (which I think we fixed). It is however on master that this is failing.

@shantanu-gontia

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

The current master implementation does seem to convert any PandasArray to a a Numpy Array.

if isinstance(values, ABCPandasArray):
        values = values.to_numpy()
if isinstance(dtype, PandasDtype):
        dtype = dtype.numpy_dtype
@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

commented May 14, 2019

Thanks for looking into it!
Yes, that's indeed what we want. But apparently something else still goes wrong.

@shantanu-gontia

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

Without converting the PandasArray to a Numpy Array the block type assigned is ExtensionBlock. However, conversion to a NumpyArray results in the block type to be ObjectBlock.

ExtensionBlock is a child of NonConsolidatableMixIn which sets its _validate_ndim property to False hence when the _check_ndim check is performed no error is raised.

class NonConsolidatableMixIn:
""" hold methods for the nonconsolidatable blocks """
_can_consolidate = False
_verify_integrity = False
_validate_ndim = False

This is not true for ObjectBlock, which has its _validate_ndim property set to True. Hence, the error is raised.


If we pass a NumpyArray instead of a PandasArray, then the during the call to the _set_item
method of NDFrame,

pandas/pandas/core/frame.py

Lines 3400 to 3413 in e5d15b2

def _set_item(self, key, value):
"""
Add series to DataFrame in specified column.
If series is a numpy-array (not a Series/TimeSeries), it must be the
same length as the DataFrames index or an error will be thrown.
Series/TimeSeries will be conformed to the DataFrames index to
ensure homogeneity.
"""
self._ensure_valid_index(value)
value = self._sanitize_column(key, value)
NDFrame._set_item(self, key, value)

the _sanitize_column method, when used with a numpy array explicitly converts it to 2-dimensions

return np.atleast_2d(np.asarray(value))

This step is left out when we convert the PandasArray to a numpy array. Perhaps we can add this after the to_numpy() conversion

def make_block(values, placement, klass=None, ndim=None, dtype=None,
fastpath=None):
# Ensure that we don't allow PandasArray / PandasDtype in internals.
# For now, blocks should be backed by ndarrays when possible.
if isinstance(values, ABCPandasArray):
values = values.to_numpy()
if isinstance(dtype, PandasDtype):
dtype = dtype.numpy_dtype

@shantanu-gontia

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

If we simply add a line values = np.atleast_2d(np.asarray(values)) after Line 3036 in

if isinstance(values, ABCPandasArray):

then the following test will fail

def test_make_block_no_pandas_array():
# https://github.com/pandas-dev/pandas/pull/24866
arr = pd.array([1, 2])
# PandasArray, no dtype
result = make_block(arr, slice(len(arr)))
assert result.is_integer is True
assert result.is_extension is False
# PandasArray, PandasDtype
result = make_block(arr, slice(len(arr)), dtype=arr.dtype)
assert result.is_integer is True
assert result.is_extension is False
# ndarray, PandasDtype
result = make_block(arr.to_numpy(), slice(len(arr)), dtype=arr.dtype)
assert result.is_integer is True
assert result.is_extension is False

However, with the bug in hand, and the new implementation of passing PandasArray by converting them to NumPy arrays, is this test valid now?


Another solution can be to convert the PandasArray to a NumpyArray during the _sanitize_column method, maybe here

pandas/pandas/core/frame.py

Lines 3623 to 3625 in e5d15b2

# return internal types directly
if is_extension_type(value) or is_extension_array_dtype(value):
return value

or add a special case of ABCPandasArray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.