Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Assigning extension array value to series of dtype object fails if element type is array-like #42437

Open
3 tasks done
frreiss opened this issue Jul 7, 2021 · 10 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version

Comments

@frreiss
Copy link
Contributor

frreiss commented Jul 7, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
from pandas.api.extensions import ExtensionArray, ExtensionDtype

class StubDtype(ExtensionDtype):
    """Extension dtype whose elements are something that numpy.asarray() 
    will turn into an array (in this case a tuple)"""
    def __init__(self):
        pass
    
    @property
    def type(self):
        return tuple
    
    @property
    def name(self) -> str:
        return "StubDtype"
        
    @classmethod
    def construct_array_type(cls):
        return StubExtensionArray()

class StubExtensionArray(ExtensionArray):
    """Just enough of an extension array to run the four lines of code 
    that follow."""
    @property
    def dtype(self):
        return StubDtype()
    
    def copy(self):
        return StubExtensionArray()
    
    def __len__(self):
        return 5
    
    def __getitem__(self, key):
        # Every position in the array has the tuple (1, 2, 3)
        return (1, 2, 3)
    
    

data = StubExtensionArray()
series1 = pd.Series(data, name="data")
series2 = pd.Series(index=series1.index, dtype=object, name="data")
series2.loc[series1.index] = data

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-168-cfd127e7fa55> in <module>
     41 series1 = pd.Series(data, name="data")
     42 series2 = pd.Series(index=series1.index, dtype=object, name="data")
---> 43 series2.loc[series1.index] = data

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    721 
    722         iloc = self if self.name == "iloc" else self.obj.iloc
--> 723         iloc._setitem_with_indexer(indexer, value, self.name)
    724 
    725     def _validate_key(self, key, axis: int):

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
   1730             self._setitem_with_indexer_split_path(indexer, value, name)
   1731         else:
-> 1732             self._setitem_single_block(indexer, value, name)
   1733 
   1734     def _setitem_with_indexer_split_path(self, indexer, value, name: str):

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
   1966 
   1967         # actually do the set
-> 1968         self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
   1969         self.obj._maybe_update_cacher(clear=True)
   1970 

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
    353 
    354     def setitem(self: T, indexer, value) -> T:
--> 355         return self.apply("setitem", indexer=indexer, value=value)
    356 
    357     def putmask(self, mask, new, align: bool = True):

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
    965                 values[indexer] = value.to_numpy(value.dtype.numpy_dtype)
    966             else:
--> 967                 values[indexer] = np.asarray(value)
    968 
    969         # if we are an exact match (ex-broadcasting),

ValueError: shape mismatch: value array of shape (5,3) could not be broadcast to indexing result of shape (5,)

Problem description

If the user creates a Series of dtype object and attempts to set the value of that Series with an extension array, the block manager will first pass the extension array through np.asarray() and then assign the block's values to the ndarray returned by np.asarray() (See code here).

This logic assumes that np.asarray() will always return a 1D array. However, np.asarray() is not guaranteed to return a 1D array; if the items of the argument to np.asarray() are array-like, np.asarray() will iterate over them and generate an array with 2 or more dimensions. This 2- or-more-dimensional array can't be assigned to the series, and Pandas throws the error "ValueError: shape mismatch...".

This problem is affecting the TensorArray extension type in Text Extensions for Pandas, because the elements of a TensorArray are tensors. The example code above shows a simpler case where the items of the extension array are Python tuples. In general, any item type that np.asarray() converts to an array of one or more dimensions will have this problem.

Expected Output

The above code should fill series2 with the individual objects at each of the positions in the extension array. In the case of the example code above, that means that each element of series2 should contain the Python tuple (1, 2, 3).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f00ed8f python : 3.7.10.final.0 python-bits : 64 OS : Darwin OS-release : 20.5.0 Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@simonjayhawkins
Copy link
Member

code sample worked in 1.2.5

first bad commit: [527c789] API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] (#39163)

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 3, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.2 milestone Aug 3, 2021
@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@jbrockmendel
Copy link
Member

Somewhere in there it is calling np.array on your EA, and that produces a 2D ndarray. If add to your example

    def __array__(self, *args, **kwargs):
        arr = np.empty(5, dtype=object)
        for i in range(5):
            arr[i] = (1, 2, 3)
        return arr

then the code snippet works

@frreiss
Copy link
Contributor Author

frreiss commented Sep 10, 2021

@jbrockmendel if I'm reading your comment correctly, there's a (currently undocumented) requirement for Pandas extension types to implement the NumPy callback __array__(). If that's the case, then perhaps the ExtensionArray class should have an implementation of __array__() that calls self.to_numpy()?

@jbrockmendel
Copy link
Member

If that's the case, then perhaps the ExtensionArray class should have an implementation of array() that calls self.to_numpy()?

Huh, I thought that was already there as the default implementation, but apparently not. PR would be welcome.

@simonjayhawkins
Copy link
Member

The default implementation of to_numpy() starts with doing a result = np.asarray(self, dtype=dtype) so adding something like ...

    def __array__(self, dtype: npt.DTypeLike| None = None) -> np.ndarray:
        """
        Correctly construct numpy arrays when passed to `np.asarray()`.
        """
        return self.to_numpy(dtype=dtype)

to the base class would raise RecursionError: maximum recursion depth exceeded

@simonjayhawkins
Copy link
Member

Huh, I thought that was already there as the default implementation, but apparently not. PR would be welcome.

moving to 1.3.4

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

this is good to fix but not necessary for 1.3.x

@simonjayhawkins
Copy link
Member

moving off 1.3.x milestone

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.5, Contributions Welcome Dec 11, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added the Needs Tests Unit test(s) needed to prevent regressions label Jul 27, 2023
@jbrockmendel
Copy link
Member

This works on main, could ue a test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

6 participants