Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behavior when concatenating multiple ExtensionBlocks with different dtypes #22994

Open
TomAugspurger opened this issue Oct 4, 2018 · 4 comments

Comments

3 participants
@TomAugspurger
Copy link
Contributor

commented Oct 4, 2018

In

if all(type(b) is type(blocks[0]) for b in blocks[1:]): # noqa

we check that we have one type of block.

For ExtensionBlocks, that's insufficient. If you try to concatenate two series with different EA dtypes, it'll calling the first EA's _concat_same_type with incorrect types.

In [13]: from pandas.tests.extension.decimal.test_decimal import *

In [14]: import pandas as pd

In [15]: a = pd.Series(pd.core.arrays.integer_array([1, 2]))

In [16]: b = pd.Series(DecimalArray([decimal.Decimal(1), decimal.Decimal(2)]))

In [17]: pd.concat([a, b])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-714da278d09e> in <module>
----> 1 pd.concat([a, b])

~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    225                        verify_integrity=verify_integrity,
    226                        copy=copy, sort=sort)
--> 227     return op.get_result()
    228
    229

~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
    389
    390                 mgr = self.objs[0]._data.concat([x._data for x in self.objs],
--> 391                                                 self.new_axes)
    392                 cons = _concat._get_series_result_type(mgr, self.objs)
    393                 return cons(mgr, name=name).__finalize__(self, method='concat')

~/sandbox/pandas/pandas/core/internals/managers.py in concat(self, to_concat, new_axis)
   1637
   1638             if all(type(b) is type(blocks[0]) for b in blocks[1:]):  # noqa
-> 1639                 new_block = blocks[0].concat_same_type(blocks)
   1640             else:
   1641                 values = [x.values for x in blocks]

~/sandbox/pandas/pandas/core/internals/blocks.py in concat_same_type(self, to_concat, placement)
   2047         """
   2048         values = self._holder._concat_same_type(
-> 2049             [blk.values for blk in to_concat])
   2050         placement = placement or slice(0, len(values), 1)
   2051         return self.make_block_same_class(values, ndim=self.ndim,

~/sandbox/pandas/pandas/core/arrays/integer.py in _concat_same_type(cls, to_concat)
    386     def _concat_same_type(cls, to_concat):
    387         data = np.concatenate([x._data for x in to_concat])
--> 388         mask = np.concatenate([x._mask for x in to_concat])
    389         return cls(data, mask)
    390

~/sandbox/pandas/pandas/core/arrays/integer.py in <listcomp>(.0)
    386     def _concat_same_type(cls, to_concat):
    387         data = np.concatenate([x._data for x in to_concat])
--> 388         mask = np.concatenate([x._mask for x in to_concat])
    389         return cls(data, mask)
    390

AttributeError: 'DecimalArray' object has no attribute '_mask'

For EA blocks, we need to ensure that they're the same dtype. When they differ, we should fall back to object.

Checking the dtypes actually solves a secondary problem. On master, we allow concat([ Series[Period[D]], Series[Period[M]] ]), i.e. concatenating series of periods with different frequencies. If we want to allow that still, we need to bail out before we get down to PeriodArray._concat_same_type.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

Do we need more fine grained control?

Because I could assume that in some cases an ExtensionArray (eg with a parametrized dtype) would like to have a smarter way to concat arrays with different dtype than just converting to object?

Also eg IntegerArray with int64 and int32 would not need to be converted to object ?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2018

Do we need more fine grained control?

Absolutely. We're even getting there with Sparse, since it would like to take non-sparse arrays and make them sparse, rather than going to object. Right now I think we just special case sparse before getting to concat.

I think we're coming up on the need for a general concat_array mechanism. We scan the list of types, and try those. I wonder how much we should piggy back on https://www.numpy.org/neps/nep-0018-array-function-protocol.html. I think it would be nice if a pd.api.extensions.concatenate_array simply became np.concatenate some day.

@TomAugspurger TomAugspurger referenced this issue Oct 11, 2018

Merged

SparseArray is an ExtensionArray #22325

4 of 4 tasks complete
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 15, 2018

Looking at the array function protocol as inspiration for the design looks a good idea to me

@TomAugspurger TomAugspurger added this to Orthogonal Blockers in DatetimeArray Refactor Oct 18, 2018

@TomAugspurger TomAugspurger moved this from Orthogonal Blockers to Done in DatetimeArray Refactor Oct 18, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

Are we OK with pushing this to 0.25?

As a proposal, we can have something like the following:

Iterate through the dtypes, calling ExtensionDtype.get_concat_dtype(dtypes). As soon as an array type says "I know how to handle all these dtypes" by returning a non-None dtype, we stop looking.

def get_concat_dtype(arrays):  # internal to pandas
    """
    Get the result dtype for concatenating many arrays.

    Parameters
    ----------
    arrays : Sequence[Union[numpy.ndarray, ExtensionArray]]

    Returns
    -------
    dtype : Union[ExtensionDtype, numpy.dtype]
        The NumPy dtype or ExtensionDtype to use for the concatenated
        array.
    """
    types = {x.dtype for x in arrays}
    if len(types) == 1:
        return list(types)[0]

    seen = {}
    # iterate in order of `arrays`
    for arr in objs:
        dtype = arr.dtype
        if dtype not in seen:
            seen.insert(dtype)

        # this assumes it's an extension dtype, which isn't correct
        result_dtype = dtype.get_concat_dtype(dtypes)
        if result_dtype is not None:
            return result_dtype

    return np.dtype('object')

class ExtensionDtype:
    ...
    @classmethod
    def get_concat_dtype(cls, dtypes):
        # part of the extension array API
        return None

So for SparseDtype, we would return a SparseDtype(object), or a SparseDtype(dtype) if the subtypes are all coercible. For IntegerArray, we would return the highest precision necessary (follow the rules of the underlying dtypes). Categorical could do something like union_categoricals.

Some questions:

  1. Should this be "type stable", i.e. the ExtensionDtype.get_concat_dtype only sees types and not values?
  2. Should we special case self in ExtensionDtype.get_concat_dtype. Right now I made it a class method, but do we want to know that the actual instance?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.